Table of Contents
Natural Language Processing with Spark NLP Glossary
Return to NLP Glossary, Natural Language Processing with Spark NLP, NLP bibliography - Python NLP - NLP, Python AI - AI bibliography, Python ML - Machine Learning (ML) bibliography, Python DL - Deep Learning (DL) bibliography, Python Data science - Data Science bibliography
A
- annotation - “In an NLP context, an annotation is a marking on a segment of text or audio with some extra information. Generally, an annotation will require character indices for the start and end of the annotated segment, as well as an annotation type.” (NLPwSprk 2020)
- Apache Hadoop - ”Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig.“ (NLPwSprk 2020)
- Apache Spark - ”Spark is a distributed computing framework with a high-level interface and in memory processing. Spark was developed in Scala, but there are now APIs for Java, Python, R, and SQL.“ (NLPwSprk 2020)
- application - “An application is a program with an end user. Many applications have a graphical user interface (GUI), though this is not necessary. In this book, we also consider programs that do batch data processing as “applications”.” (NLPwSprk 2020)
- array - “An array is a data structure where elements are associated with an index. They are implemented differently in different programming languages. Numpy arrays, `ndarrays`, are the most popular kind of arrays used by Python users (especially among data scientists).” (NLPwSprk 2020)
- autoencoder - “An autoencoder is a neural-network–based technique used to convert some input data into vectors, matrices, or tensors. This new representation is generally of a lower dimension than the input data.” (NLPwSprk 2020)
B
- Bidirectional Encoder Representations from Transformers (BERT) - BERT from Google is a technique for converting words into a vector representation. Unlike Word2vec, which disregards context, BERT uses the context a word is found in to produce the vector.” (NLPwSprk 2020)
C
- classification - “In a machine learning context, classification is the task of assigning classes to examples. The simplest form is the binary classification task where each example can have one of two classes. The binary classification task is a special case of the multiclass classification task where each example can have one of a fixed set of classes. There is also the multilabel classification task where each example can have zero or more labels from a fixed set of labels.” (NLPwSprk 2020)
- clustering - “In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms.” (NLPwSprk 2020)
- container - “In software there are two common senses of “container.” In this book, the term is primarily used to refer to a virtual environment that contains a program or programs. The term “container” is also sometimes used to refer to an abstract data type of data structure that contains a collection of elements.” (NLPwSprk 2020)
- CSV - “A CSV (Comma Separated Values) file is a common way to store structured data. Elements are separated by commas, and rows are separated by new lines. Another common separator is the tab character. Files that use the tab are called TSVs. It is not uncommon for files that use a separator other than a comma to still be called CSVs.” (NLPwSprk 2020)
D
- data scientist - “A data scientist is someone who uses scientific techniques to analyze data or build applications that consume data.” (NLPwSprk 2020)
- DataFrame - “A DataFrame is a data structure that is used to manipulate tabular data.” (NLPwSprk 2020)
- decision tree - “In a machine learning context, a decision tree is a data structure that is built for classification or regression tasks. Each node in the tree splits on a particular feature.” (NLPwSprk 2020)
- deep learning - “Deep learning is a collection of neural-network techniques that generally use multiple layers.” (NLPwSprk 2020)
- dialect - “In a linguistics context, a dialect is a particular variety of a language associated with a specific group of people.” (NLPwSprk 2020)
- differentiate - “In a mathematics context, to differentiate is to find the derivative of a function. The derivative function is a function that maps from the domain to the instantaneous rate of change of the original function.” (NLPwSprk 2020)
- distributed computing - “Distributed computing is using multiple computers to perform parallelized computation.” (NLPwSprk 2020)
- distributional semantics - In an NLP context, this refers to techniques that attempt to represent words in a numerical form, almost always a vector, based on the words’ distribution in a corpus. This name originally comes from linguistics where it refers to theories that attempt to use the distribution of words in data to understand the words’ semantics.“ (NLPwSprk 2020)
E
- embedding - “In an NLP context, an embedding is a technique of representing words (or other language elements) as a vector, especially when such a representation is produced by a neural network.” (NLPwSprk 2020)
F
G
- graph - “In a computer science or mathematics context, a graph is a set of nodes and edges that connect the nodes.” (NLPwSprk 2020)
- guidelines - In a human labeling context, guidelines are the instructions given to the human labelers.” (NLPwSprk 2020)
H
- hyperparameter - “In a machine learning context, a hyperparameter is a setting of a learning algorithm. For example, in a neural network, the weights are parameters, but the number and size of the layers are hyperparameters.” (NLPwSprk 2020)
I
- interlabeler agreement - “In a human labeling context, interlabeler agreement is a measure of how much labelers agree (generally unknowingly) when labeling the same example.” (NLPwSprk 2020)
- inverted index - “In an information retrieval context, an index is a mapping from words to the documents that contain the words.” (NLPwSprk 2020)
J
K
- knowledge base - “A knowledge base is a collection of knowledge or facts in a computationally usable format.” (NLPwSprk 2020)
L
- language model - “In an NLP context, a language model is a model of the probability distribution of word sequences.” (NLPwSprk 2020)
- latent semantic indexing (LSI) - LSI is a technique for topic modeling that performs single value decomposition on the term-document matrix.“ (NLPwSprk 2020)
- linear algebra - ”Linear algebra is the branch of mathematics focused on linear equations. In a programming context, linear algebra generally refers to the mathematics that describe vectors, matrices, and their associated operations.“ (NLPwSprk 2020)
- linear regression - ”Linear regression is a statistical technique for modeling the relationship between a single variable and one or more other variables. In a machine learning context, linear regression refers to a regression model based on this statistical technique.“ (NLPwSprk 2020)
- linguistic typology - ”Linguistic typology is a field of linguistics that groups languages by their traits.“ (NLPwSprk 2020)
- logging - “In a software context, logging is information output by an application for use in monitoring and debugging the application.” (NLPwSprk 2020)
- logistic regression - ”Logistic regression is a statistical technique for modeling the probability of an event. In a machine learning context, logistic regression refers to a classification model based on this statistical technique.“ (NLPwSprk 2020)
- loss - “In a machine learning context, loss refers to a measure of how wrong a supervised model is.” (NLPwSprk 2020)
M
- machine learning - “Machine learning is a field of computer science and mathematics that focuses on algorithms for building and using models “learned” from data.” (NLPwSprk 2020)
- MapReduce - “MapReduce is a style of programming based on functional programming that was the basis of Hadoop.” (NLPwSprk 2020)
- matrix - “A matrix is a rectangular array of numeric values. The mathematical definition is much more abstract.” (NLPwSprk 2020)
- model - “In a general scientific context, a model is some formal description, especially a mathematical one, of a phenomenon or system. In the machine learning context, a model is a set of hyperparameters, a set of learned parameters, and an evaluation or prediction function, especially one learned from data. In Spark MLlib, a model is what is produced by an Estimator when fitted to data.” (NLPwSprk 2020)
- model publishing - “Once a machine learning model has been learned, it must be published to be used by other applications.” (NLPwSprk 2020)
- monitoring - “In a software context, monitoring is the process of recording and publishing information about a running application.” (NLPwSprk 2020)
- morphology - “Morphology is a branch of linguistics focused on structure and parts of a word (actually morphemes).” (NLPwSprk 2020)
N
- N-gram - “An N-gram is a subsequence of words. Sometimes, “N-gram” can refer to a subsequence of characters.” (NLPwSprk 2020)
- naïve Bayes - “Naïve Bayes is a classification technique built on the naïve assumption that the features are all independent of each other.” (NLPwSprk 2020)
- natural language - ”Natural language is a language spoken or signed by people, in contrast to a programming language which is used for giving instruction to computers. Natural language also contrasts with artificial or constructed languages, which are designed]] by a person or group of people.“ (NLPwSprk 2020)
- natural language processing (NLP) - NLP is a field of computer science and linguistics focused on techniques and algorithms for processing data, continuing natural language.” (NLPwSprk 2020)
- neural network - “An artificial neural network is a collection of neurons connected by weights.” (NLPwSprk 2020)
- notebook - “In this book, a notebook refers to a programming and writing environment, for example Jupyter Notebook and Databricks notebooks.” (NLPwSprk 2020)
O
- object - “In an object-oriented programming context, an object is an instance of a class or type.” (NLPwSprk 2020)
- overfitting - In machine learning, our data has biases as well as useful information for our task. The more exactly our machine learning model fits the data, the more it reflects these biases. This means that the predictions may be based on spurious relationships that incidentally occur in the training data.” (NLPwSprk 2020)
P
- pandas - “pandas is a Python library for data analysis and processing that uses DataFrames.” (NLPwSprk 2020)
- parallelism - In computer science, parallelism is how much an algorithm is or can be distributed across multiple threads, processes, or machines.“ (NLPwSprk 2020)
- parameter - “In a mathematics context, a parameter is a value in a mathematical model. In a programming context, a parameter is another name for an argument of a function. In a machine learning context, a parameter is value learned in the training process using the training data.” (NLPwSprk 2020)
- parts of speech (POS) - POS are word categories. The most well known are nouns and verbs. In an NLP context, the Penn Treebank tags are the most frequently used set of parts of speech.” (NLPwSprk 2020)
- pipeline - “In data processing, a pipeline is a sequence of processing steps combined into a single object. In Spark MLlib, a pipeline is a sequence of stages. A Pipeline is an estimator containing transformers, estimators, and evaluators. When it is trained, it produces a PipelineModel containing transformers, models, and evaluators.” (NLPwSprk 2020)
- product owner - In software development, the product owner is the person or people who represent the customer in the development process. They also own the requirements and prioritizing development tasks.” (NLPwSprk 2020)
- programming language - “A programming language is a formal language for writing high-level (human readable) instructions for a computer.” (NLPwSprk 2020)
- Python - ”Python is a programming language that is popular among NLP developers and data scientists. It is a multi-paradigm language, allowing object-oriented, functional, and imperative programming.“ (NLPwSprk 2020)
R
- random forest - ”Random forest is a machine learning technique for training an ensemble of decision trees. The training data for each decision tree is a subset of the rows and features of the total data.“ (NLPwSprk 2020)
- recurrent neural network (RNN) - An RNN is a special kind of neural network used for modeling sequential data.” (NLPwSprk 2020)
- regression - “In a machine learning context, regression is the task of assigning scalar value to examples.” (NLPwSprk 2020)
- regular expression - “A regular expression is a string that defines a pattern to be matched in text.” (NLPwSprk 2020)
- repository - “In a software context, a repository is a data store that contains the code and or data for a project.” (NLPwSprk 2020)
- resilient distributed dataset (RDD) - In Spark, an RDD is a distributed collection. In early versions of Spark, they were the fundamental elements of Spark programming.“ (NLPwSprk 2020)
S
- semantics - Semantics is a branch of linguistics focused on the meaning communicated by language.” (NLPwSprk 2020)
- serialization - In computing, serialization is the process of converting objects or other programming elements into a format for storage.” (NLPwSprk 2020)
- software developer - “A software developer is someone who writes software, especially using software engineering.” (NLPwSprk 2020)
- software development - “Software development is the process of making an application (or an update to an application) available in the production environment.” (NLPwSprk 2020)
- software engineering - “`Software engineering is the discipline and best practices used in developing software.” (NLPwSprk 2020)
- software library - “A software library is a piece of software that is not necessarily an application. Applications are generally built by combining libraries. Some software libraries also contain applications.” (NLPwSprk 2020)
- software test - “A software test is a program, function, or set of human instructions used to test or verify the behavior of a piece of software.” (NLPwSprk 2020)
- stakeholder - In software development, a stakeholder is a person who has a vested interest in the software being developed. For example customers and users are stakeholders.“ (NLPwSprk 2020)
- stop word - “In an NLP context, a stop word is a word or token that is considered to have negligible value for the given task.” (NLPwSprk 2020)
- structured query language (SQL) - SQL is a programming language used to interact with relational data.” (NLPwSprk 2020)
T
- tag - In an NLP context, a tag is a kind of annotation where a subsequence, especially a token, is marked with a label from a fixed set of labels. For example, annotators that identify the POS of tokens are often called POS taggers.“ (NLPwSprk 2020)
- TensorFlow - ”TensorFlow is a data processing and mathematics library. It was popularized for its implementation of neural networks.“ (NLPwSprk 2020)
- TF.IDF - ”TF.IDF refers to the technique developed in information retrieval. TF refers to the term frequency of a given term in a given document, and IDF refers to the inverse of the document frequency of the given term. TF.IDF is the product of the term frequency and the inverse document frequency which is supposed to represent the relevance of the given document to the given term.“ (NLPwSprk 2020)
- thread - “In computing, a thread is a subsequence of instructions in a program that may be executed in parallel.” (NLPwSprk 2020)
- token - “In an NLP context, a token is a unit of text, generally — but not necessarily — a word.” (NLPwSprk 2020)
- topic - “In an NLP context, a topic is a kind of cluster of meaning (or a quantified representation).” (NLPwSprk 2020)
- Transformer - “In a Spark MLlib context, a Transformer is a stage of a pipeline that does not need to be fit or trained on data.” (NLPwSprk 2020)
U
V
W
- word - “In linguistics, a word is loosely defined as an unbound morpheme, that is, a unit of language that can be used alone and still have meaning.” (NLPwSprk 2020)
- word vector - “In distributional semantics, a word is represented as a vector. The mapping from word to vector is learned from data.” (NLPwSprk 2020)
- Word2vec - ”Word2vec is a distributional semantics technique that learns word representations by building a neural network“ (NLPwSprk 2020).
X
Fair Use Sources
Natural Language Processing (NLP): What Is Language, Text classification, Language modeling, Google Gemini, ChatGPT
Machine Learning for NLP NLP ML, NLP DL - NLP Deep learning - Python NLP, NLP MLOps, Python NLP (sci-kit NLP, OpenCV NLP, TensorFlow NLP, PyTorch NLP, Keras NLP, NumPy NLP, NLTK NLP, SciPy NLP, sci-kit learn NLP, Seaborn NLP, Matplotlib NLP), C Plus Plus Natural Language Processing | C++ NLP, C Sharp Natural Language Processing | NLP, Golang Natural Language Processing | Golang NLP, Java Natural Language Processing | Java NLP, JavaScript Natural Language Processing | JavaScript NLP, Julia Natural Language Processing | Julia NLP, Kotlin Natural Language Processing | Kotlin NLP, R Natural Language Processing | R NLP, Ruby Natural Language Processing | Ruby NLP, Rust Natural Language Processing | Rust NLP, Scala Natural Language Processing | Scala NLP, Swift Natural Language Processing | Swift NLP, NLP history, NLP bibliography, NLP glossary, NLP topics, NLP courses, NLP libraries, NLP frameworks, NLP GitHub, NLP Awesome list. (navbar_nlp - See also navbar_llm, navbar_chatbot, navbar_dl, navbar_ml, navbar_chatgpt, navbar_ai, borg_usage_disclaimer)
Cloud Monk is Retired ( for now). Buddha with you. © 2025 and Beginningless Time - Present Moment - Three Times: The Buddhas or Fair Use. Disclaimers
SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.