https://DevOpsCloud.io -- Cloud Monk Losang Jinpa, Ph.D., MCSE/MCT, GitOps DevOps Engineer

Natural Language Processing with Spark NLP Glossary

Return to NLP Glossary, Natural Language Processing with Spark NLP, NLP bibliography - Python NLP - NLP, Python AI - AI bibliography, Python ML - Machine Learning (ML) bibliography, Python DL - Deep Learning (DL) bibliography, Python Data science - Data Science bibliography

“ (NLPwSprk 2020)

A

algorithmic complexity - “The complexity of an algorithm is generally measured in the time it takes to run or how much space (memory or disk space) is needed to run it.” (NLPwSprk 2020)

annotation - “In an NLP context, an annotation is a marking on a segment of text or audio with some extra information. Generally, an annotation will require character indices for the start and end of the annotated segment, as well as an annotation type.” (NLPwSprk 2020)

annotator - “An annotator is a function that takes text and produces annotations. It is not uncommon for some annotators to have a dependency on another type of annotator.” (NLPwSprk 2020)

Apache Hadoop - ”Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig.“ (NLPwSprk 2020)

Apache Parquet - ”Parquet is a data format originally created for Hadoop. It allows for efficient compression of columnar data. It is a popular format in the Spark ecosystem.“ (NLPwSprk 2020)

Apache Spark - ”Spark is a distributed computing framework with a high-level interface and in memory processing. Spark was developed in Scala, but there are now APIs for Java, Python, R, and SQL.“ (NLPwSprk 2020)

application - “An application is a program with an end user. Many applications have a graphical user interface (GUI), though this is not necessary. In this book, we also consider programs that do batch data processing as “applications”.” (NLPwSprk 2020)

array - “An array is a data structure where elements are associated with an index. They are implemented differently in different programming languages. Numpy arrays, `ndarrays`, are the most popular kind of arrays used by Python users (especially among data scientists).” (NLPwSprk 2020)

autoencoder - “An autoencoder is a neural-network–based technique used to convert some input data into vectors, matrices, or tensors. This new representation is generally of a lower dimension than the input data.” (NLPwSprk 2020)

B

Bidirectional Encoder Representations from Transformers (BERT) - BERT from Google is a technique for converting words into a vector representation. Unlike Word2vec, which disregards context, BERT uses the context a word is found in to produce the vector.” (NLPwSprk 2020)

C

classification - “In a machine learning context, classification is the task of assigning classes to examples. The simplest form is the binary classification task where each example can have one of two classes. The binary classification task is a special case of the multiclass classification task where each example can have one of a fixed set of classes. There is also the multilabel classification task where each example can have zero or more labels from a fixed set of labels.” (NLPwSprk 2020)

clustering - “In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms.” (NLPwSprk 2020)

container - “In software there are two common senses of “container.” In this book, the term is primarily used to refer to a virtual environment that contains a program or programs. The term “container” is also sometimes used to refer to an abstract data type of data structure that contains a collection of elements.” (NLPwSprk 2020)

context - “In an NLP, “context” generally refers to the surrounding language data around a segment of text or audio. In linguistics, it can also refer to the “real world” context in which a language act occurs.” (NLPwSprk 2020)

CSV - “A CSV (Comma Separated Values) file is a common way to store structured data. Elements are separated by commas, and rows are separated by new lines. Another common separator is the tab character. Files that use the tab are called TSVs. It is not uncommon for files that use a separator other than a comma to still be called CSVs.” (NLPwSprk 2020)

D

data scientist - “A data scientist is someone who uses scientific techniques to analyze data or build applications that consume data.” (NLPwSprk 2020)

DataFrame - “A DataFrame is a data structure that is used to manipulate tabular data.” (NLPwSprk 2020)

decision tree - “In a machine learning context, a decision tree is a data structure that is built for classification or regression tasks. Each node in the tree splits on a particular feature.” (NLPwSprk 2020)

deep learning - “Deep learning is a collection of neural-network techniques that generally use multiple layers.” (NLPwSprk 2020)

dialect - “In a linguistics context, a dialect is a particular variety of a language associated with a specific group of people.” (NLPwSprk 2020)

differentiate - “In a mathematics context, to differentiate is to find the derivative of a function. The derivative function is a function that maps from the domain to the instantaneous rate of change of the original function.” (NLPwSprk 2020)

discourse - “In a linguistics context, a discourse is a sequence of language acts, especially between two or more people.” (NLPwSprk 2020)

distributed computing - “Distributed computing is using multiple computers to perform parallelized computation.” (NLPwSprk 2020)

distributional semantics - In an NLP context, this refers to techniques that attempt to represent words in a numerical form, almost always a vector, based on the words’ distribution in a corpus. This name originally comes from linguistics where it refers to theories that attempt to use the distribution of words in data to understand the words’ semantics.“ (NLPwSprk 2020)

Docker - ”Docker is software that allows users to create containers (virtual environments) with Docker scripts.“ (NLPwSprk 2020)

document - “In an NLP context, a document is a complete piece of text especially if it contains multiple sentences.” (NLPwSprk 2020)

E

embedding - “In an NLP context, an embedding is a technique of representing words (or other language elements) as a vector, especially when such a representation is produced by a neural network.” (NLPwSprk 2020)

encoding - “In an NLP context, the encoding or character encoding refers to the mapping from characters, e.g. “a”, “?”, to bytes.” (NLPwSprk 2020)

estimator - “In a Spark MLlib context, an estimator is a stage of a pipeline that uses data to produce a model that transforms the data.” (NLPwSprk 2020)

evaluator - “In a Spark MLlib context, an evaluator is a stage of a pipeline that produces metrics from predictions.” (NLPwSprk 2020)

F

feature - “In a machine learning context, a feature is an attribute of an input, especially a numerical attribute. For example, if the input is a document, the number of unique tokens in the document is a feature. The words present in a document are also referred to as features.” (NLPwSprk 2020)

function - “In a programming context, a function is a sequence of instructions. In a mathematics context, a function is a mapping between two sets, the domain and the range, such that each element of the domain is mapped to a single element in the range.” (NLPwSprk 2020)

G

GloVe - ”GloVe is a distributional semantics technique for representing words as vectors using word-to-word co-occurrences.“ (NLPwSprk 2020)

graph - “In a computer science or mathematics context, a graph is a set of nodes and edges that connect the nodes.” (NLPwSprk 2020)

guidelines - In a human labeling context, guidelines are the instructions given to the human labelers.” (NLPwSprk 2020)

H

hidden Markov model - “A hidden Markov model is a technique for modeling sequences using a hidden state that only uses the previous part of the sequence.” (NLPwSprk 2020)

hyperparameter - “In a machine learning context, a hyperparameter is a setting of a learning algorithm. For example, in a neural network, the weights are parameters, but the number and size of the layers are hyperparameters.” (NLPwSprk 2020)

I

index - “In an information retrieval context, an index is a mapping from documents to the words contained in the documents.” (NLPwSprk 2020)

interlabeler agreement - “In a human labeling context, interlabeler agreement is a measure of how much labelers agree (generally unknowingly) when labeling the same example.” (NLPwSprk 2020)

inverted index - “In an information retrieval context, an index is a mapping from words to the documents that contain the words.” (NLPwSprk 2020)

J

Java - “Java is an object-oriented programming language. Java is almost always compiled to run on the Java Virtual Machine (JVM). Scala and a number of other popular languages run on the JVM and so are interoperable with Java.” (NLPwSprk 2020)

Java Virtual Machine (JVM) - The JVM is a virtual machine that runs programs that have been compiled into Java byte code. As the name suggests, Java is the primary language which uses the JVM, but Scala and a number of other programming languages use it as well.“ (NLPwSprk 2020)

JSON - ”JavaScript Object Notation (JSON) is a data format.“ (NLPwSprk 2020)

K

K-Means - ”K-Means is a technique for clustering. It works by randomly placing K points, called centroids, and iteratively moving them to minimize the squared distance of elements of a cluster to their centroid.“ (NLPwSprk 2020)

knowledge base - “A knowledge base is a collection of knowledge or facts in a computationally usable format.” (NLPwSprk 2020)

L

labeling - “In a machine learning context, labeling is the process of assigning labels to examples, especially when done by humans.” (NLPwSprk 2020)

language model - “In an NLP context, a language model is a model of the probability distribution of word sequences.” (NLPwSprk 2020)

latent Dirichlet allocation (LDA) - LDA is a technique for topic modeling that treats documents as a sequence of words selected from weighted topics (probability distributions over words).” (NLPwSprk 2020)

latent semantic indexing (LSI) - LSI is a technique for topic modeling that performs single value decomposition on the term-document matrix.“ (NLPwSprk 2020)

linear algebra - ”Linear algebra is the branch of mathematics focused on linear equations. In a programming context, linear algebra generally refers to the mathematics that describe vectors, matrices, and their associated operations.“ (NLPwSprk 2020)

linear regression - ”Linear regression is a statistical technique for modeling the relationship between a single variable and one or more other variables. In a machine learning context, linear regression refers to a regression model based on this statistical technique.“ (NLPwSprk 2020)

linguist - “A linguist is a person who studies human languages.” (NLPwSprk 2020)

linguistic typology - ”Linguistic typology is a field of linguistics that groups languages by their traits.“ (NLPwSprk 2020)

logging - “In a software context, logging is information output by an application for use in monitoring and debugging the application.” (NLPwSprk 2020)

logistic regression - ”Logistic regression is a statistical technique for modeling the probability of an event. In a machine learning context, logistic regression refers to a classification model based on this statistical technique.“ (NLPwSprk 2020)

long [[short-term memory (LSTM) - LSTM is a neural-network technique that is used for learning sequences. It attempts to learn when to use and update the context.” (NLPwSprk 2020)

loss - “In a machine learning context, loss refers to a measure of how wrong a supervised model is.” (NLPwSprk 2020)

M

machine learning - “Machine learning is a field of computer science and mathematics that focuses on algorithms for building and using models “learned” from data.” (NLPwSprk 2020)

MapReduce - “MapReduce is a style of programming based on functional programming that was the basis of Hadoop.” (NLPwSprk 2020)

matrix - “A matrix is a rectangular array of numeric values. The mathematical definition is much more abstract.” (NLPwSprk 2020)

metric - “In a machine learning context, a metric is a measure of how good or bad a particular model is at its task. In a software context, a metric is a measure defined for an application, program, or function.” (NLPwSprk 2020)

model - “In a general scientific context, a model is some formal description, especially a mathematical one, of a phenomenon or system. In the machine learning context, a model is a set of hyperparameters, a set of learned parameters, and an evaluation or prediction function, especially one learned from data. In Spark MLlib, a model is what is produced by an Estimator when fitted to data.” (NLPwSprk 2020)

model publishing - “Once a machine learning model has been learned, it must be published to be used by other applications.” (NLPwSprk 2020)

model training - “Model training is the process of fitting a model to data.” (NLPwSprk 2020)

monitoring - “In a software context, monitoring is the process of recording and publishing information about a running application.” (NLPwSprk 2020)

morphology - “Morphology is a branch of linguistics focused on structure and parts of a word (actually morphemes).” (NLPwSprk 2020)

N

N-gram - “An N-gram is a subsequence of words. Sometimes, “N-gram” can refer to a subsequence of characters.” (NLPwSprk 2020)

naïve Bayes - “Naïve Bayes is a classification technique built on the naïve assumption that the features are all independent of each other.” (NLPwSprk 2020)

named-entity recognition (NER) - NER is a task in NLP that focuses on finding particular entities in text.“ (NLPwSprk 2020)

natural language - ”Natural language is a language spoken or signed by people, in contrast to a programming language which is used for giving instruction to computers. Natural language also contrasts with artificial or constructed languages, which are designed]] by a person or group of people.“ (NLPwSprk 2020)

natural language processing (NLP) - NLP is a field of computer science and linguistics focused on techniques and algorithms for processing data, continuing natural language.” (NLPwSprk 2020)

neural network - “An artificial neural network is a collection of neurons connected by weights.” (NLPwSprk 2020)

notebook - “In this book, a notebook refers to a programming and writing environment, for example Jupyter Notebook and Databricks notebooks.” (NLPwSprk 2020)

numpy - “Numpy is a Python library for performing linear algebra operations and an assortment of other mathematical operations.” (NLPwSprk 2020)

O

object - “In an object-oriented programming context, an object is an instance of a class or type.” (NLPwSprk 2020)

optical character recognition (OCR) - OCR is the set of techniques used to identify characters in an image.“ (NLPwSprk 2020)

overfitting - In machine learning, our data has biases as well as useful information for our task. The more exactly our machine learning model fits the data, the more it reflects these biases. This means that the predictions may be based on spurious relationships that incidentally occur in the training data.” (NLPwSprk 2020)

P

pandas - “pandas is a Python library for data analysis and processing that uses DataFrames.” (NLPwSprk 2020)

parallelism - In computer science, parallelism is how much an algorithm is or can be distributed across multiple threads, processes, or machines.“ (NLPwSprk 2020)

parameter - “In a mathematics context, a parameter is a value in a mathematical model. In a programming context, a parameter is another name for an argument of a function. In a machine learning context, a parameter is value learned in the training process using the training data.” (NLPwSprk 2020)

partition - “In Spark, a partition is a subset of the distributed data that is collocated on a machine.” (NLPwSprk 2020)

parts of speech (POS) - POS are word categories. The most well known are nouns and verbs. In an NLP context, the Penn Tree bank tags are the most frequently used set of parts of speech.” (NLPwSprk 2020)

PDF - Portable document format (PDF) is a common file format for formatted text. It is a common input to NLP applications.“ (NLPwSprk 2020)

phonetics - Phonetics is the branch of linguistics focused on the study of speech sounds.” (NLPwSprk 2020)

phrase - “In linguistics, a phrase is a sequence of words that make up a constituency. For example, in the sentence “The red dog wags his tail,” “the red dog” is a noun phrase, but “the red” is not.” (NLPwSprk 2020)

pickle - “The pickle module is part of the Python standard library used for serializing data.” (NLPwSprk 2020)

pipeline - “In data processing, a pipeline is a sequence of processing steps combined into a single object. In Spark MLlib, a pipeline is a sequence of stages. A Pipeline is an estimator containing transformers, estimators, and evaluators. When it is trained, it produces a Pipeline Model containing transformers, models, and evaluators.” (NLPwSprk 2020)

pragmatics - Pragmatics is the branch of linguistics focused on understanding meaning in context.“ (NLPwSprk 2020)

process - “In a computing context, a process is a running program.” (NLPwSprk 2020)

product owner - In software development, the product owner is the person or people who represent the customer in the development process. They also own the requirements and prioritizing development tasks.” (NLPwSprk 2020)

production - “Production is the environment an application is deployed into.” (NLPwSprk 2020)

profiling - In an application context, profiling is the process of measuring the resources an application or program requires to run.“ (NLPwSprk 2020)

program - “A program is a set of instructions given to a computer.” (NLPwSprk 2020)

programming language - “A programming language is a formal language for writing high-level (human readable) instructions for a computer.” (NLPwSprk 2020)

Python - ”Python is a programming language that is popular among NLP developers and data scientists. It is a multi-paradigm language, allowing object-oriented, functional, and imperative programming.“ (NLPwSprk 2020)

R

random forest - ”Random forest is a machine learning technique for training an ensemble of decision trees. The training data for each decision tree is a subset of the rows and features of the total data.“ (NLPwSprk 2020)

recurrent neural network (RNN) - An RNN is a special kind of neural network used for modeling sequential data.” (NLPwSprk 2020)

register - “In linguistics, a register is a variation of language that is defined by the context in which it is used. This contrasts with a dialect, which is defined by the group of people who speak it.” (NLPwSprk 2020)

regression - “In a machine learning context, regression is the task of assigning scalar value to examples.” (NLPwSprk 2020)

regular expression - “A regular expression is a string that defines a pattern to be matched in text.” (NLPwSprk 2020)

repository - “In a software context, a repository is a data store that contains the code and or data for a project.” (NLPwSprk 2020)

resilient distributed dataset (RDD) - In Spark, an RDD is a distributed collection. In early versions of Spark, they were the fundamental elements of Spark programming.“ (NLPwSprk 2020)

S

scale out - In computing, scaling out is when more machines are used to increase the available resources.” (NLPwSprk 2020)

scale up - In computing, scaling up is when a machine with more resources is used to increase available resources.“ (NLPwSprk 2020)

schema - “In data engineering, a schema is the structure and some meta data (e.g. column names and types). In Spark, this is the meta data for defining a Spark DataFrame.” (NLPwSprk 2020)

script - “In programming, a script is a computer program that is generally written on a runnable code file (also called a script).” (NLPwSprk 2020)

Scrum - ”Scrum is a style of agile software development. It is built around the idea of iterative development and short daily meetings (called scrums) where progress or problems are shared.“ (NLPwSprk 2020)

search - “In computing, search is a task in information retrieval concerned with finding documents that are relevant to a query.” (NLPwSprk 2020)

semantics - Semantics is a branch of linguistics focused on the meaning communicated by language.” (NLPwSprk 2020)

sentence - “In linguistics, a sentence is a special kind of phrase, especially a clausal phrase, that is considered complete.” (NLPwSprk 2020)

sentiment - In an NLP context, sentiment is the emotion or opinion a human encodes in a language act.“ (NLPwSprk 2020)

serialization - In computing, serialization is the process of converting objects or other programming elements into a format for storage.” (NLPwSprk 2020)

software developer - “A software developer is someone who writes software, especially using software engineering.” (NLPwSprk 2020)

software development - “Software development is the process of making an application (or an update to an application) available in the production environment.” (NLPwSprk 2020)

software engineering - “`Software engineering is the discipline and best practices used in developing software.” (NLPwSprk 2020)

software library - “A software library is a piece of software that is not necessarily an application. Applications are generally built by combining libraries. Some software libraries also contain applications.” (NLPwSprk 2020)

software test - “A software test is a program, function, or set of human instructions used to test or verify the behavior of a piece of software.” (NLPwSprk 2020)

Spark NLP - “Spark NLP is an NLP annotation library that extends Spark MLlib.” (NLPwSprk 2020)

stakeholder - In software development, a stakeholder is a person who has a vested interest in the software being developed. For example customers and users are stakeholders.“ (NLPwSprk 2020)

stop word - “In an NLP context, a stop word is a word or token that is considered to have negligible value for the given task.” (NLPwSprk 2020)

structured query language (SQL) - SQL is a programming language used to interact with relational data.” (NLPwSprk 2020)

syntax - “In a linguistics context, syntax is a branch of linguistics focused on the structure of phrases and sentences. It is also used to refer to the rules used by a language for constructing phrases and sentences.” (NLPwSprk 2020)

T

tag - In an NLP context, a tag is a kind of annotation where a subsequence, especially a token, is marked with a label from a fixed set of labels. For example, annotators that identify the POS of tokens are often called POS taggers.“ (NLPwSprk 2020)

TensorFlow - ”TensorFlow is a data processing and mathematics library. It was popularized for its implementation of neural networks.“ (NLPwSprk 2020)

TF.IDF - ”TF.IDF refers to the technique developed in information retrieval. TF refers to the term frequency of a given term in a given document, and IDF refers to the inverse of the document frequency of the given term. TF.IDF is the product of the term frequency and the inverse document frequency which is supposed to represent the relevance of the given document to the given term.“ (NLPwSprk 2020)

thread - “In computing, a thread is a subsequence of instructions in a program that may be executed in parallel.” (NLPwSprk 2020)

token - “In an NLP context, a token is a unit of text, generally — but not necessarily — a word.” (NLPwSprk 2020)

topic - “In an NLP context, a topic is a kind of cluster of meaning (or a quantified representation).” (NLPwSprk 2020)

Transformer - “In a Spark MLlib context, a Transformer is a stage of a pipeline that does not need to be fit or trained on data.” (NLPwSprk 2020)

U

Unicode - ”Unicode is a standard for encoding characters.“ (NLPwSprk 2020)

V

vector - “In a mathematics context, a vector is an element of a Cartesian space with more than one dimension.” (NLPwSprk 2020)

virtual machine - “A virtual machine is a software representation of a computer.” (NLPwSprk 2020)

W

word - “In linguistics, a word is loosely defined as an unbound morpheme, that is, a unit of language that can be used alone and still have meaning.” (NLPwSprk 2020)

word vector - “In distributional semantics, a word is represented as a vector. The mapping from word to vector is learned from data.” (NLPwSprk 2020)

Word2vec - ”Word2vec is a distributional semantics technique that learns word representations by building a neural network“ (NLPwSprk 2020).

X

XML - ”Extensible Markup Language is a markup language used to encode data.“ (NLPwSprk 2020)

Fair Use Sources

Fair Use Sources:

B08BW4W4CZ (NLPwSprk 2020)

Natural Language Processing (NLP): What Is Language, Text classification, Language modeling, Google Gemini, ChatGPT

SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.

Table of Contents

Natural Language Processing with Spark NLP Glossary

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

R

S

T

U

V

W

X

Fair Use Sources