Machine learning (ML) and other approaches are used in natural language processing (NLP), and they usually work with numerical arrays known as vectors that represent each instance (also known as an observation, entity, instance, or row) in the data set. The collection of all these arrays is referred to as a matrix, and each row in the matrix represents a single instance. Each column indicates a feature when looking at the matrix by its columns (or attribute).
The initial step in NLP is to turn the collection of text occurrences into a matrix, with each row being a numerical representation of a text instance (a vector). However, there are a few terms to understand before getting started with NLP.
Step by Step NLP process
A document is a single instance in NLP, whereas a corpus is a collection of instances. A document might be as simple as a short phrase or name or as complex as a complete book, depending on the problem at hand.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
A decision must be made regarding how to decompose a document into smaller parts through a process known as tokenisation. Tokens are created as a result of this operation. They are the smallest units of meaning that the algorithm can take into account. The vocabulary is the collection of all tokens found in the corpus.
Taking words as a token is a typical choice; in this example, a document is represented as a bag of words (BoW). The BoW model searches the entire corpus for vocabulary at the word level, which means that the vocabulary is the set of all the words found in the corpus. The algorithm then counts the number of times each term appears in the corpus for each document.
Most terms in the corpus will not appear in most documents, resulting in a lot of zero counts for a lot of tokens in a document. That’s essentially it in terms of concept, but when a data scientist generates the vectors from these, they must verify that the columns line in the same way for each row.
Permuting the row of this matrix, or any other design matrix (a matrix that represents instances as rows and features as columns), has no effect on its meaning. Column permutations are the same way. Data Scientists get a variable ordering of the columns depending on how they map a token to a column index, but no meaningful change in the representation. Hashing is the process of mapping tokens to indexes in such a way that no two tokens map to the same index. A hash, hashing function, or hash function is a specific implementation.
Vocabulary based Hashing
NVIDIA constructed an implicit hash function while vectorising by hand. They allocated an initial index, 0, to the first word which had not been seen, assuming a 0-indexing scheme. The index was then incremented, and the operation was repeated. “This” was mapped to the 0-indexed column, “is” to the 1-indexed column, and “the” to the 3-indexed columns using NVIDIA’s hash function. There are benefits and drawbacks to using a vocabulary-based hash function.
Fortunately, there is another way to hash tokens: use a non-cryptographic mathematical hash function for each instance. This form of hash function maps objects (represented by their bits) to a defined range of integers or numbers using a combination of arithmetic, modular arithmetic, and algebra (bits). The maximum value defines how many columns are in the matrix because the range is known. The range is rather big in general; however, for most rows, the majority of columns will be 0. As a result, a sparse representation reduces the amount of memory needed to hold the matrix, and algorithms can efficiently execute sparse matrix-based operations.
Furthermore, because there is no vocabulary, vectorisation with a mathematical hash function does not necessitate any vocabulary storage overhead. As a result, parallelisation is not limited, and the corpus can be broken into any number of processes, allowing each section to be vectorised independently. The generated matrices can be stacked to form the final matrix once each procedure has finished vectorising its part of the corpora. By reducing bottlenecks, this parallelisation, which is facilitated by the use of a mathematical hash function, can substantially speed up the training pipeline.