How algorithm understands text in NLP

Hashing is the process of mapping tokens to indexes in such a way that no two tokens map to the same index.

Machine learning (ML) and other approaches are used in natural language processing (NLP), and they usually work with numerical arrays known as vectors that represent each instance (also known as an observation, entity, instance, or row) in the data set. The collection of all these arrays is referred to as a matrix, and each row in the matrix represents a single instance. Each column indicates a feature when looking at the matrix by its columns (or attribute).

The initial step in NLP is to turn the collection of text occurrences into a matrix, with each row being a numerical representation of a text instance (a vector). However, there are a few terms to understand before getting started with NLP.

Step by Step NLP process 

A document is a single instance in NLP, whereas a corpus is a collection of instances. A document might be as simple as a short phrase or name or as complex as a complete book, depending on the problem at hand.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

A decision must be made regarding how to decompose a document into smaller parts through a process known as tokenisation. Tokens are created as a result of this operation. They are the smallest units of meaning that the algorithm can take into account. The vocabulary is the collection of all tokens found in the corpus.

Taking words as a token is a typical choice; in this example, a document is represented as a bag of words (BoW). The BoW model searches the entire corpus for vocabulary at the word level, which means that the vocabulary is the set of all the words found in the corpus. The algorithm then counts the number of times each term appears in the corpus for each document. 

Most terms in the corpus will not appear in most documents, resulting in a lot of zero counts for a lot of tokens in a document. That’s essentially it in terms of concept, but when a data scientist generates the vectors from these, they must verify that the columns line in the same way for each row.

Hashing 

Permuting the row of this matrix, or any other design matrix (a matrix that represents instances as rows and features as columns), has no effect on its meaning. Column permutations are the same way. Data Scientists get a variable ordering of the columns depending on how they map a token to a column index, but no meaningful change in the representation. Hashing is the process of mapping tokens to indexes in such a way that no two tokens map to the same index. A hash, hashing function, or hash function is a specific implementation.

Vocabulary based Hashing 

NVIDIA constructed an implicit hash function while vectorising by hand. They allocated an initial index, 0, to the first word which had not been seen, assuming a 0-indexing scheme. The index was then incremented, and the operation was repeated. “This” was mapped to the 0-indexed column, “is” to the 1-indexed column, and “the” to the 3-indexed columns using NVIDIA’s hash function. There are benefits and drawbacks to using a vocabulary-based hash function.

Mathematical Hashing 

Fortunately, there is another way to hash tokens: use a non-cryptographic mathematical hash function for each instance. This form of hash function maps objects (represented by their bits) to a defined range of integers or numbers using a combination of arithmetic, modular arithmetic, and algebra (bits). The maximum value defines how many columns are in the matrix because the range is known. The range is rather big in general; however, for most rows, the majority of columns will be 0. As a result, a sparse representation reduces the amount of memory needed to hold the matrix, and algorithms can efficiently execute sparse matrix-based operations.

Furthermore, because there is no vocabulary, vectorisation with a mathematical hash function does not necessitate any vocabulary storage overhead. As a result, parallelisation is not limited, and the corpus can be broken into any number of processes, allowing each section to be vectorised independently. The generated matrices can be stacked to form the final matrix once each procedure has finished vectorising its part of the corpora. By reducing bottlenecks, this parallelisation, which is facilitated by the use of a mathematical hash function, can substantially speed up the training pipeline.

More Great AIM Stories

Abhishree Choudhary
Abhishree is a budding tech journalist with a UGD in Political Science. In her free time, Abhishree can be found watching French new wave classic films and playing with dogs.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM