Last updated February 19, 2022
In AI Mysteries

Getting started with Gensim for basic NLP tasks

Share

Published on February 19, 2022

by Vijaysinh Lendave

Gensim is an open-source python package for natural language processing with a special focus on topic modelling. It is designed as a topic modelling library, allowing users to apply common academic-based models in production or projects. So, in this article, we will talk about this library and its main functions and features, as well as various NLP-related tasks. Below are the major points that we are going to discuss throughout this post.

What is Gensim?
Features of Genism
Hands-on NLP with Gensim
1. Creating a dictionary from a list of sentence
2. Bag-of-words
3. Creating Bi-gram
4. Creating TF-IDF matrix

Let’s first discuss the Gensim library.

What is Gensim?

Gensim is open-source software that performs unsupervised topic modelling and natural language processing using modern statistical machine learning. Gensim is written in Python and Cython for performance. It is designed to handle large text collections using data streaming and incremental online algorithms, which sets it apart from most other machine learning software packages that are only designed for in-memory processing.

Gensim is not an all-encompassing NLP research library (like NLTK); rather, it is a mature, targeted, and efficient collection of NLP tools for subject modelling. It also includes tools for loading pre-trained word embeddings in a variety of formats, as well as using and querying a loaded embedding.

Features of Genism

Following are some of the features of the gensim.

Gensim provides efficient multicore implementations of common techniques including Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Random Projections (RP), and Hierarchical Dirichlet Process to speed up processing and retrieval on machine clusters (HDP).

Using its incremental online training algorithms, Gensim can easily process massive and web-scale corpora. It is scalable since there is no need for the entire input corpus to be fully stored in Random Access Memory (RAM) at any given time. In other words, regardless of the size of the corpus, all of its methods are memory-independent.

Gensim is a strong system that has been used in a variety of systems by a variety of people. Our own input corpus or data stream can be easily plugged in. It’s also simple to add other Vector Space Algorithms to it.

Hands-on NLP with Gensim

In this section, we’ll address some of the basic NLP tasks by using Gensim. Let’s first start with creating the dictionary.

1. Creating a dictionary from a list of sentence

Gensim requires that words (aka tokens) be translated to unique ids in order to work on text documents. To accomplish this, Gensim allows you to create a Dictionary object that maps each word to a unique id. We may do this by transforming our text/sentences to a list of words and passing it to the corpora.Dictionary() method.

In the following part, we’ll look at how to really do this. The dictionary object is often used to generate a Corpus of ‘bag of words.’ This Dictionary, as well as the bag-of-words (Corpus), are utilized as inputs to Gensim’s topic modelling and other models.

Here is the snippet that creates the dictionary for a given text.

text = [
   "Gensim is an open-source library for",
   "unsupervised topic modeling and",
   "natural language processing."
]
# get the separate words
text_tokens = [[tok for tok in doc.split()] for doc in text]
# create dictionary
dict_ = corpora.Dictionary(text_tokens)
# get the tkens and ids
pprint(dict_.token2id)

2. Bag-of-words

The Corpus is the next important item to learn if you want to use gensim effectively (a Bag of Words). It is a corpus object that contains both the word id and the frequency with which it appears in each document.

To create a bag of word corpus, all that is required is to feed the tokenized list of words to the Dictionary after it has been updated. doc2bow(). To generate BOW, we’ll continue from the tokenized text from the previous example.

# tokens
text_tokens = [[tok for tok in doc.split()] for doc in text]
# create dict
dict_ = corpora.Dictionary()
#BOW
BoW_corpus = [dict_.doc2bow(doc, allow_update=True) for doc in text_tokens]
pprint(BoW_corpus)

The (0, 1) in line 1 indicates that the id=0 word appears just once in the first sentence. Similarly, the (10, 1) in the third list item indicates that the word with the id 10 appears in the third phrase once. And so forth.

3. Creating Bi-gram

Certain words in paragraphs invariably appear in pairs (bigram) or in groups of threes (trigram). Because the two terms when joined make the actual entity. Forming bigrams and trigrams from phrases is critical, especially when working with bag-of-words models. It’s simple and quick with Gensim’s Phrases model. Because the built Phrases model supports indexing, simply send the original text (list) to the built Phrases model to generate the bigrams.

from gensim.models.phrases import Phrases
# Build the bigram models
bigram = gensim.models.phrases.Phrases(text_tokens, min_count=3, threshold=10)
#Construct bigram
pprint(bigram[text_tokens[0]])

4. Creating TF-IDF matrix

Like the regular corpus model, the Term Frequency – Inverse Document Frequency (TF-IDF) model reduces the weight of tokens (words) that appear frequently across texts. Tf-Idf is calculated by dividing a local component, such as term frequency (TF), by a global component, such as inverse document frequency (IDF), and then normalizing the result to unit length. As a result, phrases that appear frequently in publications will receive less weight.

There are various formula modifications for TF and IDF. Below is the way by which we can obtain the TF-IDF matrix. The blow snippets first obtain the frequency given by the BOW and later by the TF-IDF.

from gensim.utils import simple_preprocess
from gensim import models
import numpy as np
# data to be processed
doc = [
   "Gensim is an open-source library for  ",
   "unsupervised topic modeling and",
   "natural language processing."]
 
# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in doc])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in doc]
 
# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

Now moving with TF-IDF, we just need to fit the model and access the weights by loops and conditions for each word.

# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')
 
# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

Here is the output.

Final words

Through this article, we have discussed the Python-based library called Gensim, which is a modular kind of library that gives us the facility to build SOTA algorithms and pipelines for NLP-related problems. This post is all about getting started with Gensim where we have practically addressed some of the basic tasks related to NLP and understood the same.