Now Reading
Guide to NLP’s Textrank Algorithm

Guide to NLP’s Textrank Algorithm

In this modern era, the amount of data or information is huge and important. We want our ML model, NLP model to perform precisely and accurately for every task. To develop a well-performing model, various exploratory data analysis techniques like removing stop words, stemming and lemmatizing the word. But in a situation where the amount of data or information is huge. For example, there is a huge amount of words present in any review, and we can’t go thoroughly from all the reviews, and we will be required to summarize the text in such a manner where we can get an overview of the information. The algorithm text rank came here to provide automated summarized information of huge, unorganized information. This is not the only task we can perform by the package. Instead of summarizing, we can extract keywords and rank the phrase, making a huge amount of information understandable in a very summarized and short way.

Introduction 

Textrank is a graph-based ranking algorithm like Google’s PageRank algorithm which has been successfully implemented in citation analysis. We use text rank often for keyword extraction, automated text summarization and phrase ranking. Basically, in the text rank algorithm, we measure the relationship between two or more words. Let’s dive more into the algorithm.

Deep Learning DevCon 2021 | 23-24th Sep | Register>>

Suppose we have four words in any paragraph w1,w2,w3 and w4. And we have created a table for finding the relation between them according to their occurrence in the paragraph.

Word Related word
w1w3, w4
w2
w3w1
w4w1

From the table, we can tell that

  • W1 has occurred with w3 and w4
  • w2 has not occurred with any of them 
  • And obviously, w3 and w4 has occurred with the only w1 

To provide the phrase ranking, we need to give them ratings of measure according to their occurrence. This rating would tell us the probability of occurrence of words together.

Follow us on Google News>>

To measure the probabilities, we would require a square matrix of m⤬m size where m = no. of words.

That matrix can be filled up by the probability of the words occurring in the same place or together.

So this is how the text rank algorithm gives the ranking. To implement this algorithm, python provides a package name pytextrank. Next in the article, we are going to see how we can use this package.

PyTextRank 

pyTextRank is a library package for implementation of text rank algorithm with an extension of spaCy pipeline, popular for providing features like phrase extraction, extractive summarization for text documents and structured representation of the unstructured documents.

Let’s see the implementation of some basic modules of the package using google colab.

Installing the packages 

Input:

!pip install pytextrank
!python -m spacy download en_core_web_sm

Importing the libraries.

Input:

import spacy
import pytextrank

Defining a document.

Input:

document = "Not only did it only confirm that the film would be unfunny and generic, but it
also managed to give away the ENTIRE movie; and I'm not exaggerating - every moment, every 
plot point, every joke is told in the trailer."

Loading a spaCy model, depending on English language and adding TextRank into the pipeline.

Input:

en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank")
doc = en_nlp(document)

Let’s check the time taken in milliseconds for processing by the textrank-

Input:

tr = doc._.textrank
print(tr.elapsed_time);

Output:

Printing the rank of the combination of the words.

Input:

for combination in doc._.phrases:
    print(combination.text, combination.rank, combination.count)

Output:

Here in the output, we can see the rank of the words and phrases and their occurrence in the document. But in the output, we can see that the combinations are with stop words; we can remove the stopwords from the phrases. To perform this, I am loading a yelp.txt file.

 Input:

import pathlib
text = pathlib.Path("/content/drive/MyDrive/Yugesh/textrank/yelp_labelled.txt").read_text()
text

Output:

Here we can see that there are stop words present in the data.

Input:

en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })
doc = en_nlp(text)
for phrase in doc._.phrases[:5]:
    print(phrase)

Output:

 Here we can see there are no stopwords in the top 5 ranked combinations.

We can make a graph of the phrases according to the rank.

Defining the text.

Input:

text = "“The current vaccination rate in India is far from satisfactory though in absolute
numbers India surpassed the U.S. in terms of total number of vaccinations. India's current
vaccination rate doesn’t match up to what is actually needed. This will delay covering the
people with vaccination, which is, till date, the only viable way of breaking the chain of 
transmission and averting severe diseases and deaths in people who get infected,” said Gauri
Agarwal, IVF Expert, founder-Seeds of Innocence."

Loading the model.

Input:

en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })
doc = en_nlp(text)

Making a graph of the rank of a phrase

Input:

tr = doc._.textrank
tr.plot_keyphrases()

Output:

In the graph, we can see that there are 21 phrases and their rank.

We can summarize documents using this library. To perform this, I am defining a text.

Input:

text = """India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it
registered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and
family welfare data showed. The last time India's Covid-19 tally was below 30,000-mark was on 
March 16 when the country saw 28,903 fresh cases.

The country also saw 374 deaths due to Covid-19 in the last 24 hours, taking the death toll to 414,482. This is also the lowest death count India has seen after over three months. India witnessed deaths below 400 on March 30 when 354 fatalities were recorded.

Active cases of Covid-19 in the last 24 hours dipped sharply by 15,535, bringing the current infections in the country down to 406,130, the health ministry data showed. These account for 1.35% of the total infections reported in the country.

At least 45,254 people recovered from the infectious disease in the last 24 hours, taking India's recovery rate to 97.32%."""

Loading the package in a spaCy pipeline.

Input:

en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })
doc = en_nlp(text)
tr = doc._.textrank

Making the summary of the defined text.

Input:

for sent in tr.summary(limit_phrases=10, limit_sentences=2):
    print(sent)

Output:

India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it registered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and family welfare data showed.

Active cases of Covid-19 in the last 24 hours dipped sharply by 15,535, bringing the current infections in the country down to 406,130, the health ministry data showed.

Here we can see that text rank has summarized the big paragraph. There is one more way to summarize the document using the summa library. It also follows the TextRank algorithm.

Next, we will try to summarize the long text and also, we will try to extract major keywords from the paragraph.

Installing the library:

Input:

!pip install summa

Importing the required libraries.

from summa import summarizer
from summa import keywords

Using the summarizer module to summarize the text.

Input:

print(summarizer.summarize(text))

Output:

India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it registered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and family welfare data showed.

Extracting the major keywords from the text

Input:

print(keywords.keywords(text))

Output:

Here we can see that the keyword extraction is also working fine with the summa library. 

We can also restrict the summarisation words and make a summary using the percentage of content.

Summarization with restricted word count.

Input:

summarizer.summarize(text, words=50)

Output:

India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it registered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and family welfare data showed.\nThe last time India's Covid-19 tally was below 30,000-mark was on March 16 when the country saw 28,903 fresh cases.

Summarization using percentage of the content.

Input:

summarizer.summarize(text, ratio=0.5)

Output:

India recorded its lowest daily Covid-19 cases in over four months on Tuesday as it registered 30,093 fresh cases of the coronavirus disease, the Union ministry of health and family welfare data showed.\nThe last time India's Covid-19 tally was below 30,000-mark was on March 16 when the country saw 28,903 fresh cases.\nThe country also saw 374 deaths due to Covid-19 in the last 24 hours, taking the death toll to 414,482.\nActive cases of Covid-19 in the last 24 hours dipped sharply by 15,535, bringing the current infections in the country down to 406,130, the health ministry data showed.

Here in the article, we have seen how we can decompose any text document into a phrase and how we can decide the probability of those phrases to come together. We performed automated summarization using two library packages, pyTextRank and summa. So these are the basic operations that we can perform in our NLP problems. Various content provided by magazines, online news applications, and other media platforms that are using these features to explain all the content in smaller talk to have a better experience in using the product.

References

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top