MITB Banner

Hands-on Guide To Extractive Text Summarization With BERTSum

This article is a demonstration of how simple and powerful transfer learning models are in the field of NLP. We will implement a text summarizer using BERT that can summarize large posts like blogs and news articles using just a few lines of code.

Share

With the advent of new research in the state of the art models, more powerful algorithms are being developed that focus on low code environments to make Machine learning accessible for everyone. Analogous to transfer learning models for image classification or face recognition in the field of computer vision, BERT, Roberta, XLNet are highly powerful models that solve problems using natural language processing. These models have dominated the world of NLP by making tasks like POS tagging, sentiment analysis, text summarization etc very easy yet effective. 

This article is a demonstration of how simple and powerful transfer learning models are in the field of NLP. We will implement a text summarizer using BERT that can summarize large posts like blogs and news articles using just a few lines of code. 

Text summarization

Text summarization is the concept of employing a machine to condense a document or a set of documents into brief paragraphs or statements using mathematical methods. NLP broadly classifies text summarization into 2 groups.

  1. Extractive text summarization: here, the model summarizes long documents and represents them in smaller simpler sentences. 
  2. Abstractive text summarization: the model has to produce a summary based on a topic without prior content provided.

We will understand and implement the first category here. 

Extractive text summarization with BERT(BERTSUM)

Unlike abstractive text summarization, extractive text summarization requires the model to “understand” the complete text, pick out the right keywords and assemble these keywords to make sense. There cannot be a loss of information either. So, how does BERT do all of this with such great speed and accuracy? 

BERTSUM

Looking at the image above, you may notice slight differences between the original model and the model used for summarization. The input format of BERTSUM is different when compared to the original model. Here, a [CLS] token is added at the start of each sentence in order to separate multiple sentences and to collect features of the preceding sentence. There is also a difference in segment embeddings. In the case of BERTSUM, each sentence is assigned an embedding of Ea or Eb depending on whether the sentence is even or odd. If the sequence is [s1,s2,s3] then the segment embeddings are [Ea, Eb, Ea]. This way, all sentences are embedded and sent into further layers. 

BERTSUM assigns scores to each sentence that represents how much value that sentence adds to the overall document. So, [s1,s2,s3] is assigned [score1, score2, score3]. The sentences with the highest scores are then collected and rearranged to give the overall summary of the article. 

Now that we have understood the working of BERTSUM, let us implement them to a custom article. 

Loading the custom data

You can select a paragraph or two from any source of your choice. I have selected two paragraphs from the WHO report on coronavirus. To download the article click here. Once you have your data ready, go ahead and install the bert-sum library using the command 

pip install bert-extractive-summarizer

After downloading this, let us read our file and summarize it. 

get_corona_summary=open('corona.txt','r').read()

text summarization

BERTSUM has an in-built module called summarizer that takes in our data, accesses it and provided the summary within seconds. 

from summarizer import Summarizer
model = Summarizer()
result = model(get_corona_summary, min_length=20)
summary = "".join(result)
print(summary)
text summarization

Thus, our two-paragraph information has been converted into a small brief summary.

Using a Web scaped article

Python makes data loading easy for us by providing a library called newspaper. This library is a web scraper that can extract all textual information from the URL provided. Newspaper can extract and detect languages seamlessly. If no language is specified, Newspaper will attempt to auto-detect a language. 

To install this library, use the command 

pip install newspaper3k 

Once this is done, let us select an article that we need the summary for. I have chosen this article for the purpose of this project. Let us load our dataset.

from newspaper import fulltext
import requests
article_url="https://analyticsindiamag.com/is-common-sense-common-in-nlp-models/ "
article = fulltext(requests.get(article_url).text)
print(article)
summarization
from summarizer import Summarizer
model = Summarizer()
result = model(article, min_length=30,max_length=300)
summary = "".join(result)
print(summary)

Here is a one-paragraph summary of the contents in the article. You can choose to limit the size of the summary as per your needs. Here I have limited it to 300 words. 

Conclusion

BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. The research in the field of NLP is trying to reach human-level every day.  With models like this, there is a wide range of applications where it finds use. The ease in which these methods can be implemented makes it accessible to not only developers but even business analysts. 

Share
Picture of Bhoomika Madhukar

Bhoomika Madhukar

I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.