MITB Banner

How to Identify Entities in NLP?

Share

Generally, when we read a text, we recognize entities straightway like people, values, locations and more. For example, in the sentence Alexander the Great, was a king of the ancient Greek kingdom of Macedonia.”, we can identify three types of entities as follows:

  • Person: Alexander
  • Culture: Greek
  • Kingdom: Macedonia 

We are getting an enormous amount of text data; with the help of the modern machine, we can process this text to perform tasks like Sentiment Analysis, search specific content, Named Entity Recognition, part of speech tagging, information retrieval and the list goes on.  

In this article, with the help of the Naive Bayes classifier, we will classify the text into different entities or into what category it belongs. To perform this task, we are going to use a famous 20 newsgroup dataset. The 20 newsgroups dataset comprises around 19000 newsgroups posts on 20 different topics.

Code Implementation to identify entities

Create the Environment:

Create the necessary Python environment by importing the frameworks and libraries.

 import numpy as np
 import pandas as pd
 from sklearn.datasets import fetch_20newsgroups
 from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
 from sklearn.naive_bayes import MultinomialNB 

Let’s quickly walk through the dataset.

 dataset = fetch_20newsgroups(subset = 'all', shuffle = True, random_state = 42)
 dataset.target_names 

The 20 different topics are as follows;

 'alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc' 

Let’s take a look at the 1st article. 

(dataset.data[0].split('\n'))

 'From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>',
  'Subject: Pens fans reactions',
  'Organization: Post Office, Carnegie Mellon, Pittsburgh, PA',
  'Lines: 12',
  'NNTP-Posting-Host: po4.andrew.cmu.edu',
  '',
  '',
  '',
  'I am sure some bashers of Pens fans are pretty confused about the lack',
  'of any kind of posts about the recent Pens massacre of the Devils. Actually,',
  'I am  bit puzzled too and a bit relieved. However, I am going to put an end',
  "to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they",
  'are killing those Devils worse than I thought. Jagr just showed you why',
  'he is much better than his regular season stats. He is also a lot',
  'fo fun to watch in the playoffs. Bowman should let JAgr have a lot of',
  'fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final',
  'regular season game.          PENS RULE!!!' 

We can not directly feed raw data like this to the machine before. It should be converted into a vector of numerical values representing each sentence of the document. Utilities like CountVectorizer and TfidfTransformer provided by Sklearn are used to represent raw text into meaningful vectors.

count_vecto=CountVectorizer()

source

CountVectorizer is used to tokenize a given collection of text documents and build a vocabulary of known words. When you call fit_transform on a given document, the result is an encoded vector with the length of the full vocabulary and an integer count for how many times each word appeared in the document, as shown in the above picture. The vectors returned are mostly sparse. To understand what the function has done, you can convert it into a NumPy array by calling toarray() function.   

x_train_count = count_vecto.fit_transform(dataset.data)
x_train_count.data

This gives output like this

array([1, 1, 1, ..., 1, 1, 1])

This text representation will not help as it does consider the words like ‘the’,’ an’, ’a’, and so on, which appear many times throughout the document and their large counts are not meaningful in the encoded vectors.

tfid_vecto = TfidfTransformer()

TfidfTransformer is an alternative method to perform tokenization and encoding for a given text document. TF-IDF are word frequency scores that try to highlight words that have more relevance to the context.

The frequency of occurrence of terms in a document is measured by Term Frequency. Inverse Document Frequency assigns the rank to the words based on their relevance in the document; in other words, it downscale the words that appear more frequently a’,’ an’,’ the’. The use case of TF-IDF is similar to that of the CountVectorizer.

Here we have already performed the first step of TF-IDF. We can directly use our countvectorizer data to calculate inverse document frequencies, i.e. to downscale the data.    

 train_tfid = tfid_vecto.fit_transform(x_train_count)
 train_tfid.data
 array([0.02250144, 0.07093131, 0.02297898, ..., 0.03921079, 0.05632967,
        0.04197143] 

Use of MultiNomialNaiveBayes Classifier in identifying entities:

Why NaiveBayes? 

Up till now, we have converted our raw data into a vector representation. In short, it represents probabilities of the appearance of a word or sequence of a word within its categories. So to perform our classification task based on probabilities, a MultinomialNB member of the NaiveBayes classifier family is used. I strongly encourage you to read this article to understand the NaiveBayes classifier.  

As our training data is represented in the term frequency, the MultinomialNB classifier is most suitable for discrete features such as word counts. It uses term frequency to compute maximum likelihood estimates based on training data to estimate conditional probabilities.

model = MultinomialNB().fit( train_tfid, dataset.target ) 

Let’s try to predict random text which includes few targets of our trained data.  

text = ["i have a motorbike which made by honda","i have TPU based system",'The Bible is simply the written core of that tradition.']

 test = count_vecto.transform(text)
 pred = model.predict(train_tfid.transform(test))
 pred
 array([ 7, 11, 15]) 

Cross-check the outcomes with the class. 

 for a,b in zip(new,pred):
     print(a,'--->is predicted as--->',dataset.target_names[b]) 

i have a motorbike which made by honda. --->is predicted as---> rec.autos

i have GPU based system. --->is predicted as---> sci.crypt

The Bible is simply the written core of that tradition. --->is predicted as---> soc.religion.christian

Note that we have used a complete dataset as a training set; for some systems, it will give memory error, set subset = ‘tarin’ where we defined our dataset.

End Points

We have learned about the NLP task, in which we have converted raw text into number vectors using frequency distribution of text content. The frequency distribution of the text can be done using various methods. We mainly focussed on using CountVectorizer and TFIDF vectorizer depending on the use case. Finally, we have used a MultiNomialNaiveBayes Classifier to classify the random text and create meaningful output.

References:

Share
Picture of Vijaysinh Lendave

Vijaysinh Lendave

Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.