How to Identify Entities in NLP?

Generally, when we read a text, we recognize entities straightway like people, values, locations and more. For example, in the sentence Alexander the Great, was a king of the ancient Greek kingdom of Macedonia.”, we can identify three types of entities as follows:

  • Person: Alexander
  • Culture: Greek
  • Kingdom: Macedonia 

We are getting an enormous amount of text data; with the help of the modern machine, we can process this text to perform tasks like Sentiment Analysis, search specific content, Named Entity Recognition, part of speech tagging, information retrieval and the list goes on.  

In this article, with the help of the Naive Bayes classifier, we will classify the text into different entities or into what category it belongs. To perform this task, we are going to use a famous 20 newsgroup dataset. The 20 newsgroups dataset comprises around 19000 newsgroups posts on 20 different topics.

Code Implementation to identify entities

Create the Environment:

Create the necessary Python environment by importing the frameworks and libraries.

 import numpy as np
 import pandas as pd
 from sklearn.datasets import fetch_20newsgroups
 from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
 from sklearn.naive_bayes import MultinomialNB 

Let’s quickly walk through the dataset.

 dataset = fetch_20newsgroups(subset = 'all', shuffle = True, random_state = 42)
 dataset.target_names 

The 20 different topics are as follows;

 'alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc' 

Let’s take a look at the 1st article. 

(dataset.data[0].split('\n'))

 'From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>',
  'Subject: Pens fans reactions',
  'Organization: Post Office, Carnegie Mellon, Pittsburgh, PA',
  'Lines: 12',
  'NNTP-Posting-Host: po4.andrew.cmu.edu',
  '',
  '',
  '',
  'I am sure some bashers of Pens fans are pretty confused about the lack',
  'of any kind of posts about the recent Pens massacre of the Devils. Actually,',
  'I am  bit puzzled too and a bit relieved. However, I am going to put an end',
  "to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they",
  'are killing those Devils worse than I thought. Jagr just showed you why',
  'he is much better than his regular season stats. He is also a lot',
  'fo fun to watch in the playoffs. Bowman should let JAgr have a lot of',
  'fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final',
  'regular season game.          PENS RULE!!!' 

We can not directly feed raw data like this to the machine before. It should be converted into a vector of numerical values representing each sentence of the document. Utilities like CountVectorizer and TfidfTransformer provided by Sklearn are used to represent raw text into meaningful vectors.

count_vecto=CountVectorizer()

source

CountVectorizer is used to tokenize a given collection of text documents and build a vocabulary of known words. When you call fit_transform on a given document, the result is an encoded vector with the length of the full vocabulary and an integer count for how many times each word appeared in the document, as shown in the above picture. The vectors returned are mostly sparse. To understand what the function has done, you can convert it into a NumPy array by calling toarray() function.   

x_train_count = count_vecto.fit_transform(dataset.data)
x_train_count.data

This gives output like this

array([1, 1, 1, ..., 1, 1, 1])

This text representation will not help as it does consider the words like ‘the’,’ an’, ’a’, and so on, which appear many times throughout the document and their large counts are not meaningful in the encoded vectors.

tfid_vecto = TfidfTransformer()

TfidfTransformer is an alternative method to perform tokenization and encoding for a given text document. TF-IDF are word frequency scores that try to highlight words that have more relevance to the context.

The frequency of occurrence of terms in a document is measured by Term Frequency. Inverse Document Frequency assigns the rank to the words based on their relevance in the document; in other words, it downscale the words that appear more frequently a’,’ an’,’ the’. The use case of TF-IDF is similar to that of the CountVectorizer.

Here we have already performed the first step of TF-IDF. We can directly use our countvectorizer data to calculate inverse document frequencies, i.e. to downscale the data.    

 train_tfid = tfid_vecto.fit_transform(x_train_count)
 train_tfid.data
 array([0.02250144, 0.07093131, 0.02297898, ..., 0.03921079, 0.05632967,
        0.04197143] 

Use of MultiNomialNaiveBayes Classifier in identifying entities:

Why NaiveBayes? 

Up till now, we have converted our raw data into a vector representation. In short, it represents probabilities of the appearance of a word or sequence of a word within its categories. So to perform our classification task based on probabilities, a MultinomialNB member of the NaiveBayes classifier family is used. I strongly encourage you to read this article to understand the NaiveBayes classifier.  

As our training data is represented in the term frequency, the MultinomialNB classifier is most suitable for discrete features such as word counts. It uses term frequency to compute maximum likelihood estimates based on training data to estimate conditional probabilities.

model = MultinomialNB().fit( train_tfid, dataset.target ) 

Let’s try to predict random text which includes few targets of our trained data.  

text = ["i have a motorbike which made by honda","i have TPU based system",'The Bible is simply the written core of that tradition.']

 test = count_vecto.transform(text)
 pred = model.predict(train_tfid.transform(test))
 pred
 array([ 7, 11, 15]) 

Cross-check the outcomes with the class. 

 for a,b in zip(new,pred):
     print(a,'--->is predicted as--->',dataset.target_names[b]) 

i have a motorbike which made by honda. --->is predicted as---> rec.autos

i have GPU based system. --->is predicted as---> sci.crypt

The Bible is simply the written core of that tradition. --->is predicted as---> soc.religion.christian

Note that we have used a complete dataset as a training set; for some systems, it will give memory error, set subset = ‘tarin’ where we defined our dataset.

End Points

We have learned about the NLP task, in which we have converted raw text into number vectors using frequency distribution of text content. The frequency distribution of the text can be done using various methods. We mainly focussed on using CountVectorizer and TFIDF vectorizer depending on the use case. Finally, we have used a MultiNomialNaiveBayes Classifier to classify the random text and create meaningful output.

References:

Download our Mobile App

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR