Generally, when we read a text, we recognize entities straightway like people, values, locations and more. For example, in the sentence “ Alexander the Great, was a king of the ancient Greek kingdom of Macedonia.”, we can identify three types of entities as follows:
- Person: Alexander
- Culture: Greek
- Kingdom: Macedonia
We are getting an enormous amount of text data; with the help of the modern machine, we can process this text to perform tasks like Sentiment Analysis, search specific content, Named Entity Recognition, part of speech tagging, information retrieval and the list goes on.
In this article, with the help of the Naive Bayes classifier, we will classify the text into different entities or into what category it belongs. To perform this task, we are going to use a famous 20 newsgroup dataset. The 20 newsgroups dataset comprises around 19000 newsgroups posts on 20 different topics.
Code Implementation to identify entities
Create the Environment:
Create the necessary Python environment by importing the frameworks and libraries.
import numpy as np import pandas as pd from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer from sklearn.naive_bayes import MultinomialNB
Let’s quickly walk through the dataset.
dataset = fetch_20newsgroups(subset = 'all', shuffle = True, random_state = 42) dataset.target_names
The 20 different topics are as follows;
'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'
Let’s take a look at the 1st article.
'From: Mamatha Devineni Ratnam <email@example.com>', 'Subject: Pens fans reactions', 'Organization: Post Office, Carnegie Mellon, Pittsburgh, PA', 'Lines: 12', 'NNTP-Posting-Host: po4.andrew.cmu.edu', '', '', '', 'I am sure some bashers of Pens fans are pretty confused about the lack', 'of any kind of posts about the recent Pens massacre of the Devils. Actually,', 'I am bit puzzled too and a bit relieved. However, I am going to put an end', "to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they", 'are killing those Devils worse than I thought. Jagr just showed you why', 'he is much better than his regular season stats. He is also a lot', 'fo fun to watch in the playoffs. Bowman should let JAgr have a lot of', 'fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final', 'regular season game. PENS RULE!!!'
We can not directly feed raw data like this to the machine before. It should be converted into a vector of numerical values representing each sentence of the document. Utilities like CountVectorizer and TfidfTransformer provided by Sklearn are used to represent raw text into meaningful vectors.
CountVectorizer is used to tokenize a given collection of text documents and build a vocabulary of known words. When you call fit_transform on a given document, the result is an encoded vector with the length of the full vocabulary and an integer count for how many times each word appeared in the document, as shown in the above picture. The vectors returned are mostly sparse. To understand what the function has done, you can convert it into a NumPy array by calling toarray() function.
x_train_count = count_vecto.fit_transform(dataset.data) x_train_count.data
This gives output like this
array([1, 1, 1, ..., 1, 1, 1])
This text representation will not help as it does consider the words like ‘the’,’ an’, ’a’, and so on, which appear many times throughout the document and their large counts are not meaningful in the encoded vectors.
tfid_vecto = TfidfTransformer()
TfidfTransformer is an alternative method to perform tokenization and encoding for a given text document. TF-IDF are word frequency scores that try to highlight words that have more relevance to the context.
The frequency of occurrence of terms in a document is measured by Term Frequency. Inverse Document Frequency assigns the rank to the words based on their relevance in the document; in other words, it downscale the words that appear more frequently a’,’ an’,’ the’. The use case of TF-IDF is similar to that of the CountVectorizer.
Here we have already performed the first step of TF-IDF. We can directly use our countvectorizer data to calculate inverse document frequencies, i.e. to downscale the data.
train_tfid = tfid_vecto.fit_transform(x_train_count) train_tfid.data array([0.02250144, 0.07093131, 0.02297898, ..., 0.03921079, 0.05632967, 0.04197143]
Use of MultiNomialNaiveBayes Classifier in identifying entities:
Up till now, we have converted our raw data into a vector representation. In short, it represents probabilities of the appearance of a word or sequence of a word within its categories. So to perform our classification task based on probabilities, a MultinomialNB member of the NaiveBayes classifier family is used. I strongly encourage you to read this article to understand the NaiveBayes classifier.
As our training data is represented in the term frequency, the MultinomialNB classifier is most suitable for discrete features such as word counts. It uses term frequency to compute maximum likelihood estimates based on training data to estimate conditional probabilities.
model = MultinomialNB().fit( train_tfid, dataset.target )
Let’s try to predict random text which includes few targets of our trained data.
text = ["i have a motorbike which made by honda","i have TPU based system",'The Bible is simply the written core of that tradition.']
test = count_vecto.transform(text) pred = model.predict(train_tfid.transform(test)) pred array([ 7, 11, 15])
Cross-check the outcomes with the class.
for a,b in zip(new,pred): print(a,'--->is predicted as--->',dataset.target_names[b])
i have a motorbike which made by honda. --->is predicted as---> rec.autos
i have GPU based system. --->is predicted as---> sci.crypt
The Bible is simply the written core of that tradition. --->is predicted as---> soc.religion.christian
Note that we have used a complete dataset as a training set; for some systems, it will give memory error, set subset = ‘tarin’ where we defined our dataset.
We have learned about the NLP task, in which we have converted raw text into number vectors using frequency distribution of text content. The frequency distribution of the text can be done using various methods. We mainly focussed on using CountVectorizer and TFIDF vectorizer depending on the use case. Finally, we have used a MultiNomialNaiveBayes Classifier to classify the random text and create meaningful output.