Now Reading
Guide to Named Entity Recognition with spaCy and NLTK

Guide to Named Entity Recognition with spaCy and NLTK

This field of data science also deals with text data where we need to extract many of the features from the data. Text data consists of a huge amount of information. Extraction of the information can give us various important and insightful results. There are various tests and modifications we perform in any task related to NLP such as part of speech tagging, stopword removal and named entity recognition. This article gives an overview of Named Entity Recognition(NER) with its hands-on implementation. The major points to be discussed in this article are listed in the following table of content.

Table of Contents

  1. What is the Named Entity?
  2. What is Named Entity Recognition (NER)?
  3. Implementation of NER using spaCy
  4. Implementation of NER using NLTK
  5. Applications of NER
  6. Difference between PoS tagging and NER

What is the Named Entity?

In any text data, the named entities are objects which exist in the real world. Examples of objects can be the name of any person, place or thing that can be represented in any data with their proper name. Examples of named entities are Narendra Modi, Mumbai, MacBook pro etc. or anything that can have a name.

Deep Learning DevCon 2021 | 23-24th Sep | Register>>

More formally we can say a named entity denotes the proper name of any object. As mentioned in the above example, Narendra Modi is the name of a leader, Mumbai is the name of a city and MacBook pro is the name of a laptop.

What is Named Entity Recognition (NER)?

Named entity recognition is a process where the named entity gets identified and linked to its class. As we know that any given raw text data consists of various kinds of words like some of them are stopwords, part of speech words likewise there can be various kind words that can be presented in a text file which can be segregated as named entities. These words do not represent any feeling but they can represent the relationship between two sentences or two words. 

So sometimes it becomes very important to identify and classify them so that the model which is going to work on the data can easily understand text data and make results out of them accurately. Such as from a sentence :

Looking for a job change? Let us help you.

“Rahul sold his Maruti 800 at rupees 50000 in 2015”

And the named entity recognition system will give results as 

“rahul(person) sold his maruti 800 (car/object) at rupees 50000 (price) in 2015 (time)”

Here in the sentence, we can see the recognition process of a NER model by classifying the words into the name of the person, car, prize and time. 

Implementation of NER Using spaCy

There are various platforms that can be used for NER. some of the notable platforms are:

  • GATE (general architecture for text engineering) – Suitable in the java programming language.
  • Apache OpenNLP- It is a machine learning-based toolkit for natural language processing.
  • Spacy- it features fast statistical NER with open-source named-entity visualization.
  • NLTK – it is a standard python library for various NLP task

Here in this article, we are using python language that is why I am implementing some of the features of the spacy and NLTK provided packages and models for NER.

Spacy is an open-source NLP library that provides various facilities and packages which can be help full on NLP tasks such as POS tagging, lemmatization, fast sentence segmentation 

Let’s get started with importing libraries.

import spacy

Defining a sample text for testing the model, I have taken that from the Wikipedia page of BCCI.

raw_text="""The Board of Control for Cricket in India (BCCI) is the governing body for cricket in India and is under the jurisdiction of Ministry of Youth Affairs and Sports, Government of India.[2] The board was formed in December 1928 as a society, registered under the Tamil Nadu Societies Registration Act. It is a consortium of state cricket associations and the state associations select their representatives who in turn elect the BCCI Chief. Its headquarters are in Wankhede Stadium, Mumbai. Grant Govan was its first president and Anthony De Mello its first secretary. """

Loading only the NER model of spicy.

NER = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

Fitting the model on the sample text.

text= NER(raw_text)

Printing the named entity found by the model in our sample text.

for w in text.ents:
    print(w.text,w.label_)

Output:

We can also visualize the name entities with the data using the displacy package of spacy.

spacy.displacy.render(text, style="ent",jupyter=True)

Output:

Here one thing we can get confused about is the named entity code. We can also check for the explanation of those NE codes.

spacy.explain(u"NORP")

Output:

Implementation of NER using NLTK 

Let’s start with the importing library.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

NLTK provides some already tagged sentences, we can check it using the treebank package.

nltk.download('treebank')
sent = nltk.corpus.treebank.tagged_sents()
print(nltk.ne_chunk(sent[0]))

Output:

We can also use NLTK for NER in our sample text.

Before extracting the named entity we need to tokenize the sentence and give them part of the speech tag to the tokenized words.

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
raw_words= word_tokenize(raw_text)
tags=pos_tag(raw_words)

Now we can perform NER on the changed sample using the ne_chunk module of the NLTK.

nltk.download('maxent_ne_chunker')
nltk.download('words')
ne = nltk.ne_chunk(tags,binary=True)
print(ne)

Output:

Since the result is so big I am giving a short image of the result.

For better understanding, we can use the IOB tagging format. This format provides tags similar to the pos tagging but gives clarification about the position and the entity of the words.

from nltk.chunk import tree2conlltags
iob = tree2conlltags(ne)
iob

Output:

 Here the IOB Tagging system contains tags of the form:

  • B-{CHUNK_TYPE} – for the word in the Beginning chunk
  • I-{CHUNK_TYPE} – for words Inside the chunk
  • O – Outside any chunk

Applications of NER

In the field of NLP, there can be various use cases of NER models some of the examples of the use cases are.

Information summarization from the documents

As we know nowadays the amount of digital data is increasing rapidly and this happens most of the time documents consist of various unused information for example in any insurance paper there can be a lot of information but an inspector need only few information out of them in such scenario we can extracts name, email, phone number from the documents and it will take less time to inspect the information from the documents.

Optimizing search engines algorithm 

A search engine contains a huge amount of information in it for any kind of query but how does it know which website is perfect for the query? Let’s take an example of anything about named entity recognition so if we search for it on the web then definitely somewhere we will see this article too. So in such cases, the search engine runs a NER model on articles or in the information provided according to the query and extracts the named entities which are associated with them so that the recommendation for any query can become more strong and insightful. 

In the identification of different Biomedical subparts

NER is used extensively in biomedical data for gene identification, DNA identification, and also the identification of drug names and disease names. These experiments use CRFs with features engineered for their domain data.

Content recommendations

In today’s scenario, we see that every application in our mobile phone is asking for feedback so that they can improve more and more to give their best to their customer. Applications such as Netflix and Prime ask for reviews about the content you have watched. And if you provide your review or feedback to them their algorithms extract the important information from the feedback using NER and according to that, they recommend the best to us or to similar users.

Difference between PoS Tagging and NER

  • In POS tagging we focus on the part of speech of any word in any sentence whereas in NER we focus more on the recognition of different names of the object, person, place, time etc.
  • As in NLTK implementation, we have seen that we were performing POS tagging before the NER. so we can say that the POS tagging is a process for whole data wherein NER we can use the noun words recognized by POS tagging.
  • POS tagging works on the whole data it goes through every word and classifies them all in different classes where NER works only on a few words which are presented as a named entity in the data.
  • POS tagging increases the data size more than the NER.

Final Words

Here in the article, we could see in detail about named entity recognition(NER). We also discussed the different libraries which can help us on performing NER and we also went through the popular libraries, spacy and NLTK, for the implementation. I encourage readers to use those libraries as well because they are functioning in different programming languages. I have not provided the implementation using them. In many places, NER takes a crucial part also in some of the basic NLP processes that require this to give many useful results. 

References

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top