Complete Tutorial on Parts Of Speech (PoS) Tagging

Classifying words in their part of speech and providing them labels according to their part of speech is called part of speech tagging or POS tagging OR POST.  Hence the set of labels/tags is called a tagset. Next in the article, we will discuss how we can implement that POST part of any NLP task

Advertisement

Back in school time, we learned that the part of speech indicates the function of any word, like what it means in any sentence. There are commonly nine parts of speeches; noun, pronoun, verb, adverb, article, adjective, preposition, conjunction, interjection, and a word need to be fit into the proper part of speech to make sense in the sentence. 

So the part of speech is not just part of studying the grammar of any language, but also it is a very useful part of text preprocessing in NLP as we know that NLP is a task where we make a machine able to communicate with a human or with a different machine. So it becomes compulsory for a machine to understand the part of speech.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Classifying words in their part of speech and providing them labels according to their part of speech is called part of speech tagging or POS tagging OR POST.  Hence the set of labels/tags is called a tagset. Next in the article, we will discuss how we can implement that POST part of any NLP task.

Code Implementation: Parts Of Speech Tagging

Importing the libraries     

Input:

import nltk
from nltk import word_tokenize

Downloading the packages.

Input:

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

 Viewing the POST tagsets.

Input:

nltk.help.upenn_tagset()

Output:

Here we can see the list or set of the tag which nltk provides us, and from those options, we will provide labels to every word.

Let’s check for the tags for any sentence.

Input:

sentence = word_tokenize("whatever the world is a great place")
nltk.pos_tag(sentence)

Output:

Here we can see that we have provided tags to every word. For example, the word “world” has got the tag NN, a noun, and great has got the tag JJ, which is a tag for an adjective.  

Let’s check for some more examples; this time, we are focusing on homonyms.

Input:

sentence = word_tokenize("applicant is removed from applicant list of the job ")
nltk.pos_tag(sentence)

In the input, we have provided the applicant word two times with different parts of speech. So lets check for the labels it will give to both of them.

Output:

Here we can see for the first applicant that the label is NN, a noun, and for the second applicant, it labelled the word as JJ, which means adjective. 

Input:

sentence = word_tokenize("allow us to add lines in list of allow actions")
nltk.pos_tag(sentence)

Output:

Again we have provided the exact tag to the ‘allow’ word.

This is not enough; there are some more features we can use. For introducing those features, let us just import the brown corpus.

Input:

from nltk.corpus import brown
brown.categories()

Output:

Here we can see that we are having a corpus of 15 categories. We are going to use the news category of the corpus.

Input:

text_news = nltk.Text(word.lower() for word in nltk.corpus.brown.words(categories='news'))
text_news

Output:

Here we have imported the brown corpus of the news category, and now one of the important features of tagging is that we can find or extract the word of similar tags; for example, man is a noun, and the tag given to it is NN and using the similar function we can find out the words with a similar label or part of speech. Before finding the lexical categories, let’s just have an overview of the corpus’s words count with their part of speech.

Input:

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
nltk.download('universal_tagset')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

Output:

Here we can see the count of the universal tagsets in the corpus.

Input:

tag_fd.plot(cumulative=True)

Output:

Now we can find out the words with similar POST.

Input:

text_news.similar('man')

Output:

Here we can see that the other noun, like man, comes out from the corpus. Similarly, we can extract for other parts of speeches as well.

Input:

text_news.similar('said')

Output:

Input:

text_news.similar('the')

Output:

Input:

text_news.similar('the')

Output:

Observe the words and the results we are having with other words with similar POST. There are various corpus available in nltk that we can use for the practice. Also, the tagged corpora are available in other languages like Hindi, Portugueses, Chinese etc.

We can also import the list of words with their universal tag label. In the next step, I am importing the Indian corpus for tagged corpora of the hindi language.

Input:

nltk.download('indian')
nltk.corpus.indian.tagged_words('hindi.pos')

Output:

In this article, we have seen how we can provide tags of different parts of speech and extract the tags from the sentence—also, the usage of nltk for POST. We can get started with tagging using them also. In NLP, there is a huge use of POST or part of speech tagging. By sequencing words, if we had provided the tags to the words, it becomes more useful for algorithms to understand the exact representation of the similar word in different situations. 

References :

More Great AIM Stories

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MORE FROM AIM