MITB Banner

Complete Tutorial on Parts Of Speech (PoS) Tagging

Classifying words in their part of speech and providing them labels according to their part of speech is called part of speech tagging or POS tagging OR POST.  Hence the set of labels/tags is called a tagset. Next in the article, we will discuss how we can implement that POST part of any NLP task

Share

Back in school time, we learned that the part of speech indicates the function of any word, like what it means in any sentence. There are commonly nine parts of speeches; noun, pronoun, verb, adverb, article, adjective, preposition, conjunction, interjection, and a word need to be fit into the proper part of speech to make sense in the sentence. 

So the part of speech is not just part of studying the grammar of any language, but also it is a very useful part of text preprocessing in NLP as we know that NLP is a task where we make a machine able to communicate with a human or with a different machine. So it becomes compulsory for a machine to understand the part of speech.

Classifying words in their part of speech and providing them labels according to their part of speech is called part of speech tagging or POS tagging OR POST.  Hence the set of labels/tags is called a tagset. Next in the article, we will discuss how we can implement that POST part of any NLP task.

Code Implementation: Parts Of Speech Tagging

Importing the libraries     

Input:

import nltk
from nltk import word_tokenize

Downloading the packages.

Input:

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

 Viewing the POST tagsets.

Input:

nltk.help.upenn_tagset()

Output:

Here we can see the list or set of the tag which nltk provides us, and from those options, we will provide labels to every word.

Let’s check for the tags for any sentence.

Input:

sentence = word_tokenize("whatever the world is a great place")
nltk.pos_tag(sentence)

Output:

Here we can see that we have provided tags to every word. For example, the word “world” has got the tag NN, a noun, and great has got the tag JJ, which is a tag for an adjective.  

Let’s check for some more examples; this time, we are focusing on homonyms.

Input:

sentence = word_tokenize("applicant is removed from applicant list of the job ")
nltk.pos_tag(sentence)

In the input, we have provided the applicant word two times with different parts of speech. So lets check for the labels it will give to both of them.

Output:

Here we can see for the first applicant that the label is NN, a noun, and for the second applicant, it labelled the word as JJ, which means adjective. 

Input:

sentence = word_tokenize("allow us to add lines in list of allow actions")
nltk.pos_tag(sentence)

Output:

Again we have provided the exact tag to the ‘allow’ word.

This is not enough; there are some more features we can use. For introducing those features, let us just import the brown corpus.

Input:

from nltk.corpus import brown
brown.categories()

Output:

Here we can see that we are having a corpus of 15 categories. We are going to use the news category of the corpus.

Input:

text_news = nltk.Text(word.lower() for word in nltk.corpus.brown.words(categories='news'))
text_news

Output:

Here we have imported the brown corpus of the news category, and now one of the important features of tagging is that we can find or extract the word of similar tags; for example, man is a noun, and the tag given to it is NN and using the similar function we can find out the words with a similar label or part of speech. Before finding the lexical categories, let’s just have an overview of the corpus’s words count with their part of speech.

Input:

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
nltk.download('universal_tagset')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

Output:

Here we can see the count of the universal tagsets in the corpus.

Input:

tag_fd.plot(cumulative=True)

Output:

Now we can find out the words with similar POST.

Input:

text_news.similar('man')

Output:

Here we can see that the other noun, like man, comes out from the corpus. Similarly, we can extract for other parts of speeches as well.

Input:

text_news.similar('said')

Output:

Input:

text_news.similar('the')

Output:

Input:

text_news.similar('the')

Output:

Observe the words and the results we are having with other words with similar POST. There are various corpus available in nltk that we can use for the practice. Also, the tagged corpora are available in other languages like Hindi, Portugueses, Chinese etc.

We can also import the list of words with their universal tag label. In the next step, I am importing the Indian corpus for tagged corpora of the hindi language.

Input:

nltk.download('indian')
nltk.corpus.indian.tagged_words('hindi.pos')

Output:

In this article, we have seen how we can provide tags of different parts of speech and extract the tags from the sentence—also, the usage of nltk for POST. We can get started with tagging using them also. In NLP, there is a huge use of POST or part of speech tagging. By sequencing words, if we had provided the tags to the words, it becomes more useful for algorithms to understand the exact representation of the similar word in different situations. 

References :

Share
Picture of Yugesh Verma

Yugesh Verma

Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.