MITB Banner

Hands-On Guide To Different Tokenization Methods In NLP

Share

Tokenization

Do you realize you can google up anything today and can be sure to find something related to it on the internet? This comes from the huge amount of text data available freely for us. You must be intrigued enough to use all this data for your machine learning models. The problem is, machines don’t recognize and understand the text as we do. Need to know a way to work around this?

The magnificent trick lies in the wonderful world of Natural Language Processing. The unstructured text data needs to be cleaned first before we can proceed to the modelling stage. Tokenization is used for splitting a phrase or a paragraph into words or sentences.

In this article, we will start with the first step of data pre-processing i.e Tokenization. Further, we will implement different methods in python to perform tokenization of text data.

Tokenize Words Using NLTK

Let’s start with the tokenization of words using the NLTK library. It breaks the given string and returns a list of strings by the white specified separator.

#Tokenize words
from nltk.tokenize import word_tokenize 
text = "Machine learning is a method of data analysis that automates analytical model building"
word_tokenize(text)

Here, we tokenize the sentences instead of words by a full stop (.) separator.

#Tokenize Sentence
from nltk.tokenize import sent_tokenize 
text = "Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
sent_tokenize(text) 
#Tokenize words of different words
import nltk
nltk.download('punkt')
import nltk.data 
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle') 
text = 'Hola amigo. Me llamo Ankit.'
spanish_tokenizer.tokenize(text)

Regular Expression

Regex function is used to match or find strings using a sequence of patterns consisting of letters and numbers. We will re library to tokenize words and sentences of a paragraph.

from nltk.tokenize import RegexpTokenizer 
tokenizer = RegexpTokenizer("[\w']+") 
text = "Machine learning is a method of data analysis that automates analytical model building"
tokenizer.tokenize(text)
#Split Sentences
import re
text = """Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."""
sentences = re.compile('[.!?] ').split(text)
sentences

Split()

split() method is used to break the given string in a sentence and return a list of strings by the specified separator.

text = """Machine learning is a method of data analysis that automates analytical model building"""
# Splits at space 
text.split()
#Split Sentence
text = """Machine learning is a method of data analysis that automates analytical model building.It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."""
# Splits at space 
text.split('.')

Spacy

Spacy is an open-source library used for tokenization of words and sentences. We will load en_core_web_sm  which supports the English language.

import spacy
sp = spacy.load('en_core_web_sm')
sentence = sp(u'Machine learning is a method of data analysis that automates analytical model building.')
print(sentence)
L=[]
for word in sentence:
    L.append(word)
L
#Split Sentences
sentence = sp(u'Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.')
print(sentence)
x = []
for sent in sentence.sents:
    x.append(sent.text)
x

Gensim

The last method that we will cover in this article is gensim. It is an open-source python library for topic modelling and similarity retrieval of large datasets.

from gensim.utils import tokenize
text = """Artificial intelligence, the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings."""
list(tokenize(text))
#Split Sentence
from gensim.summarization.textcleaner import split_sentences
text = """Artificial intelligence, the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience."""
split1 = split_sentences(text)
split1

Final Thoughts

In this article, we implemented different methods of tokenization from a given text. Before starting the model building process, data needs to be cleaned. Tokenization is a crucial step in data cleaning/pre-processing process. We can further extend our research by exploring other data cleaning techniques like part-of-speech-tagging and stop word removal.The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook of this code.

Share
Picture of Ankit Das

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.