Now Reading
Hands-On Guide To Different Tokenization Methods In NLP

Hands-On Guide To Different Tokenization Methods In NLP

Tokenization

Do you realize you can google up anything today and can be sure to find something related to it on the internet? This comes from the huge amount of text data available freely for us. You must be intrigued enough to use all this data for your machine learning models. The problem is, machines don’t recognize and understand the text as we do. Need to know a way to work around this?

The magnificent trick lies in the wonderful world of Natural Language Processing. The unstructured text data needs to be cleaned first before we can proceed to the modelling stage. Tokenization is used for splitting a phrase or a paragraph into words or sentences.

REGISTER>>

In this article, we will start with the first step of data pre-processing i.e Tokenization. Further, we will implement different methods in python to perform tokenization of text data.

Tokenize Words Using NLTK

Let’s start with the tokenization of words using the NLTK library. It breaks the given string and returns a list of strings by the white specified separator.

#Tokenize words
from nltk.tokenize import word_tokenize 
text = "Machine learning is a method of data analysis that automates analytical model building"
word_tokenize(text)

Here, we tokenize the sentences instead of words by a full stop (.) separator.

#Tokenize Sentence
from nltk.tokenize import sent_tokenize 
text = "Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
sent_tokenize(text) 
#Tokenize words of different words
import nltk
nltk.download('punkt')
import nltk.data 
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle') 
text = 'Hola amigo. Me llamo Ankit.'
spanish_tokenizer.tokenize(text)

Regular Expression

Regex function is used to match or find strings using a sequence of patterns consisting of letters and numbers. We will re library to tokenize words and sentences of a paragraph.

from nltk.tokenize import RegexpTokenizer 
tokenizer = RegexpTokenizer("[\w']+") 
text = "Machine learning is a method of data analysis that automates analytical model building"
tokenizer.tokenize(text)
#Split Sentences
import re
text = """Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."""
sentences = re.compile('[.!?] ').split(text)
sentences

Split()

split() method is used to break the given string in a sentence and return a list of strings by the specified separator.

See Also
Deep Imbalanced Regression

text = """Machine learning is a method of data analysis that automates analytical model building"""
# Splits at space 
text.split()
#Split Sentence
text = """Machine learning is a method of data analysis that automates analytical model building.It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."""
# Splits at space 
text.split('.')

Spacy

Spacy is an open-source library used for tokenization of words and sentences. We will load en_core_web_sm  which supports the English language.

import spacy
sp = spacy.load('en_core_web_sm')
sentence = sp(u'Machine learning is a method of data analysis that automates analytical model building.')
print(sentence)
L=[]
for word in sentence:
    L.append(word)
L
#Split Sentences
sentence = sp(u'Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.')
print(sentence)
x = []
for sent in sentence.sents:
    x.append(sent.text)
x

Gensim

The last method that we will cover in this article is gensim. It is an open-source python library for topic modelling and similarity retrieval of large datasets.

from gensim.utils import tokenize
text = """Artificial intelligence, the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings."""
list(tokenize(text))
#Split Sentence
from gensim.summarization.textcleaner import split_sentences
text = """Artificial intelligence, the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience."""
split1 = split_sentences(text)
split1

Final Thoughts

In this article, we implemented different methods of tokenization from a given text. Before starting the model building process, data needs to be cleaned. Tokenization is a crucial step in data cleaning/pre-processing process. We can further extend our research by exploring other data cleaning techniques like part-of-speech-tagging and stop word removal.The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook of this code.

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top