In any data science project life cycle, cleaning and preprocessing data is the most important performance aspect. Say if you are dealing with unstructured text data, which is complex among all the data, and you carried the same for modeling two things will happen. Either you come up with a big error, or your model will not perform as you expected. You might have wondered how the modern voice assistance system such as Google Assistance, Alexa, Siri can understand, process and respond to human language, so here comes the heavy lifter. i.e., Natural Language Processing.
Natural language processing, NLP, is a technique that comes from the semantic analysis of data with the help of computer science and artificial intelligence. It is basically an art to extract meaningful information from the raw data by aiming for interconnection between the natural language and computers, which means analysing and modelling a high volume of natural language data. By utilizing the power of NLP, real-world business problems can be solved like summarizing documents, title generator, caption generator, fraud detection, speech recognition and importantly, neural machine translation and so on.
Text processing is a method used under the NLP to clean the text and prepare it for the model building. It is versatile and contains noise in various forms like emotions, punctuations, and text written in numerical or special character forms. We have to deal with these main problems because machines will not understand they ask only for numbers. To start with text processing, certain libraries written in Python simplify this process, and their simple, straightforward syntax gives a lot of flexibility. The first one is NLTK stands for natural language toolkit useful for all tasks like stemming, POS, tokenization, lemmatizing and many more.
You might know the contraction; there is not any single sentence that is free from contractions means every time we tend to use words in a manner like didn’t instead of did not, so what happens when we tokenize such words it get in form like ‘didn’ ‘t’ and there is nothing to do with such words. To deal with such words, there is a library called contractions. BeautifulSoup is a library used for web scraping, but sometimes we tend to get data with HTML tags and URL’s to deal with this BeautifulSoup is used. And to convert numbers into words, we are using the inflect library.
Implementing Text Preprocessing
Below with python code, we remove the noise from raw text data taken from the Twitter sentiment analysis dataset. Later we will perform stop words removing, stemming and lemmatization.
Import all dependencies:
! pip install contractions import nltk import contractions import inflect from nltk import word_tokenize, sent_tokenize from nltk.corpus import stopwords from nltk.stem import LancasterStemmer, WordNetLemmatizer from bs4 import BeautifulSoup import re, string, unicodedata
The first step comes with removing the noises in the data; here in the text domain, noise is referred to as something which not related to textual human language, and those come with various nature like special characters, use of parentheses, use of square brackets, white spaces, URL’s and punctuations.
Below is the sample text on which we are processing;
As you can see, there are many HTML tags and one URL first; we need to remove them, and for that, we are using BeautifulSoup. Below code snippets removes both;
# to remove HTML tag def html_remover(data): beauti = BeautifulSoup(data,'html.parser') return beauti.get_text() # to remove URL def url_remover(data): return re.sub(r'https\S','',data) def web_associated(data): text = html_remover(data) text = url_remover(text) return text new_data = web_associated(data)
After removing the HTML tag and URL, there is still some noise in the form of punctuations and white spaces, and text data under the parenthesis; this need to be also treated;
def remove_round_brackets(data): return re.sub('\(.*?\)','',data) def remove_punc(data): trans = str.maketrans('','', string.punctuation) return data.translate(trans) def white_space(data): return ' '.join(data.split()) def complete_noise(data): new_data = remove_round_brackets(data) new_data = remove_punc(new_data) new_data = white_space(new_data) return new_data new_data = complete_noise(new_data)
Now, as you can see, we have successfully removed all the noises from the text.
Usually, text normalisation starts with tokenizing the text, which our longer corpus is now to be split into chunks of words, which the tokenizer class from NLTK can do. Post that, we need to lower case each word of our corpus, converting numbers to the words and last with contraction replacement.
def text_lower(data): return data.lower() def contraction_replace(data): return contractions.fix(data) def number_to_text(data): temp_str = data.split() string =  for i in temp_str: # if the word is digit, converted to # word else the sequence continues if i.isdigit(): temp = inflect.engine().number_to_words(i) string.append(temp) else: string.append(i) return temp_str def normalization(data): text = text_lower(data) text = number_to_text(text) text = contraction_replace(text) nltk.download('punkt') tokens = nltk.word_tokenize(text) return tokens tokens = normalization(new_data) print(tokens)
Now we came near the end of basic text preprocessing; now, we are left with one major thing: stopwords. While analyzing text data, stopwords have meaning at all; it is just used for decorative purposes. Therefore, further to reduce dimensionality, it is necessary to remove stopwords from the corpus.
In the end, we have two choices to represent our corpus in the form of stemming or lemmatized words. Stemming usually tries to convert the word into its root format, and mostly it is being carried out by simply cutting words. Where lemmatization also does the task as stemming but in the proper way means it converts the word into roots format like ‘scenes’ will be converted to ‘scene’. One can choose between stemming and lemmatized words.
def stopword(data): nltk.download('stopwords') clean =  for i in data: if i not in stopwords.words('english'): clean.append(i) return clean def stemming(data): stemmer = LancasterStemmer() stemmed =  for i in data: stem = stemmer.stem(i) stemmed.append(stem) return stemmed def lemmatization(data): nltk.download('wordnet') lemma = WordNetLemmatizer() lemmas =  for i in data: lem = lemma.lemmatize(i, pos='v') lemmas.append(lem) return lemmas def final_process(data): stopwords_remove = stopword(data) stemmed = stemming(stopwords_remove) lemm = lemmatization(stopwords_remove) return stemmed, lemm stem,lemmas = final_process(tokens)
Below we can see the stemmed and lemmatized words;
In this article, we discussed how the preprocessing of text is necessary for model building. From the starting, we learned how to remove HTML tags, and from URL’s, we have removed noise. Firstly, to remove noise, we have to take an overview of our corpus to customize the noise components. We have observed the huge tradeoff between the stemmed and lemmatized words, and we should always proceed with lemmatized words.