The real world is a messy, messy place. Everyone has different opinions, but they can’t help but agree on this fact! What else is messy? Data !! Lots and lots of data which we collect, scrape or extract from numerous sources. So messy that in a survey, it was mentioned that data scientists spend around 60% of their time cleaning data. Unfortunately, approximately 50-55% find it quite enjoyable. Yeah, it’s enjoyable.
But we know that data cleaning is time-consuming, right? Also, lots of tools have popped up from time to time. The task is to make this crucial and vital task more bearable (at least a little more bearable) . The Python Community hosts a ton of libraries to make data orderly and umm…legible? This can vary from never-ending data frames to stylizing them or whether it be analyzing datasets.
Using NLTK and Regex is known all over the community so much that we often undermine what else is really there that we can use for this hefty task. This blog is about such a new library (released only last year, January 2020) called CleanText. CleanText is an open-source python package (common for almost every package we see) specifically for cleaning raw data (as the name suggests and I believe you might have guessed).
Simple, easy to use package with minimalistic code to write with a ton of features to leverage (we all want that, right?). So there are two methods (yeah, mainly there are only two in this case), namely:
- clean: perform cleaning on raw text and then return the cleaned text in the form of a string.
- clean_words: same as above, cleaning raw text but will return a list of clean words (even better )
The beautiful thing about the CleanText package is not the amount of operations it supports but how easily you can use them. A list of those are mentioned below, and we’ll later write some code showcasing all of that for better understanding.
- Removing digits from the text.
- Converting the entire text to a uniform lowercase structure.
- Removing the stopwords, also choose a language for applying stopwords.
- Stemming is a process in which we need to convert words with similar meaning or a common stem into a single word. For example, eat, eats, eating, eaten belong to the stem word eat and hence be converted to that.
- And so on.
Enough introduction; let’s see how to install and use clean text.
Code Implementation of CleanText
CleanText package requires Python3 and NLTK for execution.
For installing using pip, use the following command.
!pip install cleantext
After this, import the library.
We’ll need to leverage stopwords from the NLTK library to use in our implementation.
import nltk nltk.download('stopwords')
As mentioned earlier, there are two methods which we can use; these are as below.
This will return the text in string format.
cleantext.clean("your_raw_text_here", all= True)
For returning a list of clean words
cleantext.clean_words("your_raw_text_here", all= True)
Application using Examples
Two main methods, as discussed, are shown below, firstly.
cleantext.clean("the_text_input_by_you", all= True)
Output - ‘thetextinputbyy’
cleantext.clean_words('Your s$ample !!!! tExt3% to cleaN566556+2+59*/133 wiLL GO he123re', all=True)
Output - [‘sampl’ , ‘text’ , ‘clean’]
Notice that every operation has been carried out, and then we have been provided with the output.
Text having letters encoded with Unicode characters, different Unicode for different letters. There are different encodings such as UTF-8, UTF-32 and so on.
a1 = 'Zürich' ''' fix_unicode argument will help us remove the unicode errors present in our text ''' clean(a1, fix_unicode=True)
Notice the ‘u’ has been encoded and we have to convert it into a normal character described by ASCII as the former will not be recognised as an English Language letter and will be discarded.
This may be the case with many such words, which are included from different languages in English.
CLOSEST ASCII REPRESENTATION
Abbreviated from American Standard Code for Information Interchange, this is a character encoding just like Unicode. They are used for representing text in computers and telecommunications equipment. This is to create a standard for character sets so that different devices can communicate with each other.
a2 = "ko\u017eu\u0161\u010dek" ''' to_ascii argument will convert the present encoding to text ''' clean(a2, to_ascii=True)
This will output – ‘kozuscek’
As you can see, the present text is untouched, and the encoding in our text has been converted successfully to text. This happens with data when doing NLP tasks; hence this is a useful operation that can be easily performed.
Uppercase and Lowercase letters are considered different; hence, we must change them to lowercase (preferably). While understanding the text to make meaning out if it, this hardly matters hence should be performed.
a3 = "My Name is Captain James Kirk" ''' simple argument lower used in this operation ''' clean(a3, lower=True)
As I said, minimalistic code is required to handle these tasks using this library.
Many times we encounter situations where we have to replace URLs with some other particular string. Usually, this requires complex Regex expressions (I hate them), the solution to this is shown below.
a4 = "https://www.Google.com has surpassed https://www.Bing.com in search volume" ''' argument no_urls make sures we don't have any URLs in the output text. argument replace as the name suggests will replace url with our mentioned text ''' clean(a4, no_urls=True, replace_with_url="URL")
Using this package makes us believe that using python is really like writing code in English.
Straight forward methods, arguments. Simple in and simple out.
We also encounter cases when there are currency symbols in our text; we can either remove them completely(nope, won’t help) or replace them with text (which is so better). Below is an example, using Rupee, which is the standard currency in India.
s5 = "I want ₹ 40" ''' argument for removing the currency symbols from text ''' clean(s5, no_currency_symbols = True) ''' Useful argument to replace that currency symbol with text ''' clean(a5, no_currency_symbols = True, replace_with_currency_symbol="Rupees")
Not only have we removed the currency symbol, which won’t mean anything to the model, but we also replaced it with our text seamlessly.
This is undoubtedly the most useful operation we require while handling language-related tasks.
These don’t add any value to any tasks we perform on the text dump we have.
a6 = "80,000 is greater than 70,000" ''' just one argument, no_punct will do this task ''' clean(a6, no_punct = True) ''' another useful argument to use in combination with no_punct ''' clean(a6, no_punct = True, replace_with_punct = "6")
You can also change what punctuations to keep and to remove. Super-friendly right?!
Another important operation or manipulation on the text data which is vital as this will not add any semantic or syntactic value.
a7 = 'abc123def456ghi789zero0' ''' argument no_digits as name suggests remove all digits ''' clean(a7, no_digits = True) ''' here too wecan replace those digits using this argument ''' clean(a7, no_digits = True, replace_with_digit="")
COMBINING IT ALL
final = """ Zürich has a famous website https://www.zuerich.com/ WHICH ACCEPTS 40,000 € and adding a random string, : abc123def456ghi789zero0 for this demo. ' """ clean(final, fix_unicode=True, to_ascii=True, lower=True, no_urls=True, no_numbers=True, no_digits=True, no_currency_symbols=True, no_punct=True, replace_with_punct="", replace_with_url="<URL>", replace_with_number="<NUMBER>", replace_with_digit="", replace_with_currency_symbol="<CUR>")
I recommend using this package which takes very little time to implement and try different combinations of the methods mentioned above. The notebook is present here for reference. All code is written Google Colab.
So this has been an introduction plus code implementation of the easiest text cleaning library I have ever used. This library is built on Python’s the bestest and my fav. The advantage of CleanText is that you have to code lessover, it is like you are writing English!! I highly recommend to the reader’s of this blog to try out this package for their NLP tasks because text cleaning is necessary.
Join Our Telegram Group. Be part of an engaging online community. Join Here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Mudit is experienced in machine learning and deep learning. He is an undergraduate in Mechatronics and worked as a team lead (ML team) for several Projects. He has a strong interest in doing SOTA ML projects and writing blogs on data science and machine learning.