Text processing is a method to extract and analyze information from textual datasets. Textual datasets contain data in text formats and are used to store some useful information. Processing the textual data is important in order to clean, analyze, and visualize the data and further use it for machine learning models.
Texthero is one such library that is used to analyze and process the textual datasets and make them zero to hero. It is a python package that is used to work with textual data efficiently and quickly.
In this article, we will try to explore texthero and its text processing capabilities. We will see how efficiently and easily we can process data using texthero.
Implementation:
Like any other library, we first need to install texthero using pip install texthero.
- Importing required libraries
We will be importing texthero for text processing and pandas for loading the dataset and manipulating it.
import pandas as pd
import texthero as hero
- Loading the dataset
The dataset we will be using here can be downloaded from Kaggle. This dataset contains certain attributes which we will analyze but we will mainly focus on the ‘content’ column.
df = pd.read_csv(‘text.csv’)
df
- Processing the dataset
We can see that our dataset contains a sentiment analysis of tweets of different authors. We will focus on the tweets and will try and apply different functions used for text processing using Texthero.
We will start by cleaning the text in the ‘content’ column which is the tweets by the users. We will clean the text and store it in a new column.
- Preprocessing the Text
- Cleaning the text
df['clean_content'] = hero.clean(df['content'])
df[‘clean_content’].head()
The clean function has certain defined properties which like, it removes all stopwords, punctuations, digits, whitespaces, etc. Also, it converts the text into all lowercase. We can use all these functions separately according to our wish.
- Tokenize the text
Tokenize function returns a pandas series where each row contains a list of tokens
hero.tokenize(df['clean_content'])
- Stemming
Stemming means removing the end of words with a heuristic process. Stem function makes use of two NLTK stemming algorithms known as Snowball Stemmer and Porter Stemmer.
hero.stem(df['clean_content'], stem=’snowball’)
- Visualize the Cleaned Text
There are many ways of visualizing the textual data, here we will use ‘Wordcloud’ to visualize the cleaned data we created.
hero.visualization.wordcloud(df['clean_content'], width= 250, height = 150, max_words=200, background_color='WHITE')
Similarly, we can visualize the most frequently used words or the top used words using the top_words visualization by TextHero.
hero.visualization.top_words(df['clean_content'])
- NLP Operations on Text
Now we will implement some of the NLP operations provided by TextHero on our data.
- Named Entities
Named entities function returns a Pandas Series where each row contains a list of tuples containing information regarding the given named entities. We will be using the spacy as a package here.
hero.named_entities(df['clean_content'], package='spacy')
- Noun Chunks
It returns a group of consecutive word that belongs together. As our dataset is pretty large so we will analyze the noun chunks in only 100 rows.
hero.noun_chunks(df['clean_content'][:100])
Conclusion:
In this article, we learned about TextHero, a python library used for text processing. We saw how we can use texthero for basic preprocessing, visualization and then performed some NLP operations on the text. Texthero is simple and easy to use with a wide variety of text processing functions.