Now Reading
Texthero Guide: A Python Toolkit for Text Processing

Texthero Guide: A Python Toolkit for Text Processing

TextHero Text Processing
W3Schools

Text processing is a method to extract and analyze information from textual datasets. Textual datasets contain data in text formats and are used to store some useful information. Processing the textual data is important in order to clean, analyze, and visualize the data and further use it for machine learning models.

Texthero is one such library that is used to analyze and process the textual datasets and make them zero to hero. It is a python package that is used to work with textual data efficiently and quickly.

In this article, we will try to explore texthero and its text processing capabilities. We will see how efficiently and easily we can process data using texthero.



Implementation:

Like any other library, we first need to install texthero using pip install texthero.

  1. Importing required libraries

We will be importing texthero for text processing and pandas for loading the dataset and manipulating it.  

import pandas as pd

import texthero as hero

  1. Loading the dataset

The dataset we will be using here can be downloaded from Kaggle. This dataset contains certain attributes which we will analyze but we will mainly focus on the ‘content’ column.

df = pd.read_csv(‘text.csv’)

df

Dateset Used
  1. Processing the dataset

We can see that our dataset contains a sentiment analysis of tweets of different authors. We will focus on the tweets and will try and apply different functions used for text processing using Texthero.

We will start by cleaning the text in the ‘content’ column which is the tweets by the users. We will clean the text and store it in a new column.

  1. Preprocessing the Text
  • Cleaning the text

df['clean_content'] = hero.clean(df['content']) 

df[‘clean_content’].head()

The clean function has certain defined properties which like, it removes all stopwords, punctuations, digits, whitespaces, etc. Also, it converts the text into all lowercase. We can use all these functions separately according to our wish.

  • Tokenize the text

Tokenize function returns a pandas series where each row contains a list of tokens

hero.tokenize(df['clean_content'])

  • Stemming

Stemming means removing the end of words with a heuristic process. Stem function makes use of two NLTK stemming algorithms known as Snowball Stemmer and Porter Stemmer. 

hero.stem(df['clean_content'], stem=’snowball’)

  1. Visualize the Cleaned Text

There are many ways of visualizing the textual data, here we will use ‘Wordcloud’ to visualize the cleaned data we created.

See Also

hero.visualization.wordcloud(df['clean_content'], width= 250, height = 150, max_words=200, background_color='WHITE')

Word-cloud of Clean Text, Texthero

Similarly, we can visualize the most frequently used words or the top used words using the top_words visualization by TextHero.     

hero.visualization.top_words(df['clean_content'])

Top words visualization, Texthero
  1. NLP Operations on Text

Now we will implement some of the NLP operations provided by TextHero on our data.

  • Named Entities

Named entities function returns a Pandas Series where each row contains a list of tuples containing information regarding the given named entities. We will be using the spacy as a package here. 

hero.named_entities(df['clean_content'], package='spacy')

Texthero
  • Noun Chunks

It returns a group of consecutive word that belongs together. As our dataset is pretty large so we will analyze the noun chunks in only 100 rows.

hero.noun_chunks(df['clean_content'][:100])

Noun Chunks, Texthero

Conclusion:

In this article, we learned about TextHero, a python library used for text processing. We saw how we can use texthero for basic preprocessing, visualization and then performed some NLP operations on the text. Texthero is simple and easy to use with a wide variety of text processing functions.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top