Natural Language Processing allows the computer to understand the human language with the help of different modules/packages that python provides. NLP can practically be used for Speech Recognition, creating voice search engines, etc. NLP can be used to perform a large variety of operations on text data like tokenizing, lamenting, stemming POS tagging, etc.
Spacy is an NLP based python library that performs different NLP operations. Some of its main features are NER, POS tagging, dependency parsing, word vectors. Also, it contains models of different languages that can be used accordingly.
Scattertext is an open-source python library that is used with the help of spacy to create beautiful visualizations of what words and phrases are more characteristics of a given category. It is a tool for finding distinguishing terms in corpora and presenting them in an interactive, HTML scatter plot. Scattertext visualizations are highly informative because in the visualization the points corresponding to terms are selectively labeled so that they don’t overlap with other labels or points.
In this article, we will draw a sentiment analysis visualization using spacy and scatter text and see how beautifully scatter text allows you to visualize and find text in the data.
Implementation:
We will start by installing spacy and scattertext using pip install spacy and pip install scattertext respectively.
- Importing required libraries
We will be importing spacy and scattertext for visualization and pandas for loading our dataset.
import spacy
import pandas as pd
import scattertext as st
- Loading the Dataset
For creating a sentiment analysis visualization we will import ‘Twitter Airline Sentiment Dataset’ from Kaggle. The dataset contains different attributes like Username, tweet, id, text, etc. We will use the data to visualize the different terms used for different sentiments.
twitter_df = pd.read_csv('Tweets.csv')
twitter_df.dtypes
- Downloading English Model
As we have already discussed, spacy contains models for different languages. We will use spacy and download the English model as we are working in the English Language.
nlp = spacy.load('en')
- Creating Scatterext Corpus
Next, we will create a scattertext corpus of the dataset we are working on As we are working on the sentiment analysis we will set the category_col to ‘airline_sentiment’, and the text column which contains tweets will be used as text_col.
corpus = st.CorpusFromPandas(twitter_df, category_col='airline_sentiment', text_col='text', nlp=nlp).build()
For creating this corpus we have used the NLP as the English model which we downloaded in the previous step, and create it using the build() function.
- Creating the visualization
This is the main and the final step. Here we will create a visualization with the following parameters:
- category: We will set this to negative as we will denote negative sentiments using this.
- category_name: This will be set as “Negative” and displayed as the axis title
- not_category_name: The sentiments which are not in the negative category are under this category with the name as “Positive”.
- Metadata: The data we will be using for excerpts.
Now let us define all these and create the visualization using produce_scattertext_explorer.
sent = st.produce_scattertext_explorer(corpus,
category='negative',
category_name='Negative',
not_category_name='Positive',
width_in_pixels=1000,
metadata=netflix_df['name'])
This command will create the desired visualization and we will write this into an Html file that can be run standalone.
open(“Twitter_Sentiment.html", 'wb').write(html.encode('utf-8'))
This is the final visualization we created using scattertext.
In the visualization, we can clearly see that X-Axis displays the positive frequency and the y-axis displays the negative frequency. The axis is divided into three sections namely:
- Frequent: It shows the words with the highest frequency
- Average: Shows word with an average frequency
- Infrequent: Shows words with the least frequency.
We can also see that the visualization contains the ‘Top Negative Words’, ‘Top Positive Words’, and the ‘Characteristics’ also. Other than this we can see that there is a search bar that is used to search a word in the corpus and display its frequency along with the text where it is used.
Let us search the word ‘hour’ and see the results.
Here we can see it clearly that the search results display the frequency of the word in the negative and the positive texts along with some of the tweets where this word is used.
The visualization created is highly interactive i.e. when you hover over any word in the visualization it displays its frequency along with score as a tooltip, and no word overlaps any other word.
Conclusion:
In this article we saw how beautiful, insightful and informative graphs/visualization can be created using scatter text. We saw how we can use this visualizations search bar to know the word frequency and where it is used. Scattertext is easy to use and is blazingly fast we can use it for different types of text data visualization.