Automated Machine learning or autoML is used for automating the complete process of machine learning for real-world problems to make the process easier and more efficient. Over the years researchers have developed ways of automating processes by developing tools like AutoKeras, AutoSklearn and even no-coding platforms like WEKA and H2o.
One such area of automation is in the field of natural language processing. With the development of AutoNLP, it is now super easy to build a model like sentiment analysis with very few basic lines of code and get a good output. With automation like these, it allows everyone to be a part of the machine learning community and does not restrict machine learning to only developers and engineers.
In this article, we will learn about what AutoNLP is and implement a sentiment analysis model with twitter dataset.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
What is AutoNLP?
Using the concepts of AutoML, AutoNLP helps in automating the process of exploratory data analysis like stemming, tokenization, lemmatization etc. It also helps in text processing and picking the best model for the given dataset. AutoNLP was developed under AutoVIML which stands for Automatic Variant Interpretable ML. Some of the features of AutoNLP are:
- Data cleansing: The entire dataset can be sent to the model without performing any process like vectorization. It even fills the missing data and cleans the data automatically.
- Uses feature tools library for feature extraction: Feature Tools is another great library that helps in feature engineering and extraction in any easy way.
- Model performance and graphs are produced automatically: Just by setting the verbose, the model graph and performance can be shown.
- Feature reduction is automatic: With huge datasets, it becomes tough to select the best features and perform EDA. But this is taken care of by AutoNLP.
Implementation of AutoNLP
Let us now implement a sentiment analysis model for a twitter dataset using autoNLP. Without autoNLP, the data had to be first vectorized, stemmed and lemmatized and finally converted to a word cloud before training. But with autoNLP, all we have to do is five simple steps.
Installing the AutoNLP
To install this we can use a simple pip command. Since AutoNLP belongs to autoviml we need to install that.
!pip install autoviml
After installing this, we can go ahead and download the dataset for the project. I will be using the twitter dataset since we are doing sentiment analysis. You can download the dataset here. Once done, let us mount the drive and see our dataset.
from google.colab import drive
import pandas as pd
drive.mount('/content/gdrive')
data=pd.read_csv("/content/gdrive/MyDrive/twitter_train.csv")
Model
Now, we can use the AutoNLP and build the model. import numpy as np
from sklearn.model_selection import train_test_split
from autoviml.Auto_NLP import Auto_NLP
train, test = train_test_split(data, test_size=0.2)
Since the model is a classification type we will mention it is mentioned in the AutoNLP method. The top_num_feature, if not set will be assumed to be a value above 300 and the training becomes slower when compared to 100.
input_feature, target = "SentimentText", "Sentiment"
train_x, test_x, final, predicted= Auto_NLP(input_feature, train, test,target,score_type="balanced_accuracy",top_num_features=100,modeltype="Classification", verbose=2, build_model=True)
Now you will see a series of graphs and within few minutes you will see the trained output.
These graphs show in detail about the visualizations during the training process. It shows the word count, the density and character count as well. As the training progresses these graphs change and here is the final output. All the punctuations and tags are automatically removed and the density of these are also shown in the graph.
The model has selected multinomial NB as a classifier and has performed the training. If the top_num_features were not given, a random forest algorithm would be used.
The final output is as shown below.
To understand the pipelining process just print the final value and you will see the following.
Finally, you can make predictions as follows.
final.predict(test_x[input_feature])
Conclusion
We saw how using AutoNLP made the model building very easy for performing sentiment analysis. Not only this but it also automatically pre-processed the data and gave visualizations for different aspects of the dataset. Thus, automation makes it easy to build even complex models.