Step By Step Guide To Reviews Classification Using SVC, Naive Bayes & Random Forest

In this article, we will analyze the reviews of customers about a restaurant and will predict based on these reviews whether the customer has liked the product of the restaurant or not.
review analysis using NLP

Analyzing the review of customers is becoming more interesting day by day.  This is because every customer-oriented company wants to know the feedback of its customers towards its product in order to customer satisfaction. The developments in the field of natural language processing are helping in analyzing the customer’s reviews that are available in the textual format. 

In this article, we will analyze the reviews of customers about a restaurant and will predict based on these reviews whether the customer has liked the product of the restaurant or not. That means whether it is a positive review or negative review, based on the available text review. The prediction will be carried out by classification models and will find out which classification model is best in this task of prediction. 

The Dataset

In this experiment, a restaurant’s reviews dataset is used that is publically available on Kaggle. The motivation behind using this dataset is that most of the restaurants request its customers for review. Based on these reviews, the restaurant will be able to make improvements in order to further customer satisfaction. This dataset comes in a TSV (Tab Separated Values) file and comprises 1,000 reviews of customers along with a binary feature indicating whether a customer has liked the product of the restaurant or not. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Analyzing Reviews using Natural Language Processing

First, we need to import the required libraries. The below code snippet will import those libraries.

#Importing required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

The dataset that is downloaded from the Kaggle will be read here. As we are reading a TSV file, we need to give the separating parameter that is delimiter here. For TSV files, the tab space is the delimiter so ‘\t’ is used here. To ignore the quotes present in the text, we are including the quoting parameter here. 

# Reading the dataset
dataset = pd.read_csv('Restaurant_Reviews_New.tsv', delimiter = '\t', quoting = 3)

Now, once we read the dataset, we need to preprocess the data. It is expected that you are aware of the basics of natural language processing. Here we need text data only into consideration so a regular expression is used for taking alphabetical values only. Then we convert all the text into lower case. After that, it is split into words and converted to its original form by stemming. Finally, all the stopwords are removed from the text and every word is added to the corpus.

# Preprocessing'stopwords')
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)

Once the corpus of words is ready, we create a bag of words model, that is actually a sparse matrix where we are using a maximum of 1500 features. From this sparse matrix, we have created the input and output features. The output feature in this data set is the binary response to whether the customer has liked the product of the restaurant or not. After that, we have defined the training and test sets. 

# Creating the Bag of Words model
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Classification of Reviews

Once we are ready with the training and test data, we will define three classification models through which we will predict whether the custom has liked the product or not, based on the reviews. We are going to define three classification models here – the Naive Bayes Model, the Random Forest Model, and the Support Vector Model.

# Fitting Naive Bayes
NB_classifier = GaussianNB(), y_train)
y_pred_NB = NB_classifier.predict(X_test)

# Random Forest
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0), y_train)
y_pred_rf = rf_classifier.predict(X_test)

#Support Vector
SVC_classifier = SVC(kernel = 'rbf'), y_train)
y_pred_SVC = SVC_classifier.predict(X_test)

Now once we have defined and trained the three models and made predictions using all of these, we need to see their classification accuracies. This we are going to check through the confusion matrix and CAP (Cumulative Accuracy Profile) Curve Analysis.

#Confusion Matrices
cm_NB = confusion_matrix(y_test, y_pred_NB)
cm_RandFor = confusion_matrix(y_test, y_pred_rf)
cm_SVC = confusion_matrix(y_test, y_pred_SVC)

confusion matrix of review analysis and classification

#CAP Analysis
total = len(y_test) 
one_count = np.sum(y_test) 
zero_count = total - one_count 
lm_NB = [y for _, y in sorted(zip(y_pred_NB, y_test), reverse = True)] 
lm_SVC = [y for _, y in sorted(zip(y_pred_SVC, y_test), reverse = True)] 
lm_RandFor = [y for _, y in sorted(zip(y_pred_rf, y_test), reverse = True)] 
x = np.arange(0, total + 1) 
y_NB = np.append([0], np.cumsum(lm_NB)) 
y_SVC = np.append([0], np.cumsum(lm_SVC)) 
y_RandFor = np.append([0], np.cumsum(lm_RandFor)) 

plt.figure(figsize = (10, 10))
plt.title('CAP Curve Analysis')
plt.plot([0, total], [0, one_count], c = 'k', linestyle = '--', label = 'Random Model')
plt.plot([0, one_count, total], [0, one_count, one_count], c = 'grey', linewidth = 2, label = 'Perfect Model') 
plt.plot(x, y_SVC, c = 'y', label = 'SVC', linewidth = 2)
plt.plot(x, y_NB, c = 'b', label = 'Naive Bayes', linewidth = 2)
plt.plot(x, y_RandFor, c = 'r', label = 'Rand Forest', linewidth = 2)

cap curver analysis of review analysis and classification

By analyzing the confusion matrices and CAP curves of each of the three classifiers, it is clear that the support vector classifier is best among all three. In the task of predicting whether the customer likes the product or not, based on reviews, the support vector classifier has outperformed the other two classifiers. 

Dr. Vaibhav Kumar
Dr. Vaibhav Kumar is a seasoned data science professional with great exposure to machine learning and deep learning. He has good exposure to research, where he has published several research papers in reputed international journals and presented papers at reputed international conferences. He has worked across industry and academia and has led many research and development projects in AI and machine learning. Along with his current role, he has also been associated with many reputed research labs and universities where he contributes as visiting researcher and professor.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox