MITB Banner

Step By Step Guide To Reviews Classification Using SVC, Naive Bayes & Random Forest

In this article, we will analyze the reviews of customers about a restaurant and will predict based on these reviews whether the customer has liked the product of the restaurant or not.
Share
review analysis using NLP

Analyzing the review of customers is becoming more interesting day by day.  This is because every customer-oriented company wants to know the feedback of its customers towards its product in order to customer satisfaction. The developments in the field of natural language processing are helping in analyzing the customer’s reviews that are available in the textual format. 

In this article, we will analyze the reviews of customers about a restaurant and will predict based on these reviews whether the customer has liked the product of the restaurant or not. That means whether it is a positive review or negative review, based on the available text review. The prediction will be carried out by classification models and will find out which classification model is best in this task of prediction. 

The Dataset

In this experiment, a restaurant’s reviews dataset is used that is publically available on Kaggle. The motivation behind using this dataset is that most of the restaurants request its customers for review. Based on these reviews, the restaurant will be able to make improvements in order to further customer satisfaction. This dataset comes in a TSV (Tab Separated Values) file and comprises 1,000 reviews of customers along with a binary feature indicating whether a customer has liked the product of the restaurant or not. 

Analyzing Reviews using Natural Language Processing

First, we need to import the required libraries. The below code snippet will import those libraries.

#Importing required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

The dataset that is downloaded from the Kaggle will be read here. As we are reading a TSV file, we need to give the separating parameter that is delimiter here. For TSV files, the tab space is the delimiter so ‘\t’ is used here. To ignore the quotes present in the text, we are including the quoting parameter here. 

# Reading the dataset
dataset = pd.read_csv('Restaurant_Reviews_New.tsv', delimiter = '\t', quoting = 3)

Now, once we read the dataset, we need to preprocess the data. It is expected that you are aware of the basics of natural language processing. Here we need text data only into consideration so a regular expression is used for taking alphabetical values only. Then we convert all the text into lower case. After that, it is split into words and converted to its original form by stemming. Finally, all the stopwords are removed from the text and every word is added to the corpus.

# Preprocessing
nltk.download('stopwords')
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

Once the corpus of words is ready, we create a bag of words model, that is actually a sparse matrix where we are using a maximum of 1500 features. From this sparse matrix, we have created the input and output features. The output feature in this data set is the binary response to whether the customer has liked the product of the restaurant or not. After that, we have defined the training and test sets. 

# Creating the Bag of Words model
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Classification of Reviews

Once we are ready with the training and test data, we will define three classification models through which we will predict whether the custom has liked the product or not, based on the reviews. We are going to define three classification models here – the Naive Bayes Model, the Random Forest Model, and the Support Vector Model.

#Clasiification
# Fitting Naive Bayes
NB_classifier = GaussianNB()
NB_classifier.fit(X_train, y_train)
y_pred_NB = NB_classifier.predict(X_test)

# Random Forest
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
rf_classifier.fit(X_train, y_train)
y_pred_rf = rf_classifier.predict(X_test)

#Support Vector
SVC_classifier = SVC(kernel = 'rbf')
SVC_classifier.fit(X_train, y_train)
y_pred_SVC = SVC_classifier.predict(X_test)

Now once we have defined and trained the three models and made predictions using all of these, we need to see their classification accuracies. This we are going to check through the confusion matrix and CAP (Cumulative Accuracy Profile) Curve Analysis.

#Confusion Matrices
cm_NB = confusion_matrix(y_test, y_pred_NB)
cm_RandFor = confusion_matrix(y_test, y_pred_rf)
cm_SVC = confusion_matrix(y_test, y_pred_SVC)

confusion matrix of review analysis and classification

#CAP Analysis
total = len(y_test) 
one_count = np.sum(y_test) 
zero_count = total - one_count 
lm_NB = [y for _, y in sorted(zip(y_pred_NB, y_test), reverse = True)] 
lm_SVC = [y for _, y in sorted(zip(y_pred_SVC, y_test), reverse = True)] 
lm_RandFor = [y for _, y in sorted(zip(y_pred_rf, y_test), reverse = True)] 
x = np.arange(0, total + 1) 
y_NB = np.append([0], np.cumsum(lm_NB)) 
y_SVC = np.append([0], np.cumsum(lm_SVC)) 
y_RandFor = np.append([0], np.cumsum(lm_RandFor)) 

plt.figure(figsize = (10, 10))
plt.title('CAP Curve Analysis')
plt.plot([0, total], [0, one_count], c = 'k', linestyle = '--', label = 'Random Model')
plt.plot([0, one_count, total], [0, one_count, one_count], c = 'grey', linewidth = 2, label = 'Perfect Model') 
plt.plot(x, y_SVC, c = 'y', label = 'SVC', linewidth = 2)
plt.plot(x, y_NB, c = 'b', label = 'Naive Bayes', linewidth = 2)
plt.plot(x, y_RandFor, c = 'r', label = 'Rand Forest', linewidth = 2)
plt.legend()

cap curver analysis of review analysis and classification

By analyzing the confusion matrices and CAP curves of each of the three classifiers, it is clear that the support vector classifier is best among all three. In the task of predicting whether the customer likes the product or not, based on reviews, the support vector classifier has outperformed the other two classifiers. 

PS: The story was written using a keyboard.
Share
Picture of Dr. Vaibhav Kumar

Dr. Vaibhav Kumar

Dr. Vaibhav Kumar is a seasoned data science professional with great exposure to machine learning and deep learning. He has good exposure to research, where he has published several research papers in reputed international journals and presented papers at reputed international conferences. He has worked across industry and academia and has led many research and development projects in AI and machine learning. Along with his current role, he has also been associated with many reputed research labs and universities where he contributes as visiting researcher and professor.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India