A huge portion of the data that exists today is textual and as a Data Scientist, it is very important to have the skill sets to process these textual data. Natural Language processing has been around for a long time and it has been growing in popularity. Today almost all tech devices have some sort of NLP technology that let them communicate with us.
NLP should be one of the most updated skill sets in a Data Scientist’s Tool kit. In this article, we will learn to implement Natural Language Processing in Machine Learning in the simplest way possible to solve MachineHack’s – Whose Line Is It Anyway: Identify The Author Hackathon
About The Data Set
The dataset we are going to use consists of sentences from thousands of books of 10 authors. The idea is to train our machine to predict which author has written a specific sentence. This is an NLP classification problem where the objective is to classify each sentence based on who wrote it.
Where to get the dataset?
Head to MachineHack, sign up and start the Whose Line Is It Anyway: Identify The Author Hackathon, you will find the dataset in the assignments page.
Top Data Scientists for our Hackathons
Natural Language Processing With Python
We will implement NLP in 8 simple steps as explained below.
Importing necessary libraries
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
The above code block consists of the necessary libraries that we need to implement our NLP classifier. We will look into each of them as we come across various methods.
Importing the dataset
dataset = pd.read_csv('TRAIN.csv')
The above code block reads the data from the csv file and loads it into a pandas data-frame using the read_csv method of the pandas library that we imported earlier.
Let’s have a peek at the dataset :
Cleaning and preprocessing the data
Cleaning the data is one of the most essential tasks in not just Natural Language Processing but in the entire Data Science spectrum. In Natural Language Processing, there are various stages of cleaning. Some of the basic stages are listed below :
- Cleaning the test for unnecessary data (noises such as symbols, emojis, special characters, etc.)
- Stemming or lemmatization for reducing the words to its root form.
- Removing stopwords.
- Stemming is the process of reducing a word to its root form. This helps remove redundancy in words. For example, if the words ‘run’, ‘ran’ and ‘running’ are present in a sentence, each word is reduced to its base or root form ‘run’ and counted as 3 occurrences of the same word instead of counting each word as unique.
- Stopwords are the words that are too often used in a natural language and hence are useless when comparing documents or sentences. For example, ‘the’, ‘a’, ‘an’, ‘has’, ‘do’, ‘what’, etc are some of the stopwords. Such words are removed for NLP.
nltk.download('stopwords') #downloading the stopwords from nltk
corpus =  # List for storing cleaned data
ps = PorterStemmer() #Initializing object for stemming
for i in range(len(dataset)): # for each obervation in the dataset
#Removing special characters
text = re.sub('[^a-zA-Z]', ' ', dataset['text'][i]).lower().split()
#Stemming and removing stop words
text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
#Joining all the cleaned words to form a sentence
text = ' '.join(text)
#Adding the cleaned sentence to a list
The NLTK library comes with a collection of stopwords which we can use to clean the dataset. The PorterStemmer method of nltk.stem.porter library is used to perform stemming. In the above code block, we traverse through each observation in the dataset, removing special characters, performing stemming and removing stop words.
Let’s see the cleaned data :
Generating Count Vectors
cv = CountVectorizer(max_features = 120)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
With the above code block, we will create a Bag-of-Words model. The CountVectorizer method imported from sklearn.feature_extraction.text creates a matrix of vectors consisting of the counts of each word in a sentence. The parameter max_features = 120 selects a maximum of 120 unique words. We transform the cleaned data in corpus into CountVector X which is the independent variable set for the test classifier that we will build in the coming steps.
Here is what X looks like :
Each row represents a row in the actual observation and each column represents a word of the 120 selected words.
Splitting the dataset into the Training set and Validation set
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20, random_state = 0)
In the above code block, we split the dataset into training and validation sets. The parameter test_size = 0.2 specifies that the test set (X_val & y_val ) should consist of 20 % of the overall data in X and y. The random_state parameter allows us to set a seed value to reproduce the exact same results.
Building a classifier
classifier = SVC()
Since we are ready with the training data we can now use it to train a classifier. The above code block initializes a Support Vector Classifier and fits the training data for learning.
Predicting the author
y_pred = classifier.predict(X_val)
After training the classifier with X_train and y_train, we can now make the classifier to predict the authors for the texts in the validation set X_val.
Evaluating the model
After predicting for the validation set, we need to check how many of the predictions are actually right.To do this, we will make use of the confusion matrix.Using the confusion matrix we will compare the predicted values in y_pred and the actual values in y_val.The accuracy from a confusion matrix can be calculated by summing up the diagonal elements and diving it by the total sum of elements in the matrix. We define a method as shown below:
diagonal_sum = confusion_matrix.trace()
sum_of_all_elements = confusion_matrix.sum()
return diagonal_sum / sum_of_all_elements
#Creating the confusion matrix with y_val and y_pred
cm = confusion_matrix(y_val, y_pred)
print("Accuracy : ", accuracy(cm))
Accuracy : 0.7461271882975549
This means that 74 % of the overall predictions were actually true when compared to the real observations.
That’s it! You have now created your first NLP project in Machine Learning. You can use the above code blocks as a basic template for working with NLP. Happy Coding!