Natural Language Processing (NLP) is one of the most explored and successful domains in machine learning. It is important because complex communication is one of the best signs of intelligence as we are trying to make machines communicate with humans effortlessly.
In this article, we will do a hands-on NLP with Python to solve MachineHack’s Predict The News Category hackathon.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Predict The News Category Hackathon
MachineHack has launched its second Natural Language Processing challenge for its large Data Science and ML audience. The hackathon is about predicting the category or section of news from its content.The dataset consists of news pieces collected from a number of different sources along with the category or section of the news piece in which it was featured.
Given below is the description of the dataset.
Download our Mobile App
Size of training set: 7,628 records
Size of test set: 2,748 records
FEATURES:
STORY: A part of the main content of the article to be published as a piece of news.
SECTION: The genre/category the STORY falls in.
There are four distinct sections where each story may fall in to. The Sections are labelled as follows :
Politics: 0
Technology: 1
Entertainment: 2
Business: 3
Getting The Datasets
Go to MachineHack, Sign Up as a user and click on the Predict The News Category Hackathon. Start the hackathon and find the dataset in the Attachment section.
Click here to register for the hackathon
Without further ado, let’s crack the Hackathon!
Solving The Hackathon
Let’s break the solution into 6 parts as given below for better understanding.
- Exploratory Data Analysis: A Simple analysis of Data
- Data cleaning
- Data preprocessing: Count Vectors and TF-IDF Vectors
- Training the classifier
- Predicting for the test set
- Submitting your solution at MachineHack
Exploratory Data Analysis: A Simple analysis of Data
Let’s start off with the usual drill and import all the necessary modules for our project.
#Importing the libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
#Download the following modules once
nltk.download('stopwords')
nltk.download('wordnet')
Let’s do a simple analysis of the data in hand.
#Importing the training set
train_data = pd.read_excel("Datasets/Data_Train.xlsx")
#Printing the top 5 rows
print(train_data.head(5))
#Printing the dataset info
print(train_data.info())
#Printing the shape of the dataset
print(train_data.shape)]
Out:(7628, 2)
#Printing the group by description of each category
train_data.groupby("SECTION").describe()
Data Cleaning
#Removing duplicates to avoid overfitting
train_data.drop_duplicates(inplace = True)
#A punctuations string for reference (added other valid characters from the dataset)
all_punctuations = string.punctuation + '‘’,:”][],'
#Method to remove punctuation marks from the data
def punc_remover(raw_text):
no_punct = "".join([i for i in raw_text if i not in all_punctuations])
return no_punct
#Method to remove stopwords from the data
def stopword_remover(no_punc_text):
words = no_punc_text.split()
no_stp_words = " ".join([i for i in words if i not in stopwords.words('english')])
return no_stp_words
#Method to lemmatize the words in the data
lemmer = nltk.stem.WordNetLemmatizer()
def lem(words):
return " ".join([lemmer.lemmatize(word,'v') for word in words.split()])
#Method to perform a complete cleaning
def text_cleaner(raw):
cleaned_text = stopword_remover(punc_remover(raw))
return lem(cleaned_text)
#Testing the cleaner method
text_cleaner("Hi!, this is a sample text to test the text cleaner method. Removes *@!#special characters%$^* and stopwords. And lemmatizes, go, going - run, ran, running")
Out: 'Hi sample text test text cleaner method Removes special character stopwords And lemmatizes go go run run run'
#Applying the cleaner method to the entire data
train_data['CLEAN_STORY'] = train_data['STORY'].apply(text_cleaner)
#Checking the new dataset
print(train_data.values)
Data Preprocessing: Count Vectors and TF-IDF Vectors
Creating Count vectors
#Importing sklearn’s Countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
#Creating a bag-of-words dictionary of words from the data
bow_dictionary = CountVectorizer().fit(train_data['CLEAN_STORY'])
#Total number of words in the bow_dictionary
len(bow_dictionary.vocabulary_)
Out : 35189
#Using the bow_dictionary to create count vectors for the cleaned data.
bow = bow_dictionary.transform(train_data['CLEAN_STORY'])
#Printing the shape of the bag of words model
print(bow.shape)
Out: (7551, 35189)
Creating TF-IDF Vectors
#Importing TfidfTransformer from sklearn
from sklearn.feature_extraction.text import TfidfTransformer
#Fitting the bag of words data to the TF-IDF transformer
tfidf_transformer = TfidfTransformer().fit(bow)
#Transforming the bag of words model to TF-IDF vectors
storytfidf = tfidf_transformer.transform(bow)
Training The Classifier
#Creating a Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
#Fitting the training data to the classifier
classifier = MultinomialNB().fit(storytfidf, train_data['SECTION'])
Predicting For The Test Set
#Importing and cleaning the test data
test_data = pd.read_excel("Datasets/Data_Test.xlsx")
test_data['CLEAN_STORY'] = test_data['STORY'].apply(text_cleaner)
#Printing the cleaned data
print(test_data.values)
Creating A Pipeline To Pre-Process The Data & Initialise The Classifier
#Importing the Pipeline module from sklearn
from sklearn.pipeline import Pipeline
#Initializing the pipeline with necessary transformations and the required classifier
pipe = Pipeline([
('bow', CountVectorizer()),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())])
#Fitting the training data to the pipeline
pipe.fit(train_data['CLEAN_STORY'], train_data['SECTION'])
#Predicting the SECTION
test_preds_mnb = pipe.predict(test_data['CLEAN_STORY'])
#Writing the predictions to an excel sheet
pd.DataFrame(test_preds_mnb, columns = ['SECTION']).to_excel("Predictions/predictions.xlsx")
Submitting Your Solution At MachineHack
Finally, head to MachineHack, and submit your excel fine at the Submission Deck of the hackathon.
- 1. Click on the Assignment
- 2. Browse to your file and select it.
Note: Also provide a comment for your submission.
Check your score on the Hackathon Leaderboard. The hackathon leaderboard will be updated within 2 minutes.
And that’s it. You have successfully found a solution. Now what’s left to do is tweaking the model for performance. We will leave that part up to you. Tune the model, improve your accuracy, top the leaderboard and win exciting prizes.
Happy Coding!