Complete Guide To Cracking MachineHack’s ‘Predict The News Category Hackathon’

Natural Language Processing (NLP) is one of the most explored and successful domains in machine learning. It is important because complex communication is one of the best signs of intelligence as we are trying to make machines communicate with humans effortlessly.

In this article, we will do a hands-on NLP with Python to solve MachineHack’s Predict The News Category hackathon. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Predict The News Category Hackathon 

MachineHack has launched its second Natural Language Processing challenge for its large Data Science and ML audience. The hackathon is about predicting the category or section of news from its content.The dataset consists of news pieces collected from a number of different sources along with the category or section of the news piece in which it was featured.

Given below is the description of the dataset.


Download our Mobile App



Size of training set: 7,628 records
Size of test set: 2,748 records

FEATURES:

STORY: A part of the main content of the article to be published as a piece of news.
SECTION: The genre/category the STORY falls in.

There are four distinct sections where each story may fall in to. The Sections are labelled as follows :

Politics: 0
Technology: 1
Entertainment: 2
Business: 3

Getting The Datasets

Go to MachineHack, Sign Up as a user and click on the Predict The News Category Hackathon. Start the hackathon and find the dataset in the Attachment section.

Click here to register for the hackathon 

Without further ado, let’s crack the Hackathon!

Solving The Hackathon

Let’s break the solution into 6 parts as given below for better understanding.

  1. Exploratory Data Analysis: A Simple analysis of Data 
  2. Data cleaning 
  3. Data preprocessing: Count Vectors and TF-IDF Vectors
  4. Training the classifier
  5. Predicting for the test set
  6. Submitting your solution at MachineHack

Exploratory Data Analysis: A Simple analysis of Data

Let’s start off with the usual drill and import all the necessary modules for our project.

#Importing the libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
#Download the following modules once
nltk.download('stopwords')
nltk.download('wordnet')

Let’s do a simple analysis of the data in hand.

#Importing the training set
train_data = pd.read_excel("Datasets/Data_Train.xlsx")

#Printing the top 5 rows
print(train_data.head(5))

#Printing the dataset info
print(train_data.info())

#Printing the shape of the dataset
print(train_data.shape)]

Out:(7628, 2)

#Printing the group by description of each category
train_data.groupby("SECTION").describe()

Data Cleaning 

#Removing duplicates to avoid overfitting
train_data.drop_duplicates(inplace = True)

#A punctuations string for reference (added other valid characters from the dataset)
all_punctuations = string.punctuation + '‘’,:”][],' 

#Method to remove punctuation marks from the data
def punc_remover(raw_text):
no_punct = "".join([i for i in raw_text if i not in all_punctuations])
return no_punct

#Method to remove stopwords from the data
def stopword_remover(no_punc_text):
words = no_punc_text.split()
no_stp_words = " ".join([i for i in words if i not in stopwords.words('english')])
return no_stp_words

#Method to lemmatize the words in the data
lemmer = nltk.stem.WordNetLemmatizer()
def lem(words):
return " ".join([lemmer.lemmatize(word,'v') for word in words.split()])

#Method to perform a complete cleaning
def text_cleaner(raw):
cleaned_text = stopword_remover(punc_remover(raw))
return lem(cleaned_text)

#Testing the cleaner method
text_cleaner("Hi!, this is a sample text to test the text cleaner method. Removes *@!#special characters%$^* and stopwords. And lemmatizes, go, going - run, ran, running")

Out: 'Hi sample text test text cleaner method Removes special character stopwords And lemmatizes go go run run run'

#Applying the cleaner method to the entire data
train_data['CLEAN_STORY'] = train_data['STORY'].apply(text_cleaner)

#Checking the new dataset
print(train_data.values) 

Data Preprocessing: Count Vectors and TF-IDF Vectors

Creating Count vectors

#Importing sklearn’s Countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

#Creating a bag-of-words dictionary of words from the data
bow_dictionary = CountVectorizer().fit(train_data['CLEAN_STORY'])

#Total number of words in the bow_dictionary
len(bow_dictionary.vocabulary_)

Out : 35189

#Using the bow_dictionary to create count vectors for the cleaned data.
bow = bow_dictionary.transform(train_data['CLEAN_STORY'])

#Printing the shape of the bag of words model
print(bow.shape)

Out: (7551, 35189)

Creating TF-IDF Vectors

#Importing TfidfTransformer from sklearn
from sklearn.feature_extraction.text import TfidfTransformer

#Fitting the bag of words data to the TF-IDF transformer
tfidf_transformer = TfidfTransformer().fit(bow)

#Transforming the bag of words model to TF-IDF vectors
storytfidf = tfidf_transformer.transform(bow)

Training The Classifier

#Creating a Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

#Fitting the training data to the classifier
classifier = MultinomialNB().fit(storytfidf, train_data['SECTION'])

Predicting For The Test Set

#Importing and cleaning the test data
test_data = pd.read_excel("Datasets/Data_Test.xlsx")
test_data['CLEAN_STORY'] = test_data['STORY'].apply(text_cleaner)

#Printing the cleaned data
print(test_data.values)

Creating A Pipeline To Pre-Process The Data & Initialise The Classifier

#Importing the Pipeline module from sklearn
from sklearn.pipeline import Pipeline

#Initializing the pipeline with necessary transformations and the required classifier
pipe = Pipeline([
('bow', CountVectorizer()),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())])

#Fitting the training data to the pipeline
pipe.fit(train_data['CLEAN_STORY'], train_data['SECTION'])

#Predicting the SECTION
test_preds_mnb = pipe.predict(test_data['CLEAN_STORY'])

#Writing the predictions to an excel sheet
pd.DataFrame(test_preds_mnb, columns = ['SECTION']).to_excel("Predictions/predictions.xlsx")

Submitting Your Solution At MachineHack

Finally, head to MachineHack, and submit your excel fine at the Submission Deck of the hackathon.

  • 1. Click on the Assignment

  • 2. Browse to your file and select it.

 Note: Also provide a comment for your submission.

Check your score on the Hackathon Leaderboard. The hackathon leaderboard will be updated within 2 minutes.

And that’s it. You have successfully found a solution. Now what’s left to do is tweaking the model for performance. We will leave that part up to you. Tune the model, improve your accuracy, top the leaderboard and win exciting prizes.

Happy Coding!

 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Amal Nair
A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: From Promise to Peril: The Pros and Cons of Generative AI

Most people associate ‘Generative AI’ with some type of end-of-the-world scenario. In actuality, generative AI exists to facilitate your work rather than to replace it. Its applications are showing up more frequently in daily life. There is probably a method to incorporate generative AI into your work, regardless of whether you operate as a marketer, programmer, designer, or business owner.