MachineHack is launching yet another hackathon to keep the data science and machine learning community occupied during the quarantine period amid the Covid-19 outbreak. With the objective of helping the community use this time by expanding their knowledge, MachineHack and Analytics India Magazine brings to you – Classifying Movie Scripts: Predict The Movie Genre Hackathon
Problem Statement & Description
If provided by the entire script of the movie, can your ML model classify it into the right genre?
Labelling text data can be hard. To use the available information to auto-create or predict the labels can be an interesting machine learning task. Using the power of Natural Language Processing (NLP), the unstructured text data can be leveraged to auto-generate the right classes for the test data in the future.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
To accomplish this, we have scraped close to 2000 movie scripts and the respective genres.

As some of the scripts are huge, it would be interesting to figure out new ways of feature extraction and different NLP techniques.
In this hackathon, participants are challenged to use the movie script to design a natural language processing system that can help the customer classify it into the right genre in the coming future.
The current platform struggles to classify the movies with an accuracy above 90%. However, we at MachineHack, feel that the current state of the art NLP algorithms such as BERT and OpenGPT have paved the way to design more robust systems which can understand the context of the provided text data.
Data Description
The participants will have access to the following files:
- Train.csv – 1978 script file names with the class labels.
- Test.csv – 849 script file names without the class labels.
- Scripts – Folder with 2827 scripts .txt files.
- Sample Submission – Sample format for the submission.
- Started Notebook – A simple benchmark notebook.
Data Preview
Train.csv
Test.csv
Movie_Scripts_Sample_Submission.xlsx
Refer the starter notebook below, just run the notebook to generate a benchmark submission.
Bounties
The hackathon provides participants with an exclusive opportunity to win free passes to Cypher 2020.
Top 3 competitors will receive a free pass to Cypher 2020.
Cypher is India’s largest Analytics & AI summit. In its sixth year, Cypher has emerged as the ideal platform to network and learn from leading industry experts, companies and startups in the fields of analytics, data science and artificial intelligence.
Learning from transformative thinkers and connecting with like-minded innovators, Cypher provides a platform where you will be challenged to push yourself in data-driven processes while drawing inspiration from those thriving in the industry.
Rules
- There can only be one account per participant. Submissions from multiple accounts will lead to disqualification.
- The submission limit for the hackathon is three per day, after which the submission will not be evaluated.
- This hackathon will expire on May 15 16:00 IST.
- All registered users are eligible to participate in the hackathon.
- This competition counts towards our overall ranking points.
- You will not be able to submit once you click the “Complete Hackathon” button. You may ignore this feature.
- We ask that you respect the spirit of the competition and do not cheat.
Evaluation
The leaderboard is evaluated using Multi-Class Log loss (Cross-entropy loss) for the participant’s submission.