Last updated June 1, 2020
In Creative AI

Meet The MachineHack Champions Who Cracked The ‘Predict The Movie Genre’ Hackathon

Share

Published on June 1, 2020

by Amal Nair

MachineHack last week successfully concluded its Classifying Movie Scripts: Predict The Movie Genre Hackathon. With over 600 registrations and active participation from 60 participants, we introduce you to the top 2 competitors and approaches that helped them in cracking the problem.

#1: Amul Patil

Amul started his career as a financial analyst, where he was introduced to some basic inferential statistics used in business setups. Intrigued by the potential of data and its impact on decision making, he chose to follow data science as his career path. He has worked for multiple organisations developing his skills along the journey.

“MachineHack is a wonderful platform for learners, which keeps engaging with frequent new case studies and allows participants to compete while learning along the way.” – he shared his opinion on MachineHack.

Approach To Solving The Problem

Amul explains his approach as follows:

Performed data cleaning and created an initial baseline model using Logistic Regression
Performed hyperparameter search of different algorithms and used TFIDF Vectorizer
Ensemble of three models (Naive Bayes, Logistic Regression & SVD) and XGboost were used for the final probabilities.
1. The arithmetic mean of probabilities Naive Bayes (over cross-validated models)
2. The arithmetic mean of probabilities Logistic Regression( over cross-validated models)
3. Created SVD component features
4. Meta learner XGboost was used on the outputs of the above three results

I tried some of the latest techniques like ULMFIT (semi-supervised learning) and summarization of text followed by BERT. Both of the approaches took a lot of time to train on available resources (i,e Colab) and seemed to be an overkill for the task at hand. So, I decided to stick with the above setup which paid off in obtaining a top score in private leaderboard.

Get the complete code here.

#2: Sairam Chitreddy

Sairam completed his B.Tech in Engineering Physics from IIT Delhi and later joined FIITJEE in 2015 as a Physics faculty for IIT JEE. His passion for learning in general and problem-solving led him to machine learning and with no background in coding he enrolled in various MOOCs to learn about the demanding field. He uses his free time to venture into deep learning and natural language processing. He spends most of his time doing projects, participating in hackathons. He is also actively looking for an entry into the field of data science.

“This was my 2^nd competition on MachineHack after the Flight ticket price prediction. Though I did not fare well in that competition, later when I read the winner’s solution, it improved my perspective on how to approach the problem, and I also learnt some cool techniques. Personally, during this hackathon, not only my application skills but an understanding of underlying concepts also improved. Also, I felt that the learning doesn’t stop once the competition ends because, during the course of this hackathon, I realised there are still many more ideas to try, implement and enhance the performance. Another aspect of MachineHack, which I like is the variety in the competitions.”- Sairam shared his opinion on MachineHack

Approach To Solving The Problem

He explains his approach briefly as follows.

I used Google Colab to work with GPUs. The major problem which I have encountered is “CUDA: out of memory” errors. Following are the steps which I took to solve this issue:

1) Processing one script at a time instead of a batch of scripts

2) Using distilled models

3) Deleting unused variables at each step

4) Switching of gradient calculations. (Even when the model is in evaluation mode, it was still calculating gradients. So, need to use torch.no_grad())

While encoding, I used sequences of maximum possible lengths instead of individual sentences, because one of the strengths of transformer models is the self-attention mechanism. Longer sequences give better contextual word embeddings.

Then for each script, I took the mean of sequence encodings and applied a Logistic Regression on it to establish a baseline which finally turned out to be a winning solution.

My leaderboard score was too bad compared to validation scores. So, after referring to some Kaggle discussions, I realised that Stratified K-fold cross-validation gives a better estimate of model performance. And indeed, my final score was nearly within one standard deviation of my 5-fold CV score.

Apart from the above approach, I tried treating each encoding as an example, with script-label as its label and fitting a neural network over it but failed because of too much noise.

Other ideas worth exploring which I could not try :

1) Using more powerful models instead of just Logistic Regression

2) Address class imbalance with oversampling (not doing this might be the reason for bad leaderboard scores

My most important learning from this competition – Trust your K-fold cross-validation because the split used for calculating public leaderboard scores might be skewed.

Get the complete code here.

Check out for new hackathons here.

Access all our open Survey & Awards Nomination forms in one place