How Two Persistent Data Scientists Aced AIM’s Author Identification Hackathon

MachineHack recently concluded the Whose Line Is It Anyway: Identify The Author Hackathonand announced great prizes. Analytics India Magazine talked to the winners of the hackathon and found out their experience of participating and winning the hackathon.

Saurabh Kumar took the top rank on the leaderboard and is a Group Lead working on Financial Surveillance Analytics at Ameriprise Financial Services Inc, and is an avid data scientist. He first got interested in data science back in 2014 when he heard about how a machine learning algorithm named Random Forest was performing really well in classification tasks as compared to traditional classifiers. He was overwhelmed by the amount of information available online and variety of real-world problem he could solve using such ML algorithms. Since then he managed to keep his curiosity and consistency in learning about the field.

For the particular problem of author classification, Saurabh started with basic feature extraction and Bag of the word (BoW)/TF-IDF methods. But as the competition was strong on leaderboard, he further created Deep Learning LSTM models and used their averaged cross-validated outputs to feed as input to Xgboost to get final probabilities of all the classes. He was also getting good local validation scores so was confident will get good results on the test set as well.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Saurabh first heard about MachineHack when they organised Beer Hackathon and since then he has been participating in all hackathons. About his experience on MachineHack.com he says, “My experience on this platform is great as they are continuously evolving in enriching user’s experience. Also, moderators are really helpful and prompt in answering participant’s queries.”

Mayank Kumar came in second on the leaderboard did his graduation( BTech) in computer science. While starting out he didn’t have any experience in coding and any programming languages. He faced a lot of failures in learning programming and every day seemed like a failure and even easy question, he wasn’t able to figure out any solution.

He says about these failures, “But, after so many hard practices, participating in various competitions and sleepless nights, I win over it.  After some months, I came to know about a buzzword Data Science.”

He really enjoyed putting sleepless nights in designing algorithms and solutions. In the beginning, Mayank found it always frustrating because there were so many concepts like Linear Algebra, Probability, Statistics, Programming which together creates the field of data science. The best thing he loves about being into this field is that always there is something which he has never heard or worked before. The best way to reach Mayank Kumar is via LinkedIn (https://www.linkedin.com/in/mk9440/).

Mayank divides his approach for problem-solving into EDA (Exploratory Data Analysis), Data Preprocessing and Cleaning, and Modeling Phase. First, he started with data exploration to gain domain knowledge about the data. Both train and test data have common column/feature as text which consists of sentences written by any particular author. Talking about the most interesting insights into the data, Mayank says, “During frequent word visualization through word cloud after removal of some common words in each of the article, it appeared that each author has some distinct plots/scenes which were a good way for machine learning model to capture that information to distinguish these articles between respective authors. Some plots were related to prince and king like stuff while some plots were related to Russian stuff, some were of Indian stuff while some were of detective stories of Sherlock Holmes and so on.”
In the modelling phase, Mayank came up with five different models consisting of 2 SGD models, 2 PassiveAggressiveC models and one RidgeC model. He ran all of these 5 models on 40 KFold (after tuning) train and test data (on all of the above 12 generated data sets) and then he stacked all of their results by using XGBoost as a meta-model which gave me my best score of 0.9889 (rank 2) on the leaderboard.

Talking about his experience on MachineHack, Mayank says, “I first came to know about MachineHack via LinkedIn when I saw a post describing a launch of a competition titled ‘How To Choose The Perfect Beer’. I was very excited to join another Machine Learning community besides Kaggle and Hackerearth. Until now, my overall experience has been really enjoyable and I am expecting more challenging and fun competitions to participate at MachineHack.”

More Great AIM Stories

Abhijeet Katte
As a thorough data geek, most of Abhijeet's day is spent in building and writing about intelligent systems. He also has deep interests in philosophy, economics and literature.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM