Last updated October 7, 2021
In Creative AI

Meet The Winners Of TheMathCompany’s Data Scientist Hiring Hackathon

Based on the double hurdle format, TheMathCompany has announced the three cash prize winners.

Share

Published on August 2, 2021

by Krishna Rastogi

The MATHCO.thon concluded on July 19, 2021. The Data Scientist Hiring Hackathon was followed by a SQL based quiz for all shortlisted participants. Based on the double hurdle format, TheMathCompany has announced the three cash prize winners. Here, we look at their personal journey, solution approach and experience at MachineHack.

First prize – Sai Deepak

Sai Deepak completed his UG degree in production engineering. He did his final project on air pollutants and automobile exhaust using IoT data.

He did a data science course at Great Learning after his graduation and later joined a telecom-based company as a data analyst. Sai Deepak aspires to move from a data analyst role to a data-scientist role.

Approach

Sai Deepak did an EDA on all his features, and he combined high cardinality columns into fewer columns. He built the base-line models with multiple experiments on output variables and zeroed in on log transformation. He also used log transformation on one of the predictor variables: Mileage and Levy. Levy was particularly tricky due to missing data; imputation was done using trial and error. He had incorporated the ‘ID’ column as a feature. He found outliers were impacting the model result. To achieve robustness, he applied quantile transformation on the data, improving the result of the model. He used Categorical and One hot Encoding techniques to convert categorical data.

Sai Deepak used mathematical functions like square root and arithmetic operators and KMeans to create new features. He tried various models like Tree based Regressors, SVC and Stacking. Light GBM regressor, XGBoost regressor, Random Forest regressor and Catboost regressor contributed significantly for prediction. He used a blending technique to incorporate the results of all models.

Experience

This is Sai Deepak’s second outing at MachineHack. He said MachineHack is the only platform where data science problems are available at basic level, intermediate level and advanced level. “MachineHack is the best platform to learn Data Science,” he said. The bootcamp videos helped him prepare for the hackathon.

Check out his solution HERE.

Second prize- Akash Gupta

Akash Gupta completed B.Tech in IT from AKGEC: Ajay Kumar Garg Engineering College, Ghaziabad in 2020. In his second semester, he enrolled for Andrew Ng’s storied course on Machine Learning. He has also signed up for courses like IBM Data Science Professional course, Mathematics for Machine Learning by Imperial College London, Deep Learning Specialization, etc on Coursera.

He started actively participating in ML hackathons from the fourth semester. His Data science accomplishments include Grandmaster at MachineHack (AIM), Kaggle expert, 7 gold medals at Dockship, and top 3 positions at 40+ ML hackathons.

Approach

To start with, he did EDA on data and noted the Train Price deviation on the given dataset. He found out the top 10 prices of given cars. He detected a huge difference of price with the quartile. He removed some train rows which were not best for the model and clipped the outlier data to make the model robust.

Feature engineering included aggregating various correlated features with Mileage and statistical features like quartile, mean and std.

For modeling, first he converted the price into “np.log1p” format, followed by the train test split. LightGBM algorithm gave the best performance. He also experimented with boosting type -GBDT, DART and with different random states. He got the final prediction using ten K Fold cross validation technique

Experience

“MachineHack is one of the best hackathon organisers and data science knowledge portals in India,” said Akash. He’s been a MachineHacker since its inception. He also likes the new user interface. Additionally, the features like practice and Boot Camp have helped him in preparation for the MATHCO.thon.

Check out his solution HERE.

Third prize – N Sai Sandeep

Sai Sandeep sparked to data science after he saw a video of Google’s 2018 I/O conference. He then enrolled for an Applied-AI course. He later interned with AppliedAI and built a chatbot using the Seq to Seq model-Pytorch Framework. He worked full time in the same organisation guiding students with their projects and developing course content. Currently, he is working as a Data Scientist at Sutherland Global.

Approach

Sai Sandeep spent 95% of the time in analysing and preparing data for the model and the rest in model building. He said EDA and data preprocessing steps are important, and AutoML tools for model building work exceptionally well for hackathons. Here’s his step by step approach:

1) Univariate and Multivariate analysis: EDA on all the independent features and target column. Columns such as Mileage, Levy, Engine volume etc had missing values or additional strings attached to them (formatted as categorical instead of continuous). Performed multivariate analysis using Plotly to understand the relation of the independent variables with the target variable.

He noticed a high correlation of a few variables with respect to manufacturers such as missing Levy values and Price, median car prices and the year of production etc.

2) Data preprocessing & imputation: From the insights drawn above, columns such as Engine Volume, Levy, Doors Mileage were transformed. Imputation of missing values was performed based on the other features (group by column and taking median) that are correlated with features having missing values.

3) Target variable & metric analysis: The target column “Price” was skewed and contained a lot of extreme values. Target column was log-transformed.

4) Feature engineering & feature importance: Combinations such as with and without raw features (features before preprocessing), ID column etc were trialed to calibrate ideal feature configurations for optimal result. Categorical features were transformed using one-hot encoding and label encoding, whereas numerical features were scaled. Extra-Trees regressor model was trained to remove unwanted features.

5) Model tuning using Auto Gluon: Auto Gluon gave better results compared to other packages. However, direct application of AG didn’t help as the stacked models were overfitting. After going through the source code and logs, he found a data leakage with the Random Forest model. Hence RF was excluded from the list of models and Auto Gluon was trained to output 3 level stacked model. The final model was a Level -3 weighted ensemble model with base models as Lightgbm, Extratrees and Catboost algorithms.

Experience

”MachineHack has been an amazing platform for machine learning enthusiasts. I signed up for the first time to participate in the Great India Hiring Hackathon. Though I didn’t perform well back then, I applied my learning in the next Car Price Predictions challenge and got into the top 3. Also, the discussion forum has been very helpful and the UI was clutter-free and fast,” he said.

Check out his solution HERE.

Access all our open Survey & Awards Nomination forms in one place

Krishna Rastogi

Krishna currently working as an Associate Director at ADaSci. He has 6+ experience research & development, cutting edge engineering to develop products from idea to deployment. He comes with expertise in building deep learning computer vision applications using both hardware and software solutions in several domains. His interests are the domain of distributed learning and Edge AI.