Now Reading
MachineHack Winners: How These Data Science Enthusiasts Solved The ‘Predicting The Costs Of Used Cars’ Hackathon


MachineHack Winners: How These Data Science Enthusiasts Solved The ‘Predicting The Costs Of Used Cars’ Hackathon


MachineHack concluded its 15th successful competition by announcing winners for its recent hackathon Predicting The Costs Of Used Cars. This hackathon, sponsored by Imarticus Learning, was launched in June and received active participation from over 1,700 participants out of which 3 emerged as winners.



Participants Divyanshu Suri, Sandeep Rathod and Vasim Shaikh won the first, second and third places respectively on the hackathon leaderboard.

#1: Divyanshu Suri

Now a Senior Manager of Machine Learning at AXA XL, Divyanshu Suri is not new to MachineHack. Last year he had won the How To Choose The Perfect Beer Hackathon and still remains on the very top of the leaderboard. Having done his Bachelors in Statistics from Delhi University and a Masters in Applied Statistics from IIT Bombay, Divyanshu was amazed by the real power of data science in his second job at EXL Service where he worked in insurance analytics. He then went on to participate in many online hackathons gaining knowledge and improving his skills.

Now, as a Senior Manager, he applies predictive analytics to solve a variety of data science problems in commercial and speciality lines.

Divyanshu’s Approach To Solving The Problem

He started by trying to demystify the data by performing exploratory analysis. He figured out the missing values and cleaned the unstructured data. He explained his process as follows: 

Following variables were converted to numeric by appropriately applying the units –

  • Mileage 
  • Engine
  • Power
  • New_Price

Created Dummy variables for the following variables –

  • Location
  • Fuel_Type
  • Transmission
  • Owner_Type

Imputed following variables missing values based on other variables –

  • Mileage
  • Engine
  • Power
  • Seats

I didn’t use “Name” as it is in the models, rather extracted following 2 features from it –

  • Company Name (like Honda, Maruti, Hyundai, etc)
  • Car Name (Amaze, Alto, Innova, Jazz, etc)

Created prior years (1,2,3 & all) average Price, Min Price & Max Price by following variables –

  • Name
  • Power (after grouping it in bins)
  • Engine (after grouping it in bins)
  • Company Name

Imputed missing values of “New_Price” by creating a separate XgBoost Model using 5 fold cross-validation strategy.

Response Variable: Since the evaluation metric was RMSLE, I used the log(X+1) transformation in all my models, except 2, where I used Power of 0.1 as the transformation (the intent was to create diversified models for ensemble).

Validation Strategy: Since the size of the training dataset was not big enough, I used 5-fold validation strategy and selected a seed which was giving me the score which was closest to the leaderboard. I also used the same seed to split the dataset across all the models.

Modelling: I started by trying different model techniques as I wanted to see which technique was giving me the best result. So, I tried Linear Regression, Lasso, Ridge, Random Forest, XgBoost, Light GBM, Neural Nets, etc... XgBoost and Light GBM gave me the best results and then most of my energy went into optimizing the parameters for both of them.

My best single XgBoost Model scored 0.9464 on the leaderboard and single Light GBM Model scored 0.9463 on the leaderboard. Also, I got a score of 0.9473 on the leaderboard by taking an average of 6 models.

Final Model: My final ensemble model was inspired by the solution which was used by the Netflix Competition winner team (“BellKor's Pragmatic Chaos") in 2009, where they used the learnings from the leaderboard score to find the coefficients for different model’s predictions in the ensemble model.

Click here to view the code.

#2: Sandeep Rathod

Sandeep was Introduced to Data Science by his colleagues at Infosys and it became his passion in which he enjoyed the right blend of Mathematics and Analysis. Wanting to learn more and upskill in his newly found passion, he joined UpX academy which gave him the right tools and confidence to win hackathons. Sandeep is now a Project Manager at Infosys.

Sandeep has come close to victory on other hackathons at MachineHack and he is familiar with the platform. He said “MachineHack use cases are a good encouragement for naïve people like me. I would like to thank them for proving such a platform to compete and learn machine learning.”

Approach to solving the problem:

See Also

Sandeep’s approach to the problem was straight forward. He provided more focus on imputing the missing values and feature generation. He had his focus on one particular feature ‘New_Price’ which had a huge ratio of missing values in the dataset. He imputed the missing values by comparing the Brand, Model and Price of same cars. He spent most of his time in cleaning the data as he believed clean data would provide more accurate results, which totally paid off.

Click here to view the code.

#3: Vasim Shaikh

Having completed his Masters in Finance and with a zero background in Computer Programming, Vasim’s passion for Data Science began out of curiosity. He had worked with numbers and that being the only factor that connects him to data, he enjoyed playing around with it, generating insights, explaining complex patterns through visualizations and predicting numbers. His curiosity urged him to learn more from all available resources like MOOCs and books.

Vasim is now an expert in analysing, exploring and wrangling tabular data.His skills with data and numbers have taken him to the Manager Of Data Strategy & Analytics position at Willis Tower Watson.

Like the other two winners, Vasim is also a regular user at MachineHack having participated in multiple Hackathons.On being asked of his experience at MachineHack he responded “It is an amazing platform, where budding data scientist can showcase their skills and develop and learn new techniques”

Approach to solving the problem:

Vasim started with feature generation extracting as much information as possible from the given set of data. Given below are the steps of his approach as explained by him.

Pre-Processing steps:

  • Extracted Manufacturer and Brands from Name.
  • Extract variants, whatever I could
  • Convert year to Age of the car
  • Clean Engine, Mileage and Power variables
  • Log transformation to Engine and Power, this increases the correlation
  • Impute mean values for Engine missing values
  • Linear regression prediction used for imputing Power (Engine was used as X variable)
  • Used divisible for different variables. E.g. Engine / Power, KM_driven / Age, etc

Model:

  • Data split was stratified on variables Fuel Type, Transmission and Seats
  • Used 25 baggings of XGBoost, hyperparameters derived from 5-fold cross-validation

Click here to view the code.



Register for our upcoming events:


Enjoyed this story? Join our Telegram group. And be part of an engaging community.

Provide your comments below

comments

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
Scroll To Top