With the successful conclusion of yet another MachineHack’s hackathon — Buyer’s Time Prediction Challenge — on the 4th of January, 2020, in this article, we will share the approaches and solutions of the participants that made it to the top of the leaderboard.
Centred around the massive transformation of consumers’ behaviour and their buying patterns, Buyer’s Time Prediction Challenge asked the participants to develop a machine learning model that can predict buyer’s/consumer’s time-spent on an eCommerce platform. This fortnight-long hackathon saw participation from more than 700 data scientists and machine learning practitioners who tirelessly worked towards building innovative solutions to this problem statement.
After a careful evaluation using the RMSLE (Root Mean Squared Logarithmic Error) metric, along with the private and public leaderboards scores, MachineHack selected the following top three winners:
Winner 01: Prathik Shirolkar
Taking Ankit Fadia, the renowned white hat computer hacker, as a role model and with interest in networking and system security, Prathik Shirolkar started his data science journey by taking a series of internships during his college days. He began with a data science internship at Stride.ai, where he managed to grasp web scraping, followed by an engineering company — Robert Bosch, where he did image processing for their automated cars division. Post that, Prathik worked as a full-time data science office at Electronics For Imaging Inc (EFI), where he was finally exposed to predictive analytics.
During his tenure at the computation and data sciences department at Indian Institute of Science (IISc), he worked on a distributed computing-based research problem. And, his work at a non-profit organisation, Pratham Books, made him experienced in segmentation, clustering, and improving recommendation engines.
To solve MachineHack’s Buyer’s Time Prediction Challenge, Prathik relied his technique on feature engineering. “My approach to solving the challenge was very simple — concentrate a lot on feature engineering and spend very little time on modelling.” Some of the feature engineering he did were manually classifying client agents into handheld devices and desktops, and extracting the browser version.
Further, he created one combined feature for ‘purchased’, ‘added_in_cart,’ and ‘checked_out’, and gave them numeric weights. He also calculated the average daily traffic and some other features from client_agent feature, using a TfidfVectorizer for both characters and words.
One important thing that was kept in mind while building the machine learning model was to create a cross-validation strategy for matching the proportions of the actual train and test data provided. Once that’s done, Prathik built three machine learning models using CatBoost, XGBoost and Light Gradient Boosting Machine (LightGBM), where LightGBM performed the best, predicting all the values within a certain small range, and XGB performed the worst.
Thus, to incorporate the variance provided by XGB but in a controlled way, Prathik took the weighted average of the three models, using the combined cross-validation for multiple permutations of weights — 50% weightage to LightGBM; 45% weightage to CatBoost; and 5% weightage to XGBoost. “I did try Stacking these models, but a simple weighted average proved to be better for me,” added Prathik.
While speaking about his MachineHack experience, Prathik said, “As a beginner, I found MachineHack to have the least clutter in its UI/UX experience, which made me the least anxious.”
“The problem statement, although it was simple, has its own challenges, but doesn’t require people to have special domain skills to be able to solve problems,” concluded Prathik.
Winner 02: Najeed Osmani
Telangana-based Mechanical Engineering student, Mohommad Najeed Osmani is the second winner of the Buyer’s Time Prediction Challenge hackathon. Since childhood, Najeed had a keen interest in the field of computer science engineering and therefore relied on MOOCs, books and online blogs by statisticians and ML developers to learn data science.
Najeed has also been mentored by a data science expert who helped him get the initial grip of this complex subject. “I thought data science would be a cakewalk and could easily get into this by following my mentor’s guidelines, but later I got to know how deep it is,” said Najeed.
To solve this challenge, Najeed firstly, removed the outliers, as there are many categorical variables involved using algorithms that implement the tree model. Further, he chose transformations for continuous variables. He firmly believes that feature engineering played the key role in getting a perfect baseline score. “Columns like client_agent and device details pretty much indicated the same,” added Najeed.
Once feature engineering was done, the label encoding was done to all the categorical columns. With the help of mean encoding, Najeed generated more features that helped him produce a good score. He used the CatBoost algorithm for training the model while keeping the learning rate very low. However, the conversion of the ‘time_spent’ column with ‘log1p’ has made the training very easy for him; as the metric RMSLE always gets low when provided with lesser values. To further increase the performance score, the predictions were again applied to the power of Euler’s number, scaling it back to the original scale.
This is the third MachineHack competition Najeed has participated in, and with each challenge, he gathered immense knowledge and experience in solving complex problems. “With MachineHack’s recognition, I hope to get some boost to succeed further in this domain.”
Winner 03: Deep Contractor
Our third winner, Deep Contractor, is a final year computer science and engineering student, who is currently working as a data science intern at Celebal Technologies, Jaipur. Essentially self-taught with online courses, YouTube video tutorials and medium articles, his data science journey started in his initial college days with a data wrangling task. “My data science journey started a year ago when one of my college professors gave me a data wrangling task. This got me interested in data science,” said Deep. Additionally, Deep participates in hackathons to learn new skills and make new connections, especially with the experts of this domain.
To solve this complex MachienHack’s problem statement, Deep started with some basic exploratory data analysis (EDA), to study the data. He used the ‘device_details’ column to make three new columns — medium_used; device_used; os_used, using feature engineering. After that, he used the ‘date’ column to engineer six new columns based on day, year, name of the week, is_weekend, quarter, and month.
Once feature engineering is done, Deep merged the test data set and training data set to perform label encoding on columns like device_used, os_used, week, and year. However, before building the model, he dropped columns like session_id, device_details, client_agent and date. While testing some popular ML algorithms like DecisionTreeRegressor and Random decision forests, he got the best score — 2.003 — with Decision Tree Regressor. He then tried Support Vector Regression (SVR) to reduce my score to 1.8 RMSLE on the leaderboard. Finally, he was able to further improve his score to 1.76 by fine-tuning some parameters in the SVR.