MachineHack organised a hackathon – Workation Price Prediction Challenge – from 26 March to 12 April 2021. The results were announced on 15 April 2021. In this article, we learn about the approaches and solutions of the top three winners.
The challenge was to build an ML model to predict the best workation packages optimised for price. The hackathon saw participation of around 450 data scientists and machine learning practitioners from across geographies.
#1 | Panshul Pant
Panshul is an engineering student pursuing Computer Science. With a penchant for mathematics and problem-solving, it was only a matter of time before he stumbled on data science. He is excited about the immense potential of machine learning and deep learning in shaping our future. Once he discovered the field of data science, during the pandemic, he was immediately hooked. Panshul spent a lot of time reading and watching YouTube videos about data science to dive deep into the subject. Courses on platforms like Coursera also helped. Later, he realised, online courses and tutorials are not enough; hands-on experience is key to building expertise. He started looking for ways to improve his skills with emphasis on tackling real-world problems. He came across sites like Kaggle, MachineHack, HackerEarth, offering real-world challenges for data enthusiasts to solve. Initially, he got stuck at various stages while solving problems, but improved his skills over time. He is happy that his efforts paid off in his second MachineHack challenge.
Approach to solve the problem
The problem had columns mainly consisting of text-based data. Panshul engineered features after studying the text-based features. He teased out the total number of destinations covered in each trip from the ‘Destination’ column and the total places seen in each trip from the ‘Sightseeing Places Covered’ column. From the ‘Hotel Details’ column, with the ratings of hotels visited, he made a feature with average ratings of all hotels in a trip from the available data. From the ‘Itinerary’ column, he extracted the total number of days covered in each trip.
Panshul tried both TF-IDF vectorizer and countvectorizer to transform most text-based features in the dataset and found that countvectorizer gave better results. Columns like ‘Package Type’, ‘Start City’ were encoded in order. Here, dropping the ’Travel Date’ feature proved a better choice. ‘Places Covered’ was removed as it was similar to ‘Destination’. The target variable ‘Per Person Price’ was right-skewed, so he applied log transformation during training. After testing, he used the exponent function to revert it to the original form. This step further reduced the RMSLE.
For model training, he tried different tree-based models like LightGBM, RandomForest, CatBoost, XGBoost etc. LightGBM and CatBoost performed better compared to others. He manually tuned their hyperparameters and used an ensemble technique of stacked LightGBM and CatBoost to come up with the best model.
“MachineHack has been really good. In my opinion, three submissions per day are also fine as it would force the participants to rethink their approach and build a proper validation scheme to come up with a more generalised model and reduce overfitting. It’s the real-world problems that platforms like MachineHack and few others provide which help data science enthusiasts to improve and sharpen their skills. Fellow participants are also very competitive which makes it more exciting and challenging. I am thankful to the team for organising this competition and look forward to participating in future as well.” — Panshul Pant
#2 | Eric Vos
Erik is not a Data Scientist by training. He learned industrial IT and robotics 30 years ago. He studied the basics of traditional AI as part of his courses. A few years ago, he got piqued about modern ML techniques like Neural Networks and Deep Learning. He signed up for MOOCs by Andrew NG, Geoffrey Hinton etc. To practice and improve his skills, Eric participates in various Data Science competitions and hackathons across the world.
Approach to solve the problem
He started analyzing challenge with regular EDA (exploratory data analysis) and focused on systematic encoding of categorical features using custom encoders. A feature engineering layer was added to extract additional valuable features like trip duration and average hotel rating. At this stage, almost 2,700 features have been created. An Auto ViML Catboost Model was used to quickly evaluate the effectiveness of previously created features and perform additional automated feature engineering. The best result came from an “Ensemble_Stacked” MLJAR model trained in complete mode.
“I participated in several MH hackathons and learned a lot from published solutions from top Machine Hackers. It’s a great place to improve my machine learning skills and play with different original datasets” — Erik Vos
#3 | Rupesh Prasad
Rupesh completed his Masters in mathematics from IIT Madras in 2011. Then, he started working as a predictive modeller for the BFSI domain. There, he has developed and deployed various statistical and machine learning models like a pure premium model, fraud detection model, attrition models, next best offer optimization etc. He pursued data science as a career as it is a good mixture of mathematics, statistics, coding and domain knowledge. Rupesh has picked up data science skills mainly through open-source courses and by reading articles on various platforms like Analytics India Magazine, Kaggle etc.
Approach to solve the problem
- Exploratory Data Analysis (EDA)
- He used EDA to understand the data in detail which helped him implement different feature engineering techniques.
- Feature Engineering
- Create separate columns for the first and last destination.
- Calculate the distance b/w origin and first destination. Similarly, calculate the distance b/w origin and last destination. This required some cleaning of data as well. For example, ‘Tiruchirapalli’ was converted to ‘Tiruchi’.
- The rating for each hotel (wherever possible) was extracted from the ‘Hotel Details’ field.
- Various date related variables were also derived using the Datetime Index attributes of the Pandas’ library.
- Variables like the number of places covered, number of nights during the trip, number of sightseeing places were also calculated.
- Finally, following variables were combined [‘Package Name’, ‘Destination’, ‘Itinerary’, ‘Places Covered’, ‘Hotel Details’, ‘Sightseeing Places Covered’ ‘’Cancellation Rules’, ‘Airline’] to create a new variable ‘info’. To extract statistical features from info, he tokenized these variables using countvectoctorizer (unigram). He tried bigrams, tfidf, without much success. He also kept the max feature as 1,100.
- One hot encoded the categorical features.
- Feature Extraction
- At this moment, he had more than 1,200 variables to experiment with. Using univariate linear regression tests, he extracted the top 900 variables.
- Model development:
- Finally, three different models with ten fold cross-validation were developed (LightGBM, CATBoost and XGBoost). Various combinations of hyperparameters were tried to arrive at a final set of parameters (final model). They all performed well, with LightGBM being the best one.
The best performance was obtained by taking their weighted average.
MachineHack is an awesome platform for both freshers as well as experienced professionals. The hackathons organised during these tough times are great for both learning and competing. The articles published in Analytics India Magazine are quite informative, easy to follow and must be followed by all data science professionals. Hoping to crack more hackathons in the upcoming months with MachineHack. — Rupesh Prasad
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Krishna currently working as an Associate Director at ADaSci. He has 6+ experience research & development, cutting edge engineering to develop products from idea to deployment. He comes with expertise in building deep learning computer vision applications using both hardware and software solutions in several domains. His interests are the domain of distributed learning and Edge AI.