MachineHack has successfully concluded The Great Indian Hiring Hackathon on 23rd of November 2020, where it collaborated for the first time with 12 companies to help data science professionals land up in a rewarding career. In this hackathon, the MachineHack community was asked to come up with an algorithm to predict the price of retail items belonging to different categories. In participation with companies like — Aditya Birla Group, Bridgei2i, Concentrix, Fractal, Genpact, Lowe’s, MiQ, Piramal, Scienaptic, Vmware, WellsFargo, and Zycus, the hackathon has witnessed an active attendance of whooping 5655 practitioners.
Foretelling the retail price can be a daunting task due to the huge datasets with a variety of attributes ranging from text, numbers (floats, integers), as well as date and time. Also, outliers can be a big problem when dealing with unit prices. Thus this hackathon asked the participants to come out with a solution to forecast retail prices of items of different categories.
With the COVID pandemic dwindling the data science job market, this hackathon was designed to bring out the talent in the industry to the potential recruiters. After various stages of critical evaluation that includes assessing the participants based on their Root Mean Square Error (RMSE) scores and their leaderboard scores, a number of candidates topped the charts. Here we will introduce you to two of those champions of The Great Indian Hiring Hackathon and will describe their approach to solving the problem.
Winner 01: Nilesh Verma
A computer science student, Nilesh Verma while familiar with Python, isn’t a professional in the AI and data science field. After completing his master’s, one of the subjects he studied was data mining, and that was his first step towards the data science field. In March 2020, Nilesh started working on machine learning projects in his university, where he developed six different types of projects in machine learning, deep learning, natural language processing, and computer vision. Some of his projects were even featured in local newspapers and TV channels.
To solve this problem, Nilesh firstly tried to load the dataset in a pandas data frame, and then started with the Exploratory Data Analysis (EDA) operations on data. With this, Nilesh noted some variances in column quantity, unit price and countries. He also pointed out that the values contained in these columns range smaller compared to other columns. To this, Nilesh started checking the column data types for date-time values and ran it with a simple random forest (RF) model and got a 28% RMSE score.
The aim was to get a lower RMSE score, and that is why Nilesh started with doing feature engaging and extracted five features from date-time columns and five statistical elements, which was again run on the model to get a 23% RMSE score. Once that’s done, Nilesh removed the outlier and tried some power transmission of the Unit-Price column, because that is highly skewed, which helped in reaching a 22% RMSE score. To improve the 2-3% RMSE score, Nilesh also tried removing duplicate values. Finally, Nilesh started working on the normal range and pulled some extra data points that were higher than the normal one, which, again, helped to improve the score by 3-4%. With all these, Nilesh managed to get a 16-18% RMSE score.
To further reduce the score, Nilesh started digging more on the data and found some loopholes. While checking the dataset, Nilesh realised that a lot of values were present in test data rows, however, weren’t present in the training rows. To solve the data-leakage problem, he decided to merge training and test data. Considering there weren’t any test data labels for this situation, that fill the values with zero, Nilesh overfitted the model with some dummy data. After saving the dataset, Nilesh used an Excel sheet for data manipulation operation, which was time friendly. After doing all the manipulation, i.e. replacing large values, filing zero values to mean, median, etc., he managed to get a 4-6% RMSE score. Further changing minus values to plus in Quantity columns improved the RMSE score 2-3%.
With that being said, the aim was to get an RMSE score of 1% and for that Nilesh started working on the models leveraging Sklearn, and added some popular algorithms like Catboost, Xgboost, etc. Concluding the process, Nilesh noticed that the top two algorithms that are providing the least RMSE score on train data are DecisionTreeRegressor and ExtraTreesRegressor and made the ensemble model. After saving the model score with a different run, Nilesh managed to get a 3-4% RMSE score, but on some runs, it also reached up to 1% RMSE.
Winner 02: Harikrishnan V
With a mechanical engineering background, Harikrishnan V always had an analytical mindset. Being a curious mind, Harikrishnan has always been very passionate about digging deep and finding information. During his early years, Harikrishnan used to collect data and analyse it to answer questions of problems like IIM CAT scoring trends, 2008 recession effect on placements, crime scenario in India etc. Realising that he can do wonders in a data career, he started his formal preparation into data science six months ago by starting an online course and learning from content on the internet as well as practising on self-projects. According to him, the learning process has been intense yet exciting because practising data science has been a very fulfilling journey for Harikrishnan.
Despite being his first data science competition, Harikrishnan managed to top the charts with his brilliant approach. To solve this problem, he started with a thorough exploratory analysis, where he found many patterns within the data. He noticed that most of the rows to predict have a low UnitPrice and very few have exceptionally high values.
There was one extreme outlier and a few other high ones.
On further exploration to understand any pattern the spread of prices in the data, he recognised a pattern by StockCodes.
To this, he grouped the train data by the number of unique prices in a StockCode and found that up to 11 unique prices. The other 5 StockCodes (3678,3680,3679,3683,3681) were the ones with maximum uncertainty and the high-value outliers.
InvoiceDate was converted to a float as days elapsed since 2010-1-1
For the StockCodes with just one unique value, Harikrishnan simply mapped and predicted those values in the test set, which led to a total of 8084 out of 122049 rows. Next, he tried many models and concluded that the XGBRegressor gave the best results on this dataset.
Further, he made nine different models, each with its best hyperparameter settings, for each set of StockCodes with unique values from 2 up to 11, i.e. (2,3,4,5,6,7,8,9,11). He also had a table with the unique prices of each StockCode. After the prediction result from these nine separate models, he used a function and approximated each prediction to the closest unique UnitPrice for that particular StockCode. This managed to total 121423 out of 122049 rows, which were done.
Next, there were 91 rows in the test set with a StockCode not present in train data, and for them, he approximated the UnitPrice to the weighted price of the closest StockCode in train data — a total of 121514 out of 122049 rows done. Now only 535 rows remain to be predicted! On further exploration, he found that StockCodes 3678 and 3680 were from only one customer (14096) and that the prices had a strong correlation with the date.
The vertical lines show the position of the test points to predict.
Post this, Harikrishnan created a feature ‘month’ by combining month number and year number as a string from InvoiceDate. Here, he ran a model on the combined 3678 and 3680 StockCode points from train data and predicted the test values. ‘Quantity’ and ‘Date’ were used as numeric and ‘month’ and ‘StockCode’ as categorical — size rows are done.
Now only 3 StockCodes (3679,3681,3683) and 529 rows remain to be predicted, which were the most tricky and time-consuming. In his exploratory analysis, he observed that Stocks 3679 and 3683 had extreme outliers and that 3679’s outlier matched as a pair (+1 & -1 quantity from the same customer in the same day) to a test entry for Stock 3681! So there was a possibility for mixing. It was further observed that a lot of high-value transactions occurred in pairs. Although this could be handpicked, Harikrishnan decided to make a model detect such pairs and predict them accurately.
For this, he made a dataset with the full 3681 data combined with 3679 and 3683 data where the UnitPrice z-score exceeded three and made a custom algorithm combined with an XGBRegressor to predict such pair values. Now there was the requirement for making new features to predict the remaining 3681 data and the other 2 StockCode rows. He calculated monthly customer sales for all non-high-value StockCodes by combining train and test data already predicted and stored in a data frame.
Further, more features were created — hour group; month group; weekday group; month start; different country; monthly sales customer; total sales customer; days visited customer; months visited customer; invoice numbers customer; transact numbers customer; average spend per transaction customer; average spend per invoice customer; average spend per day customer, and average spend per month customer. Features of all customers were stored in a data frame.
For StockCode 3679, there was one extreme outlier, and that’s why train row was removed, and modelling was done on the 3679 data, and test values predicted. A total of 121548 out of 122049 rows are done. For StockCode 3683, the majority prices were 15,18,28,40. Countries 13 and 14 were predominant and had predominantly price 18. Here also there is one extreme outlier which again has an identical pair in test data in StockCode 3683 itself, which was removed and modelling was done on the data. Predicted values in the range of 12 to 45 were approximated to the closest among [15,18,28,40]. High quantity and predictions less than 0 rows were approximated with median price from train data. Rows of Countries 31 & 32 were capped at their maximum value (40) from train data — a total 121904 out of 122049 rows done. Only 120 rows remain to be predicted in StockCode 3681. These are the most unpredictable rows with significant high-value entries.
On exploring the data, the one extreme outlier close to 40000 doesn’t have a pair in the test data, and that’s why it was removed. The train data used here to capture maximum trends is all rows of categories >=3678 in the train data combined with the high 3681 pair values predicted in the test set with the earlier model. The best model was run, and the values for the remaining 3681 Stock rows were predicted. High variance in prices was only in Quantities -2,-1,1, and other high quantity (+ve and -ve) rows with prediction less than 0 were approximated with corresponding median values from the train data of Stock 3681.
Further, a few customers were identified having low transactions and quantities and who were judged to be with low-value spenders in this category. These people’s transactions were approximated with low values in two groups — 1 and 5. 3681 Stock Customers with no negative quantity and high overall quantity with at least one Quantity=1 in the test entry would have a relation with their high quantity spends. Their one quantity transactions were predicted with their average high quantity transaction value with custom code. Customers who ended their transactions with the store with a high proportion of final transactions in StockCode 3681 had a trend of having a maximum earlier monthly sale value as the UnitPrice in this transaction. Such transactions were appropriately predicted with custom code. Customers with a high negative value for total sales would have approximately that value as UnitPrice in their one quantity 3681 entry, to even out the cash flow. Such customers’ transactions were appropriately predicted. For pairs like mentioned above, which had both members in the test set, their predicted values from my model were averaged and assigned.
In this hackathon, RMSE was the metric, and that’s why Harikrishnan wanted to predict as many points as possible in the test set. He further sought out of the box ideas to use the available resources to improve his score using the daily submissions and realised that there were only a few dozen rows in the entire test set of 122049 rows. There were three rows of customers who had no transactions in train data, a pair transaction (+1 & -1 quantity) with both rows in the test set, customers with high sales values, and customers with high occurrence in 3681 Stock in test set etc. These all identified rows amounted to only a few dozens. It was given that the public leader board scoring was being done on 70% test data.
To which, he made a submission with the value of one interesting point changed in the 70% data, which in turn changed the RMSE score. With this, he could predict the value of a point by calculating the difference in the sum of squared errors by using the equation of RMSE. He made a function to do this calculation and tried to predict the value of a few more shortlisted points. Harikrishnan also included his script without the last block of code of the hack.