MachineHack is not just a platform for data science enthusiasts to hone and practice their skills. It can also help organizations mine the right talent in an ever-expanding domain.
Data science is one of the most demanding skills in the IT industry today. With the hype in AI and machine learning, the pool of enthusiasts is expanding exponentially, making it extremely hard for organizations to find a perfect fit for their job requirement. MachineHack – backed by its parent company Analytics India Magazine – has been continuously helping the machine learning and data science community grow to its peak by conducting exciting hackathons and challenging aspirants.
Recently, we collaborated with an IT firm to help them find the best talents in the domain. The Patient Drug-Switch Prediction Hackathon was both thrilling and challenging for the community. Despite being one of our shortest challenges running for just 13 days, the hackathon proved to be a great success, with over 700 registrations and over 60 active participants.
Out of the 66 participants, Tirthankar Das, Nikhil Kumar Mishra, and Amey More ended up as the top competitors and presented their work at MLDS 2020.
Analytics India Magazine introduces you to the winners and their approaches to the solution.
Tirthankar started his data science career in the Banking Financial Services and Insurance (BFSI) domain, when data science was still in its infancy. Currently working in the aviation industry, he has more than five years of experience working with data. Tirthankar has always been inclined towards statistics, and data science has helped him pursue his love for the subject. He believes that a good data scientist should be domain-agnostic.
Approach To Solving The Problem
Tirthankar explains his approach as follows:
The objective of the Patient Drug Switch Hackathon was to identify the patient population who are likely to switch any product in the RA (Rheumatoid Arthritis therapeutic) Market. It had three layers to evaluate the solution. The first round was evaluating the best solution based on AUC. The second round was based on ‘Time and Memory Complexity’. The second step is especially relevant in the practical scenario from a deployment perspective. Most hackathons ignore this point. Other than ‘MacBook’, this second step made this hackathon interesting. And the final round was presented at MLDS in a room full of data science enthusiasts.
During the hackathon, we were given transaction data of the patient for both train and test. We had to create features from that transaction data of drug purchasing. It had six columns: Patient_id, Time, Event, Specialty, Plan_Type and Payment. The event, Specialty and Plan_type are different sublevels to represent the drug. I always try to follow the traditional approach of the model building, which has three building blocks.
There were three broad types of features, that is, Recency, Frequency and NormChange. Recency is defined as how recently did an Event/Plan_Type/Specialty happen before the anchor date; Frequency means how many times did an Event/Plan_Type/Specialty occur in a specific time frame, and NormChange means whether the frequency of an Event/Plan_Type/Specialty increased or decreased in a recent time frame (not more than 1.5 years) as compared to the previous time frame.
Around 40k variables were generated with the logic mentioned above. As I mentioned earlier, this hackathon is not only about the accuracy of the model. It is also about tackling time and memory complexity. Creating 40k features sequentially will take a lot of time if we run it on the machine with RAM of 16GB with four cores (as per given specification). To reduce the time consumption, I introduced a parallel processing concept for calculating features of Frequency and NormChange. In contrast, I used the default Apply function of python for calculating Recency features.
Once all the features were created, I did two preprocessing steps before feeding these into the LightGBM model. Imputing the missing value with 9999999 and removing degenerate variables. It helped in reducing the number of features. After this step, remaining features were fed into the LightGBM model. I did not do much tuning in this step as it was just a feature selection step. I used feature importance to select the final set of features. The logic of subsetting the feature was very straightforward. It involved selecting all the features whose importance is greater than zero. Finally, I had 7k features for the final model. Recency features were found most important compared to NormChange and Frequency.
LightGBM has always been my favourite algorithm whenever memory and time are constrained. In the final model, I used stratified k fold cross-validation, where the value of K is 30. The final prediction came from the average of those models. Tuning of hyperparameters, such as learning rate, feature fraction and num_leaves were important, which impacted the result.
“MachineHack is a great platform where data science enthusiasts can experiment with different algorithms and look at their performances. I have been an active participant of MachineHack hackathons from the last seven to eight months. I have learned many ML techniques from my fellow participants like Chetan Ambi, Rajat Rajan, and Saurabh Kumar. MachineHack’s initiative to share winners’ solutions with proper documentation helps many data scientists like myself” – Tirthankar shared his MachineHack experience.
Get the complete code here.
Nikhil Kumar Mishra
Nikhil is currently a final year Computer Science Engineering student at Pesit South Campus, Bangalore. He started his data science journey during his second year after being inspired by a youtube video on self-driving cars. The technology intrigued him, and he was driven into the world of ML. He started with Andrew NG’s famous course and applied his knowledge in the hackathons which he participated in.
Kaggle’s Microsoft Malware Prediction hackathon – in which he finished 25th – was a turning point in his data science journey, which gave him the confidence to take it further and challenge himself with more hackathons on other platforms like MachineHack and Analytics Vidhya.
Approach To Solving The Problem
Nikhil explains his approach as follows:
“This was a very challenging competition. Feature engineering was the key component,just like any other competition, along with a wide selection of models”- Nikhil spoke about the problem.
At first, it seemed simple aggregations like mean on the numerical data and size, and unique categorical data for each patient id would help. But the AUC was stuck to well below 80, while other competitors could easily get above 80. Creating very deep features like this landed me a max AUC of 77. Then I realized I needed to use the features which were required to be created in the feature generation part. Recency was the feature which had magic in it, especially events like event_439 and event_449 gave the model a magic boost to easily reach above 85.
Recency based features gave a lot of ideas into much more creative feature engineering. Recency was the latest when something occurred, such as an event, a patient going for some plan_type and some speciality. So why not create other time-based features like:
1. The first time something happened (Oldest_Time)
2. The number of unique times something happened
Note: All features were being calculated for each event, and not all of the events at once, calculating on all events or plan_type or speciality, which will cause a lot of loss in information. A helpful way to think the would-be occurrence of event_1 could lead to the patient switching drugs while the occurrence of event_2 would make him use the same drug. So all the events, plan_types and specialities should be treated separately. Verifying which were important among these could be easily done with the help of EDA.
Payment Based Aggregations
One can calculate the mean, max and total payment for each of the events, plan_types and specialities separately.
Time and Money
Recency and other time-based features were being calculated on categorical columns, and payment was numerical, so I discretized the payment into bins, and calculated time-based features on that. This helped me a lot and I reached 88 AUC.
Model Selection and Ensembling
I used an ensemble of 2 XGBoost and 2 LightGBM models, LightGBM was used with a lot less number of features, because it was not memory efficient and RAM usage was spiking easily above the 16 GB memory limit on Kaggle.
180 days trick
Normchange features hinted that a patient’s behaviour could change over time, so I created two models(LightGBM and Xgboost) using the data from only past six months or 180 days, which again gave me a significant boost in the ensemble. 180 days was by no means a strict threshold, and I encourage people to try data for only the past three months or one year or any other time interval.
The final model consisted of the ensemble of the four models, which led to my highest LB score.
“Understanding the problem is very critical in ML competitions. Also, no model rules all kinds of problems, so I suggest all MachineHack fellow competitors first build a baseline for different models and then proceed appropriately.”- he said.
“MachineHack is an amazing platform, especially for beginners. MachineHack team is very helpful in understanding and interacting with participants to get doubts resolved. Also, the community is ever-growing, with new and brilliant participants coming up in every competition. I intend to continue using MachineHack to practice and refresh my knowledge on data science,” says Nikhil about his experience with MachineHack.
Get the complete code here.
Amey is an engineering graduate with close to one year of experience in the industry. He was introduced to data science two years back during his college days with Andrew Ng’s Machine learning course. He soon realized that the best way to enter this expanding domain was through hackathons. He also praises the vast and lively global data science community for its excellent support in sharing knowledge.
Approach To Solving The Problem
Amey explains his approach as follows:
Objective 1: Feature Creation
I decided to use built-in pandas functions like group by, stack, unstack, describe etc. rather than going the iterative way. I created the recency features with correct values, and observed that the frequency feature values were not entirely accurate.
Objective 2: Modelling
I realized that features mentioned in the problem were extremely useful, as only using features I engineered did not give a good score. Used recency and frequency features significantly increased the dimensionality of the dataset. Also, most of these features consisted mainly of null values.
Aggregated ‘patient_payment’ (min, max, mean, sum) features over ‘event_name’ & ‘specialty’ for each patient, ‘event_name’ frequency for each patient, ‘patient_payment’ aggregated for each patient and count of total events for each patient
The dataset provided had a class imbalance of 85%(negative class) and 15%(positive class); a similar proportion was expected in the test set.
The final model was a bagged run of LightGBM across five-folds created using Stratified K-Fold cross-validation. Using this cross-validation strategy allowed the model to be trained on data that followed the same class distribution as train and test sets.
The above approach leads to more robust and better predictions. The idea was to get a more generalized sense of error. Parameters used in the model script were already tuned on the training set.
Final test data is scored using these five models (built on five-folds).
To convert the predictions into hard classes, the simple logic of metric maximization was used. I took all the out-of-fold predictions and iterated over a range of thresholds, took the argmax, i.e. whichever threshold gave the maximum AUC score was chosen as the threshold value.
“MachineHack is a great platform for budding data scientists to hone their skills, as well as test them against each other. The problems here are real-life, giving us an idea about the kind of work being done in industry” – he shared his opinion on the platform.
Get the complete code here.