Bad loans are a menace that weakens our financial system. In a bid to solve the loan defaulter problem, Deloitte teamed up with MachineHack to organise a hackathon for data scientists and machine learning practitioners called “Machine Learning Challenge” from November 29 to December 13, 2021. The hackathon focused on various attributes such as funded amount, location, loan, balance, etc., to predict if a person will be a loan defaulter or not.
The winners of this hackathon have been declared, who will take home cash prizes worth up to INR 1 lakh. Let’s get to know these top 5 leaderboard winners and their methods to ace the hackathon.
Sign up for your weekly dose of what's up in emerging technology.
Rank 01: Chandrashekhar
Chandrashekhar has been crowned the winner of the Deloitte hackathon. One of his professors encouraged him to pursue a career in data science after Chandrashekhar graduated in information technology. He was always a math person and saw a close relationship between data science and maths. So he pursued a six-month course in data science, which changed his perception of data.
Chandrashekhar says he focused more on feature engineering to win this competition. He decided to generate more features instead of model development. He extracted various features from the existing variables and generated binning, PCA, and arithmetic features from the numerical columns and combined features using categorical columns.
Chandrashekhar adds, “I have generated aggregation features among the categorical and numerical columns. Then, I used Optuna to tune the model (lgbm) and extracted useful features using lgbm feature_importance. Finally, I built stratified_cv fold to predict their results and used the same method for Catboost and Gradientboost. Then, Ensembling the three models gave the best solution.”
Chandrashekhar calls it a good experience to participate in the hackathon. He feels that he has been participating in MachineHack hackathons, improving his fundamentals. He says, “I strongly believe MachineHack is providing a great platform for data science aspirants. This platform conducts hackathons on a variety of problems, and I met some cool people through MachineHack.”
Check out the solution here.
Rank 02: Felipe Carneiro Wanderley
Felipe graduated in Production Engineering. Presently, he works as a bid analyst in a government company named IBGE. However, it has been around nine months since he started studying machine learning. So he searched for an area that joined programming and statistics and found data science perfectly fitting.
Felipe says he invested most of his time preprocessing the data, especially in feature engineering. Three variables were engineered (Batch Enrolled, Grade and Loan Title), four encoded using the label encoder (Sub Grade, Employment Duration, Verification Status and Initial List Status), and four excluded (ID, Application Type, Accounts Delinquent and Payment Plan).
He adds, “For the features engineered, I used an ordinal encoding by calculating the percentage of defaulters for each class and ordering these. I tried to use the target encoding, but the model became overfitted, and the Log Loss metric for validation got worse. One Hot encoding also did not bring good results for the model.”
In terms of preprocessing, since the model that he used was a tree-based model, it wasn’t necessary to use normalisation/standardisation for the data. However, the target is imbalanced, so it was necessary to use SMOTE oversample. The sampling_strategy chosen was 0.11.
He says that before using the SMOTE, the mean of the probability that his model brought was something between 9.5~10.5. However, from the feedback given from the public leaderboard, using the Log Loss metric and using the mean of his submission, he could calculate the mean of the public leaderboard. The mean calculated was 11.3~12.0.
He adds, “To test my theory, I submitted to achieve the Dumb-Log Loss, using only the value I assumed to be the proportion of Loan defaulters, that is 0.1167. Before submitting, I calculated the Log Loss of a submission that only contained the value of 0.1167 and got 0.3603. This solution (Sub 9) brought me a Log Loss of 0.36024, confirming my theory.”
As the mean of his solution was a little displaced, he decided to use the SMOTE Sampling_strategy of 0.11 that shifted the mean of his submission to the proper range. In addition, he used a tree-based model such that the presence of multicollinearity tends to 0.
He used different algorithms such as Random Forest Classifier, KneighborsClassifier, Logistic Regression, XGB Classifier, Voting Classifier with these combinations, etc. The one that produced the best results was Random Forest Classifier. After submitting the RandomForest baseline, he used the GridSearch to choose the best parameters.
Wanderley adds, “For the model evaluation, I used cross-validation with five folds. Additionally, I used a pipeline so that I could aggregate both transformers, SMOTE and Random Forest Classifier.”
His first contact with MachineHack was through a data science course in Brazil. The aim was to participate in such a hackathon and achieve the top 100 positions on the leaderboard. He chose the competition “Prediction House Prices in Bengaluru” and got fourth place.
Check out the solution here.
Rank 03: Tapas Das
Tapas is presently working as a delivery manager in The Math Company. He is a machine/deep learning enthusiast and first got interested in this subject area in 2018. He went through different MOOCs like the Andrew Ng ML course and the Deep Learning Specialization course on Coursera. He also spent a significant amount of time learning Python programming basics and then started picking diverse types of projects from various online sources like Kaggle, HackerEarth, Driven Data and slowly got comfortable with the analytical mindset and data science approach.
He started with basic EDA to explore the dataset and modified the data distribution for a few continuous (Funded Amount Investor, Interest Rate, Home Ownership, etc.) and categorical (Loan Title) variables. Next, he used Label Encoder to encode the categorical variables and used the Feature Tools library to generate 200 new features.
He adds, “Finally, I used a weighted average ensemble of LightGBM, CatBoost and XGBoost models to generate the final predictions. Also, I used the Optuna library for hyperparameters search for the different models.”
Tapas has been participating in different hackathons on the MachineHack platform for a while now. He says, “I love how the different problems mimic the real-world scenarios, which helps in a deeper understanding of that domain. Also, it’s fun to compete with the greatest minds in the area of data science. It really brings out the best in me.”
Check out the solution here.
Rank 04: Vivek Kumar
Kumar says he gets quickly if he is not exploring new things. He likes data science because it allows him to explore, iterate, and learn from different problems and experiments, just like we human beings learn. He added, “When I started learning, I realised that data science skills could be applied to many different domain areas, so I started making slow and steady progress by taking different training courses, participating in different hackathons.”
Kumar’s model consisted of technical steps and a focus on understanding the business process and data understanding before going to the data preparation, modelling, and evaluation steps.
In terms of data analysis, there were 67463 rows and 35 features (26 numerical and nine categorical) in the training data. No missing and duplicate entries are found for training and testing data. Train and Test data distribution for most of the features was similar except for Revolving Balance, Collection Recovery Fee, Accounts Delinquent, Total Collection Amount, and Total Current Balance. The collection and recovery based features have shown a positive relationship with the target outcome.
Target Data Analysis
The target variable (Loan Status) was highly imbalanced.
Missing value analysis
No missing values were found in the train and test datasets.
Log transformation was applied for all the numeric features. Tree-based models like XGBoost were not sensitive to monotonic transformations, but log transformations for all numeric features helped improve the cross-validation and leaderboard score.
New features were generated based on the interaction of some of the features. Kumar performed a Train/Test distribution check for each of the new features. Only the features showing a similar distribution across both train and test data have been included for model training to avoid any prediction drift. This allowed him to restrict the feature list.
- Calculated the sum of ‘Recoveries’ and ‘Collection Recovery Fee.’
- Calculated the sum of ‘Total Collection Amount’ and ‘Total Received Late Fee.’
- Created interest rate category based on ‘Interest Rate.’
Kumar also performed Frequency encoding for the categorical features. It was done inside the cross-validation loop during model training and validation to avoid data leakage.
Model training and validation approach
Kumar chose the gradient boosting algorithm as a final set of algorithms because this works best to identify non-linear relationships between features.
- Spot training/validation was performed for 12 machine algorithms (i.e., a combination of a linear and non-linear set of algorithms). This helped him understand what types of models can uncover the pattern for making predictions. The Pycaret package provides a very good and easy way to perform this activity.
- Based on the spot testing done in the 1st step, the XGBoost model was used as a final model for training/evaluation and prediction.
- The model has been trained/evaluated based on different training/validation split data based on a five-fold cross-validation technique. In this way, I ensured that the original training dataset was used for both training and validation, which helped improve the model robustness due to the diversity of training/validation data split.
- The test predictions are generated in each of the five folds and then averaged to generate final test predictions.
- Local validation score variation based on cross-validation can show a similar trend on leader board score. This has helped him build a robust validation strategy and allowed him to experiment with different aspects of the model building like validating new features, new encoding techniques, hyperparameter tuning, etc. The cross-validation score is comparable with the leaderboard.
- Out of the Fold Log loss metric: 0.30822
- Public Leaderboard Log loss: 0.35350
- Private Leaderboard Log loss: 0.34099
The Optuna package was used for the hyperparameter tuning.
Kumar said, “We can see that the features Loan Amount and Collection recovery fee have been given the highest importance score among all the features as it has been used many times by XGBoost for the split. It would be unfair to make any business decision based on XGBoost feature importance compared to other well established available techniques used for model interpretability, like SHAP. This is based on game theory backed by a solid mathematical foundation to justify the rationale behind global feature importance.
Kumar added that the mean absolute value of the SHAP values for each feature is taken to get a standard bar plot.
The SHAP library provides easy ways to aggregate and plot the Shapely values for a set of points (in our case, the validation set) to have a global explanation for the model.
The new feature generated based on the summation of collection amount and received late fee has been given the highest importance score among all the features. The higher the value of Collection_amt_plus_received_late_fee, the more impact it will have on the loan default, which is intuitive.
In terms of his experience with MachineHack, Kumar said that the team is very supportive. He says, “It felt quite satisfying by solving some of the challenging problems faced by the industry, and MachineHack made it possible by bringing those problems to a platform where different hackers compete to solve them. Thank you, MachineHack.”
Check out the solution here.
Rank 05: Rahul Pednekar
Pednekar has always been passionate about new technologies, especially data science, AI and machine learning. His expertise lies in creating data visualisations to tell his data’s story & using feature engineering to add new features to give a human touch in the world of machine learning algorithms. In addition, he is very interested in developing software that solves real-world problems by leveraging data to make efficient decisions by predicting the future.
He heavily utilises Python to clean, analyse, and perform machine learning on data and has over 19 years of work experience in IT, project management, software development, application support, software system design, and requirement study.
He started with outlier removal and kept only those rows in training data where – “Collection Recovery Fee” < 55, “Total Current Balance” < 1000000 and “Total Received Late Fee” < 40.
For data type conversion, he converted the following numeric data type columns into Object data types as they were categorical columns:
- Delinquency – two years
- Inquiries – six months
- Public Record
- Open Account
- Total Accounts
- Last week Pay
Then, he converted the Target column “Loan Status” from the object data type into a numeric data type.
A few rows in the test dataset were not present in the training dataset. Therefore, he replaced them with values that are present in both train & test datasets.
- Column name = “Term”: Replace “60” with “59”
- Column name = “Delinquency – two years”: Replace “9” with “8”
- Column name = “Total Accounts”: Replace “73” with “72”
The column “Loan Title” contains many duplicate values. This column is cleaned using the following method. The final 16 categories from column “Loan Title” were formed from various different categories by combining them into one of the 16 categories.
- Personal_Loan: All types of personal loans
- Vacation_Loan: Any loans taken for vacation
- Home_Loan: Any loan is taken for buying a new home or renovation of the existing home
- Medical_Loan: Loan taken for medical purpose
- Debt_Consolidation_Loan: Loan is taken to consolidate existing debt
- Consolidation_Loan: All types of consolidation loans
- Credit_Card_Consolidation: All types of credit card consolidation loans
- Debt_Free: Loan is taken to become debt-free
- CREDIT_CARDS: Loan taken over credit cards
- REFI_LOAN: Loan is taken to refinance existing loans
- Other_Loans: All other types of loans
- CAR_LOAN: Loan taken for the card
- Major purchase: Loan taken for major buying
- Business: Any type of business loans
- Moving and relocation: Loan is taken for moving and relocation
- Other: Any other type of loan
Then, a new column, “Loan Type”, with the above 16 categories, is created. Finally, he dropped four columns (ID, Payment Plan, Loan Title and Accounts Delinquent) as they were not adding any value.
Modelling and Prediction
For this, Pednekar used:
- One Hot Encoding: Use get_dummies() function to form around 400 columns for final modelling.
- RandomizedSearchCV to find the best hyperparameters using RandomForest Regressor
- Predict the test data using the best model given by the above hyperparameter tuned RandomForest Regressor.
In terms of his experience at MachineHack, Pednekar says, “I would like to thank MachineHack for providing me with the opportunity to participate in the Deloitte Machine Learning Challenge. It has been a wonderful learning experience, and I would like to participate in future hackathons. I would encourage them to organise many more hackathons.”
Check out the solution here.