Why Tree-Based Models are Preferred in Credit Risk Modeling?

Credit risk refers to the likelihood that a borrower will be unable to make regular payments and will default on their obligations.

Credit risk modeling is a field where machine learning may be used to offer analytical solutions because it has the capability to find answers from the vast amount of heterogeneous data. In credit risk modeling, it is also necessary to infer about the features because they are very important in data-driven decision making. In contrast to credit risk, we will examine what credit risk is and how it can be represented using various machine learning algorithms in this post. We will implement the credit risk modeling with different machine learning models and will see how tree-based models outperform other models in this task. The following are the main points to be discussed.

Table of Contents

  1. What is Credit Risk
  2. What is Credit Risk Modeling
  3. How Machine Learning Is Used in Credit Risk Modelling?
  4. Implementing Credit Risk Modelling
  5. The outperformance of Tree-Based Models

Let’s start the discussion by understanding what is credit risk.

What is Credit Risk 

Credit risk refers to the likelihood that a borrower will be unable to make regular payments and will default on their obligations. It refers to the possibility that a lender will not be paid for the interest or money given on time.  The cash flow of the lender is disrupted, and the cost of recovery rises. In the worst-case scenario, the lender may be obliged to write off some or all of the loan, resulting in a loss. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

It is incredibly tough and complex to predict a person’s likelihood of defaulting on a debt. Simultaneously, appropriately assessing credit risk can help to limit the chance of losses due to default and late payments. As recompense for taking on credit risk, the lender receives interest payments from the borrower. 

The lender or investor will either charge a higher interest rate or refuse to make the loan if the credit risk is higher. For the same loan, a loan applicant with a solid credit history and regular income will be charged a lower interest rate than one with a terrible credit history.

What is Credit Risk Modeling

A person’s credit risk is influenced by a variety of things. As a result, determining a borrower’s credit risk is a difficult undertaking. Credit risk modelling has entered the scene since there is so much money relying on our ability to appropriately predict a borrower’s credit risk. Credit risk modelling is the practice of applying data models to determine two key factors. The first is the likelihood that the borrower will default on the loan. The second factor is the lender’s financial impact if the default occurs.

Credit risk models are used by financial organizations to assess the credit risk of potential borrowers. Based on the credit risk model validation, they decide whether or not to approve a loan as well as the loan’s interest rate.

New means of estimating credit risk have emerged as technology has progressed, like credit risk modelling using R and Python. Using the most up-to-date analytics and big data techniques to model credit risk is one of them. Other variables, such as the growth of economies and the creation of various categories of credit risk, have had an impact on credit risk modelling.

How Machine Learning Is Used in Credit Risk Modelling?

Machine learning enables more advanced modelling approaches like decision trees and neural networks to be used. This introduces nonlinearities into the model, allowing for the discovery of more complex connections between variables. We selected to employ an XGBoost model that was fed with features picked using the permutation significance technique.

ML models, on the other hand, are frequently so complex that they are difficult to understand. We chose to combine XGBoost and logistic regression because interpretability is critical in a highly regulated industry like credit risk assessment.

Implementing Credit Risk Modelling

Credit risk modelling in Python can assist banks and other financial institutions in reducing risk and preventing financial catastrophes in society. The goal of this article is to create a model that can predict the likelihood of a person defaulting on a loan. Let’s start by loading the dataset.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# load the data
loan_data = pd.read_csv('/content/drive/MyDrive/data/loan_data_2007_2014.csv')

When you look at the Colab notebook for this implementation, you’ll find that numerous columns are identifiers and do not include any meaningful information for creating our machine learning model. Id, member id, and so on are some examples. Remember that we want to build a model that predicts the likelihood of a borrower defaulting on a loan, therefore we won’t need qualities that relate to events that happen after a person defaults. This is because this information isn’t available at the time of loan approval. Recoveries, collection recovery fees, and so on are examples of these features. The code below displays the columns that have been eliminated. 

#dropping irrelevant columns
columns_to_ = ['id', 'member_id', 'sub_grade', 'emp_title', 'url', 'desc', 'title', 'zip_code', 'next_pymnt_d',
                          'recoveries', 'collection_recovery_fee', 'total_rec_prncp', 'total_rec_late_fee', 'desc', 'mths_since_last_record',
                  'mths_since_last_major_derog', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'open_acc_6m', 'open_il_6m',
                  'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m',
                  'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl', 'inq_last_12m','policy_code',]
loan_data.drop(columns=columns_to_, inplace=True, axis=1)
# drop na values

Now you might know that while preparing the data multicollinearity should be failed because the highly correlated variable provides the same information and those are redundant if we don’t then models will fail to estimate the relationship between the dependent and independent variables.  

To check the multicollinearity we will draw the heatmap of the correlation matrix obtained with help of the panda’s correlation matrix. The heat map is shown below.

As can be seen, several variables are highly correlated and should be eliminated. ‘loan amnt’, ‘funded amnt’, ‘funded amnt inv’, ‘installment’, ‘total pymnt inv’, and ‘out prncp inv’ are multi-collinear variables.

If you look through the Notebook, you’ll notice that several variables aren’t in the right data types and need to be pre-processed to get them into the right format. We will define some functionalities to aid in the automation of this procedure. The functions that were used to transform variables to data are coded as below.

def Term_Numeric(data, col):
    data[col] = pd.to_numeric(data[col].str.replace(' months', ''))
term_numeric(loan_data, 'term')
def Emp_Length_Convert(data, col):
    data[col] = data[col].str.replace('\+ years', '')
    data[col] = data[col].str.replace('< 1 year', str(0))
    data[col] = data[col].str.replace(' years', '')
    data[col] = data[col].str.replace(' year', '')
    data[col] = pd.to_numeric(data[col])
    data[col].fillna(value = 0, inplace = True)
def Date_Columns(data, col):
    today_date = pd.to_datetime('2020-08-01')
    data[col] = pd.to_datetime(data[col], format = "%b-%y")
    data['mths_since_' + col] = round(pd.to_numeric((today_date - data[col]) / np.timedelta64(1, 'M')))
    data['mths_since_' + col] = data['mths_since_' + col].apply(lambda x: data['mths_since_' + col].max() if x < 0 else x)
    data.drop(columns = [col], inplace = True)

In our dataset, the goal column is loan status, which has different unique values. These values must be converted to binary. That is a score of 0 for a bad borrower and a score of 1 for a good borrower. In our situation, a bad borrower is someone who falls into one of the categories listed in our target column. Charged off, Default, Late (31–120 days), Does not comply with credit policy Charged Off Status The remaining debtors are considered to be good borrowers.

# creating a new column based on the loan_status 
loan_data['good_bad'] = np.where(loan_data.loc[:, 'loan_status'].isin(['Charged Off', 'Default', 'Late (31-120 days)',
'Does not meet the credit policy. Status:Charged Off']), 0, 1)
# Drop the original 'loan_status' column
loan_data.drop(columns = ['loan_status'], inplace = True)

Now we have some more variables that are in categorical type and need to convert into numbers for further modelling for that we will be using the Label Encoder class from the sklearn library as below.

categorical_column = loan_data.select_dtypes('object').columns
for i in range(len(categorical_column)):
  le = LabelEncoder()
  loan_data[categorical_column[i]] = le.fit_transform(loan_data[categorical_column[i]]) 

Now, we are all set to train the various algorithms and will check which will perform best. Here we are evaluating one linear model, one Neighborhood model, two tree-based models, and one Naive-Bayes model. We will do cross_validation using KFold for 10 folds and will check mean accuracies for those folds. 

# compare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append((DT, DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
results = []
names = []
for name, model in models:
        kfold = KFold(n_splits=10)
        cv_results = cross_val_score(model, x_train, y_train, cv=kfold)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())

The Outperformance of Tree-Based Models

As we can see from the above mean accuracies the tree-based models performed far better than the rest of the others. This is because Tree-based algorithms provide great accuracy, stability, and interpretability to prediction models. They map nonlinear interactions pretty well, unlike linear models. They can adjust to any situation and solve any challenge (classification or regression).

Because tree creation requires no domain knowledge or parameter configuration, it is ideal for exploratory knowledge discovery. Multidimensional data can be handled via decision trees.

Attribute selection measures are used during tree construction to choose the attribute that best splits the tuples into distinct classes. Many of the branches in a tree may reflect noise or outliers in the training data. Tree trimming aims to locate and delete such branches in order to improve classification accuracy on data that isn’t visible.

In addition to these all, applications like Credit risk modeling where feature importance plays a very important role as it is going to decide the predictions. Using Decision Tree and likewise algorithms we can obtain feature importance maps and can tune models accordingly. Below you can see the feature importance map given by the Decision Tree algorithm. 

In many kinds of data science challenges, methods including decision trees, random forests, and gradient boosting are often used.

Final Words

Through this post, we have discussed in detail credit risk and credit risk modelling. We have seen types of credit risk, factors affecting credit risk, and seen how ML can be used to model credit risk rather than the conventional method. Later we have seen the practical implementation of modelling where we have tested various models and concluded how tree-based algorithms have outperformed and hence these are preferred in such tasks.


Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox