How to find feature importance in loan default prediction?

Analyzing the feature importance is necessary for certain predictive analytics works such as credit or loan default predictions

Understanding the feature importance is very important for a data science practitioner so that he/she can select the important features for training the machine learning model. Analyzing the feature importance is necessary for certain predictive analytics works such as credit or loan default predictions. Along with minimizing the default risk with prediction, it is also necessary to find the important features when a customer does default. So in this article, we will conduct a feature importance analysis on popular Loan defaulters prediction problems using a Random Forest classifier. Below are the major points that we are going to discuss in this post.

Table of contents

  1. Feature Importance
  2. Predictive Modeling and Feature Analysis
  3. Training a Random Forest Classifier
  4. Analyzing feature importance

Let’s first understand the importance of features importance. 

Feature Importance

Feature importance becomes an essential part of the machine learning pipeline, feature importance outputs the list of features with corresponding importance scores. So once we get the score we can select the important feature.


Sign up for your weekly dose of what's up in emerging technology.

As mentioned in the title we will use a Loan Default dataset. A loan defaulter’s dataset usually has so many features so which feature should a data scientist focus more on is a big question. Because examining each feature one by one closely is not feasible, there should be a way that gives us a set of essential features, and then feature importance comes into play. With the help of tree-based algorithms such as random forest, we can have a list of important features.

Are you looking for for a complete repository of Python libraries used in data science, check out here.

Predictive Modeling and Feature Analysis

The random forest has built-in feature importance. Random forest uses its Gini impurity criterion to select the important feature. The feature which helps the model to decrease the impurity is becoming an important feature, which implies that if a feature contributes more to reducing impurity, it becomes more important.

Download our Mobile App

We will use the Loan Default dataset for implementing feature importance, loan default is the most widely solved problem in machine learning. It contains information about the customers that could be a bank or any loan company or any vehicle company. Our dataset is Bank data, it has all the information of customers to whom loans were given in the past and one target variable which tells whether any particular customer is a defaulter or not. 0 means not defaulter,  1 means defaulter.

Training a Random Forest Classifier

(NOTE:- There are only a few pre-processing steps that  are not mentioned in this post but you can check those all in ColabNotebook which is referenced below.)

In this section we first will perform classification on the Loan Default dataset from Kaggle, then will generate feature importance maps.

First of all, import essential libraries like for mathematical computation we imported NumPy, for plotting graphs we imported seaborn and matplotlib. From sklearn, we imported modules for preprocessing. 

import pandas as pd 
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

Read the CSV file and save it into pandas dataframe.

data_raw = pd.read_csv("/content/train.csv")

There are categorical features given in the dataset which should have to be encoded. For “years_in_current_job” we just trim the ‘year’ part and leave the years only. But for other features, we encode the values into labels.

data_raw["years_in_current_job"] = data_raw["years_in_current_job"].replace({'-1': -1, '10+ years': 10, '8 years': 8, '6 years': 6, 
                                                                              '7 years': 7, '5 years': 5, '1 year': 1, '< 1 year': 0, 
                                                                              '4 years': 4, '3 years': 3, '2 years': 2, '9 years': 9})

data_raw["purpose"] = le.fit_transform(data_raw.purpose.values)
data_raw["home_ownership"] = le.fit_transform(data_raw.home_ownership.values)
data_raw["term"] = le.fit_transform(data_raw.term.values)

The next step is to fill nan values. For ‘months_since_last_delinquent’ we fill nan values with -1 because it has almost 50% of missing values that’s why filling it with mean value does not make any sense. Other features such as ‘annual_income’, ‘credit_score’ are filled by the mean.

data_raw['months_since_last_delinquent'] = data_raw['months_since_last_delinquent'].fillna(-1)
data_raw['annual_income'].fillna(int(data_raw['annual_income'].mean()), inplace=True)
data_raw['credit_score'].fillna(int(data_raw['credit_score'].mean()), inplace=True)
data_raw['years_in_current_job'].fillna(int(data_raw['years_in_current_job'].mean()), inplace=True)
data_raw['bankruptcies'].fillna(int(data_raw['bankruptcies'].mean()), inplace=True)

Get all the features into the X variable and target column in the y variable.

X = data_raw[['home_ownership', 'annual_income', 'years_in_current_job', 'tax_liens',
              'number_of_open_accounts', 'years_of_credit_history',
              'maximum_open_credit', 'number_of_credit_problems',
              'months_since_last_delinquent', 'bankruptcies', 'purpose', 'term',
              'current_loan_amount', 'current_credit_balance', 'monthly_debt',

y = data_raw[['credit_default']]

Now we will convert the features into standard form.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)

Split the dataset into 80:20 ratios. 80 for training and 20 for testing the dataset

#split the data 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.2, random_state = 4)

Next, we will import and initialize the random forest classifier and fit the model on training data.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, random_state=0), y_train)

Analyzing Feature Importance

Now as we have trained the random forest classifier, we will proceed with analyzing the feature impotence. First, we will convert the feature importance in the pandas series and then we will print the feature importance score

feature_imp = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)

This output shows that credit_socre and current_loan_amount are the most important features in making a credit default classification.

We can visualize these features with the help of seaborn.

f, ax = plt.subplots(figsize=(15, 7))
ax = sns.barplot(x=feature_imp, y=feature_imp.index)
ax.set_title("Visualize feature scores of the features")
ax.set_xlabel("Feature importance score")

Now it looks cool, X-axis shows the feature importance score, and the y-axis shows the feature name. Credit_score has the longest bar that shows it is the most important feature in case of defaults.

Final words

In this article, we saw how feature importance is the key in the machine learning pipeline and understood how random forests determine feature importance. Lastly, we analyzed the feature importance by implementing the random forest on the Loan default dataset and found the important features in the case of defaults.


More Great AIM Stories

Waqqas Ansari
Waqqas Ansari is a data science guy with a math background. He likes solving challenging business problems through predictive modelling, descriptive modelling, and machine learning algorithms. He is fascinated by new technologies, especially those relating to machine learning.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox