Active Hackathon

Hands-on Guide to Interpret Machine Learning with SHAP

An open-source library, SHAP puts an end to the question on reliability of a machine learning model

In machine learning, a dataset is prepared to train the model and predictions are obtained from the unseen data model. How far any of the features have contributed to obtaining the predictions? We might think some features are most important, but the model might have considered some other features to help predict the model’s point of view. With predictions in our hand as output, we cannot understand or explain what has happened inside a machine learning model. Sometimes a model’s way of performance is questionable.  An open-source library, SHAP puts an end to this question on the reliability of a machine learning model.

What is SHAP or SHapley Additive exPlanations?

SHAP is the acronym for SHapley Additive exPlanations derived originally from Shapley values introduced by Lloyd Shapley as a solution concept for cooperative game theory in 1951. SHAP works well with any kind of machine learning or deep learning model. ‘TreeExplainer’ is a fast and accurate algorithm used in all kinds of tree-based models such as random forests, xgboost, lightgbm, and decision trees.


Sign up for your weekly dose of what's up in emerging technology.

‘DeepExplainer’ is an approximate algorithm used in deep neural networks. ‘KernelExplainer’ algorithm is, in general, applicable to any machine learning regression model. SHAP prefers different visualizations to demonstrate the feature importance and the way features contributed in predictions. In this article we discuss various SHAP visualization techniques with a decision tree classification model.

Create the working environment

To install the open source library in python environment,

!pip install shap

Import necessary libraries and an in-built dataset for SHAP analysis. Here ‘Breast Cancer Data’ from sklearn datasets is used.

 import numpy as np
 import pandas as pd
 from matplotlib import pyplot as plt
 import seaborn as sns
 from sklearn.datasets import load_breast_cancer
 from sklearn.tree import DecisionTreeClassifier, export_graphviz
 from sklearn.model_selection import train_test_split
 import shap 
 import graphviz

Build and Train a Decision Tree model

Save predictors and targets as in the variables X and y respectively. Then split the dataset into train and test sets in 80:20 ratio.

 # load the famous breast cancer data from sklearn inbuilt datasets
 # a supervised binary classification problem
 data = load_breast_cancer()
 # define predictors as pandas dataframe
 X = pd.DataFrame(data['data'], columns=data['feature_names'])
 # define target as pandas series
 y = pd.Series(data['target'])
 # split data into train and test sets
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
 # sample first few rows and few columns
 X_train.iloc[:5, :12] 

Construct a decision tree classifier model and train it with the training set of data. To know how the model classifies given data, we can use graphviz tree visualizer to plot the tree.

 # develop a decision tree model
 model = DecisionTreeClassifier(random_state=1, max_depth=5)
 # train the model data, y_train)
 # Visualize how model classified the entire data
 tree_graph = export_graphviz(model, out_file=None, feature_names = data['feature_names'], rounded=True, filled=True)

SHAP Force Plot

Develop a tree-based SHAP explainer and calculate the shap values. Shap values are arrays of a length corresponding to the number of classes in target. Here the problem is binary classification, and thus shap values have two arrays corresponding to either class.

Shap values are floating-point numbers corresponding to data in each row corresponding to each feature. Shap value represents the contribution of that particular data point in predicting the outputs. If the shap value is much closer to zero, we can say that the data point contributes very little to predictions. If the shap value is a strong positive or strong negative value, we can say that the data point greatly contributes to predicting the positive or negative class.

Force plots are suitable for row-wise SHAP analysis. It takes in a single row and shows in a rank order how each of the features contributed to the prediction. Wider a feature’s block, more the contribution.

 # Initialize JavaScript visualizations in notebook environment
 # Define a tree explainer for the built model
 explainer = shap.TreeExplainer(model)
 # obtain shap values for the first row of the test data
 shap_values = explainer.shap_values(X_test.iloc[0])
 shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[0]) 

Force plots can be made interactive by plotting it with more data points. Here we plotted it with all of the test data. By hovering mouse pointer over the regions of plot, we can observe shap values interactively.

 # obtain shap values for the test data
 shap_values = explainer.shap_values(X_test)
 shap.force_plot(explainer.expected_value[0], shap_values[0], X_test) 

Dropdown options are shown in the interactive plot to select features of interest. It gives a better understanding on how two different features interact with each other in predicting the outputs. It should be noted that, in our example, the red and blue colors show positive and negative predictions respectively.

SHAP Summary Plot

Summary plots are easy-to-read visualizations which bring the whole data to a single plot. All of the features are listed in y-axis in the rank order, the top one being the most contributor to the predictions and the bottom one being the least or zero-contributor. Shap values are provided in the x-axis. As we discussed already, a value of zero represents no contribution whereas contributions increase as the shap value moves away from zero. Each circular dot in the plot represents a single data point. Color of the dot denotes the value of that corresponding feature. It can be observed that the feature ‘worst perimeter’ contributes greatly to the model’s prediction with low values deciding one class and higher values deciding the other.

shap.summary_plot(shap_values[1], X_test)

Summary plot can also be visualized as a bar plot for quick reading with minimum details.

shap.summary_plot(shap_values[1], X_test, plot_type='bar')

It is clearly observed that top 8 ranked features alone contribute to the model’s predictions.

SHAP Dependence Plot

Dependence plots can be of great use while analyzing feature importance and doing feature selection. It makes one-versus-one plot against two features by plotting shap values of one feature and coloring the dots with respect to another interactive feature.

 # we use whole of X data from more points on plot
 shap_values = explainer.shap_values(X)
 shap.dependence_plot('worst perimeter', shap_values[1], X, interaction_index='worst concave points') 

If the interactive feature is not provided by the user, SHAP determines a suitable feature on its own and uses that as the interactive feature.

shap.dependence_plot('worst concave points' , shap_values[1], X)

SHAP Decision Plot

Finally, we discuss the decision plot. As the summary plot, it gives an overall picture of contribution to prediction. From bottom to top of the decision plot, shap values are cumulatively added to the base value of the model in determining the output values. It can be observed that certain strings colored in blue resulted in final class value 0 and the remaining strings colored in red resulted in final class value 1.

shap.decision_plot(explainer.expected_value[1], shap_values[1], X)

SHAP analysis can be used to interpret or explain a machine learning model. Also, it can be done as part of feature engineering to tune the model’s performance or generate new features!

More Great AIM Stories

Rajkumar Lakshmanamoorthy
A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM