Active Hackathon

A Complete Guide to SHAP – SHAPley Additive exPlanations for Practitioners

SHAP or SHAPley Additive exPlanations is a visualization tool that can be used for explaining the prediction of any model by computing the contribution of each feature to the prediction

There are many machine learning models which are very accurate and high performing while making predictions. One of the limitations with these models we always find is that we can not explain the quality of outcomes produced by them. There is always a need to make the outcomes from the model more explainable. In this article, we are going to introduce a tool “SHAP (SHAPley Additive exPlanations)” that can help us in making the outcomes of the machine learning models more explainable. The major points to be discussed in this article are listed below.

The most comprehensive Repository of Python Libraries for Data Science >>

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Table of Contents

  1. What is SHAP?
  2. Installing SHAP
  3. Simple Implementation of SHAP
  4. Explaining Models With SHAPely Values
    1. Examining the Model Coefficients
    2. Partial Dependence Plots
    3. Waterfall Plot

What is SHAP?

SHAP or SHAPley Additive exPlanations is a visualization tool that can be used for making a machine learning model more explainable by visualizing its output. It can be used for explaining the prediction of any model by computing the contribution of each feature to the prediction. It is a combination of various tools like lime, SHAPely sampling values, DeepLift, QII, and many more. 

One of the main components of the SHAP tool is SHAPley values because using it, SHAP connects optimal credit allocation with local explanations. When we talk about the SHAPley values we can consider them as a method that can tell how to accurately distribute the contribution by the features, among the features. 

One of the good things about the SHAP is, it supports modelling procedures followed by using libraries like SciKit-Learn, PySpark, TensorFlow, Keras, PyTorch, and many more. These are the widely used libraries for data modelling and the basic problem with these libraries is that model outcomes are not so explainable. Using SHAP, we can make outcomes more understandable for users who are not so knowledgeable about the outcomes of machine learning models. With this ability of SHAP, we can also use it for data visualization. Let’s start with the installation process of the SHAP tool in our environment.

Installing SHAP

We can install the SHAP tool by using the following pip command:

!pip install SHAP

Output:

So as we have installed the SHAP tool, now we can start by making models with simple data.

Simple Implementation of SHAP

As we have discussed, that we can utilize the SHAP tool with many modelling libraries, in this section, we will look at how simply we can use this tool to make the outcomes from simple models more explainable. 

Let’s start by loading data. With the SHAP tool installation, we also get some ready datasets with this package which we will use here. In this article, we are going to use the IRIS dataset for classification.

Loading the data

import SHAP
X,y = SHAP.datasets.iris(display=True)

Splitting the data

from sklearn.model_selection import train_test_split
X_train,X_test = train_test_split(X,test_size=0.2, random_state=0)
Y_train,Y_test = train_test_split(y, test_size=0.2, random_state=0)

Checking the data

from google.colab import data_table
data_table.enable_dataframe_formatter()
X_train

Output:

For classification, we are using the SVM model from the SK-Learn library.

Importing and fitting model

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svc = SVC(kernel='linear', probability=True)
svc.fit(X_train, Y_train)
y_pred = svc.predict(X_test)
accuracy_score(Y_test, y_pred)

Output:

Here we have got an accuracy of 100%. Now we can use the SHAP tool for explaining the prediction in the test set using visualization.

Explaining the prediction using an explainer

explainer = SHAP.KernelExplainer(svc.predict_proba, X_train)
SHAP_values = explainer.SHAP_values(X_test)

Plotting the prediction 

SHAP.initjs()
SHAP.force_plot(explainer.expected_value[0], SHAP_values[0], X_test)

Output:

We can move the cursor to see the values in the output. Here I am just posting the picture of the output. Here we have used the force plot to plot outcomes from the model. By visualizing the force plot we can understand the impact of every feature on the prediction by the model even for a specific instance of the data. 

We can say that the force plot is an explanation of feature importance based on the game theory method, i.e., SHAPley values. The Force plot shows the influence of each feature on the current prediction. Values in the blue colour can be considered as the values that have a positive influence on the prediction whereas values in the red colour have a negative influence on the prediction.

Here in the above example, we have seen a general idea of applying the SHAP tool to the models. Let’s have a look at the explanation of the SHAPe values which we have created in the modelling in the last section.

Explaining Models With SHAPely Values

In this section of the article, we will see how we can make a machine learning model more explainable using the SHAPley values. For this purpose, we will use a simple linear regression model on the IRIS data set which we have already used in the last section of the article.

Let’s start with fighting the model on the previously loaded data.

model = sklearn.linear_model.LinearRegression()
model.fit(X, y)

Output:

Examining the Model Coefficients

One of the most common procedures of explaining linear model success is to find out the level of coefficient learned by the model for each feature. Since the SHAPley values, consider that every value is important from the data for the output. By examining the coefficient we can tell how much the output can change if we change the feature.

Here, what we have done is a traditional method to examine the model. We can say that the petal width feature from the dataset is the most influencing feature. Since we have the SHAP tool we can make a clearer picture using the partial dependence plot.

Partial Dependence Plots

The importance of the feature can be found by knowing the impact of the feature on the output or by knowing the distribution of the feature. So if we can plot the model and the distribution in a single plot, it would become more beneficial and informative for us. Let’s see how we can do this.

SHAP.plots.partial_dependence(
    "petal length (cm)", model.predict, X50, ice=False,
    model_expected_value=True, feature_expected_value=True
)

Output:

Here on the X-axis, we can see the histogram of the distribution of the data, and the blue line in the plot is the average value of the model output which passes through a centre point which is also the intersection point of the expected value lines. 

Using this plot we can read the SHAP value which can be considered as the SHAPley values that are applied to any conditional expectation function of a model.

For example, we can extract a few values from the data and use them as a sample for background distribution. Let’s say we have extracted 50 instances. Using which we can make the SHAP values.

Computing  the SHAP values

X50 = SHAP.utils.sample(X, 50)
explainer = SHAP.Explainer(model.predict, X50)
SHAP_values = explainer(X)

Partial dependence plot

sample_ind = 18
SHAP.partial_dependence_plot(
    "petal length (cm)", model.predict, X50, model_expected_value=True,
    feature_expected_value=True, ice=False,
    SHAP_values=SHAP_values[sample_ind:sample_ind+1,:]

Output:

Here we can see that a close correspondence between the partial dependence plot and SHAP value. It means that we have plotted a mean-centred version of the partial dependence plot for that feature.

Let’s check the distribution of the SHAP value.

SHAP.plots.scatter(SHAP_values[:,"petal length (cm)"])

Output:

This is a clearer outcome where we can see that the SHAP values distribution is similar to the distribution of the portal length distribution.

Waterfall Plot

These SHAP values of all input features will always be summed up to the difference between the expected output from the model and that is how the output from the current model for the prediction becomes explained. We can see it through the waterfall plot.

SHAP.plots.waterfall(SHAP_values[sample_ind])

Output:

By seeing in the waterfall plot, we can imagine how we get the predicted values with SHAP. 

Final Words

In this article, we have seen what is SHAP tool, and how we can simply apply this to our models to make the outcome from the model more explainable. Along with this, we have also seen how we can use the SHAP values to improve the explainability of any model. 

References

More Great AIM Stories

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM