In machine learning, a dataset is prepared to train the model and predictions are obtained from the unseen data model. How far any of the features have contributed to obtaining the predictions? We might think some features are most important, but the model might have considered some other features to help predict the model’s point of view. With predictions in our hand as output, we cannot understand or explain what has happened inside a machine learning model. Sometimes a model’s way of performance is questionable. An open-source library, SHAP puts an end to this question on the reliability of a machine learning model.
SHAP is the acronym for SHapley Additive exPlanations derived originally from Shapley values introduced by Lloyd Shapley as a solution concept for cooperative game theory in 1951. SHAP works well with any kind of machine learning or deep learning model. ‘TreeExplainer’ is a fast and accurate algorithm used in all kinds of tree-based models such as random forests, xgboost, lightgbm, and decision trees.
Sign up for your weekly dose of what's up in emerging technology.
‘DeepExplainer’ is an approximate algorithm used in deep neural networks. ‘KernelExplainer’ algorithm is, in general, applicable to any machine learning regression model. SHAP prefers different visualizations to demonstrate the feature importance and the way features contributed in predictions. In this article we discuss various SHAP visualization techniques with a decision tree classification model.
Create the working environment
To install the open source library in python environment,
!pip install shap
Import necessary libraries and an in-built dataset for SHAP analysis. Here ‘Breast Cancer Data’ from sklearn datasets is used.
import numpy as np import pandas as pd from matplotlib import pyplot as plt import seaborn as sns from sklearn.datasets import load_breast_cancer from sklearn.tree import DecisionTreeClassifier, export_graphviz from sklearn.model_selection import train_test_split import shap import graphviz sns.set_style('darkgrid')
Build and Train a Decision Tree model
Save predictors and targets as in the variables X and y respectively. Then split the dataset into train and test sets in 80:20 ratio.
# load the famous breast cancer data from sklearn inbuilt datasets # a supervised binary classification problem data = load_breast_cancer() # define predictors as pandas dataframe X = pd.DataFrame(data['data'], columns=data['feature_names']) # define target as pandas series y = pd.Series(data['target']) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # sample first few rows and few columns X_train.iloc[:5, :12]
Construct a decision tree classifier model and train it with the training set of data. To know how the model classifies given data, we can use graphviz tree visualizer to plot the tree.
# develop a decision tree model model = DecisionTreeClassifier(random_state=1, max_depth=5) # train the model data model.fit(X_train, y_train) # Visualize how model classified the entire data tree_graph = export_graphviz(model, out_file=None, feature_names = data['feature_names'], rounded=True, filled=True) graphviz.Source(tree_graph)
SHAP Force Plot
Develop a tree-based SHAP explainer and calculate the shap values. Shap values are arrays of a length corresponding to the number of classes in target. Here the problem is binary classification, and thus shap values have two arrays corresponding to either class.
Shap values are floating-point numbers corresponding to data in each row corresponding to each feature. Shap value represents the contribution of that particular data point in predicting the outputs. If the shap value is much closer to zero, we can say that the data point contributes very little to predictions. If the shap value is a strong positive or strong negative value, we can say that the data point greatly contributes to predicting the positive or negative class.
Force plots are suitable for row-wise SHAP analysis. It takes in a single row and shows in a rank order how each of the features contributed to the prediction. Wider a feature’s block, more the contribution.
Force plots can be made interactive by plotting it with more data points. Here we plotted it with all of the test data. By hovering mouse pointer over the regions of plot, we can observe shap values interactively.
# obtain shap values for the test data shap_values = explainer.shap_values(X_test) shap.force_plot(explainer.expected_value, shap_values, X_test)
Dropdown options are shown in the interactive plot to select features of interest. It gives a better understanding on how two different features interact with each other in predicting the outputs. It should be noted that, in our example, the red and blue colors show positive and negative predictions respectively.
SHAP Summary Plot
Summary plots are easy-to-read visualizations which bring the whole data to a single plot. All of the features are listed in y-axis in the rank order, the top one being the most contributor to the predictions and the bottom one being the least or zero-contributor. Shap values are provided in the x-axis. As we discussed already, a value of zero represents no contribution whereas contributions increase as the shap value moves away from zero. Each circular dot in the plot represents a single data point. Color of the dot denotes the value of that corresponding feature. It can be observed that the feature ‘worst perimeter’ contributes greatly to the model’s prediction with low values deciding one class and higher values deciding the other.
Summary plot can also be visualized as a bar plot for quick reading with minimum details.
shap.summary_plot(shap_values, X_test, plot_type='bar')
It is clearly observed that top 8 ranked features alone contribute to the model’s predictions.
SHAP Dependence Plot
Dependence plots can be of great use while analyzing feature importance and doing feature selection. It makes one-versus-one plot against two features by plotting shap values of one feature and coloring the dots with respect to another interactive feature.
# we use whole of X data from more points on plot shap_values = explainer.shap_values(X) shap.dependence_plot('worst perimeter', shap_values, X, interaction_index='worst concave points')
If the interactive feature is not provided by the user, SHAP determines a suitable feature on its own and uses that as the interactive feature.
shap.dependence_plot('worst concave points' , shap_values, X)
SHAP Decision Plot
Finally, we discuss the decision plot. As the summary plot, it gives an overall picture of contribution to prediction. From bottom to top of the decision plot, shap values are cumulatively added to the base value of the model in determining the output values. It can be observed that certain strings colored in blue resulted in final class value 0 and the remaining strings colored in red resulted in final class value 1.
shap.decision_plot(explainer.expected_value, shap_values, X)
SHAP analysis can be used to interpret or explain a machine learning model. Also, it can be done as part of feature engineering to tune the model’s performance or generate new features!