githubEdit

Handling categorical variables with KernelSHAP

Note

To enable SHAP support, you may need to run

pip install alibi[shap]
# shap.summary_plot currently doesn't work with matplotlib>=3.6.0,
# see bug report: https://github.com/slundberg/shap/issues/2687
!pip install matplotlib==3.5.3

Introduction

In this example, we show how the KernelSHAP method can be used for tabular data, which contains both numerical (continuous) and categorical attributes. Using a logistic regression model fitted to the Adult dataset, we examine the performance of the KernelSHAP algorithm against the exact shap values. We investigate the effect of the background dataset size on the estimated shap values and present two ways of handling categorical data.

import shap
shap.initjs()

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from alibi.explainers import KernelShap
from alibi.datasets import fetch_adult
from scipy.special import logit
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

Data preparation

Load and split

The fetch_adult function returns a Bunch object containing the features, the targets, the feature names and a mapping of categorical variables to numbers.

Note that for your own datasets you can use our utility function gen_category_map to create the category map.

Create feature transformation pipeline

Create feature transformation pipeline

Create feature pre-processor. Needs to have 'fit' and 'transform' methods. Different types of pre-processing can be applied to all or part of the features. In the example below we will standardize ordinal features and apply one-hot-encoding to categorical features.

Ordinal features:

Categorical features:

Note that in order to be able to interpret the coefficients corresponding to the categorical features, the option drop='first' has been passed to the OneHotEncoder. This means that for a categorical variable with n levels, the length of the code will be n-1. This is necessary in order to avoid introducing feature multicolinearity, which would skew the interpretation of the results. For more information about the issue about multicolinearity in the context of linear modelling see [1].

Combine and fit:

Age
Capital Gain
Capital Loss
Hours per week
Workclass
Education
Marital Status
Occupation
Relationship
Race
Sex
Country

0

49

0

0

55

Private

Prof-School

Never-Married

Professional

Not-in-family

White

Female

United-States

png

The decision plot above informs us of the path to the decision Income < $50,0000 for the original record (depicted in blue, and, for clarity, on its own below). Aditionally, decision paths for fictitious records where only the Capital Gain feature was altered are displayed. For clarity, only a handful of these instances have been plotted. Note that the base value of the plot has been altered to be the classification threshold (6) as opposed to the expected prediction probability for individuals earning more than $50,000.

We see that the second highlighted instance (in purple) would have been predicted as making an income over $50, 0000 with approximately 0.6 probability, and that this change in prediction is largely dirven by the Capital Gain feature. We can see below that the income predictor would have predicted the income of this individual to be more than $50, 000 had the Capital Gain been over $3,500.

png

Note that passing return_objects=True and using the r.feature_idx as an input to the decision plot above we were able to plot the original record along with the feature values in the same feature order. Additionally, by passing logit to the plotting function, the scale of the axis is mapped from the margin to probability space(5) .

Combined, the two decision plots show that:

  • the largest decrease in the probability of earning more than $50,000 is significantly affected if the individual is has marital status Never-Married

  • the largest increase in the probability of earning more than $50,000 is determinbed by the education level

  • the probability of making an income greater than $50,000 increases with the capital gain; notice how this implies that features such as Education or Occupation also cotribute more to the increase in probability of earning more than $50,000

Checking if prediction paths significantly differ for extreme probability predictions

One can employ the decision plot to check if the prediction paths for low (or high) probability examples differ significantly; conceptually, examples which exhibit prediction paths which are significantly different are potential outliers.

Below, we seek to explain only those examples which are predicted to have an income above $ 50,000 with small probability.

png

From the above plot, we see that the prediction paths for the samples with low probability of being class 1 are similar - no potential outliers are identified.

Investigating the effect of the background dataset size on shap value estimates

The shap values estimation relies on quering the model with samples where certain inputs are toggled off in order to infer the contribution of a particular feature. Since most models cannot accept arbitrary patterns of missing values, the background dataset is used to replace the values of the missing features, that is, as a background model. In more detail, the algorithm creates first creates a number of copies of this dataset, and then subsamples sets of

Since the model predicts on these perturbed samples and regresses on the predictions to infer the shap values, the quality of the background model is key for the explanation model. Here we will not be concerned with modelling the background, but instead investigate whether simply increasing the background set size can give rise to wildly different shap values. This part of the example is long running so the graph showing our original results can be loaded instead.

Below cell is long running, skip and display the graph instead.

background-effect

We notice that with the exception of the Capital Gain and Capital Loss, the differences betweem the shap values estimates are not significant as the fraction of the training set used as a background dataset increases from 0.005 to 0.16. Notably, the Capital Gain feature would be ranked as the second most important by the all approximate models, whereas in the initial experiment which used the first 100 (0.003) examples from the training set the ranking of the two features was reversed. How to select an appropriate background dataset is an open ended question. In the future, we will explore whether clustering the training data can provide a more representative background model and increase the accuracy of the estimation.

A potential limitation of expensive explanation methods such as KernelShap when used to draw insights about the global model behaviour is the fact that explaining large datasets can take a long time. Below, we explain a larger fraction of the testing set (0.4) in order to see if different conclusions about the feature importances would be made.

png

As expected, the exact shap values have the same ranking when a larger set is explained, since they are derived from the same model coefficients.

png

The ranking of the features also remains unchanged for the approximate method even when significantly more instances are explained.

Footnotes

(1): As detailed in Theorem 1 in [3], the estimation process for a shap value of feature $i$ from instance $x$ involves taking a weighted average of the contribution of feature $i$ to the model output, where the weighting takes into account all the possible orderings in which the previous and successor features can be added to the set. This computation is thus performed by choosing subsets of features from the full feature set and setting the values of these features to a background value; the prediction on these perturbed samples is used in a least squares objective (Theorem 2), weighted by the Shapley kernel. Note that the optimisation objective involves a summation over all possible subsets. Enumerating all the feature subsets has exponential computational cost, so the smaller the feature set, the more samples can be drawn and more accurate shap values can be estimated. Thus, grouping the features can serve to reduce the variance of the shap values estimation by providing a smaller set of features to choose from.

(2): This is a kwarg to shap_values method.

(3): Note the progress bars below show, however, different runtimes between the two methods. No accurate timing analysis was carried out to study this aspect.

(4): Note that the shap library currently does not support grouping when the data is represented as a sparse matrix, so it should be converted to a numpy.ndarray object, both during explainer initialisation and when calling the shap_values method.

(5): When link='logit' is passed to the plotting function, the model outputs are scaled to the probability space, so the inverse logit transformation is applied to the data and axis ticks. This is in contrast to passing link='logit' to the KernelExplainer, which maps the model output through the forward logit transformation, $\log \left( \frac{p}{1-p} \right)$.

(6): We could alter the base value by specifying the new_base_value argument to shap.decision_plot. Note that this argument has to be specified in the same units as the explanation - if we explained the instances in margin space then to switch the base value of the plot to, say, p=0.4 then we would pass new_base_value = log(0.4/(1 - 0.4)) to the plotting function.

(7): In this context, bias refers to the bias-variance tradeoff; a simpler model will likely incur a larger error during training but will have a smaller generalisation gap compared to a more complex model which will have smaller training error but will generalise poorly.

References

[1] Mahto, K.K., 2019. "One-Hot-Encoding, Multicollinearity and the Dummy Variable Trap". Retrieved 02 Feb 2020 (link)arrow-up-right

[2] Mood, C., 2017. "Logistic regression: Uncovering unobserved heterogeneity."

[3] Lundberg, S.M. and Lee, S.I., 2017. A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765-4774).

[4] Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N. and Lee, S.I., 2020. From local explanations to global understanding with explainable AI for trees. Nature machine intelligence, 2(1), pp.56-67.

[5] Lundberg, S.M., Nair, B., Vavilala, M.S., Horibe, M., Eisses, M.J., Adams, T., Liston, D.E., Low, D.K.W., Newman, S.F., Kim, J. and Lee, S.I., 2018. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature biomedical engineering, 2(10), pp.749-760.

Last updated

Was this helpful?