In bgreenwell/fastshap: Fast Approximate Shapley Values

knitr::opts_chunk$set(
  echo = TRUE, 
  warning = FALSE, 
  message = FALSE,
  fig.width = 6, 
  fig.asp = 0.618,
  out.width = "100%"
)

Introduction

The Shapley value [@shapley-2016-value] is an idea from coalitional/cooperative game theory. In a coalitional game (i.e., a competetive game between groups of players called coalitions), assume we have $p$ players that form a grand coalition ($S$) worth a certain payout ($\Delta_S$). Suppose we also know how much any smaller coalition ($Q \subseteq S$) (i.e., any subset of $p$ players) is worth ($\Delta_Q$). The goal is to distribute the total payout $\Delta_S$ to the individual $p$ players in a "fair" way; that is, so that each player receives their "fair" share. The Shapley value is one such solution and the only one that uniquely satisfies a particular set of "fairness properties."

Let $v$ be a characteristic function (or mapping) that assigns a value to each subset of players; in particular, $v : 2^p \rightarrow \mathbb{R}$, where $v\left(S\right) = \Delta_S$, $v\left(\emptyset\right) = 0$, and $\emptyset$ is the empty set (i.e., zero players). Let $\phi_i\left(v\right)$ be the contribution (or portion of the total payout) attributed to player $i$ in a particular game with total payout $v\left(S\right) = \Delta_S$. The Shapley values $\left{\phi_i\right}_{i=1}^p$ satisfy the following four properties:

Efficiency: $\sum_{i = 1} ^ p \phi_i\left(v\right) = \Delta_S$.
Null player: $\forall W \subseteq S \setminus \left{i\right}: \Delta_W = \Delta_{W \cup \left{i\right}} \implies \phi_i\left(v\right) = 0$.
Symmetry: $\forall W \subseteq S \setminus \left{i, j\right}: \Delta_{W \cup \left{i\right}} = \Delta_{W \cup \left{j\right}} \implies \phi_i\left(v\right) = \phi_j\left(v\right)$.
Linearity: If $v$ and $w$ are functions describing two coalitional games, then $\phi_i\left(v + w\right) = \phi_i\left(v\right) + \phi_i\left(w\right)$.

The above properties can be interpreted as follows: 1) the individual player contributions sum to the total payout, hence, are implicitly normalized; 2) if a player does not contribute to any coalition they receive a payout of zero; 3) if two players have the same impact across all coalitions, they receive equal payout; and 4) the local contributions are additive across different games.

@shapley-2016-value showed that the unique solution satisfying the above properties is given by

\begin{equation} \phi_i\left(x\right) = \frac{1}{p!} \sum_{\mathcal{O} \in \pi\left(p\right)} \left[v\left(S^\mathcal{O} \cup i\right) - v\left(S^\mathcal{O}\right)\right], \quad i = 1, 2, \dots, p, (#eq:shapley-value) \end{equation} where $\mathcal{O}$ is a specific permutation of the players indices $\left{1, 2, \dots, p\right}$, $\pi\left(p\right)$ is the set of all such permutations of size $p$, and $S^\mathcal{O}$ is the set of players joining the coalition before player $i$.

In other words, the Shapley value is the average marginal contribution of a player across all possible coalitions in a game. Another way to interpret Equation \@ref(eq:shapley-value) is as follows. Imagine the coalitions (i.e., subsets of players) being formed one player at a time (which can happen in different orders), with the $i$-th player demanding a fair contribution/payout of $v\left(S^\mathcal{O} \cup i\right) - v\left(S^\mathcal{O}\right)$. The Shapley value for player $i$ is given by the average of this contribution over all possible permutations in which the coalition can be formed.

A simple example may help clarify the main ideas. Suppose three friends (players)---Alex, Brad, and Brandon---decide to go out for drinks after work (the game). They shared a few pitchers of beer, but nobody payed attention to how much each person drank (collaborated). What's a fair way to split the tab (total payout)? Suppose we knew the follow information, perhaps based on historical data:

If Alex drank alone, he'd only pay \$10.
If Brad drank alone, he'd only pay \$20.
If Brandon drank alone, he'd only pay \$10.
If Alex and Brad drank together, they'd only pay \$25.
If Alex and Brandon drank together, they'd only pay \$15.
If Brad and Brandon drank together, they'd only pay \$13.
If Alex, Brad, and Brandon drank together, they'd only pay \$30.

Note that $S = \left{\text{Alex}, \text{Brad}, \text{Brandon}\right}$ and that $\Delta_S = \$30$. With only three players, we can enumerate all possible coalitions. In Table r knitr::asis_output(ifelse(knitr::is_html_output(), '\\@ref(tab:drinks-html)', '\\@ref(tab:drinks-pdf)')), we list out all possible permutations of the three players along with the marginal contribution of each. Take the first row, for example. In this particular permutation, we start with Alex. We know that if Alex drinks alone, he'd spend $10, so his marginal contribution by entering first is $10. Next, we assume Brad enters the coalition. We know that if Alex and Brad drank together, they'd pay a total of $25, leaving $15 left over for Brad's marginal contribution. Similarly, if Brandon joins the party last, his marginal contribution would be only $5 (the difference between $30 and $25). The Shapley value for each player is the average marginal contribution across all six possible permutations (these are the column averages reported in the last row).

library(kableExtra)

# Generate data frame of table values
combos <- c(
  "Alex, Brad, Brandon",
  "Alex, Brandon, Brad",
  "Brad, Alex, Brandon",
  "Brad, Brandon, Alex",
  "Brandon, Alex, Brad",
  "Brandon, Brad, Alex",
  "Shapley contribution:"
)
tab <- data.frame(
  `Permutation/order of players` = combos,
  "Alex"    = scales::dollar(c(10, 10, 5, 10, 5, 17, 9.50)),
  "Brad"    = scales::dollar(c(15, 15, 20, 20, 15, 3, 14.67)),
  "Brandon" = scales::dollar(c(5, 5, 5, 0, 10, 10, 5.83)),
  check.names = FALSE
)

# Figure caption
cap <- paste(
  "Marginal contribution for each permutation of the players",
  "{Alex, Brad, Brandon} (i.e., the order in which they arrive). The Shapley ",
  "contribution is the average marginal contribution cross all permutations.",
  "(Notice how each row sums to the total bill of $30.)"
)

knitr::kable(tab, format = "html", align = c("lrrr"), caption = cap) %>%
  row_spec(6, hline_after = TRUE)

# cap <- gsub("\\{", replacement = "\\\\{", x = cap)
# cap <- gsub("\\}", replacement = "\\\\}", x = cap)
cap <- gsub("\\$", replacement = "\\\\$", x = cap)
knitr::kable(tab, format = "latex", align = c("lrrr"), caption = cap, 
             booktabs = TRUE, linesep = "", position = "!htb") %>%
  row_spec(6, hline_after = TRUE)

In this example, Brandon would get away with the smallest payout (i.e., have to pay the smallest portion of the total tab). The next time the bartender asks how you want to split the tab, whip out a pencil and do the math! In the next section, we'll show how the same idea can be used to help quantify the contribution each feature value makes to its corresponding prediction in a machine learning model.

Shapley values for explaining predictions

@strumbelj-2014-explaining suggested using the Shapley value \@ref(eq:shapley-value) to help explain predictions from a machine learning model. In the context of machine learning:

a game is represented by the prediction task for a single observation $\boldsymbol{x} = \left(x_1, x_2, \dots, x_p\right)$ (i.e., there are $p$ features in total);
the total payout/worth ($\Delta_S$) for $\boldsymbol{x}$ is the prediction for $\boldsymbol{x}$, denoted $\hat{f}\left(\boldsymbol{x}\right)$, minus the average prediction for all training observations (call this the baseline prediction, which we'll denote by $\bar{f}$);
the players are the individual feature values of $\boldsymbol{x}$ that collaborate to receive the payout (i.e., $\hat{f}\left(\boldsymbol{x}\right) - \bar{f}$).

From the last point, it's important to note that Shapley explanations are not trying to quantify the contribution each feature value in $\boldsymbol{x}$ makes to its prediction $\hat{f}\left(\boldsymbol{x}\right)$, but rather to the quantity $\hat{f}\left(\boldsymbol{x}\right) - \bar{f}$, the difference between its prediction and the baseline. This seems to be a common source of confusion in interpreting a set of Shapley explanations from a given model.

In the following sections, we'll discuss several popular ways to compute Shapley values in practice.

Choice of characteristic function $v$

The challenge of using Shapley values for the purpose of explaining predictions is in defining the functional form of $v$. As discussed in @chen-2020-true, there are several ways to do this. However, since we are primarily interested in understanding how much each feature contributed to a particular prediction, $v$ is typically related to a conditional expectation of the model's prediction. @chen-2020-true make the distinction between two possibilities, each of which differs in their conditioning argument. The Shapley value implementations discussed in this paper (e.g., MC SHAP and Tree SHAP) rely on what @chen-2020-true call the interventional conditional expectation, which can be expressed using Pearl's $do\left(\cdot\right)$ operator [@pearl-2009-causality]:

\begin{equation} \begin{split} v\left(S\right) &= \mathbb{E}\left[f\left(\boldsymbol{x}S, \boldsymbol{x}{S^c}\right) | do\left(\boldsymbol{x}S\right)\right] \ &= \int f\left(\boldsymbol{x}_S, \boldsymbol{x}{S^c}\right) p\left(\boldsymbol{x}{S^c}\right) d \boldsymbol{x}{S^c}, \end{split} (#eq:ice) \end{equation}

where $S^c$ is the complement of $S$, $\boldsymbol{x}S$ and $\boldsymbol{x}{S^c}$ are the set of features in $S$ and $S^c$, respectively, and $p\left(\boldsymbol{x}{S^c}\right)$ is the joint probability density of $\boldsymbol{x}{S^c}$. Equation \@ref((eq:ice) can be interpreted as the expected value of $f\left(\boldsymbol{x}\right)$ given some intervention on the features in $S$, which assumes independence between $\boldsymbol{x}S$ and $\boldsymbol{x}{S^c}$; a similar assumption is also used in the construction of partial dependence plots [@friedman-2001-greedy], with the connection to Pearl's $do\left(\cdot\right)$ operator established in @zhao-2021-causal. Shapley values based on this formulation of $v$ are referred to as interventional Shapley values [@chen-2020-true]. The various Shapley value algorithms discussed over the next several sections fall under this form.

The following sections detail several algorithms for estimating Shapley explanations in practice.

Monte Carlo (MC) SHAP: approximate Shapley values via Monte Carlo simulation

Computing the exact Shapley value is computationally infeasible, even for moderately large $p$. To that end, @strumbelj-2014-explaining suggest a Monte Carlo approximation, which we'll call SampleSHAP^[FIXME: Make a note on the technical use of the term SHAP, and how we're being loose with the terminology here.], that assumes independent features^[While SampleSHAP, along with many other common Shapley value procedures, assumes independent features, several arguments can be made in favor of this assumption; see, for example, @chen-2020-true and the references therein.]. Their approach is described in Algorithm 1 below.

r knitr::asis_output(if (knitr::is_latex_output()) '\\noindent') Algorithm 1: Approximating the $i$-th feature's contribution to $f\left(\boldsymbol{x}\right)$.

For $j = 1, 2, \dots, R$: a. Select a random permutation $\mathcal{O}$ of the feature values $x_1, x_2, \dots, x_p$. b. Select a random instance $\boldsymbol{w}$ from the set of training observations $\boldsymbol{X}$. c. Construct two new instances as follows: * $\boldsymbol{b}1 = \boldsymbol{x}$, but all the features in $\mathcal{O}$ that appear after feature $x_i$ get their values swapped with the corresponding values in $\boldsymbol{w}$. * $\boldsymbol{b}_2 = \boldsymbol{x}$, but feature $x_j$, as well as all the features in $\mathcal{O}$ that appear after $x_j$, get their values swapped with the corresponding values in $\boldsymbol{w}$. d. Compute the difference in predictions: $\phi{ij}\left(\boldsymbol{x}\right) = f\left(\boldsymbol{b}_1\right) - f\left(\boldsymbol{b}_2\right)$.
Aggregate the results: $\phi_i\left(\boldsymbol{x}\right) = \sum_{j = 1} ^ R \phi_{ij}\left(\boldsymbol{x}\right) / R$.

Here, A single estimate of the contribution of $x_i$ to $f\left(\boldsymbol{x}\right) - \bar{f}$ is nothing the more than the difference between two predictions, where each prediction is based on a set of "Frankenstein instances"^[The terminology used here takes inspiration from @molnar-2019-iml (p. 231).] that are constructed by swapping out values between the instance being explained ($\boldsymbol{x}$) and an instance selected at random from the training data. To help stabilize the results, the procedure is repeated a large number, say, $R$, times, and the results averaged together. Note that Algorithm 1 can be parallelized across features or MC repetitions.

If there are $p$ features and $m$ instanced to be explained, this requires $2 \times R \times p \times m$ predictions (or calls to a scoring function). In practice, this can be quite computationally demanding, especially since $R$ needs to be large enough to produce good approximations to each $\phi_i\left(x\right)$. How large does $R$ need to be to produce accurate explanations? It depends on the variance of each feature in the observed training data, but typically $R \in \left[30, 100\right]$ will suffice. In a later section, we'll discuss a particularly optimized implementation of Algorithm 1 that only requires $2mp$ calls to a scoring function.

Even with certain optimizations or parallel processing, MC SHAP can be computationally prohibitive if you need to explain a large number of predictions Fortunately, you often only need to explain a handful of predictions, for example the most extreme predictions. However, generating individual explanations for the entire training set, or a large enough sample thereof, can be useful for generating aggregated (i.e., global) model summaries, like Shapley-based variable importance plots [@lundberg-2020-treeshap].

A simple R implementation of Algorithm 1 is given below. Here, obj is a fitted model with scoring function f() (e.g., predict()), R is the number of MC repetitions to perform, feature gives the name of the corresponding feature in x to be explained, and X is the training set of features.

sample.shap <- function(f, obj, R, x, feature, X) {
  phi <- numeric(R)  # to store Shapley values
  N <- nrow(X)   # sample size
  p <- ncol(X)   # number of features
  b1 <- b2 <- x  # initialize new instances
  for (m in seq_len(R)) {
    w <- X[sample(N, size = 1), ]  # sample random obs from X     # (step 2. b.)
    ord <- sample(names(w))  # random permutation of features     #
    swap <- ord[seq_len(which(ord == feature) - 1)]               #
    b1[swap] <- w[swap]                                           # (step 1. c.)
    b2[c(swap, feature)] <- w[c(swap, feature)]                   # (step 1. c.)
    phi[m] <- f(obj, newdata = b1) - f(obj, newdata = b2)         # (step d.)
  }                                                               #
  mean(phi)  # return approximate feature contribution            # (step 2.)
}

Linear SHAP: shapley values from additive linear models

First, lets discuss how a feature's value contributes to a prediction $f\left(\boldsymbol{x}\right)$ in an additive linear model with independent features. That is, let's assume for a moment that $f$ takes the form \begin{equation} \nonumber f\left(\boldsymbol{x}\right) = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p \end{equation}

Recall that the contribution of $x$_i (the $i$-th feature component of $\boldsymbol{x}$) to the prediction $f\left(\boldsymbol{x}\right)$ is the difference between $f\left(\boldsymbol{x}\right)$ and the expected prediction if the $i$-th feature’s value were not known: \begin{equation} \nonumber \begin{split} \phi_i\left(\boldsymbol{x}\right) &= \beta_0 + \dots + \beta_i x_i + \dots + \beta_p x_p \ &\quad\quad - \left(\beta_0 + \dots + \beta_i \bar{x}_i + \dots + \beta_p \bar{x}_p\right) \ &= \beta_i \left(x_i - \bar{x}_i\right) \end{split}, \end{equation} where, for example, $\bar{x}_i$ corresponds to the sample mean of the $i$-th features values in the training sample. For a proof, see @aas-2020-explaining. The quantity $\phi_i\left(\boldsymbol{x}\right)$ is also referred to as the situational importance of $x_i$ [@achen-1982-interpreting].

Note that if you're using R, then $\beta_i \left(x_i - \bar{x}_i\right)$ is exactly what's returned by R's predict() method when applied to lm/glm models, provided you specify type = "terms"; see ?predict.lm for details.

Kernel SHAP: approximate Shapley values using kernel approximations

Kernel SHAP \citet{lundberg-2017-KernelSHAP} uses a linear regression-based approximation to estimate Shapley values from a given model. It is model-agnostic in the sense that it can be applied in the same way to any type of supervised learning model.

In the Kernel SHAP formulation, the computation is represented as a linear model, with a specific Shapley Kernel.

Kernel SHAP samples coalitions $z_k' \in {0,1}^M$, with $k \in {1, \ldots, K}$, where '0' signifies that a feature is absent and '1' that it is present. A function $h_x: Z \mapsto X$ maps coalitions to the feature space, making it possible to get the predictions for the sampled coalitions: $f(h_x(z_k'))$. A dataset is generated by sampling coalitions and computing their model predictions. A weighted linear model is then fitted with the coalition vector as features and the model predictions as target. The weight used is the kernel:

$$\pi_{x}(z')=\frac{(M-1)}{\binom{M}{|z'|}|z'|(M-|z'|)}$$

where $M$ is the maximum coalition size (for tabular data the number of features) and $|z'|$ is the number of 1's in the coalition vector, or the number of features that are present. The function $h_x$ maps elements of the coalition vector with a 1 to the original feature vector, and the elements with 0's to the respective feature values of a randomly sampled data point. The estimated coefficients in this weighted linear model can be interpreted as Shapley value estimates. The accuracy of the estimate depends on the size of the coalitions that are sampled. Like SampleSHAP and LinearSHAP, Kernel SHAP assumes independent features. The Kernel SHAP supposedly requires less computational power than SampleSHAP to obtain a similar approximation accuracy (i.e., fewer replications).

NOTES:

(Christoph, this was my initial "thought dump", feel free to delete or use.)

Kernel SHAP, or at least it's implementation in shap, distributes the number of replications unevenly across the different features. Apparently, features with higher variance are attributed more replications.
Extended to handle dependent features in \citet{aas-2020-explaining} and is available in the R package \CRANpkg{shapr} \citep{R-shapr}.
Kernel SHAP [@lundberg-2017-KernelSHAP] uses a specially-weighted local linear regression to estimate SHAP values for any model. Unlike MC SHAP...
SHAP decomposes a prediction into \begin{equation} f\left(x\right) = \phi_0 + \sum_{i = 1} ^ p \phi_i, \end{equation} where $f\left(x\right)$, $\phi_0 = \mathbb{E}\left[f\left(x\right)\right]$, $p$ is the number of predictors, and $\phi_i$ is the contribution of the $i$-th feature. It should be noted that $\phi_i$ ($i = 1, 2, \dots, p$) depend on the observation $x$, whereas $\phi_0$ is constant. From Equation FIXME: eq:shap, it should be clear that $\sum_{i = 1}^p \phi_i = f\left(x\right) - \phi_0$. In other words, SHAP values help to explain the difference between a particular prediction and the global average prediction. The quantity $\phi_0$ is often referred to as the baseline prediction and is estimated in practice using the average prediction across all $N$ training observations: $\bar{y}{trn} = \sum{i = 1}^N y_i / N$
@covert-2021-improving (and the corresponding GitHub repo: \url{https://github.com/iancovert/shapley-regression}) offer a nice discussion on Kernel SHAP and some insight into its properties.
Discuss Shapley values as the solution to a weighted least squares problem; see @charnes-1988-extremal for details.

Unlike sampling-based approaches (e.g., MC SHAP), Kernel SHAP does not provide estimates of uncertainlty. An improvemed version of Kernel SHAP was proposed in @covert-2021-improving. This version is unbiased and has better convergence.

Tree SHAP: efficient Shapley values for tree ensembles

FIXME: Need to find the right balance of details and complexity here.

NOTES:

Tree SHAP assumes "less" feature independence in the sense that it accounts for some of the dependence, but not all [@aas-2020-explaining].

Only applicable to tree-based models, and implemented for only a few algorithms (e.g., XGBoost and LightGBM).

Implementations in R {#r-pkgs}

Probably the first, and most widely used implementation of Shapley explanations is the Python \pkg{shap} library [@lundberg-2017-KernelSHAP], which provides a Python implementation of SampleSHAP, KernelSHAP, TreeSHAP, and a few other model-specific Shapley methods (e.g., DeepSHAP, which is provides approximate Shapley values for deep learning models).

The \CRANpkg{shapper} package [@R-shapper] provides an R interface to the Python \pkg{shap} library using \CRANpkg{reticulate} \citep{R-reticulate}; however, it currently only supports Kernel SHAP (\pkg{shap} itself additionally supports MC SHAP, Tree SHAP, Linear SHAP, as well as various other model-specific Shapley explanation methods).

There are several R packages available for computing Shapley-based feature contributions. You can perform a quick search for CRAN packages related to Shapley value using the \CRANpkg{pkgsearch} [@R-pkgsearch]:

pkgsearch::ps("Shapley")  # set `format = "long"` for more detailed results

While we won't demonstarte use of the package, it's worth point readers to the \pkg{shapr} package \citep{R-sellereite}. As previously discussed, one drawback of traditional Shapley values (like the ones computed by the MC SHAP procedure) is the assumption of independent features (an assumption made by many IML procedures, in fact). To that end, the \pkg{shapr} package implements Shapley explanations that can account for the dependence between features \citep{aas-2019-explaining}, resulting in significantly more accurate approximations to the Shapley values. The package also includes an implementation of KernelSHAP that's consistent with the \pkg{shap} package for Python.

Tree SHAP has been directly incorporated into most implementations of XGBoost \citep{chen-2016-xgboost} (including \CRANpkg{xgboost} \citep{R-xgboost}), CatBoost \citep{eronika-2017-catboost}, and LightGBM \citep{ke-2017-lightgbm}. Both \CRANpkg{fastshap} \citep{R-fastshap} and \CRANpkg{SHAPforxgboost} \citep{R-SHAPforxgboost} provide an interface to \pkg{xgboost}'s TreeSHAP implementation.

The remainder of this article will focus applying Shapley values to machine learning using a handful of packages: \CRANpkg{fastshap} [@R-fastshap], \CRANpkg{iml} [@R-iml], \CRANpkg{iBreakDown} [@R-iBreakDown], and \CRANpkg{lightgbm} [@R-lightgbm]. The first three packages all prvide an implementation of MC SHAP (i.e., Algorithm 1), while the latter includes an implementation of Tree SHAP; note that \CRANpkg{xgboost} [@R-xgboost], an efficient boosting library similar to lightgbm, provides similar Tree SHAP functionality.

\CRANpkg{fastshap} provides an efficient implementation of SampleSHAP and makes it a viable option for explaining the predictions from model's where efficient model-specific Shapley methods do not exist or are not yet implemented.

The \pkg{iml} package provides the Shapley() function, which is a direct implementation of Algorithm 1. It is written in \CRANpkg{R6} [@R-R6]. Moreover, the \pkg{iml} package provides a standard interface to several other interpretable machine learning (IML) algorithms, whence the package name.

Package \pkg{iBreakDown} implements a general approach to explaining the predictions from supervised models, called Break Down [@gosiewska-2019-iBreakDown]. MC SHAP values can be computed as a special case from random Break Down profiles; see ?iBreakDown::shap for details.

While several of these packages provide their own plotting function for visualizing the output, the \CRANpkg{shapviz} package [@R-shapviz] provides a generic set of function for plotting Shapley explanations with direct support for a number of packages (including \pkg{fastshap}, \pkg{lightgbm}, \pkg{xgboost}, and \pkg{shapr}, to name a few). The package is general enough and can be applied to any set of Shapley explanations stored in an ordinary R matrix.

Example: explaining survival on the Titanic

In this section, we'll look at a simple example related to predicting survival on the ill-fated Titanic. We'll use this as an opportunity to introduce all four packages mentioned above. To start, we'll load...

library(fastshap)

# Use one of fastshap's imputed versions of the Titanic data
head(titanic <- titanic_mice[[1L]])

While \pkg{lightgbm} now supports categorical features, it's easier just to re-encode binary variables as 0/1, which we do below. We then construct a matrix (X) containing only the feature columns before calling lightgbm() too fit a model using log loss (the number of trees, or nrounds, was found using 5-fold cross-validation vi the lgb.cv() function):

library(lightgbm)

# Re-encode binary variables as 0/1
titanic$survived <- ifelse(titanic$survived == "yes", 1, 0)
titanic$sex <- ifelse(titanic$sex == "male", 1, 0)

# Matrix of only predictor values
X <- data.matrix(subset(titanic, select = -survived))

params <- list(
  num_leaves = 10L,
  learning_rate = 0.1,
  objective = "binary"
)

set.seed(1420)  # for reproducibility
bst <- lightgbm(X, label = titanic$survived, params = params, nrounds = 45,
                verbose = 0)

To illustrate the simplest use of Shapley values for quantifying feature contributions, we need an observation to predict. While we can use any observation from the training set, we'll construct an observation for a new passenger. Everyone, meet Jack:

jack.dawson <- data.matrix(data.frame(
  #survived = 0L,  # in case you haven't seen the movie
  pclass = 3L,     # third-class passenger
  age = 20.0,      # twenty years old
  sex = 1L,        # male
  sibsp = 0L,      # no siblings/spouses aboard
  parch = 0L       # no parents/children aboard
))  # lightgbm doesn't like data frames

Note that \pkg{fastshap}, \pkg{iml}, and \pkg{iBreakDown} typically require a predefined prediction wrapper; that is, a simple function that tells each package how to extract the appropriate predictions from the fitted model. In this case, for comparison with Tree SHAP, our prediction wrapper will return the predictions on the raw (i.e., logit) scale:

pfun <- function(object, newdata) {  # prediction wrapper
  predict(object, data = data.matrix(newdata), rawscore = TRUE)
}

# Compute Jack's predicted likelihood of survival
(jack.logit <- pfun(bst, newdata = jack.dawson))  # logit scale
(jack.prob <- plogis(jack.logit))  # probability scale

ex.lightgbm <- predict(bst, data = jack.dawson, predcontrib = TRUE)
colnames(ex.lightgbm) <- c(colnames(X), "baseline")
ex.lightgbm
sum(ex.lightgbm)  # since baseline is included, this should some to prediction

benchmark <- readRDS("data/benchmark.rds")
maxy <- max(benchmark$iBreakDown, benchmark$iml, benchmark$fastshap)
palette("Okabe-Ito")
plot(iBreakDown ~ nreps, data = benchmark, type = "b", las = 1,
     xlab = "Number of Monte Carlo repetitions", ylab = "Time (in seconds)", 
     ylim = c(0, maxy))
lines(iml ~ nreps, data = benchmark, type = "b", col = 2)
lines(fastshap ~ nreps, data = benchmark, type = "b", col = 3)
legend("topleft", legend = c("iBreakDown", "iml", "fastshap"),
       lty = c(1, 1, 1), col = 1:3, inset = 0.02)
palette("default")

In this example, for $R = 1000$ MC repetitions, \pkg{fastshap} is roughly r max(benchmark$iml) / max(benchmark$fastshap) times faster than \pkg{iml}, and nearly r max(benchmark$iBreakDown) / max(benchmark$fastshap) times faster than \pkg{iBreakDown}.

Example: visualizing global explanations with \pkg{fastshap} and \pkg{shapviz}

Example: predicting/explaining ALS progression

In this example, we'll do something a bit more interesting. Rather than demonstrating the use of fastshap on an ordinary prediction task, let's use it to help explain the output from a probabilistic regression framework that's (currently) only available in Python. To illustrate, we'll look at a brief example using the ALS data from \citet[p.~349]{efron-2016-computer}. A description of the data, along with the original source and download instructions, can be found at https://web.stanford.edu/~hastie/CASI/}.

The data concern $N = 1,822$ observations on amyotrophic lateral sclerosis (ALS or Lou Gehrig's disease) patients. The goal is to predict ALS progression over time, as measured by the slope (or derivative) of a functional rating score (dFRS), using 369 available predictors obtained from patient visits. The data were originally part of the DREAM-Phil Bowen ALS Predictions Prize4Life challenge. The winning solution [@kuffner-2015-als] used a tree-based ensemble quite similar to a random forest [@breiman-2001-rf], while @efron-2016-computer (Chap. 17) analyzed the data using a gradient boosted tree ensemble [@friedman-2001-greedy; @friedman-2002-stochastic].

Many classification tasks are inherently probabilistic. For example, probability forests [@malley-2012-consistent] can be used to obtain consistent probability estimates for the different class outcomes (i.e., $Pr\left(y = j|\boldsymbol{x}\right)$). Regression tasks, on the other hand, are typically not probabilistic and the predictions correspond to some location estimate of $y|\boldsymbol{x}$; that is, the distribution of $y$ conditional on a set of predictor values $\boldsymbol{x}$. For instance, the terminal nodes in a regression tree---which are used to compute fitted values and predictions---provide an estimate of the conditional mean $E\left(y|\boldsymbol{x}\right)$. Often, it is of scientific interest to know about the probability of specific events conditional on a set of features, rather than a single point estimate like $E\left(y|\boldsymbol{x}\right)$. In the ALS example, rather than using an estimate of the conditional mean $\hat{f}\left(\boldsymbol{x}\right) = \hat{E}\left(\texttt{dFRS}|\boldsymbol{x}\right)$ to predict ALS progression for a new patient, it might be more useful to estimate $Pr\left(\texttt{dFRS} < c | \boldsymbol{x}\right)$, for some constant $c$. This is where probabilistic regression/forecasting comes in.

Probabilistic regression models provide estimates of the entire probability distribution of the response conditional on a set of predictors, denoted $\mathcal{D}_{\boldsymbol{\theta}}\left(y | \boldsymbol{x}\right)$, where $\boldsymbol{\theta}$ represents the parameters of the conditional distribution. For example, the normal distribution has $\boldsymbol{\theta} = \left(\mu, \sigma\right)$; examples include generalized additive models for shape, scale, and location (GAMLSS) [@rigby-2005-gamlss], Bayesian additive regression trees (BART) [@chipman-2010-bart], and Bayesian deep learning. While several approaches to probabilistic regression exist, many of them are inflexible (e.g., GAMSLSS), computationally expensive (e.g., BART), or inaccessible to non-experts (e.g., Bayesian deep learning) [@duan-2020-ngboost]. Natural gradient boosting (NGBoost) extends the simple ideas of gradient boosting to probabilistic regression by treating the parameters $\boldsymbol{\theta}$ as targets for a multiparameter boosting algorithm similar to gradient boosting. We say "multiparameter" because NGBoost fits a separate model for each parameter at every iteration.

The "natural" in "natural gradient boosting" refers to the fact that NGBoost uses something called the natural gradient, as opposed to the ordinary gradient. The natural gradient provides the direction of steepest descent in Riemannian space; this is necessary since gradient descent in the parameter space is not gradient descent in the distribution space because distances don't correspond. The important thing to remember is that NGBoost approximates the gradient of a proper scoring rule---similar to a loss function, but for predicted probabilities and probability distributions of the observed data---as a function of $\boldsymbol{\theta}$. Compared to alternative probabilistic regression methods, NGBoost is fast, flexible, scalable, and easy to use. NGBoost is available in the ngboost package for Python. For more info, check out the NGBoost GitHub repository at https://github.com/stanfordmlgroup/ngboost.

To start, we'll read in the data from the companion website to @efron-2016-computer. Note that the data already include an indicator for training and validation, so we'll go ahead and split the data into train/validation sets:

als <- read.table("https://web.stanford.edu/~hastie/CASI_files/DATA/ALS.txt", 
                  header = TRUE)

# Split into train/test sets
als.trn <- als[!als$testset, -1]  # train
als.val <- als[als$testset, -1]  # validation

# Print dimensions
dim(als.trn)
dim(als.val)

Next, we'll use \CRANpkg{reticulate} (FIXME: need citation) to load the Python ngboost module:

library(reticulate)

ngboost <- import("ngboost")  # requires installation of ngboost

# Construct an NGBoost regressor object
ngb <- ngboost$NGBRegressor(
  Dist = ngboost$distns$Normal,
  n_estimators = 2000L,
  learning_rate = 0.01,
  verbose_eval = 0,
  random_state = 1601L
)

In the next chunk, we call the ngb object's fit()` method to actually train the model (the validation set is used with early stopping to determine the optimal number of trees in the ensemble):

X.trn <- subset(als.trn, select = -dFRS)  # features only
X.val <- subset(als.val, select = -dFRS)  # features only

# Train the model
ngb$fit(
  X = X.trn,
  Y = als.trn$dFRS,
  X_val = X.val,
  Y_val = als.val$dFRS,
  early_stopping_rounds = 5L
)

ngb$predict(X.val[1, ])
(params <- ngb$pred_dist(X.val[1, ])$params)

The code chunk below generates a plot of the estimated cumulative probability density function (i.e., $Pr\left(\texttt{dFRS} <= t\right)$) for the first observation in the validation set; see Figure \@ref(fig:als-ngb-normal).

plot(function(x) pnorm(x, mean = params$loc, sd = params$scale), 
     from = min(als$dFRS), to = max(als$dFRS),
     xlab = "dFRS", ylab = "Cumulative probability")

Since ngboost is built on top of sklearn [@scikit-learn], we can actually use the Python shap package [@lundberg-2017-shap] to create efficient explanations for the model. Below, we call the TreeExplainer() method on the training set and generate global some global interpretations: a variable importance plot, and a SHAP dependence plot:

library(ggplot2)
library(shapviz)

# Use 'shap' package to compute SHAP values for entire training set
shap <- import("shap")
explainer <- shap$TreeExplainer(ngb, model_output = 0L)
ex.trn <- explainer$shap_values(X.trn)  # training explanations
colnames(ex.trn) <- colnames(X.trn)

# Construct variable importance and SHAP dependence plots
viz <- shapviz(ex.trn, X = X.trn)
p1 <- sv_importance(viz) + theme_bw()
p2 <- sv_dependence(viz, v = "Onset.Delta", alpha = 0.3) + theme_bw()
gridExtra::grid.arrange(p1, p2, nrow = 1)

A dFRS of less than -1.1 is considered to be fast progression [@kuffner-2015-als]. Hence, it could be useful to estimate the probability of dFRS < -1.1 and provide an explanation for any individual who's corresponding probability is considered high. The prediction wrapper (pfun()) defined below computes the cumulative probability $Pr\left(\texttt{dFRS} < -1.1 | \boldsymbol{x}\right)$. We use it to determine the observation in the validation set with the highest cumulative probability:

pfun <- function(object, newdata) {
  dist <- object$pred_dist(newdata)
  pnorm(-1.1, mean = dist$params$loc, sd = dist$params$scale)
}

max(prob.val <- pfun(ngb, newdata = X.val))
xval.max <- X.val[which.max(prob.val), ]  # obs with highest predicted prob

library(fastshap)

system.time({
  set.seed(1110)
  ex <- explain(ngb, X = X.trn, nsim = 100, pred_wrapper = pfun, 
                newdata = xval.max, adjust = TRUE)
})

# Visualize with a force plot
viz <- shapviz(ex, X = xval.max, baseline = attr(ex, which = "baseline"))
sv_waterfall(viz, max_display = 10)