explain | R Documentation |
Compute dependence-aware Shapley values for observations in x_explain
from the specified
model
using the method specified in approach
to estimate the conditional expectation.
See Aas et al. (2021)
for a thorough introduction to dependence-aware prediction explanation with Shapley values.
For an overview of the methodology and capabilities of the package, see the software paper
Jullum et al. (2025), or the pkgdown site at
norskregnesentral.github.io/shapr/.
explain(
model,
x_explain,
x_train,
approach,
phi0,
iterative = NULL,
max_n_coalitions = NULL,
group = NULL,
n_MC_samples = 1000,
seed = NULL,
verbose = "basic",
predict_model = NULL,
get_model_specs = NULL,
prev_shapr_object = NULL,
asymmetric = FALSE,
causal_ordering = NULL,
confounding = NULL,
extra_computation_args = list(),
iterative_args = list(),
output_args = list(),
...
)
model |
Model object.
The model whose predictions you want to explain.
Run |
x_explain |
Matrix or data.frame/data.table. Features for which predictions should be explained. |
x_train |
Matrix or data.frame/data.table. Data used to estimate the (conditional) feature distributions needed to properly estimate the conditional expectations in the Shapley formula. |
approach |
Character vector of length |
phi0 |
Numeric. The prediction value for unseen data, i.e., an estimate of the expected prediction without conditioning on any features. Typically set this equal to the mean of the response in the training data, but alternatives such as the mean of the training predictions are also reasonable. |
iterative |
Logical or NULL.
If |
max_n_coalitions |
Integer.
Upper limit on the number of unique feature/group coalitions to use in the iterative procedure
(if |
group |
List.
If |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
seed |
Positive integer.
Specifies the seed before any code involving randomness is run.
If |
verbose |
String vector or NULL.
Controls verbosity (printout detail level) via one or more of |
predict_model |
Function.
Prediction function to use when |
get_model_specs |
Function.
An optional function for checking model/data consistency when
If |
prev_shapr_object |
|
asymmetric |
Logical.
Not applicable for (regular) non-causal explanations.
If |
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
confounding |
Logical vector.
Not applicable for (regular) non-causal or asymmetric explanations.
|
extra_computation_args |
Named list.
Specifies extra arguments related to the computation of the Shapley values.
See |
iterative_args |
Named list.
Specifies the arguments for the iterative procedure.
See |
output_args |
Named list.
Specifies certain arguments related to the output of the function.
See |
... |
Arguments passed on to
|
The shapr
package implements kernelSHAP estimation of dependence-aware Shapley values with
eight different Monte Carlo-based approaches for estimating the conditional distributions of the data.
These are all introduced in the
general usage vignette.
(From R: vignette("general_usage", package = "shapr")
).
For an overview of the methodology and capabilities of the package, please also see the software paper
Jullum et al. (2025).
Moreover,
Aas et al. (2021)
gives a general introduction to dependence-aware Shapley values and the approaches "empirical"
,
"gaussian"
, "copula"
, and also discusses "independence"
.
Redelmeier et al. (2020) introduces the approach "ctree"
.
Olsen et al. (2022) introduces the "vaeac"
approach.
Approach "timeseries"
is discussed in
Jullum et al. (2021).
shapr
has also implemented two regression-based approaches "regression_separate"
and "regression_surrogate"
,
as described in Olsen et al. (2024).
It is also possible to combine the different approaches, see the
general usage vignette for more information.
The package also supports the computation of causal and asymmetric Shapley values as introduced by
Heskes et al. (2020) and
Frye et al. (2020).
Asymmetric Shapley values were proposed by
Frye et al. (2020) as a way to incorporate causal knowledge in
the real world by restricting the possible feature combinations/coalitions when computing the Shapley values to
those consistent with a (partial) causal ordering.
Causal Shapley values were proposed by
Heskes et al. (2020) as a way to explain the total effect of features
on the prediction, taking into account their causal relationships, by adapting the sampling procedure in shapr
.
The package allows parallelized computation with progress updates through the tightly connected
future::future and progressr::progressr packages.
See the examples below.
For iterative estimation (iterative=TRUE
), intermediate results may be printed to the console
(according to the verbose
argument).
Moreover, the intermediate results are written to disk.
This combined batch computation of the v(S) values enables fast and accurate estimation of the Shapley values
in a memory-friendly manner.
Object of class c("shapr", "list")
. Contains the following items:
shapley_values_est
data.table with the estimated Shapley values with explained observation in the rows and
features along the columns.
The column none
is the prediction not devoted to any of the features (given by the argument phi0
)
shapley_values_sd
data.table with the standard deviation of the Shapley values reflecting the uncertainty
in the coalition sampling part of the kernelSHAP procedure.
These are, by definition, 0 when all coalitions are used.
Only present when extra_computation_args$compute_sd=TRUE
, which is the default when iterative = TRUE
.
internal
List with the different parameters, data, functions and other output used internally.
pred_explain
Numeric vector with the predictions for the explained observations.
MSEv
List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage vignette for details.
timing
List containing timing information for the different parts of the computation.
summary
contains the time stamps for the start and end time in addition to the total execution time.
overall_timing_secs
gives the time spent on different parts of the explanation computation.
main_computation_timing_secs
further decomposes the main computation time into different parts of the
computation for each iteration of the iterative estimation routine, if used.
Martin Jullum, Lars Henry Berge Olsen
# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
# Split data into test and training data
data_train <- head(airquality, -3)
data_explain <- tail(airquality, 3)
x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]
# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)
# Explain predictions
p <- mean(data_train[, y_var])
# (Optionally) enable parallelization via the future package
if (requireNamespace("future", quietly = TRUE)) {
future::plan("multisession", workers = 2)
}
# (Optionally) enable progress updates within every iteration via the progressr package
if (requireNamespace("progressr", quietly = TRUE)) {
progressr::handlers(global = TRUE)
}
# Empirical approach
explain1 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
phi0 = p,
n_MC_samples = 1e2
)
# Gaussian approach
explain2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = p,
n_MC_samples = 1e2
)
# Gaussian copula approach
explain3 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "copula",
phi0 = p,
n_MC_samples = 1e2
)
if (requireNamespace("party", quietly = TRUE)) {
# ctree approach
explain4 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "ctree",
phi0 = p,
n_MC_samples = 1e2
)
}
# Combined approach
approach <- c("gaussian", "gaussian", "empirical")
explain5 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = approach,
phi0 = p,
n_MC_samples = 1e2
)
## Printing
print(explain1) # The Shapley values
print(explain1) # The Shapley values
# The MSEv criterion (+sd). Smaller values indicate a better approach.
print(explain1, what = "MSEv")
print(explain2, what = "MSEv")
print(explain3, what = "MSEv")
## Summary
summary1 <- summary(explain1)
# Various additional info stored in the summary object
# Examples
summary1$shapley_est # A data.table with the Shapley values
summary1$timing$total_time_secs # Total computation time in seconds
summary1$parameters$n_MC_samples # Number of Monte Carlo samples used for the numerical integration
summary1$parameters$empirical.type # Type of empirical approach used
# Plot the results
if (requireNamespace("ggplot2", quietly = TRUE)) {
plot(explain1)
plot(explain1, plot_type = "waterfall")
}
# Group-wise explanations
group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R"))
explain_groups <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
group = group_list,
approach = "empirical",
phi0 = p,
n_MC_samples = 1e2
)
print(explain_groups)
# Separate and surrogate regression approaches with linear regression models.
req_pkgs <- c("parsnip", "recipes", "workflows", "rsample", "tune", "yardstick")
if (requireNamespace(req_pkgs, quietly = TRUE)) {
explain_separate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p,
approach = "regression_separate",
regression.model = parsnip::linear_reg()
)
explain_surrogate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p,
approach = "regression_surrogate",
regression.model = parsnip::linear_reg()
)
}
# Iterative estimation
# For illustration only. By default not used for such small dimensions as here.
# Restricting the initial and maximum number of coalitions as well.
explain_iterative <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = p,
iterative = TRUE,
iterative_args = list(initial_n_coalitions = 8),
max_n_coalitions = 12
)
# When not using all coalitions, we can also get the SD of the Shapley values,
# reflecting uncertainty in the coalition sampling part of the procedure.
print(explain_iterative, what = "shapley_sd")
## Summary
# For iterative estimation, convergence info is also provided
summary_iterative <- summary(explain_iterative)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.