explain | R Documentation |
Computes dependence-aware Shapley values for observations in x_explain
from the specified
model
by using the method specified in approach
to estimate the conditional expectation.
See Aas et al. (2021)
for a thorough introduction to dependence-aware prediction explanation with Shapley values.
explain(
model,
x_explain,
x_train,
approach,
phi0,
iterative = NULL,
max_n_coalitions = NULL,
group = NULL,
n_MC_samples = 1000,
seed = NULL,
verbose = "basic",
predict_model = NULL,
get_model_specs = NULL,
prev_shapr_object = NULL,
asymmetric = FALSE,
causal_ordering = NULL,
confounding = NULL,
extra_computation_args = list(),
iterative_args = list(),
output_args = list(),
...
)
model |
Model object.
Specifies the model whose predictions we want to explain.
Run |
x_explain |
Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained. |
x_train |
Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula. |
approach |
Character vector of length |
phi0 |
Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable. |
iterative |
Logical or NULL
If |
max_n_coalitions |
Integer.
The upper limit on the number of unique feature/group coalitions to use in the iterative procedure
(if |
group |
List.
If |
n_MC_samples |
Positive integer.
For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration
of every conditional expectation.
For |
seed |
Positive integer.
Specifies the seed before any randomness based code is being run.
If |
verbose |
String vector or NULL.
Specifies the verbosity (printout detail level) through one or more of strings
|
predict_model |
Function.
The prediction function used when |
get_model_specs |
Function.
An optional function for checking model/data consistency when
If |
prev_shapr_object |
|
asymmetric |
Logical.
Not applicable for (regular) non-causal or asymmetric explanations.
If |
causal_ordering |
List.
Not applicable for (regular) non-causal or asymmetric explanations.
|
confounding |
Logical vector.
Not applicable for (regular) non-causal or asymmetric explanations.
|
extra_computation_args |
Named list.
Specifies extra arguments related to the computation of the Shapley values.
See |
iterative_args |
Named list.
Specifies the arguments for the iterative procedure.
See |
output_args |
Named list.
Specifies certain arguments related to the output of the function.
See |
... |
Arguments passed on to
|
The shapr
package implements kernelSHAP estimation of dependence-aware Shapley values with
eight different Monte Carlo-based approaches for estimating the conditional distributions of the data.
These are all introduced in the
general usage vignette.
(From R: vignette("general_usage", package = "shapr")
).
Moreover,
Aas et al. (2021)
gives a general introduction to dependence-aware Shapley values, and the three approaches "empirical"
,
"gaussian"
, "copula"
, and also discusses "independence"
.
Redelmeier et al. (2020) introduces the approach "ctree"
.
Olsen et al. (2022) introduces the "vaeac"
approach.
Approach "timeseries"
is discussed in
Jullum et al. (2021).
shapr
has also implemented two regression-based approaches "regression_separate"
and "regression_surrogate"
,
as described in Olsen et al. (2024).
It is also possible to combine the different approaches, see the
general usage for more information.
The package also supports the computation of causal and asymmetric Shapley values as introduced by
Heskes et al. (2020) and
Frye et al. (2020).
Asymmetric Shapley values were proposed by
Heskes et al. (2020) as a way to incorporate causal knowledge in
the real world by restricting the possible feature combinations/coalitions when computing the Shapley values to
those consistent with a (partial) causal ordering.
Causal Shapley values were proposed by
Frye et al. (2020) as a way to explain the total effect of features
on the prediction, taking into account their causal relationships, by adapting the sampling procedure in shapr
.
The package allows for parallelized computation with progress updates through the tightly connected
future::future and progressr::progressr packages.
See the examples below.
For iterative estimation (iterative=TRUE
), intermediate results may also be printed to the console
(according to the verbose
argument).
Moreover, the intermediate results are written to disk.
This combined batch computing of the v(S) values, enables fast and accurate estimation of the Shapley values
in a memory friendly manner.
Object of class c("shapr", "list")
. Contains the following items:
shapley_values_est
data.table with the estimated Shapley values with explained observation in the rows and
features along the columns.
The column none
is the prediction not devoted to any of the features (given by the argument phi0
)
shapley_values_sd
data.table with the standard deviation of the Shapley values reflecting the uncertainty.
Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by
definition 0 when all coalitions is used.
Only present when extra_computation_args$compute_sd=TRUE
, which is the default when iterative = TRUE
internal
List with the different parameters, data, functions and other output used internally.
pred_explain
Numeric vector with the predictions for the explained observations
MSEv
List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.
timing
List containing timing information for the different parts of the computation.
init_time
and end_time
gives the time stamps for the start and end of the computation.
total_time_secs
gives the total time in seconds for the complete execution of explain()
.
main_timing_secs
gives the time in seconds for the main computations.
iter_timing_secs
gives for each iteration of the iterative estimation, the time spent on the different parts
iterative estimation routine.
Martin Jullum, Lars Henry Berge Olsen
# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
# Split data into test- and training data
data_train <- head(airquality, -3)
data_explain <- tail(airquality, 3)
x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]
# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)
# Explain predictions
p <- mean(data_train[, y_var])
# (Optionally) enable parallelization via the future package
if (requireNamespace("future", quietly = TRUE)) {
future::plan("multisession", workers = 2)
}
# (Optionally) enable progress updates within every iteration via the progressr package
if (requireNamespace("progressr", quietly = TRUE)) {
progressr::handlers(global = TRUE)
}
# Empirical approach
explain1 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
phi0 = p,
n_MC_samples = 1e2
)
# Gaussian approach
explain2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = p,
n_MC_samples = 1e2
)
# Gaussian copula approach
explain3 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "copula",
phi0 = p,
n_MC_samples = 1e2
)
if (requireNamespace("party", quietly = TRUE)) {
# ctree approach
explain4 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "ctree",
phi0 = p,
n_MC_samples = 1e2
)
}
# Combined approach
approach <- c("gaussian", "gaussian", "empirical")
explain5 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = approach,
phi0 = p,
n_MC_samples = 1e2
)
# Print the Shapley values
print(explain1$shapley_values_est)
# Plot the results
if (requireNamespace("ggplot2", quietly = TRUE)) {
plot(explain1)
plot(explain1, plot_type = "waterfall")
}
# Group-wise explanations
group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R"))
explain_groups <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
group = group_list,
approach = "empirical",
phi0 = p,
n_MC_samples = 1e2
)
print(explain_groups$shapley_values_est)
# Separate and surrogate regression approaches with linear regression models.
explain_separate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p,
approach = "regression_separate",
regression.model = parsnip::linear_reg()
)
explain_surrogate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p,
approach = "regression_surrogate",
regression.model = parsnip::linear_reg()
)
# Iterative estimation
# For illustration purposes only. By default not used for such small dimensions as here
# Gaussian approach
explain_iterative <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = p,
n_MC_samples = 1e2,
iterative = TRUE,
iterative_args = list(initial_n_coalitions = 10)
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.