shapley | R Documentation |
Calculates weighted mean SHAP ratios and confidence intervals to assess feature importance
across a collection of models (e.g., a grid of fine-tuned models or base-learners
in a stacked ensemble). Rather than reporting relative SHAP contributions for
only a single model, this function accounts for variability in feature importance
across multiple models. Each model's performance metric is used as a weight.
The function also provides a plot of weighted SHAP values with confidence intervals.
Currently, only models trained by the h2o
machine learning platform,
autoEnsemble
, and the HMDA
R packages are supported.
shapley(
models,
newdata,
plot = TRUE,
performance_metric = "r2",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
top_n_features = NULL,
n_models = 10,
sample_size = nrow(newdata)
)
models |
h2o search grid, autoML grid, or a character vector of H2O model IDs. |
newdata |
An |
plot |
logical. if TRUE, the weighted mean and confidence intervals of the SHAP values are plotted. The default is TRUE. |
performance_metric |
Character specifying which performance metric to use
as weights. The default is |
standardize_performance_metric |
Logical, indicating whether to standardize
the performance metric used as weights so
their sum equals the number of models. The
default is |
performance_type |
Character. Specify which performance metric should be
reported: |
minimum_performance |
Numeric. Specify the minimum performance metric
for a model to be included in calculating weighted
mean SHAP ratio Models below this threshold receive
zero weight. The default is |
method |
Character. Specify the method for selecting important features
based on their weighted mean SHAP ratios. The default is
|
cutoff |
numeric, specifying the cutoff for the method used for selecting the top features. |
top_n_features |
integer. if specified, the top n features with the highest weighted SHAP values will be selected, overrullung the 'cutoff' and 'method' arguments. specifying top_n_feature is also a way to reduce computation time, if many features are present in the data set. The default is NULL, which means the shap values will be computed for all features. |
n_models |
minimum number of models that should meet the 'minimum_performance' criterion in order to compute WMSHAP and CI. If the intention is to compute global summary SHAP values (at feature level) for a single model, set n_models to 1. The default is 10. |
sample_size |
integer. number of rows in the |
The function works as follows:
SHAP contributions are computed at the individual level (row) for each model for the given "newdata".
Each model's feature-level SHAP ratios (i.e., share of total SHAP) are computed.
The performance metrics of the models are used as weights.
Using the weights vector and shap ratio of features for each model, the weighted mean SHAP ratios and their confidence intervals are computed.
a list including the GGPLOT2 object, the data frame of SHAP values, and performance metric of all models, as well as the model IDs.
E. F. Haghish
## Not run:
# load the required libraries for building the base-learners and the ensemble models
library(h2o) #shapley supports h2o models
library(shapley)
# initiate the h2o server
h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE)
# upload data to h2o cloud
prostate_path <- system.file("extdata", "prostate.csv", package = "h2o")
prostate <- h2o.importFile(path = prostate_path, header = TRUE)
set.seed(10)
### H2O provides 2 types of grid search for tuning the models, which are
### AutoML and Grid. Below, I demonstrate how weighted mean shapley values
### can be computed for both types.
#######################################################
### PREPARE AutoML Grid (takes a couple of minutes)
#######################################################
# run AutoML to tune various models (GBM) for 60 seconds
y <- "CAPSULE"
prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification
aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 120,
include_algos=c("GBM"),
# this setting ensures the models are comparable for building a meta learner
seed = 2023, nfolds = 10,
keep_cross_validation_predictions = TRUE)
### call 'shapley' function to compute the weighted mean and weighted confidence intervals
### of SHAP values across all trained models.
### Note that the 'newdata' should be the testing dataset!
result <- shapley(models = aml, newdata = prostate, performance_metric = "aucpr", plot = TRUE)
#######################################################
### PREPARE H2O Grid (takes a couple of minutes)
#######################################################
# make sure equal number of "nfolds" is specified for different grids
grid <- h2o.grid(algorithm = "gbm", y = y, training_frame = prostate,
hyper_params = list(ntrees = seq(1,50,1)),
grid_id = "ensemble_grid",
# this setting ensures the models are comparable for building a meta learner
seed = 2023, fold_assignment = "Modulo", nfolds = 10,
keep_cross_validation_predictions = TRUE)
result2 <- shapley(models = grid, newdata = prostate, performance_metric = "aucpr", plot = TRUE)
#######################################################
### PREPARE autoEnsemble STACKED ENSEMBLE MODEL
#######################################################
### get the models' IDs from the AutoML and grid searches.
### this is all that is needed before building the ensemble,
### i.e., to specify the model IDs that should be evaluated.
library(autoEnsemble)
ids <- c(h2o.get_ids(aml), h2o.get_ids(grid))
autoSearch <- ensemble(models = ids, training_frame = prostate, strategy = "search")
result3 <- shapley(models = autoSearch, newdata = prostate,
performance_metric = "aucpr", plot = TRUE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.