hmda.wmshap | R Documentation |
This function is a wrapper for shapley package that computes the
Weighted Mean SHAP (WMSHAP) values and corresponding confidence intervals for a
grid of models (or an ensemble of base-learners) by calling the
shapley()
function. It uses the specified performance metric to assess the
models' performances and use the metric as a weight
and returns both the weighted mean SHAP values and, if requested, a plot of these
values with confidence intervals. This approach considers the variability of feature
importance across multiple models rather than reporting SHAP values from a single model.
for more details about shapley algotithm, see https://github.com/haghish/shapley
hmda.wmshap(
models,
newdata,
plot = TRUE,
performance_metric = "r2",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = c("mean"),
cutoff = 0.01,
top_n_features = NULL,
n_models = 10,
sample_size = nrow(newdata)
)
models |
A grid object, an AutoML grid, an autoEnsemble object, or a character vector of H2O model IDs from which the SHAP values will be computed. |
newdata |
An H2OFrame (or data frame already uploaded to the H2O server) on which the SHAP values will be evaluated. |
plot |
Logical. If |
performance_metric |
Character. Specifies the performance metric to be used as
weights for the SHAP values. The default is |
standardize_performance_metric |
Logical. If |
performance_type |
Character. Specifies whether the performance metric should be
retrieved from the training data ("train"), validation data ("valid"), or
cross-validation ("xval"). Default is |
minimum_performance |
Numeric. The minimum performance threshold; any model with
a performance equal to or lower than this threshold will have a weight of zero in
the weighted SHAP calculation. Default is |
method |
Character. Specify the method for selecting important features
based on their weighted mean SHAP ratios. The default is
|
cutoff |
Numeric. The cutoff value used in the feature selection method
(default is |
top_n_features |
Integer. If specified, only the top |
n_models |
Integer. The minimum number of models that must meet the
|
sample_size |
Integer. The number of rows in |
This function is designed as a wrapper for the HMDA package and calls the
shapley()
function from the shapley package. It computes the weighted
average of SHAP values across multiple models, using a specified performance
metric (e.g., R Squared, AUC, etc.) as the weight. The performance metric can be
standardized if required. Additionally, the function selects top features based on
different methods (e.g., "mean"
or "lowerCI"
) and
can limit the number of features considered via top_n_features
. The
n_models
parameter controls how many models must meet a minimum performance
threshold to be included in the weighted SHAP calculation.
For more information on the shapley and WMSHAP approaches used in HMDA, please refer to the shapley package documentation and the following resources:
shapley GitHub: https://github.com/haghish/shapley
shapley CRAN: https://CRAN.R-project.org/package=shapley
A list with the following components:
A ggplot2 object showing the weighted mean SHAP values and
confidence intervals (if plot = TRUE
).
A data frame of the weighted mean SHAP values and confidence intervals for each feature.
A data frame of performance metrics for all models used in the analysis.
A vector of model IDs corresponding to the models evaluated.
a list including the GGPLOT2 object, the data frame of SHAP values, and performance metric of all models, as well as the model IDs.
E. F. Haghish
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = gbm_params1)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
# build an autoEnsemble model & test it with the testing dataset
meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
print(h2o.performance(model = meta$model, newdata = test))
# compute weighted mean shap values
wmshap <- hmda.wmshap(models = hmda_grid1,
newdata = test,
performance_metric = "aucpr",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
plot = TRUE)
# identify the important features
selected <- hmda.feature.selection(wmshap,
method = c("mean"),
cutoff = 0.01)
print(selected)
# View the plot of weighted mean SHAP values and confidence intervals
print(wmshap$plot)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.