Nothing
#' @title Weighted Mean SHAP Ratio and Confidence Interval for a ML Grid
#' of Fine-Tuned Models or Base-Learners of a Stacked Ensemble Model
#'
#' @description
#' Calculates weighted mean SHAP ratios and confidence intervals to assess feature importance
#' across a collection of models (e.g., a grid of fine-tuned models or base-learners
#' in a stacked ensemble). Rather than reporting relative SHAP contributions for
#' only a single model, this function accounts for variability in feature importance
#' across multiple models. Each model's performance metric is used as a weight.
#' The function also provides a plot of weighted SHAP values with confidence intervals.
#' Currently, only models trained by the \code{h2o} machine learning platform,
#' \code{autoEnsemble}, and the \code{HMDA} R packages are supported.
#'
#' @details
#' The function works as follows:
#' \enumerate{
#' \item SHAP contributions are computed at the individual level (row) for each model for the given "newdata".
#' \item Each model's feature-level SHAP ratios (i.e., share of total SHAP) are computed.
#' \item The performance metrics of the models are used as weights.
#' \item Using the weights vector and shap ratio of features for each model,
#' the weighted mean SHAP ratios and their confidence intervals are computed.
#' }
#'
#' @param models h2o search grid, autoML grid, or a character vector of H2O model IDs.
#' @param newdata An \code{h2o} frame (or \code{data.frame}) already uploaded to the
#' \code{h2o} server. This data will be used for computing SHAP
#' contributions for each model, alongside model's performance
#' weights.
#' @param performance_metric Character specifying which performance metric to use
#' as weights. The default is \code{"r2"}, which can
#' be used for both regression and classification.
#' For binary classification, other options include:
#' \code{"aucpr"} (area under the precision-recall curve),
#' \code{"auc"} (area under the ROC curve),
#' and \code{"f2"} (F2 score).
#' @param standardize_performance_metric Logical, indicating whether to standardize
#' the performance metric used as weights so
#' their sum equals the number of models. The
#' default is \code{FALSE}.
#' @param performance_type Character. Specify which performance metric should be
#' reported: \code{"train"} for training data, \code{"valid"}
#' for validation, or \code{"xval"} for cross-validation (default).
#' @param minimum_performance Numeric. Specify the minimum performance metric
#' for a model to be included in calculating weighted
#' mean SHAP ratio Models below this threshold receive
#' zero weight. The default is \code{0}.
#' @param method Character. Specify the method for selecting important features
#' based on their weighted mean SHAP ratios. The default is
#' \code{"mean"}, which selects features whose weighted mean shap ratio (WMSHAP)
#' exceeds the \code{cutoff}. The alternative is
#' \code{"lowerCI"}, which selects features whose lower bound of confidence
#' interval exceeds the \code{cutoff}.
#' @param cutoff numeric, specifying the cutoff for the method used for selecting
#' the top features.
#' @param top_n_features integer. if specified, the top n features with the
#' highest weighted SHAP values will be selected, overrullung
#' the 'cutoff' and 'method' arguments. specifying top_n_feature
#' is also a way to reduce computation time, if many features
#' are present in the data set. The default is NULL, which means
#' the shap values will be computed for all features.
#' @param n_models minimum number of models that should meet the 'minimum_performance'
#' criterion in order to compute WMSHAP and CI. If the intention
#' is to compute global summary SHAP values (at feature level) for
#' a single model, set n_models to 1. The default is 10.
#' @param sample_size integer. number of rows in the \code{newdata} that should
#' be used for SHAP assessment. By default, all rows are used,
#' which is the recommended procedure for scientific analyses.
#' However, SHAP analysis is time consuming and in the process
#' of code development, lower values can be used for quicker
#' shapley analyses.
#' @param plot logical. if TRUE, the weighted mean and confidence intervals of
#' the SHAP values are plotted. The default is TRUE.
#'
# @param normalize_shap_per_model Logical. If TRUE, the SHAP contribution ratio for each
# model is normalized to range between 0 to 1. The default is
# FALSE, which encourages reporting Weighted Mean SHAP Ratios (not ranks)
# comparison between models without scaling each model.
# This option is only provided for rare use cases where
# WMSHAP values are prefered to be scaled to 0 to 1, but
# should otherwise be avoided.
# @param normalize_to character. The default value is "upperCI", which sets the feature with
# the maximum SHAP value to one, allowing the higher CI to
# go beyond one. Setting this value is mainly for aesthetic
# reason to adjust the Plot, but also, it can influence the
# feature selection process, depending on the method in use,
# because it changes how the SHAP values should be normalized.
# the alternative is 'feature', specifying that
# in the normalization of the SHAP values, the maximum confidence
# interval of the weighted SHAP values should be equal to
# "1", in order to limit the plot values to maximum of one.
#'
#' @importFrom utils setTxtProgressBar txtProgressBar globalVariables
#' @importFrom stats weighted.mean
#' @importFrom h2o h2o.stackedEnsemble h2o.getModel h2o.auc h2o.aucpr h2o.r2
#' h2o.F2 h2o.mean_per_class_error h2o.giniCoef h2o.accuracy
#' h2o.shap_summary_plot
# @importFrom h2otools h2o.get_ids
#' @importFrom curl curl
#' @importFrom ggplot2 ggplot aes geom_col geom_errorbar coord_flip ggtitle xlab
#' ylab theme_classic theme scale_y_continuous margin expansion
#' @return a list including the GGPLOT2 object, the data frame of SHAP values,
#' and performance metric of all models, as well as the model IDs.
#' @examples
#'
#' \dontrun{
#' # load the required libraries for building the base-learners and the ensemble models
#' library(h2o) #shapley supports h2o models
#' library(shapley)
#'
#' # initiate the h2o server
#' h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE)
#'
#' # upload data to h2o cloud
#' prostate_path <- system.file("extdata", "prostate.csv", package = "h2o")
#' prostate <- h2o.importFile(path = prostate_path, header = TRUE)
#'
#' set.seed(10)
#'
#' ### H2O provides 2 types of grid search for tuning the models, which are
#' ### AutoML and Grid. Below, I demonstrate how weighted mean shapley values
#' ### can be computed for both types.
#'
#' #######################################################
#' ### PREPARE AutoML Grid (takes a couple of minutes)
#' #######################################################
#' # run AutoML to tune various models (GBM) for 60 seconds
#' y <- "CAPSULE"
#' prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification
#' aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 120,
#' include_algos=c("GBM"),
#'
#' # this setting ensures the models are comparable for building a meta learner
#' seed = 2023, nfolds = 10,
#' keep_cross_validation_predictions = TRUE)
#'
#' ### call 'shapley' function to compute the weighted mean and weighted confidence intervals
#' ### of SHAP values across all trained models.
#' ### Note that the 'newdata' should be the testing dataset!
#' result <- shapley(models = aml, newdata = prostate, performance_metric = "aucpr", plot = TRUE)
#'
#' #######################################################
#' ### PREPARE H2O Grid (takes a couple of minutes)
#' #######################################################
#' # make sure equal number of "nfolds" is specified for different grids
#' grid <- h2o.grid(algorithm = "gbm", y = y, training_frame = prostate,
#' hyper_params = list(ntrees = seq(1,50,1)),
#' grid_id = "ensemble_grid",
#'
#' # this setting ensures the models are comparable for building a meta learner
#' seed = 2023, fold_assignment = "Modulo", nfolds = 10,
#' keep_cross_validation_predictions = TRUE)
#'
#' result2 <- shapley(models = grid, newdata = prostate, performance_metric = "aucpr", plot = TRUE)
#'
#' #######################################################
#' ### PREPARE autoEnsemble STACKED ENSEMBLE MODEL
#' #######################################################
#'
#' ### get the models' IDs from the AutoML and grid searches.
#' ### this is all that is needed before building the ensemble,
#' ### i.e., to specify the model IDs that should be evaluated.
#' library(autoEnsemble)
#' ids <- c(h2o.get_ids(aml), h2o.get_ids(grid))
#' autoSearch <- ensemble(models = ids, training_frame = prostate, strategy = "search")
#' result3 <- shapley(models = autoSearch, newdata = prostate,
#' performance_metric = "aucpr", plot = TRUE)
#'
#'
#' }
#' @export
#' @author E. F. Haghish
shapley <- function(models,
newdata,
#nboot = NULL,
plot = TRUE,
performance_metric = "r2",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
top_n_features = NULL,
n_models = 10,
sample_size = nrow(newdata)
#normalize_shap_per_model = FALSE
#normalize_to = "upperCI"
) {
# Variables definitions
# ============================================================
BASE <- NULL #contribution SHAP plot
w <- NULL #performance metric (weight)
results <- NULL #data frame of SHAP values
selectedFeatures <- NULL #list of selected features
Plot <- NULL #GGPLOT2 object
feature_importance <- list() #list of feature importance
z <- 0 #counter for the progress bar
# models with low minimum_performance are stored in 'ignored_models' data.frame
ignored_models <- data.frame(id = character(), performance = numeric())
included_models <- NULL
# Where should the performance type be retrieved from?
train <- FALSE
valid <- FALSE
xval <- FALSE
#if (normalize_shap_per_model) message("using 'normalize_shap_per_model' is discouraged! This is for rare use only")
# Syntax check
# ============================================================
if (performance_metric != "r2" &
performance_metric != "aucpr" &
performance_metric != "auc" &
performance_metric != "f2") stop("performance metric must be 'r2', 'aucpr', 'auc', or 'f2'")
if (performance_type == "train") train <- TRUE
else if (performance_type == "valid") valid <- TRUE
else if (performance_type == "xval") xval <- TRUE
else stop("performance_type must be 'train', 'valid', or 'xval'")
# STEP 0: prepare the models, by either confirming that the models are 'h2o' or 'autoEnsemble'
# or by extracting the model IDs from these objects
# ============================================================
if (inherits(models,"H2OAutoML") | inherits(models,"H2OAutoML")
| inherits(models,"H2OGrid")) {
ids <- h2o.get_ids(models)
}
else if (inherits(models,"autoEnsemble")) {
ids <- models[["top_rank_id"]]
}
else if (inherits(models,"character")) {
ids <- models
}
# Initiate the progress bar after identifying the ids
# ------------------------------------------------------------
pb <- txtProgressBar(z, length(ids), style = 3)
# STEP 1: Compute SHAP values and performance metric for each model
# ============================================================
for (i in ids) {
z <- z + 1
performance <- NULL
model <- h2o.getModel(i)
# Compute performance metrics
# ----------------------------------------------------------
# for regression and classification
if (performance_metric == "r2") performance <- h2o.r2(model, train = train, valid = valid, xval = xval)
# for classification
else if (performance_metric == "aucpr") performance <- h2o.aucpr(model, train = train, valid = valid, xval = xval)
else if (performance_metric == "auc") performance <- h2o.auc(model, train = train, valid = valid, xval = xval)
else if (performance_metric == "f2") performance <- h2o.F2(model)
# If the performance of the model is below the minimum_performance,
# ignore processing the model to save runtime
# ----------------------------------------------------------
if (performance <= minimum_performance) {
ignored_models <- rbind(ignored_models, c(i, performance), make.row.names=F)
}
# otherwise, continue processing the model
# ----------------------------------------------------------
else {
included_models <- c(included_models, i)
w <- c(w, performance)
# if top_n_features is not specified, evaluate SHAP for ALL FEATURES,
# otherwise, evaluate SHAP for the top_n_features. Evaluating SHAP for
# all features can be very time consuming. Otherwise, there is no other
# reason to limit the number of features.
if (!is.null(top_n_features)) {
m <- h2o.shap_summary_plot(
model = model,
newdata = newdata,
top_n_features = top_n_features,
sample_size = nrow(newdata)
)
}
else {
m <- h2o.shap_summary_plot(
model = model,
newdata = newdata,
columns = model@allparameters$x, #get SHAP for all predictors
sample_size = nrow(newdata)
)
}
# create the summary dataset
# ----------------------------------------------------------
if (length(included_models) == 1) {
# reserve the first dataset for SHAP cntribution
# reserve the first model's data for rowmean shap computation
BASE <- m
data <- m$data
data <- data[order(data$Row.names), ]
results <- data[, c("id.x", "Row.names", "feature", "contribution")]
names(results)[1] <- "index"
}
else if (length(included_models) > 1) {
holder <- m$data[, c("Row.names", "feature", "contribution")]
colnames(holder) <- c("Row.names", "feature", paste0("contribution", z))
holder <- holder[order(holder$Row.names), ]
#results <- cbind(results, holder[, 2, drop = FALSE])
# NOTE: instead of cbind, use merge because the number of "important" features
# are not identical according to different models. therefore, merge the
# datasets and if a new feature is added, then the value of this this
# feature for previous models should be zero
results <- merge(results, holder, by="Row.names", all = T)
findex <- is.na(results[,"feature.x"])
results[findex,"feature.x"] <- results[findex,"feature.y"]
results[,"feature.y"] <- NULL
names(results)[names(results) == "feature.x"] <- "feature"
}
}
setTxtProgressBar(pb, z)
}
# number of included_models must be higher than 1
if (length(included_models) < n_models) stop("number of models that have met the minimum_performance criteria is too low")
# Check that the sum of weights are larger than 1 to avoid negative variance computation
# ============================================================
if (sum(w) <= 1 & !standardize_performance_metric) stop("sum of model(s) performance weight(s) are lower than 1. enable 'standardize_performance_metric' by setting it to TRUE")
else if (standardize_performance_metric) w <- w * length(w) / sum(w)
# notify the user about ignored models
# ============================================================
if (nrow(ignored_models) > 0) {
colnames(ignored_models) <- c("id", "performance")
warning(paste(nrow(ignored_models),
"models did not meet the minimum_performance criteria and were excluded.
see 'ignored_models' in the returned shapley object"))
}
# ???
# NOTE: if a feature is not important for a model, then the shap value is NA.
# replace them with zero
# ============================================================
results[is.na(results)] <- 0
# STEP 2: Calculate the summary shap values for each feature and store the mean
# SHAP values in a list, for significance testing
# ============================================================
ratioDF <- NULL
UNQ <- unique(results$feature)
summaryShaps <- data.frame(
feature = UNQ,
mean = NA,
sd = NA,
ci = NA,
lowerCI = NA,
upperCI = NA)
# CALCULATE THE TOTAL CONTRIBUTION PER MODEL (don't collapse to vector)
TOTAL <- colSums(abs(results[, grep("^contribution", names(results)), FALSE]),
na.rm = TRUE)
# Calculate the ratio of contribution of each feature per model
# -------------------------------------------------------------
for (j in UNQ) {
# get all contribution columns for the j feature
tmp <- results[results$feature == j, grep("^contribution", names(results)), FALSE]
# compute the ratio of absolute shap values for features of all models
mat <- matrix(colSums(abs(tmp), na.rm = TRUE) / TOTAL, nrow = 1)
# create a matrix
ratioDF <- rbind(ratioDF, mat)
}
# Scale the ratio matrix and create a data frame
# -------------------------------------------------------------
for (i in 1:ncol(ratioDF)) ratioDF[,i] <- abs(ratioDF[,i])
ratioDF <- as.data.frame(ratioDF)
names(ratioDF) <- paste0("ratio", 1:ncol(ratioDF))
feature <- unique(summaryShaps$feature)
ratioDF <- cbind(feature, ratioDF)
# Compute the weighted mean, sd, and ci for each feature
# -------------------------------------------------------------
for (j in UNQ) {
# get all contribution columns for the j feature
tmp <- ratioDF[ratioDF$feature == j, grep("^ratio", names(ratioDF)), FALSE]
weighted_mean <- weighted.mean(tmp, w, na.rm = TRUE)
weighted_var <- sum(w * (tmp - weighted_mean)^2, na.rm = TRUE) / (sum(w, na.rm = TRUE) - 1)
weighted_sd <- sqrt(weighted_var)
# update the summaryShaps data frame
summaryShaps[summaryShaps$feature == j, "mean"] <- weighted_mean #mean(tmp)
summaryShaps[summaryShaps$feature == j, "sd"] <- weighted_sd
summaryShaps[summaryShaps$feature == j, "ci"] <- 1.96 * weighted_sd / sqrt(length(tmp))
}
# Compute the lower and upper confidence intervals
# -------------------------------------------------------------
summaryShaps$lowerCI <- summaryShaps$mean - summaryShaps$ci
summaryShaps$upperCI <- summaryShaps$mean + summaryShaps$ci
# compute feature_importance used by shapley.test function
# -------------------------------------------------------------
for (j in unique(results$feature)) {
tmp <- results[results$feature == j, grep("^contribution", names(results)), FALSE]
tmp <- colSums(abs(tmp))
feature_importance[[j]] <- tmp
}
# Compute row means of SHAP contributions for each subject
# ============================================================
cols <- grep("^contribution", names(results))
# for (r in 1:nrow(data)) {
# data[r, "contribution"] <- weighted.mean(results[r, cols], w)
# }
#???
BASE$data <- BASE$data[order(BASE$data$Row.names), ]
BASE$data$contribution <- data$contribution
BASE$labels$title <- "SHAP Mean Summary Plot\n"
# STEP 3: NORMALIZE the SHAP contributions and their CI
# ============================================================
# the minimum contribution should not be normalized as zero. instead,
# it should be the ratio of minimum value to the maximum value.
# The maximum would be the highest mean + the highest CI
# if (normalize_to == "upperCI") {
# max <- max(summaryShaps$mean + summaryShaps$ci)
# }
# else {
# max <- max(summaryShaps$mean)
# }
# #??? I might still give the minimum value to be zero!
# min <- 0 # min(summaryShaps$mean)/max
#
# summaryShaps$normalized_mean <- normalize(x = summaryShaps$mean,
# min = min,
# max = max)
#
# summaryShaps$normalized_ci <- normalize(x = summaryShaps$ci,
# min = min,
# max = max)
# compute relative shap values
# summaryShaps$shapratio <- summaryShaps$mean / sum(summaryShaps$mean)
# # compute lowerCI
# summaryShaps$lowerCI <- summaryShaps$normalized_mean - summaryShaps$normalized_ci
# summaryShaps$upperCI <- summaryShaps$normalized_mean + summaryShaps$normalized_ci
# STEP 4: Feature selection
# ============================================================
selectedFeatures <- summaryShaps[order(summaryShaps$mean, decreasing = TRUE), ]
if (!is.null(top_n_features)) {
selectedFeatures <- selectedFeatures[1:top_n_features, ]
}
else {
if (method == "mean") {
selectedFeatures <- selectedFeatures[selectedFeatures$mean > cutoff, ]
}
# else if (method == "shapratio") {
# selectedFeatures <- selectedFeatures[selectedFeatures$shapratio > cutoff, ]
# }
else if (method == "lowerCI") {
selectedFeatures <- selectedFeatures[selectedFeatures$lowerCI > cutoff, ]
}
else stop("method must be one of 'mean' or 'lowerCI'")
}
# STEP 5: Create the shapley object
# ============================================================
obj <- list(ids = ids,
plot = Plot,
contributionPlot = BASE,
summaryShaps = summaryShaps,
selectedFeatures = selectedFeatures$feature,
feature_importance = feature_importance,
weights = w,
results = results,
shap_contributions_by_ids = results,
ignored_models = ignored_models,
included_models= included_models)
class(obj) <- c("shapley", "list")
# STEP 6: PLOT
# ============================================================
if (plot) {
obj$plot <- shapley.plot(obj,
plot = "bar",
method = method,
cutoff = cutoff,
top_n_features = top_n_features)
print(obj$plot)
}
return(obj)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.