View source: R/hmda.feature.selection.R
hmda.feature.selection | R Documentation |
This function selects "important", "inessential", and "irrelevant"
features based on a summary of weighted mean SHAP values obtained from a prior
analysis. It uses the SHAP summary table (found in the wmshap
object)
to identify features that are deemed important according to a specified method
and cutoff. Features with a lower confidence interval (lowerCI) below zero
are labeled as "irrelevant", while the remaining features are classified as
"inessential" if they do not meet the importance criteria.
hmda.feature.selection(
wmshap,
method = c("mean"),
cutoff = 0.01,
top_n_features = NULL
)
wmshap |
A list object (typically returned by a weighted SHAP analysis)
that must contain a data frame |
method |
Character. Specify the method for selecting important features
based on their weighted mean SHAP ratios. The default is
|
cutoff |
Numeric. The threshold cutoff for the selection method. Features
with a weighted SHAP value (or ratio) greater than or equal to this value
are considered important. Default is |
top_n_features |
Integer. If specified, the function selects the top
|
The function performs the following steps:
Retrieves the SHAP summary table from the wmshap
object.
Sorts the summary table in descending order based on the mean
SHAP value.
Identifies all features available in the summary.
Classifies features as irrelevant if their lowerCI
value is below zero.
If top_n_features
is not specified, selects important
features as those whose value for the specified method
column
meets or exceeds the cutoff
; the remaining features (excluding
those marked as irrelevant) are classified as inessential.
If top_n_features
is provided, the function selects the top
n
features (based on the sorted order) as important, with the
rest (excluding irrelevant ones) being inessential.
A list with three elements:
A character vector of features deemed important.
A character vector of features considered inessential (present in the data but not meeting the importance criteria).
A character vector of features deemed irrelevant, defined as those with a lower confidence interval (lowerCI) below zero.
E. F. Haghish
## Not run:
library(HMDA)
library(h2o)
hmda.init()
h2o.removeAll()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = gbm_params1)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
# build an autoEnsemble model & test it with the testing dataset
meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
print(h2o.performance(model = meta$model, newdata = test))
# compute weighted mean shap values
wmshap <- hmda.wmshap(models = hmda_grid1,
newdata = test,
performance_metric = "aucpr",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
plot = TRUE)
# identify the important features
selected <- hmda.feature.selection(wmshap,
method = c("mean"),
cutoff = 0.01)
print(selected)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.