Ensemble feature ranking for variable selection in SuperLearner ensembles [@polley2021package], based on @effrosynidis2021evaluation. Multiple algorithms estimate a ranking of the strength of the relationship between predictors and the outcome in the training set, and these rankings are combined into a single ranking via an aggregation method (reciprocal ranking currently). The final ranking can then be cut at a certain number of variables (e.g. top 10 predictors, top 70%, etc.) to create one or more feature selection wrappers for SuperLearner. The result should generally be more robust and stable than feature selection using a single algorithm. See also [@neumann2017efs] for a similar method.
# install.packages("remotes") remotes::install_github("ck37/featurerank")
Currently implemented algorithms are:
A minimal example to demonstrate how the package can be used.
# We include library() here so that the output is suppressed, and again # later in the demo just so people can see it. library(SuperLearner) library(glmnet) # https://github.com/cjcarlson/embarcadero library(embarcadero) library(dbarts) library(weights) library(randomForest) library(ck37r) # Ignore warnings, e.g. from glm(). options("warn" = -1)
# TODO: switch to a less problematic demo dataset. data(Boston, package = "MASS") # Use "chas" as our outcome variable, which is binary. y = Boston$chas x = subset(Boston, select = -chas)
Specify the feature ranking wrappers for the ensemble library.
library(featurerank) # Modify RF feature ranker to use 100 trees (faster than default of 500). featrank_randomForest100 = function(...) featrank_randomForest(ntree = 100L, ...) # Specify the set of feature ranking algorithms. ensemble_rank_custom = function(top_vars, ...) ensemble_rank(fn_rank = c(featrank_cor, featrank_randomForest100, featrank_glm, featrank_glmnet), #featrank_shap, # too verbose currently #featrank_dbarts), # skip for speed top_vars = top_vars, ...) # There are 13 total vars so try dropping 1 of them. top12 = function(...) ensemble_rank_custom(top_vars = 12, ...) # Try dropping worst 2 predictors. top11 = function(...) ensemble_rank_custom(top_vars = 11, ...) # Drop worst 3 predictors. top10 = function(...) ensemble_rank_custom(top_vars = 10, ...)
library(SuperLearner) set.seed(1) # Takes 93 seconds with 1 core. sl = SuperLearner(y, x, family = binomial(), # 10-fold cross-validation stratified on the outcome. cvControl = list(V = 10L, stratifyCV = TRUE), SL.library = list("SL.glm", # Baseline estimator uses all predictors. # Try three ensemble screening options, giving the # screened variable list to logistic regression (SL.glm). c("SL.glm", "top12", "top11", "top10"))) # Review timing. sl$times$everything # We do achieve a modest AUC benefit. ck37r::auc_table(sl, y = y)[, -6] # Which features were dropped (will show FALSE below)? t(sl$whichScreen)
# Check if we see stability across multiple runs, # especially for comparison to individual feature ranking algorithms. # (See stability scores in Table 3 of paper.) set.seed(2) # Takes about 90 seconds using 1 core. system.time({ results = do.call(rbind.data.frame, lapply(1:10, function(i) top12(y, x, family = binomial(), # Default replications is 3 - more replications increases stability. replications = 10, detailed_results = TRUE)$ranking)) }) names(results) = names(x) # Stability looks excellent. results # What if we treated each iteration as its own ranking and then aggregated? agg_reciprocal_rank(t(results))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.