In ck37/featurerank: Ensemble feature ranking for variable selection

Feature Rank: ensemble feature ranking for variable selection

Ensemble feature ranking for variable selection in SuperLearner ensembles [@polley2021package], based on @effrosynidis2021evaluation. Multiple algorithms estimate a ranking of the strength of the relationship between predictors and the outcome in the training set, and these rankings are combined into a single ranking via an aggregation method (reciprocal ranking currently). The final ranking can then be cut at a certain number of variables (e.g. top 10 predictors, top 70%, etc.) to create one or more feature selection wrappers for SuperLearner. The result should generally be more robust and stable than feature selection using a single algorithm. See also [@neumann2017efs] for a similar method.

Install

# install.packages("remotes")
remotes::install_github("ck37/featurerank")

Algorithms

Currently implemented algorithms are:

Feature ranking: correlation, glm, glmnet, random forest, bart, xgboost + shap, variance
Rank aggregation: reciprocal ranking

Example

A minimal example to demonstrate how the package can be used.

# We include library() here so that the output is suppressed, and again
# later in the demo just so people can see it.
library(SuperLearner)
library(glmnet)
# https://github.com/cjcarlson/embarcadero
library(embarcadero)
library(dbarts)
library(weights)
library(randomForest)
library(ck37r)

# Ignore warnings, e.g. from glm().
options("warn" = -1)

Prepare dataset

# TODO: switch to a less problematic demo dataset.
data(Boston, package = "MASS")

# Use "chas" as our outcome variable, which is binary.
y = Boston$chas
x = subset(Boston, select = -chas)

Create feature ranking library

Specify the feature ranking wrappers for the ensemble library.

library(featurerank)

# Modify RF feature ranker to use 100 trees (faster than default of 500).
featrank_randomForest100 =
  function(...) featrank_randomForest(ntree = 100L, ...)

# Specify the set of feature ranking algorithms.
ensemble_rank_custom =
  function(top_vars, ...)
    ensemble_rank(fn_rank = c(featrank_cor, featrank_randomForest100,
                              featrank_glm, featrank_glmnet),
                              #featrank_shap, # too verbose currently
                              #featrank_dbarts), # skip for speed
                  top_vars = top_vars,
                  ...)

# There are 13 total vars so try dropping 1 of them.
top12 = function(...) ensemble_rank_custom(top_vars = 12, ...)

# Try dropping worst 2 predictors.
top11 = function(...) ensemble_rank_custom(top_vars = 11, ...)

# Drop worst 3 predictors.
top10 = function(...) ensemble_rank_custom(top_vars = 10, ...)

Use in SuperLearner

library(SuperLearner)

set.seed(1)
# Takes 93 seconds with 1 core.
sl = SuperLearner(y, x, family = binomial(),
                  # 10-fold cross-validation stratified on the outcome.
                  cvControl = list(V = 10L, stratifyCV = TRUE),
                  SL.library =
                    list("SL.glm", # Baseline estimator uses all predictors.
                         # Try three ensemble screening options, giving the
                         # screened variable list to logistic regression (SL.glm).
                         c("SL.glm", "top12", "top11", "top10")))

# Review timing.
sl$times$everything

# We do achieve a modest AUC benefit.
ck37r::auc_table(sl, y = y)[, -6]

# Which features were dropped (will show FALSE below)?
t(sl$whichScreen)

Assess ranking stability

# Check if we see stability across multiple runs,
# especially for comparison to individual feature ranking algorithms.
# (See stability scores in Table 3 of paper.)
set.seed(2)

# Takes about 90 seconds using 1 core.
system.time({
results =
  do.call(rbind.data.frame,
          lapply(1:10,
                 function(i) top12(y, x, family = binomial(),
                                   # Default replications is 3 - more replications increases stability.
                                   replications = 10,
                                   detailed_results = TRUE)$ranking))
})

names(results) = names(x)
# Stability looks excellent.
results

# What if we treated each iteration as its own ranking and then aggregated?
agg_reciprocal_rank(t(results))