knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(flevr)
flevr
is a R
package for doing variable selection based on flexible ensembles. The package provides functions for extrinsic variable selection using the Super Learner and for intrinsic variable selection using the Shapley Population Variable Importance Measure (SPVIM).
The author and maintainer of the flevr
package is Brian Williamson. For details on the method, check out our preprint.
You can install a development release of flevr
from GitHub via devtools
by running the following code:
# install devtools if you haven't already # install.packages("devtools", repos = "https://cloud.r-project.org") devtools::install_github(repo = "bdwilliamson/flevr")
This section should serve as a quick guide to using the flevr
package --- we will cover the main functions for doing extrinsic and intrinsic variable selection using a simulated data example. More details are given in the specific vignettes for extrinsic selection and intrinsic selection.
First, we create some data:
# generate the data -- note that this is a simple setting, for speed set.seed(4747) p <- 2 n <- 500 # generate features x <- replicate(p, stats::rnorm(n, 0, 1)) x_df <- as.data.frame(x) x_names <- names(x_df) # generate outcomes y <- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)
This creates a matrix of covariates x
with 2 columns and a vector y
of normally-distributed outcome values for a sample of n = 500
study participants.
There are two main types of variable selection available in flevr
: extrinsic and intrinsic. Extrinsic selection is the most common type of variable selection: in this approach, a given algorithm (and perhaps its associated algorithm-specific variable importance) is used for variable selection. The lasso is a widely-used example of extrinsic selection. Intrinsic selection, on the other hand, uses estimated intrinsic variable importance (a population quantity) to perform variable selection. This intrinsic importance is both defined and estimated in a model-agnostic manner.
We recommend using the Super Learner @ref(vanderlaan2007) to do extrinsic variable selection to protect against model misspecification; more details on this procedure are available in the vignette on extrinsic selection. This requires specifying a library of candidate learners (e.g., lasso, random forests). We can do this in flevr
using the following code:
set.seed(1234) # fit a Super Learner ensemble; note its simplicity, for speed library("SuperLearner") learners <- c("SL.glm", "SL.mean") V <- 2 fit <- SuperLearner::SuperLearner(Y = y, X = x_df, SL.library = learners, cvControl = list(V = V)) # extract importance based on the whole Super Learner sl_importance_all <- extract_importance_SL( fit = fit, feature_names = x_names, import_type = "all" ) sl_importance_all
These results suggest that feature 2 is more important than feature 1 within the Super Learner ensemble (since a lower rank is better). If we want to scrutinize the importance of features within the best-fitting algorithm in the Super Learner ensemble, we can do the following:
sl_importance_best <- extract_importance_SL( fit = fit, feature_names = x_names, import_type = "best" ) sl_importance_best
Finally, to do variable selection, we need to select a threshold (ideally before looking at the data). In this case, since there are only two variables, we choose a threshold of 1.5, which means we will select only one variable:
extrinsic_selected <- extrinsic_selection( fit = fit, feature_names = x_names, threshold = 1.5, import_type = "all" ) extrinsic_selected
In this case, we select only variable 2.
Intrinsic variable selection is based on population variable importance @ref(williamson2020c); more details on this procedure are available in the vignette on intrinsic selection. Intrinsic selection also uses the Super Learner under the hood, and requires specifying a useful measure of predictiveness (e.g., R-squared or classification accuracy). The first step in doing intrinsic selection is estimating the variable importance:
set.seed(1234) # set up a library for SuperLearner learners <- "SL.glm" univariate_learners <- "SL.glm" V <- 2 # estimate the SPVIMs library("vimp") est <- suppressWarnings( sp_vim(Y = y, X = x, V = V, type = "r_squared", SL.library = learners, gamma = .1, alpha = 0.05, delta = 0, cvControl = list(V = V), env = environment()) ) est
This procedure again shows (correctly) that variable 2 is more important than variable 1 in this population.
The next step is to choose an error rate to control and a method for controlling the family-wise error rate. Here, we choose the generalized family-wise error rate to control overall and choose Holm-adjusted p-values to control the individual family-wise error rate:
intrinsic_set <- intrinsic_selection( spvim_ests = est, sample_size = n, alpha = 0.2, feature_names = x_names, control = list( quantity = "gFWER", base_method = "Holm", k = 1) ) intrinsic_set
In this case, we select both variables.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.