knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The fastFeatures
package is designed for high-speed variable selection in a variety of settings:
where
Examples of these types of data situations are all around us, and bring all sorts of interesting mathematical challenges to the statistician and data scientist. Typical regression problems are of the form N >> P, where ${P}$ are said to be independent, so that the matrix inversion $(X^{\prime}X)^{-1}$ is "stable". Moreover, N is often small enough so that the problem neatly fits into available system memory(!!!).
We find that data are rarely so well behaved. fastFeatures
is one package that helps the analyst/scientist quickly arrive at a reduced model to generate quality "first-pass" analyses.
The fastFeatures
package function cVIP
reproduces a modified version of the Conditional Variable Inclusion Probability as presented in Bunea et. al. (2010). The modified algorithm:
glmnet::cv.glmnet
to obtain reasonably "optimal" $\lambda$This is a subtle distinction from the method in Buena et. al. as they run a cross-validation to determine the mathematically optimal $\lambda$ at each iteration - frankly I am too impatient.
The authors of the original method suggest:
"If a predictor is selected in more than 50% of the bootstrap samples, we [may] declare it important and investigate its impact further. The bootstrap frequency associated with a certain predictor indicates to what extent the predictor is more likely to be included in the model, rather than be excluded, should we repeat the experiment a large number of times. We refer to this frequency as the variable inclusion probability (VIP). Of course, the threshold of 50% is user specified, we regard this value as a minimum requirement on investigating further a specific predictor. Since our goal is not to miss any possibly relevant predictors, we have opted for using the 50% conservative threshold."
We can change this threshold, however, preliminary testing against the data from the Santander challenge has shown that 50% seems to capture a reasonable amount of information.
Care should be used when setting up the process to use cVIP
. The algorithm is built the modified bootstrap LASSO, and there are several parameters that control how the algorithm explores the conditional distribution of (modified) Variable Inclusion Probability. In particular,
l1_lambda
column_proportion
record_proportion
all have the ability to control how the algorithm explores the posterior density over the n_iterations
-many bootstrap replications.
General guidelines are to pre-tune for a good enough value of $\lambda$, and choose a value of record_proportion
that ensures roughly 10-20x coverage. That is to say we want the quantity
$$ \text{n_iterations} \times \text{record_proportion} > 20 $$
Simulation studies in bootstrapping and MCMC seem to suggest that 20 is a "good" rule of thumb. You may have to explore this more.
For this vignette, we will look at the data-set for the Santander Customer Transaction Prediction challenge at Kaggle.
library(fastFeatures) library(ggplot2) library(magrittr)
Point R to the correct directory for the data. It is advised to use data.table
because it is much faster than other data frame packages in R (and python).
data("kaggleSample") #random seed) set.seed(4321) #grab a subset of data train_idx <- sample(1:dim(kaggleSample)[1], size = round(x = 0.75 *dim(kaggleSample)[1], digits = 0), replace = F) train <- kaggleSample[train_idx,]
dim(train) table(train$target) object.size(train)
Set up for fastFeatures
(or any easy to follow data-science problem)
#id columns ---- record_identifier <- "ID_code" #target ---- target_variable <- "target" #optional columns to exclude ---- opt_cols_exclude <- NULL #features ---- feature_variables <- names(train)[!names(train)%in% c(target_variable, record_identifier, opt_cols_exclude) ]
cVIP
results <- fastFeatures::cVIP(df = train, target_column = target_variable, feature_columns = feature_variables, column_proportion = 0.25, record_proportion = 0.25, n_iterations = 10, l1_lambda = 0.0099, glmnet_family = "binomial")
ggplot(results, aes(x=`Conditional Variable Inclusion Probability`)) + geom_histogram(aes(y=..density..), colour="black", fill="white", binwidth=0.02)+ geom_density(alpha=.2, fill="#FF6666") + xlab("Conditional Variable Inclusion Probability")
thresh <- 0.5 included_vars_index <- ifelse(results[["Conditional Variable Inclusion Probability"]] > thresh, T, F) print(sum(included_vars_index) / length(feature_variables))
subset(results,subset=included_vars_index, select="Conditional Variable Inclusion Probability") %>% knitr::kable(caption="Included Variables", digits=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.