knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Background

The fastFeatures package is designed for high-speed variable selection in a variety of settings:

  1. Large N
  2. Large P
  3. ||P|| > > ||N||

where

Examples of these types of data situations are all around us, and bring all sorts of interesting mathematical challenges to the statistician and data scientist. Typical regression problems are of the form N >> P, where ${P}$ are said to be independent, so that the matrix inversion $(X^{\prime}X)^{-1}$ is "stable". Moreover, N is often small enough so that the problem neatly fits into available system memory(!!!).

We find that data are rarely so well behaved. fastFeatures is one package that helps the analyst/scientist quickly arrive at a reduced model to generate quality "first-pass" analyses.

The Algorithm

The fastFeatures package function cVIP reproduces a modified version of the Conditional Variable Inclusion Probability as presented in Bunea et. al. (2010). The modified algorithm:

  1. [Global (pre)-Tuning] The User preforms cross-validation using glmnet::cv.glmnet to obtain reasonably "optimal" $\lambda$
  2. [Bootstrap]
    • Select random sub-sample of data with replacement
    • Select random columns without replacement
    • Fit LASSO on the random sub-sample of data with selected predictors
  3. [Summarize] Record the frequency with which selected variables are kept in the model conditioned on the provided $\lambda$

This is a subtle distinction from the method in Buena et. al. as they run a cross-validation to determine the mathematically optimal $\lambda$ at each iteration - frankly I am too impatient.

Interperting Results

The authors of the original method suggest:

"If a predictor is selected in more than 50% of the bootstrap samples, we [may] declare it important and investigate its impact further. The bootstrap frequency associated with a certain predictor indicates to what extent the predictor is more likely to be included in the model, rather than be excluded, should we repeat the experiment a large number of times. We refer to this frequency as the variable inclusion probability (VIP). Of course, the threshold of 50% is user specified, we regard this value as a minimum requirement on investigating further a specific predictor. Since our goal is not to miss any possibly relevant predictors, we have opted for using the 50% conservative threshold."

We can change this threshold, however, preliminary testing against the data from the Santander challenge has shown that 50% seems to capture a reasonable amount of information.

A Note on "Mixing"

Care should be used when setting up the process to use cVIP. The algorithm is built the modified bootstrap LASSO, and there are several parameters that control how the algorithm explores the conditional distribution of (modified) Variable Inclusion Probability. In particular,

all have the ability to control how the algorithm explores the posterior density over the n_iterations-many bootstrap replications.

General guidelines are to pre-tune for a good enough value of $\lambda$, and choose a value of record_proportion that ensures roughly 10-20x coverage. That is to say we want the quantity

$$ \text{n_iterations} \times \text{record_proportion} > 20 $$

Simulation studies in bootstrapping and MCMC seem to suggest that 20 is a "good" rule of thumb. You may have to explore this more.

Getting Started

For this vignette, we will look at the data-set for the Santander Customer Transaction Prediction challenge at Kaggle.

Packages

library(fastFeatures)
library(ggplot2)
library(magrittr)

Load-In

Point R to the correct directory for the data. It is advised to use data.table because it is much faster than other data frame packages in R (and python).

data("kaggleSample")

#random seed)
set.seed(4321)

#grab a subset of data
train_idx <- sample(1:dim(kaggleSample)[1], size = round(x = 0.75 *dim(kaggleSample)[1], digits = 0), replace = F)
train <- kaggleSample[train_idx,]

Data Information

dim(train)
table(train$target)
object.size(train)

Setup

Set up for fastFeatures (or any easy to follow data-science problem)

#id columns ----
record_identifier <- "ID_code"

#target ----
target_variable <- "target"

#optional columns to exclude ---- 
opt_cols_exclude <- NULL

#features ----
feature_variables <- names(train)[!names(train)%in% c(target_variable, 
                                                      record_identifier, 
                                                      opt_cols_exclude)
                                  ]

using cVIP

Useage:

results <- fastFeatures::cVIP(df = train,
                                    target_column = target_variable,
                                    feature_columns = feature_variables,
                                    column_proportion = 0.25,
                                    record_proportion = 0.25,
                                    n_iterations = 10,
                                    l1_lambda = 0.0099,
                                   glmnet_family = "binomial")

Distribution of (modified) Conditional Variable Inclusion Probability:

ggplot(results, aes(x=`Conditional Variable Inclusion Probability`)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white", binwidth=0.02)+
 geom_density(alpha=.2, fill="#FF6666") + xlab("Conditional Variable Inclusion Probability")

Proportion of the variables are candidates for further investigation:

thresh <- 0.5
included_vars_index <- ifelse(results[["Conditional Variable Inclusion Probability"]] > thresh, T, F)

print(sum(included_vars_index) / length(feature_variables))

Variables that we may (and shall) investigate:

subset(results,subset=included_vars_index, select="Conditional Variable Inclusion Probability") %>%
  knitr::kable(caption="Included Variables", digits=2)


jameshorine/fastFeatures documentation built on Feb. 3, 2023, 1:30 p.m.