knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
fastFeatures
fastFeatures
is an R
package built from the ground up with one goal in mind: High-speed variable selection in typical statistical/data science settings so that you may (and shall) build a predictive model that is easy to explain and share.
Namely, we are focusing on problem of the type:
where
fastFeatures
provides an easy-to-use interface to fast variable selection methods. The implemented feature selection functions are:
cVIP
is built from the notion of (Conditional) Variable Inclusion Probability as introduced by Bunea et. al [1]. Built on glmnet
and can accept any regression of the families:
gaussian
binomial
poisson
multinomial
cox
mgaussian
[TODO] rf_cVIP
(should be) a ranger
-powered bootstrapped random forest. The same methodology applies with a twist that I have not yet figured out.
You can install the released version of fastFeatures from GitHub with:
devtools:
install.packages("devtools")
UPDATE: It turns out I was wrong about installation via devtools, and there appears to be a bit of a community kerfuffle between Hadley and the folks at devtools
...per https://community.rstudio.com/t/vignettes-suddenly-stopped-installing/18391, try installing as:
devtools::install_github("jameshorine/fastFeatures", build_vignettes = TRUE, build_opts = c("--no-resave-data", "--no-manual"), force = T)
Note: you will need to install knitr
, kableExtra
so that R may build the markdown vignette.
You may access the package Vignette via
browseVignettes(package = "fastFeatures")
The general recipe for using this package is:
Below is an outline of how to call cVIP
. The inputs:
df
target_column
feature_columns
column_proportion
n_iterations
l1_lambda
glmnet_family
are required user-defined inputs.
The record_proportion
parameter has a default value fo 5%. This was chosen intentionally because the author (James) desires speed for this application. If you (the user) desire to sample more of the records simply increase the value. Keep in mind that doing so will increase the expected algorithm run time.
fastFeatures::cVIP(df = train, target_column = target_variable, feature_columns = feature_variables, column_proportion = 0.25, record_proportion = 0.05, n_iterations = 1000, l1_lambda = 0.0099, glmnet_family = "binomial")
This is not built for the windows platform. I am making use of pbmcapply::pbmclapply()
because speed is as important as user feedback; It is simply nice to see progress indicators for a long calculation.
The "mixing" properties of this algorithm have not been explored at this time. Users should use their judgement in parameter settings. If you have many predictors, you may not adequately explore the (conditional) Variable Inclusion Probability distribution.
Contrary to [1], [2] (below), l1_lambda
is NOT optimized at every iteration of the algorithm.
rf_cVIP
Please direct any feedback to the issues section!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.