rfThresh: Variable Selection Using Random Forests
In dustinfife/fifer: A Biostatisticians Toolbox for Various Activities, Including Plotting, Data Cleanup, and Data Analysis

Description Usage Arguments Details Value Author(s) References See Also

Using a set of predictors, this function uses random forests to select the best ones in a stepwise fashion. Both the procedure and the algorithm were borrowed heavily from the VSURF package with some modifications. These modifications allow for unbiased computation of variable importance via the cforest function in the party package.

rfThresh(
  formula,
  data,
  nruns = 50,
  silent = FALSE,
  importance = "permutation",
  nmin = 1,
  ...
)

`formula`	a formula, such as `y~x1 + x2`, where `y` is the response variable and anything following `~` are predictors.
`data`	the dataset containing the predictors and response.
`nruns`	How many times should random forests be run to compute variable importance? Defaults to 50.
`silent`	Should the algorithm talk to you?
`importance`	Either "permutation" or "gini."
`nmin`	Number of times the "minimum value" is multiplied to set threshold value.
`...`	other arguments passed to `cforest` or `randomForest`

What follows is the documentation for the original algorithm in VSURF:

Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation prupose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction prupose.

First step ("thresholding step"): first, nfor.thres random forests are computed using the function randomForest with arguments importance=TRUE. Then variables are sorted according to their mean variable importance (VI), in decreasing order. This order is kept all along the procedure. Next, a threshold is computed: min.thres, the minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI. Finally, the actual "thresholding step" is performed: only variables with a mean VI larger than nmin * min.thres are kept.
Second step ("intepretation step"): the variables selected by the first step are considered. nfor.interp embedded random forests models are grown, starting with the random forest build with only the most important variable and ending with all variables selected in the first step. Then, err.min the minimum mean out-of-bag (OOB) error of these models and its associated standard deviation sd.min are computed. Finally, the smallest model (and hence its corresponding variables) having a mean OOB error less than err.min + nsd * sd.min is selected.
Third step ("prediction step"): the starting point is the same than in the second step. However, now the variables are added to the model in a stepwise manner. mean.jump, the mean jump value is calculated using variables that have been left out by the second step, and is set as the mean absolute difference between mean OOB errors of one model and its first following model. Hence a variable is included in the model if the mean OOB error decrease is larger than nmj * mean.jump.

The object returned has the following attributes:

`variable.importance`	A sorted vector of each variable importance measures.
`importance.sd`	the standard deviation of variable importance, measured across the `nruns` iterations.
`stepwise.error`	The OOB error after each variable is added to the model
`response`	The response variable that was modeled.
`variables`	A vector of strings that indicate which variables were included in the initial model.
`nruns`	How many times the random forest was initially run.
`formula`	the formula used for the last model.
`data`	the dataset used to fit the model.
`oob`	the oob error of the entire model.
`time`	how long the algorithm ran for
`rfmodel`	The final model used, a randomForest object.

Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot, with modifications by Dustin Fife

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236 Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1): 1-21, 2007. doi: 10.1186/1471-2105-8-25. URL http://dx.doi.org/10.1186/1471-2105-8-25.

rfInterp, rfPred

dustinfife/fifer documentation built on Oct. 31, 2020, 3:36 p.m.

dustinfife/fifer index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dustinfife/fifer
A Biostatisticians Toolbox for Various Activities, Including Plotting, Data Cleanup, and Data Analysis

rfThresh: Variable Selection Using Random Forests
In dustinfife/fifer: A Biostatisticians Toolbox for Various Activities, Including Plotting, Data Cleanup, and Data Analysis

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Related to rfThresh in dustinfife/fifer...

R Package Documentation

Browse R Packages

We want your feedback!

dustinfife/fifer A Biostatisticians Toolbox for Various Activities, Including Plotting, Data Cleanup, and Data Analysis

rfThresh: Variable Selection Using Random Forests In dustinfife/fifer: A Biostatisticians Toolbox for Various Activities, Including Plotting, Data Cleanup, and Data Analysis

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Related to rfThresh in dustinfife/fifer...

R Package Documentation

Browse R Packages

We want your feedback!

dustinfife/fifer
A Biostatisticians Toolbox for Various Activities, Including Plotting, Data Cleanup, and Data Analysis

rfThresh: Variable Selection Using Random Forests
In dustinfife/fifer: A Biostatisticians Toolbox for Various Activities, Including Plotting, Data Cleanup, and Data Analysis