Description Usage Arguments Details Value Author(s) References See Also
Using a set of predictors, this function uses random forests to select the best ones in a stepwise fashion. Both the procedure and
the algorithm were borrowed heavily from the VSURF
package with some modifications. These modifications allow for unbiased
computation of variable importance via the cforest
function in the party package.
1 2 3 4 5 6 7 8 9 |
formula |
a formula, such as |
data |
the dataset containing the predictors and response. |
nruns |
How many times should random forests be run to compute variable importance? Defaults to 50. |
silent |
Should the algorithm talk to you? |
importance |
Either "permutation" or "gini." |
nmin |
Number of times the "minimum value" is multiplied to set threshold value. |
... |
other arguments passed to |
What follows is the documentation for the original algorithm in VSURF:
Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation prupose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction prupose.
First step ("thresholding step"): first, nfor.thres
random forests are computed using the function randomForest
with
arguments importance=TRUE
. Then variables are sorted according to
their mean variable importance (VI), in decreasing order. This order is
kept all along the procedure. Next, a threshold is computed:
min.thres
, the minimum predicted value of a pruned CART tree fitted
to the curve of the standard deviations of VI. Finally, the actual
"thresholding step" is performed: only variables with a mean VI larger than
nmin
* min.thres
are kept.
Second step ("intepretation step"): the variables selected by the
first step are considered. nfor.interp
embedded random forests models
are grown, starting with the random forest build with only the most
important variable and ending with all variables selected in the first step.
Then, err.min
the minimum mean out-of-bag (OOB) error of these models
and its associated standard deviation sd.min
are computed. Finally,
the smallest model (and hence its corresponding variables) having a mean OOB
error less than err.min
+ nsd
* sd.min
is selected.
Third step ("prediction step"): the starting point is the same than in
the second step. However, now the variables are added to the model in a
stepwise manner. mean.jump
, the mean jump value is calculated using
variables that have been left out by the second step, and is set as the mean
absolute difference between mean OOB errors of one model and its first
following model. Hence a variable is included in the model if the mean OOB
error decrease is larger than nmj
* mean.jump
.
The object returned has the following attributes:
variable.importance |
A sorted vector of each variable importance measures. |
importance.sd |
the standard deviation of variable importance, measured across the |
stepwise.error |
The OOB error after each variable is added to the model |
response |
The response variable that was modeled. |
variables |
A vector of strings that indicate which variables were included in the initial model. |
nruns |
How many times the random forest was initially run. |
formula |
the formula used for the last model. |
data |
the dataset used to fit the model. |
oob |
the oob error of the entire model. |
time |
how long the algorithm ran for |
rfmodel |
The final model used, a randomForest object. |
Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot, with modifications by Dustin Fife
Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236 Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1): 1-21, 2007. doi: 10.1186/1471-2105-8-25. URL http://dx.doi.org/10.1186/1471-2105-8-25.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.