rfThresh: Variable Selection Using Random Forests

Description Usage Arguments Details Value Author(s) References See Also

View source: R/rf.thresh.R

Description

Using a set of predictors, this function uses random forests to select the best ones in a stepwise fashion. Both the procedure and the algorithm were borrowed heavily from the VSURF package with some modifications. These modifications allow for unbiased computation of variable importance via the cforest function in the party package.

Usage

1
2
3
4
5
6
7
8
9
rfThresh(
  formula,
  data,
  nruns = 50,
  silent = FALSE,
  importance = "permutation",
  nmin = 1,
  ...
)

Arguments

formula

a formula, such as y~x1 + x2, where y is the response variable and anything following ~ are predictors.

data

the dataset containing the predictors and response.

nruns

How many times should random forests be run to compute variable importance? Defaults to 50.

silent

Should the algorithm talk to you?

importance

Either "permutation" or "gini."

nmin

Number of times the "minimum value" is multiplied to set threshold value.

...

other arguments passed to cforest or randomForest

Details

What follows is the documentation for the original algorithm in VSURF:

Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation prupose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction prupose.

Value

The object returned has the following attributes:

variable.importance

A sorted vector of each variable importance measures.

importance.sd

the standard deviation of variable importance, measured across the nruns iterations.

stepwise.error

The OOB error after each variable is added to the model

response

The response variable that was modeled.

variables

A vector of strings that indicate which variables were included in the initial model.

nruns

How many times the random forest was initially run.

formula

the formula used for the last model.

data

the dataset used to fit the model.

oob

the oob error of the entire model.

time

how long the algorithm ran for

rfmodel

The final model used, a randomForest object.

Author(s)

Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot, with modifications by Dustin Fife

References

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236 Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1): 1-21, 2007. doi: 10.1186/1471-2105-8-25. URL http://dx.doi.org/10.1186/1471-2105-8-25.

See Also

rfInterp, rfPred


dustinfife/fifer documentation built on Oct. 31, 2020, 3:36 p.m.