Description Usage Arguments Details Value Author(s) References See Also Examples
Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation prupose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction prupose.
1 2 3 4 5 6 7 8 9 10  VSURF(x, ...)
## Default S3 method:
VSURF(x, y, ntree = 2000, mtry = max(floor(ncol(x)/3), 1),
nfor.thres = 50, nmin = 1, nfor.interp = 25, nsd = 1,
nfor.pred = 25, nmj = 1, parallel = FALSE, ncores = detectCores()  1,
clusterType = "PSOCK", ...)
## S3 method for class 'formula'
VSURF(formula, data, ..., na.action = na.fail)

x, formula 
A data frame or a matrix of predictors, the columns represent the variables. Or a formula describing the model to be fitted. 
... 
others parameters to be passed on to the 
y 
A response vector (must be a factor for classification problems and numeric for regression ones). 
ntree 
Number of trees in each forests grown. Standard parameter of

mtry 
Number of variables randomly sampled as candidates at each
split. Standard parameter of 
nfor.thres 
Number of forests grown for "thresholding step" (first of the three steps). 
nmin 
Number of times the "minimum value" is multiplied to set threshold value. 
nfor.interp 
Number of forests grown for "intepretation step" (second of the three steps). 
nsd 
Number of times the standard deviation of the minimum value of

nfor.pred 
Number of forests grown for "prediction step" (last of the three steps). 
nmj 
Number of times the mean jump is multiplied. 
parallel 
A logical indicating if you want VSURF to run in parallel on multiple cores (default to FALSE). 
ncores 
Number of cores to use. Default is set to the number of cores detected by R minus 1. 
clusterType 
Type of the multiple cores cluster used to run VSURF in
parallel. Must be chosen among "PSOCK" (default: SOCKET cluster available
locally on all OS), "FORK" (local too, only available for Linux and Mac OS)
and "MPI" (can be used on a remote cluster, which needs 
data 
a data frame containing the variables in the model. 
na.action 
A function to specify the action to be taken if NAs are
found. (NOTE: If given, this argument must be named, and as

First step ("thresholding step"): first, nfor.thres
random forests are computed using the function randomForest
with
arguments importance=TRUE
, and our choice of default values for
ntree
and mtry
(which are higher than default in
randomForest
to get a more stable variable importance measure).
Then variables are sorted according to their mean variable importance (VI),
in decreasing order. This order is kept all along the procedure.
Next, a threshold is computed:
min.thres
, the minimum predicted value of a pruned CART tree fitted
to the curve of the standard deviations of VI. Finally, the actual
"thresholding step" is performed: only variables with a mean VI larger than
nmin
* min.thres
are kept.
Second step ("intepretation step"): the variables selected by the
first step are considered. nfor.interp
embedded random forests models
are grown, starting with the random forest build with only the most
important variable and ending with all variables selected in the first step.
Then, err.min
the minimum mean outofbag (OOB) error of these models
and its associated standard deviation sd.min
are computed. Finally,
the smallest model (and hence its corresponding variables) having a mean OOB
error less than err.min
+ nsd
* sd.min
is selected.
Note that for this step (and the next one),
the mtry
parameter of randomForest
is set to its default value
(see randomForest
) if nvm
, the number of variables
in the model, is not greater than the number of observations,
while it is set to nvm/3
otherwise. This is to ensure quality of OOB
error estimations along embedded RF models.
Third step ("prediction step"): the starting point is the same than in
the second step. However, now the variables are added to the model in a
stepwise manner. mean.jump
, the mean jump value is calculated using
variables that have been left out by the second step, and is set as the mean
absolute difference between mean OOB errors of one model and its first
following model. Hence a variable is included in the model if the mean OOB
error decrease is larger than nmj
* mean.jump
.
As for interpretation step,
the mtry
parameter of randomForest
is set to its default value
if nvm
, the number of variables
in the model, is not greater than the number of observations,
while it is set to nvm/3
otherwise.
VSURF is able to run using mutliple cores in parallel
(see parallel
, clusterType
and ncores
arguments).
An object of class VSURF
, which is a list with the following
components:
varselect.thres 
A vector of indexes of variables selected after "thresholding step", sorted according to their mean VI, in decreasing order. 
varselect.interp 
A vector of indexes of variables selected after "interpretation step". 
varselect.pred 
A vector of indexes of variables selected after "prediction step". 
nums.varselect 
A vector of the 3 numbers of variables selected resp. by "thresholding step", "interpretation step" and "prediction step". 
imp.varselect.thres 
A vector of importances of the

min.thres 
The minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI. 
imp.mean.dec 
A vector of the variables importance means
(over 
imp.mean.dec.ind 
The ordering index vector associated to the sorting of variables importance means. 
imp.sd.dec 
A vector of standard deviations of all variables
importances. The order is given by 
mean.perf 
Mean OOB error rate, obtained by a random forests build on all variables. 
pred.pruned.tree 
Predictions of the CART tree fitted to the curve of the standard deviations of VI. 
err.interp 
A vector of the mean OOB error rates of the embedded random forests models build during the "interpretation step". 
sd.min 
The standard deviation of OOB error rates associated to the random forests model attaining the minimum mean OOB error rate during the "interpretation step". 
err.pred 
A vector of the mean OOB error rates of the random forests models build during the "prediction step". 
mean.jump 
The mean jump value computed during the "prediction step". 
nmin,nsd,nmj 
Corresponding parameters values. 
overall.time 
Overall computation time. 
comput.times 
A list of the 3 computation times respectively associated with the 3 steps: "thresholding", "interpretation" and "prediction". 
ncores 
The number of cores used to run 
clusterType 
The type of the cluster used to run

call 
The original call to 
terms 
Terms associated to the formula (only if formulatype call was used). 
na.action 
Method used to deal with missing values (only if formulatype call was used). 
Robin Genuer, JeanMichel Poggi and Christine TuleauMalot
Genuer, R. and Poggi, J.M. and TuleauMalot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 22252236
Genuer, R. and Poggi, J.M. and TuleauMalot, C. (2015), VSURF: An R Package for Variable Selection Using Random Forests, The R Journal 7(2):1933
plot.VSURF
, summary.VSURF
,
VSURF_thres
, VSURF_interp
,
VSURF_pred
, tune
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  data(iris)
iris.vsurf < VSURF(iris[,1:4], iris[,5], ntree = 100, nfor.thres = 20,
nfor.interp = 10, nfor.pred = 10)
iris.vsurf
## Not run:
# A more interesting example with toys data (see \code{\link{toys}})
# (a few minutes to execute)
data(toys)
toys.vsurf < VSURF(toys$x, toys$y)
toys.vsurf
# VSURF run on 2 cores in parallel (using a SOCKET cluster):
data(toys)
toys.vsurf.parallel < VSURF(toys$x, toys$y, parallel = TRUE, ncores = 2)
## End(Not run)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.