VSURF_thres | R Documentation |
Thresholding step is dedicated to roughly eliminate irrelevant variables a the
dataset. This is the first step of the VSURF
function. For
refined variable selection, see VSURF other steps: VSURF_interp
and VSURF_pred
.
VSURF_thres(x, ...)
## Default S3 method:
VSURF_thres(
x,
y,
mtry = max(floor(ncol(x)/3), 1),
ntree.thres = 500,
nfor.thres = 20,
nmin = 1,
RFimplem = "randomForest",
parallel = FALSE,
clusterType = "PSOCK",
ncores = parallel::detectCores() - 1,
verbose = TRUE,
ntree = NULL,
...
)
## S3 method for class 'formula'
VSURF_thres(formula, data, ..., na.action = na.fail)
x , formula |
A data frame or a matrix of predictors, the columns represent the variables. Or a formula describing the model to be fitted. |
... |
others parameters to be passed on to the |
y |
A response vector (must be a factor for classification problems and numeric for regression ones). |
mtry |
Number of variables randomly sampled as candidates at each split.
Standard parameter of |
ntree.thres |
Number of trees of each forest grown. |
nfor.thres |
Number of forests grown. |
nmin |
Number of times the "minimum value" is multiplied to set threshold value. See details below. |
RFimplem |
Choice of the random forests implementation to use :
"randomForest" (default), "ranger" or "Rborist" (not that if "Rborist" is
chosen, "randoForest" will still be used for the first step
|
parallel |
A logical indicating if you want VSURF to run in parallel on
multiple cores (default to FALSE). If a vector of length 3 is given,
each coordinate is passed to each intermediate function: |
clusterType |
Type of the multiple cores cluster used to run VSURF in
parallel. Must be chosen among "PSOCK" (default: SOCKET cluster available
locally on all OS), "FORK" (local too, only available for Linux and Mac
OS), "MPI" (can be used on a remote cluster, which needs |
ncores |
Number of cores to use. Default is set to the number of cores detected by R minus 1. |
verbose |
A logical indicating if information about method's progress (included progress bars for each step) must be printed (default to TRUE). Adds a small extra overload. |
ntree |
(deprecated) Number of trees in each forest grown for "thresholding step". |
data |
a data frame containing the variables in the model. |
na.action |
A function to specify the action to be taken if NAs are
found. (NOTE: If given, this argument must be named, and as
|
First, nfor.thres
random forests are computed using the function
randomForest
with arguments importance=TRUE
, and our choice of
default values for ntree
and mtry
(which are higher than default
in randomForest
to get a more stable variable importance
measure). Then variables are sorted according to their mean variable
importance (VI), in decreasing order. This order is kept all along the
procedure. Next, a threshold is computed: min.thres
, the minimum
predicted value of a pruned CART tree fitted to the curve of the standard
deviations of VI. Finally, the actual thresholding is performed: only
variables with a mean VI larger than nmin
* min.thres
are kept.
An object of class VSURF_thres
, which is a list with the
following components:
varselect.thres |
A vector of indices of selected variables, sorted according to their mean VI, in decreasing order. |
imp.varselect.thres |
A vector of importance of the
|
min.thres |
The minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI. |
num.varselect.thres |
The number of selected variables. |
imp.mean.dec |
A vector of the variables importance means (over
|
imp.mean.dec.ind |
The ordering index vector associated to the sorting of variables importance means. |
imp.sd.dec |
A vector of standard deviations of all variables
importance. The order is given by |
mean.perf |
The mean OOB error rate, obtained by a random forests build with all variables. |
pred.pruned.tree |
The predictions of the CART tree fitted to the curve of the standard deviations of VI. |
nmin |
Value of the parameter in the call. |
comput.time |
Computation time. |
RFimplem |
The RF implementation used to run
|
ncores |
The number of cores used to run |
clusterType |
The type of the cluster used to run |
call |
The original call to |
terms |
Terms associated to the formula (only if formula-type call was used). |
Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot
Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236
Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2015), VSURF: An R Package for Variable Selection Using Random Forests, The R Journal 7(2):19-33
VSURF
, tune
data(iris)
iris.thres <- VSURF_thres(iris[,1:4], iris[,5])
iris.thres
## Not run:
# A more interesting example with toys data (see \code{\link{toys}})
# (a few minutes to execute)
data(toys)
toys.thres <- VSURF_thres(toys$x, toys$y)
toys.thres
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.