VSURF_thres: Thresholding step of VSURF

VSURF_thresR Documentation

Thresholding step of VSURF

Description

Thresholding step is dedicated to roughly eliminate irrelevant variables a the dataset. This is the first step of the VSURF function. For refined variable selection, see VSURF other steps: VSURF_interp and VSURF_pred.

Usage

VSURF_thres(x, ...)

## Default S3 method:
VSURF_thres(
  x,
  y,
  mtry = max(floor(ncol(x)/3), 1),
  ntree.thres = 500,
  nfor.thres = 20,
  nmin = 1,
  RFimplem = "randomForest",
  parallel = FALSE,
  clusterType = "PSOCK",
  ncores = parallel::detectCores() - 1,
  verbose = TRUE,
  ntree = NULL,
  ...
)

## S3 method for class 'formula'
VSURF_thres(formula, data, ..., na.action = na.fail)

Arguments

x, formula

A data frame or a matrix of predictors, the columns represent the variables. Or a formula describing the model to be fitted.

...

others parameters to be passed on to the randomForest function (see ?randomForest for further information).

y

A response vector (must be a factor for classification problems and numeric for regression ones).

mtry

Number of variables randomly sampled as candidates at each split. Standard parameter of randomForest.

ntree.thres

Number of trees of each forest grown.

nfor.thres

Number of forests grown.

nmin

Number of times the "minimum value" is multiplied to set threshold value. See details below.

RFimplem

Choice of the random forests implementation to use : "randomForest" (default), "ranger" or "Rborist" (not that if "Rborist" is chosen, "randoForest" will still be used for the first step VSURF_thres). If a vector of length 3 is given, each coordinate is passed to each intermediate function: VSURF_thres, VSURF_interp, VSURF_pred, in this order.

parallel

A logical indicating if you want VSURF to run in parallel on multiple cores (default to FALSE). If a vector of length 3 is given, each coordinate is passed to each intermediate function: VSURF_thres, VSURF_interp, VSURF_pred, in this order.

clusterType

Type of the multiple cores cluster used to run VSURF in parallel. Must be chosen among "PSOCK" (default: SOCKET cluster available locally on all OS), "FORK" (local too, only available for Linux and Mac OS), "MPI" (can be used on a remote cluster, which needs snow and Rmpi packages installed), "ranger" and "Rborist" for internal parallelizations of those packages (not that if "Rborist" is chosen, "SOCKET" will still be used for the first step VSURF_thres). If a vector of length 2 is given, each coordinate is passed to each intermediate function: VSURF_thres, VSURF_interp, in this order.

ncores

Number of cores to use. Default is set to the number of cores detected by R minus 1.

verbose

A logical indicating if information about method's progress (included progress bars for each step) must be printed (default to TRUE). Adds a small extra overload.

ntree

(deprecated) Number of trees in each forest grown for "thresholding step".

data

a data frame containing the variables in the model.

na.action

A function to specify the action to be taken if NAs are found. (NOTE: If given, this argument must be named, and as randomForest it is only used with the formula-type call.)

Details

First, nfor.thres random forests are computed using the function randomForest with arguments importance=TRUE, and our choice of default values for ntree and mtry (which are higher than default in randomForest to get a more stable variable importance measure). Then variables are sorted according to their mean variable importance (VI), in decreasing order. This order is kept all along the procedure. Next, a threshold is computed: min.thres, the minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI. Finally, the actual thresholding is performed: only variables with a mean VI larger than nmin * min.thres are kept.

Value

An object of class VSURF_thres, which is a list with the following components:

varselect.thres

A vector of indices of selected variables, sorted according to their mean VI, in decreasing order.

imp.varselect.thres

A vector of importance of the varselect.thres variables.

min.thres

The minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI.

num.varselect.thres

The number of selected variables.

imp.mean.dec

A vector of the variables importance means (over nfor.thres runs), in decreasing order.

imp.mean.dec.ind

The ordering index vector associated to the sorting of variables importance means.

imp.sd.dec

A vector of standard deviations of all variables importance. The order is given by imp.mean.dec.ind.

mean.perf

The mean OOB error rate, obtained by a random forests build with all variables.

pred.pruned.tree

The predictions of the CART tree fitted to the curve of the standard deviations of VI.

nmin

Value of the parameter in the call.

comput.time

Computation time.

RFimplem

The RF implementation used to run VSURF_thres.

ncores

The number of cores used to run VSURF_thres in parallel (NULL if VSURF_thres did not run in parallel).

clusterType

The type of the cluster used to run VSURF_thres in parallel (NULL if VSURF_thres did not run in parallel).

call

The original call to VSURF.

terms

Terms associated to the formula (only if formula-type call was used).

Author(s)

Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot

References

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2015), VSURF: An R Package for Variable Selection Using Random Forests, The R Journal 7(2):19-33

See Also

VSURF, tune

Examples


data(iris)
iris.thres <- VSURF_thres(iris[,1:4], iris[,5])
iris.thres

## Not run: 
# A more interesting example with toys data (see \code{\link{toys}})
# (a few minutes to execute)
data(toys)
toys.thres <- VSURF_thres(toys$x, toys$y)
toys.thres
## End(Not run)


robingenuer/VSURF documentation built on July 15, 2024, 8:18 p.m.