VSURF: Variable Selection Using Random Forests

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/VSURF.R

Description

Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation purpose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction purpose.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
VSURF(x, ...)

## Default S3 method:
VSURF(x, y, ntree = 2000, mtry = max(floor(ncol(x)/3), 1),
  nfor.thres = 50, nmin = 1, nfor.interp = 25, nsd = 1,
  nfor.pred = 25, nmj = 1, RFimplementation = "randomForest",
  parallel = FALSE, ncores = detectCores() - 1, clusterType = "PSOCK",
  ...)

## S3 method for class 'formula'
VSURF(formula, data, ..., na.action = na.fail)

Arguments

x, formula

A data frame or a matrix of predictors, the columns represent the variables. Or a formula describing the model to be fitted.

...

others parameters to be passed on to the randomForest function (see ?randomForest for further information).

y

A response vector (must be a factor for classification problems and numeric for regression ones).

ntree

Number of trees in each forests grown. Standard parameter of randomForest.

mtry

Number of variables randomly sampled as candidates at each split. Standard parameter of randomForest.

nfor.thres

Number of forests grown for "thresholding step" (first of the three steps).

nmin

Number of times the "minimum value" is multiplied to set threshold value. See details below.

nfor.interp

Number of forests grown for "interpretation step" (second of the three steps).

nsd

Number of times the standard deviation of the minimum value of err.interp is multiplied. See details below.

nfor.pred

Number of forests grown for "prediction step" (last of the three steps).

nmj

Number of times the mean jump is multiplied. See details below.

RFimplementation

Choice of the random forests implementation to use : "randomForest" (default) or "ranger".

parallel

A logical indicating if you want VSURF to run in parallel on multiple cores (default to FALSE).

ncores

Number of cores to use. Default is set to the number of cores detected by R minus 1.

clusterType

Type of the multiple cores cluster used to run VSURF in parallel. Must be chosen among "PSOCK" (default: SOCKET cluster available locally on all OS), "FORK" (local too, only available for Linux and Mac OS) and "MPI" (can be used on a remote cluster, which needs snow and Rmpi packages installed).

data

a data frame containing the variables in the model.

na.action

A function to specify the action to be taken if NAs are found. (NOTE: If given, this argument must be named, and as randomForest it is only used with the formula-type call.)

Details

VSURF is able to run using multiple cores in parallel (see parallel, clusterType and ncores arguments).

Value

An object of class VSURF, which is a list with the following components:

varselect.thres

A vector of indexes of variables selected after "thresholding step", sorted according to their mean VI, in decreasing order.

varselect.interp

A vector of indexes of variables selected after "interpretation step".

varselect.pred

A vector of indexes of variables selected after "prediction step".

nums.varselect

A vector of the 3 numbers of variables selected resp. by "thresholding step", "interpretation step" and "prediction step".

imp.varselect.thres

A vector of importance of the varselect.thres variables.

min.thres

The minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI.

imp.mean.dec

A vector of the variables importance means (over nfor.thres runs), in decreasing order.

imp.mean.dec.ind

The ordering index vector associated to the sorting of variables importance means.

imp.sd.dec

A vector of standard deviations of all variables importance. The order is given by imp.mean.dec.ind.

mean.perf

Mean OOB error rate, obtained by a random forests build on all variables.

pred.pruned.tree

Predictions of the CART tree fitted to the curve of the standard deviations of VI.

err.interp

A vector of the mean OOB error rates of the embedded random forests models build during the "interpretation step".

sd.min

The standard deviation of OOB error rates associated to the random forests model attaining the minimum mean OOB error rate during the "interpretation step".

err.pred

A vector of the mean OOB error rates of the random forests models build during the "prediction step".

mean.jump

The mean jump value computed during the "prediction step".

nmin,nsd,nmj

Corresponding parameters values.

overall.time

Overall computation time.

comput.times

A list of the 3 computation times respectively associated with the 3 steps: "thresholding", "interpretation" and "prediction".

RFimplementation

The RF implementation used to run VSURF.

ncores

The number of cores used to run VSURF in parallel (NULL if VSURF did not run in parallel).

clusterType

The type of the cluster used to run VSURF in parallel (NULL if VSURF did not run in parallel).

call

The original call to VSURF.

terms

Terms associated to the formula (only if formula-type call was used).

na.action

Method used to deal with missing values (only if formula-type call was used).

Author(s)

Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot

References

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2015), VSURF: An R Package for Variable Selection Using Random Forests, The R Journal 7(2):19-33

See Also

plot.VSURF, summary.VSURF, VSURF_thres, VSURF_interp, VSURF_pred, tune

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
data(iris)
iris.vsurf <- VSURF(iris[,1:4], iris[,5], ntree = 100, nfor.thres = 20,
                    nfor.interp = 10, nfor.pred = 10)
iris.vsurf

## Not run: 
# A more interesting example with toys data (see \code{\link{toys}})
# (a few minutes to execute)
data(toys)
toys.vsurf <- VSURF(toys$x, toys$y)
toys.vsurf

# VSURF run on 2 cores in parallel (using a SOCKET cluster):
data(toys)
toys.vsurf.parallel <- VSURF(toys$x, toys$y, parallel = TRUE, ncores = 2)

## End(Not run)

robingenuer/VSURF documentation built on April 14, 2018, 10:16 a.m.