varSelRFBoot: Bootstrap the variable selection procedure in varSelRF In varSelRF: Variable Selection using Random Forests

Description

Use the bootstrap to estimate the prediction error rate (wuth the .632+ rule) and the stability of the variable selection procedure implemented in varSelRF.

Usage

 1 2 3 4 5 6 7 varSelRFBoot(xdata, Class, c.sd = 1, mtryFactor = 1, ntree = 5000, ntreeIterat = 2000, vars.drop.frac = 0.2, bootnumber = 200, whole.range = TRUE, recompute.var.imp = FALSE, usingCluster = TRUE, TheCluster = NULL, srf = NULL, verbose = TRUE, ...) 

Arguments

Most arguments are the same as for varSelRFBoot.

 xdata A data frame or matrix, with subjects/cases in rows and variables in columns. NAs not allowed. Class The dependent variable; must be a factor. c.sd The factor that multiplies the sd. to decide on stopping the tierations or choosing the final solution. See reference for details. mtryFactor The multiplication factor of √{number.of.variables} for the number of variables to use for the ntry argument of randomForest. ntree The number of trees to use for the first forest; same as ntree for randomForest. ntreeIterat The number of trees to use (ntree of randomForest) for all additional forests.
 vars.drop.frac The fraction of variables, from those in the previous forest, to exclude at each iteration. whole.range If TRUE continue dropping variables until a forest with only two variables is built, and choose the best model from the complete series of models. If FALSE, stop the iterations if the current OOB error becomes larger than the initial OOB error (plus c.sd*OOB standard error) or if the current OOB error becoems larger than the previous OOB error (plus c.sd*OOB standard error). recompute.var.imp If TRUE recompute variable importances at each new iteration. bootnumber The number of bootstrap samples to draw. usingCluster If TRUE use a cluster to parallelize the calculations. TheCluster The name of the cluster, if one is used. srf An object of class varSelRF. If used, the ntree and mtryFactor parameters are taken from this object, not from the arguments to this function. If used, it allows to skip carrying out a first iteration to build the random forest to the complete, original data set. verbose Give more information about what is being done. ... Not used.

Details

If a cluster is used for the calculations, it will be used for the embarrisingly parallelizable task of building as many random forests as bootstrap samples.

Value

An object of class varSelRFBoot, which is a list with components:

 number.of.bootsamples The number of bootstrap replicates. bootstrap.pred.error The .632+ estimate of the prediction error. leave.one.out.bootstrap The leave-one-out estimate of the error rate (used when computing the .632+ estimate). all.data.randomForest A random forest built from all the data, but after the variable selection. Thus, beware because the OOB error rate is severely biased down. all.data.vars The variables selected in the run with all the data. all.data.run An object of class varSelRF; the one obtained from a run of varSelRF on the original, complete, data set. See varSelRF. class.predictions The out-of-bag predictions from the bootstrap, of type "response".See predict.randomForest. This is an array, with dimensions number of cases by number of bootstrap replicates. prob.predictions The out-of-bag predictions from the bootstrap, of type "class probability". See predict.randomForest. This is a 3-way array; the last dimension is the bootstrap replication; for each bootstrap replication, the 2D array has dimensions case by number of classes, and each value is the probability of belonging to that class. number.of.vars A vector with the number of variables selected for each bootstrap sample. overlap The "overlap" between the variables selected from the run in original sample and the variables returned from a bootstrap sample. Overlap between the sets of variables A and B is defined as \frac{|variables.in.A \cap variables.in.B|}{√{|variables.in.A| |variables.in.B|}} or size (cardinality) of intersection between the two sets / sqrt(product of size of each set). all.vars.in.solutions A vector with all the genes selected in the runs on all the bootstrap samples. If the same gene is selected in several bootstrap runs, it appears multiple times in this vector. all.solutions Each solutions is a character vector with all the variables in a particular solution concatenated by a "+". Thus, all.solutions is a vector, with length equal to number.of.bootsamples, of the solution from each bootstrap run. Class The original class argument. allBootRuns A list of length number.of.bootsamples. Each component of this list is an element of class varSelRF and stores the results from the runs on each bootstrap sample.

Note

The out-of-bag predictions stored in class.predictions and prob.predictions are NOT the OOB votes from random forest itself for a given run. These are predictions from the out-of-bag samples for each bootstrap replication. Thus, these are samples that have not been used at all in any of the variable selection procedures in the given bootstrap replication.

Author(s)

Ramon Diaz-Uriarte rdiaz02@gmail.com

References

Breiman, L. (2001) Random forests. Machine Learning, 45, 5–32.

Diaz-Uriarte, R. and Alvarez de Andres, S. (2005) Variable selection from random forests: application to gene expression data. Tech. report. http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html

Efron, B. & Tibshirani, R. J. (1997) Improvements on cross-validation: the .632+ bootstrap method. J. American Statistical Association, 92, 548–560.

Svetnik, V., Liaw, A. , Tong, C & Wang, T. (2004) Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules. Pp. 334-343 in F. Roli, J. Kittler, and T. Windeatt (eds.). Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. Lecture Notes in Computer Science, vol. 3077. Berlin: Springer.

randomForest, varSelRF, summary.varSelRFBoot, plot.varSelRFBoot,
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ## Not run: ## This is a small example, but can take some time. ## make a small cluster, for the sake of illustration forkCL <- makeForkCluster(2) clusterSetRNGStream(forkCL, iseed = 123) clusterEvalQ(forkCL, library(varSelRF)) x <- matrix(rnorm(25 * 30), ncol = 30) x[1:10, 1:2] <- x[1:10, 1:2] + 2 cl <- factor(c(rep("A", 10), rep("B", 15))) rf.vs1 <- varSelRF(x, cl, ntree = 200, ntreeIterat = 100, vars.drop.frac = 0.2) rf.vsb <- varSelRFBoot(x, cl, bootnumber = 10, usingCluster = TRUE, srf = rf.vs1, TheCluster = forkCL) rf.vsb summary(rf.vsb) plot(rf.vsb) stopCluster(forkCL) ## End(Not run)