#' Variable Selection Using Random Forests
#'
#' Three steps variable selection procedure based on random forests for
#' supervised classification and regression problems. First step ("thresholding
#' step") is dedicated to eliminate irrelevant variables from the dataset.
#' Second step ("interpretation step") aims to select all variables related to
#' the response for interpretation purpose. Third step ("prediction step")
#' refines the selection by eliminating redundancy in the set of variables
#' selected by the second step, for prediction purpose.
#'
#' \itemize{ \item First step ("thresholding step"): first, \code{nfor.thres}
#' random forests are computed using the function \code{randomForest} with
#' arguments \code{importance=TRUE}, and our choice of default values for
#' \code{ntree} and \code{mtry} (which are higher than default in
#' \code{\link[randomForest]{randomForest}} to get a more stable variable importance measure).
#' Then variables are sorted according to their mean variable importance (VI),
#' in decreasing order. This order is kept all along the procedure. Next, a
#' threshold is computed: \code{min.thres}, the minimum predicted value of a
#' pruned CART tree fitted to the curve of the standard deviations of VI.
#' Finally, the actual "thresholding step" is performed: only variables with a
#' mean VI larger than \code{nmin} * \code{min.thres} are kept.
#'
#' \item Second step ("interpretation step"): the variables selected by the
#' first step are considered. \code{nfor.interp} embedded random forests models
#' are grown, starting with the random forest build with only the most important
#' variable and ending with all variables selected in the first step. Then,
#' \code{err.min} the minimum mean out-of-bag (OOB) error of these models and
#' its associated standard deviation \code{sd.min} are computed. Finally, the
#' smallest model (and hence its corresponding variables) having a mean OOB
#' error less than \code{err.min} + \code{nsd} * \code{sd.min} is selected.
#'
#' Note that for this step (and the next one), the \code{mtry} parameter of
#' \code{randomForest} is set to its default value (see
#' \code{\link[randomForest]{randomForest}}) if \code{nvm}, the number of variables in the
#' model, is not greater than the number of observations, while it is set to
#' \code{nvm/3} otherwise. This is to ensure quality of OOB error estimations
#' along embedded RF models.
#'
#' \item Third step ("prediction step"): the starting point is the same than in
#' the second step. However, now the variables are added to the model in a
#' stepwise manner. \code{mean.jump}, the mean jump value is calculated using
#' variables that have been left out by the second step, and is set as the mean
#' absolute difference between mean OOB errors of one model and its first
#' following model. Hence a variable is included in the model if the mean OOB
#' error decrease is larger than \code{nmj} * \code{mean.jump}.
#'
#' As for interpretation step, the \code{mtry} parameter of \code{randomForest}
#' is set to its default value if \code{nvm}, the number of variables in the
#' model, is not greater than the number of observations, while it is set to
#' \code{nvm/3} otherwise.}
#'
#' VSURF is able to run using multiple cores in parallel (see \code{parallel},
#' \code{clusterType} and \code{ncores} arguments).
#'
#' @param data a data frame containing the variables in the model.
#' @param na.action A function to specify the action to be taken if NAs are
#' found. (NOTE: If given, this argument must be named, and as
#' \code{randomForest} it is only used with the formula-type call.)
#' @param x,formula A data frame or a matrix of predictors, the columns
#' represent the variables. Or a formula describing the model to be fitted.
#' @param y A response vector (must be a factor for classification problems and
#' numeric for regression ones).
#' @param mtry Number of variables randomly sampled as candidates at each split.
#' Standard parameter of \code{randomForest}.
#' @param ntree.thres Number of trees of each forest grown for "thresholding
#' step" (first of the three steps).
#' @param nfor.thres Number of forests grown for "thresholding step".
#' @param nmin Number of times the "minimum value" is multiplied to set
#' threshold value. See details below.
#' @param ntree.interp Number of trees of each forest grown for "interpretation
#' step" (second of the three steps).
#' @param nfor.interp Number of forests grown for "interpretation step".
#' @param nsd Number of times the standard deviation of the minimum value of
#' \code{err.interp} is multiplied. See details below.
#' @param ntree.pred Number of trees of each forest grown for "prediction
#' step" (last of the three steps).
#' @param nfor.pred Number of forests grown for "prediction step".
#' @param nmj Number of times the mean jump is multiplied. See details below.
#' @param RFimplem Choice of the random forests implementation to use :
#' "randomForest" (default), "ranger" or "Rborist" (not that if "Rborist" is
#' chosen, "randoForest" will still be used for the first step
#' \code{VSURF_thres}). If a vector of length 3 is given, each coordinate is
#' passed to each intermediate function: \code{VSURF_thres},
#' \code{VSURF_interp}, \code{VSURF_pred}, in this order.
#' @param parallel A logical indicating if you want VSURF to run in parallel on
#' multiple cores (default to FALSE). If a vector of length 3 is given,
#' each coordinate is passed to each intermediate function: \code{VSURF_thres},
#' \code{VSURF_interp}, \code{VSURF_pred}, in this order.
#' @param ncores Number of cores to use. Default is set to the number of cores
#' detected by R minus 1.
#' @param clusterType Type of the multiple cores cluster used to run VSURF in
#' parallel. Must be chosen among "PSOCK" (default: SOCKET cluster available
#' locally on all OS), "FORK" (local too, only available for Linux and Mac
#' OS), "MPI" (can be used on a remote cluster, which needs \code{snow} and
#' \code{Rmpi} packages installed), "ranger" and "Rborist" for internal
#' parallelizations of those packages (not that if "Rborist" is
#' chosen, "SOCKET" will still be used for the first step
#' \code{VSURF_thres}). If a vector of length 2 is given, each
#' coordinate is passed to each intermediate function: \code{VSURF_thres},
#' \code{VSURF_interp}, in this order.
#' @param verbose A logical indicating if information about method's progress
#' (included progress bars for each step) must be printed (default to TRUE).
#' Adds a small extra overload.
#' @param ntree (deprecated) Number of trees in each forest grown for
#' "thresholding step".
#' @param ... others parameters to be passed on to the \code{randomForest}
#' function (see ?randomForest for further information).
#'
#'@return An object of class \code{VSURF}, which is a list with the following
#' components:
#'
#' \item{varselect.thres}{A vector of indexes of variables selected after
#' "thresholding step", sorted according to their mean VI, in decreasing
#' order.}
#'
#' \item{varselect.interp}{A vector of indexes of variables selected after
#' "interpretation step".}
#'
#' \item{varselect.pred}{A vector of indexes of variables selected after
#' "prediction step".}
#'
#' \item{nums.varselect}{A vector of the 3 numbers of variables selected resp.
#' by "thresholding step", "interpretation step" and "prediction step".}
#'
#' \item{imp.varselect.thres}{A vector of importance of the
#' \code{varselect.thres} variables.}
#'
#' \item{min.thres}{The minimum predicted value of a pruned CART tree fitted to
#' the curve of the standard deviations of VI.}
#'
#' \item{imp.mean.dec}{A vector of the variables importance means (over
#' \code{nfor.thres} runs), in decreasing order.}
#'
#' \item{imp.mean.dec.ind}{The ordering index vector associated to the sorting
#' of variables importance means.}
#'
#' \item{imp.sd.dec}{A vector of standard deviations of all variables
#' importance. The order is given by \code{imp.mean.dec.ind}.}
#'
#' \item{mean.perf}{Mean OOB error rate, obtained by a random forests build on
#' all variables.}
#'
#' \item{pred.pruned.tree}{Predictions of the CART tree fitted to the curve of
#' the standard deviations of VI.}
#'
#' \item{err.interp}{A vector of the mean OOB error rates of the embedded
#' random forests models build during the "interpretation step".}
#'
#' \item{sd.min}{The standard deviation of OOB error rates associated to the
#' random forests model attaining the minimum mean OOB error rate during the
#' "interpretation step".}
#'
#' \item{err.pred}{A vector of the mean OOB error rates of the random forests
#' models build during the "prediction step".}
#'
#' \item{mean.jump}{The mean jump value computed during the "prediction step".}
#'
#' \item{nmin,nsd,nmj}{Corresponding parameters values.}
#'
#' \item{overall.time}{Overall computation time.}
#'
#' \item{comput.times}{A list of the 3 computation times respectively
#' associated with the 3 steps: "thresholding", "interpretation" and
#' "prediction".}
#'
#' \item{RFimplem}{The RF implementation used to run \code{VSURF},
#' among "randomForest" (default), "ranger" and "Rborist" or a vector of length
#' 3 with those.}
#'
#' \item{ncores}{The number of cores used to run \code{VSURF} in parallel (NULL
#' if VSURF did not run in parallel).}
#'
#' \item{clusterType}{The type of the cluster used to run \code{VSURF} in
#' parallel (NULL if VSURF did not run in parallel).}
#'
#' \item{call}{The original call to \code{VSURF}.}
#'
#' \item{terms}{Terms associated to the formula (only if formula-type call was
#' used).}
#'
#' \item{na.action}{Method used to deal with missing values (only if
#' formula-type call was used).}
#'
#' @author Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot
#' @seealso \code{\link{plot.VSURF}}, \code{\link{summary.VSURF}},
#' \code{\link{VSURF_thres}}, \code{\link{VSURF_interp}},
#' \code{\link{VSURF_pred}}, \code{\link{tune}}
#' @references Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010),
#' \emph{Variable selection using random forests}, Pattern Recognition Letters
#' 31(14), 2225-2236
#' @references Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2015),
#' \emph{VSURF: An R Package for Variable Selection Using Random Forests}, The
#' R Journal 7(2):19-33
#' @examples
#'
#' data(iris)
#' iris.vsurf <- VSURF(iris[,1:4], iris[,5])
#' iris.vsurf
#'
#' \dontrun{
#' # A more interesting example with toys data (see \code{\link{toys}})
#' # (a few minutes to execute)
#' data(toys)
#' toys.vsurf <- VSURF(toys$x, toys$y)
#' toys.vsurf
#'
#' # VSURF run on 2 cores in parallel (using a SOCKET cluster):
#' data(toys)
#' toys.vsurf.parallel <- VSURF(toys$x, toys$y, parallel = TRUE, ncores = 2)
#' }
#'
#' @importFrom parallel detectCores
#' @export
VSURF <- function (x, ...) {
UseMethod("VSURF")
}
#' @rdname VSURF
#' @export
VSURF.default <- function(
x, y, mtry = max(floor(ncol(x)/3), 1),
ntree.thres = 500, nfor.thres = 20, nmin = 1,
ntree.interp = 100, nfor.interp = 10, nsd = 1,
ntree.pred = 100, nfor.pred = 10, nmj = 1,
RFimplem = "randomForest", parallel = FALSE, ncores = detectCores() - 1,
clusterType = "PSOCK", verbose = TRUE, ntree = 2000, ...) {
start <- Sys.time()
thres <- VSURF_thres(
x=x, y=y, ntree.thres=ntree.thres, mtry=mtry, nfor.thres=nfor.thres, nmin=nmin,
RFimplem = ifelse(length(RFimplem) == 3, RFimplem[1], RFimplem),
parallel = ifelse(length(parallel) == 3, parallel[1], parallel),
clusterType = ifelse(length(clusterType) > 1, clusterType[1], clusterType),
ncores=ncores, verbose = verbose, ...)
interp <- VSURF_interp(
x=x, y=y, ntree.interp=ntree.interp, vars=thres$varselect.thres, nfor.interp=nfor.interp,
nsd=nsd, RFimplem = ifelse(length(RFimplem) == 3, RFimplem[2], RFimplem),
parallel = ifelse(length(parallel) == 3, parallel[2], parallel),
clusterType = ifelse(length(clusterType) > 1, clusterType[2], clusterType),
ncores=ncores, verbose = verbose, ...)
pred <- VSURF_pred(x=x, y=y, ntree.pred=ntree.pred, err.interp=interp$err.interp,
varselect.interp=interp$varselect.interp, nfor.pred=nfor.pred, nmj=nmj,
RFimplem = ifelse(length(RFimplem) == 3, RFimplem[3], RFimplem),
parallel = ifelse(length(parallel) == 3, parallel[3], parallel),
ncores = ncores, verbose = verbose, ...)
cl <- match.call()
cl[[1]] <- as.name("VSURF")
if (identical(parallel, FALSE) | identical(parallel, rep(FALSE, 3))) {
clusterType <- NULL
ncores <- NULL
}
overall.time <- Sys.time() - start
output <- list('varselect.thres'=thres$varselect.thres,
'varselect.interp'=interp$varselect.interp,
'varselect.pred'=pred$varselect.pred,
'nums.varselect'=c(thres$num.varselect.thres,
interp$num.varselect.interp,
pred$num.varselect.pred),
'imp.varselect.thres'=thres$imp.varselect.thres,
'min.thres'=thres$min.thres,
'imp.mean.dec'=thres$imp.mean.dec,
'imp.mean.dec.ind'=thres$imp.mean.dec.ind,
'imp.sd.dec'=thres$imp.sd.dec,
'mean.perf'=thres$mean.perf,
'pred.pruned.tree'=thres$pred.pruned.tree,
'err.interp'=interp$err.interp,
'sd.min'=interp$sd.min,
'err.pred'=pred$err.pred,
'mean.jump'=pred$mean.jump,
'nmin'=nmin,
'nsd'=nsd,
'nmj'=nmj,
'overall.time'=overall.time,
'comput.times'=list(thres$comput.time, interp$comput.time, pred$comput.time),
'RFimplem'=RFimplem,
'ncores'=ncores,
'clusterType'=clusterType,
'call'=cl)
class(output) <- c("VSURF")
output
}
#' @rdname VSURF
#' @export
VSURF.formula <- function(formula, data, ..., na.action = na.fail) {
### formula interface for VSURF.
### code gratefully stolen from svm.formula (package e1071).
###
if (!inherits(formula, "formula"))
stop("method is only for formula objects")
m <- match.call(expand.dots = FALSE)
## Catch xtest and ytest in arguments.
if (any(c("xtest", "ytest") %in% names(m)))
stop("xtest/ytest not supported through the formula interface")
names(m)[2] <- "formula"
if (is.matrix(eval(m$data, parent.frame())))
m$data <- as.data.frame(data)
m$... <- NULL
m$na.action <- na.action
m[[1]] <- as.name("model.frame")
m <- eval(m, parent.frame())
y <- model.response(m)
Terms <- attr(m, "terms")
attr(Terms, "intercept") <- 0
attr(y, "na.action") <- attr(m, "na.action")
## Drop any "negative" terms in the formula.
## test with:
## randomForest(Fertility~.-Catholic+I(Catholic<50),data=swiss,mtry=2)
m <- model.frame(terms(reformulate(attributes(Terms)$term.labels)),
data.frame(m))
## if (!is.null(y)) m <- m[, -1, drop=FALSE]
for (i in seq(along=ncol(m))) {
if (is.ordered(m[[i]])) m[[i]] <- as.numeric(m[[i]])
}
ret <- VSURF.default(x=m, y=y, ...)
cl <- match.call()
cl[[1]] <- as.name("VSURF")
ret$call <- cl
ret$terms <- Terms
if (!is.null(attr(y, "na.action"))) {
ret$na.action <- attr(y, "na.action")
}
class(ret) <- c("VSURF.formula", class(ret))
warning(
"VSURF with a formula-type call outputs selected variables
which are indices of the input matrix based on the formula:
you may reorder these to get indices of the original data")
return(ret)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.