screen.ranger: Screen features via a fast implementation of Random Forest

View source: R/rf.R

screen.rangerR Documentation

Screen features via a fast implementation of Random Forest

Description

Speed up screen.randomForest or screen.randomForest.imp. Uses the cutoff selectors.

Usage

screen.ranger(
  Y,
  X,
  family,
  selector = c("cutoff.biggest.diff", "cutoff.k", "cutoff.k.percent"),
  k = switch(selector, cutoff.k = ceiling(0.5 * ncol(X)), cutoff.k.percent = 0.5, NULL),
  nTree = 1000,
  mTry = ifelse(family$family == "gaussian", floor(sqrt(ncol(X))), max(floor(ncol(X)/3),
    1)),
  nodeSize = ifelse(family$family == "gaussian", 5, 1),
  importanceType = c("permutation", "impurity"),
  scalePermutationImportance = TRUE,
  probabilityTrees = FALSE,
  numThreads = 1,
  verbose = FALSE,
  ...
)

Arguments

Y

Outcome (numeric vector). See SuperLearner for specifics.

X

Predictor variable(s) (data.frame or matrix). See SuperLearner for specifics.

family

Error distribution to be used in the model: gaussian or binomial. Currently unused. See SuperLearner for specifics.

selector

A string corresponding to a subset selecting function implemented in the FSelector package. One of: cutoff.biggest.diff (default), cutoff.k, or cutoff.k.percent.

k

Passed through to the selector in the case where selector is cutoff.k or cutoff.k.percent. Otherwise, should remain NULL (the default). For cutoff.k, this is an integer indicating the number of features to keep from X. For cutoff.k.percent, this is instead the proportion of features to keep.

nTree

Integer. Number of trees. Default: 1000.

mTry

Integer. Number of columns of X sampled at each split. Default: square root (gaussian() family) or one third (binomial() family) of total number of features, rounded down.

nodeSize

Integer. Minimum number of observations in terminal nodes. Default: 5 (gaussian() family) or 1 (binomial() family).

importanceType

Importance type. "permutation" (default) indicates mean decrease in accuracy (for binomial() family) or percent increase in mean squared error (for gaussian() family) when comparing predictions using the original variable versus a permuted version of the variable (column of X). "impurity" indicates increase in node purity achieved by splitting on that column of X (for binomial() family, measured by Gini index; for gaussian(), measured by variance of the responses). See ranger for more details.

scalePermutationImportance

Scale permutation importance by standard error. Ignored if importanceType = "impurity". See ranger for more details.

probabilityTrees

Logical. If family is binomial() and probabilityTrees is FALSE (the default), classification trees are grown. If family is binomial() and probabilityTrees is TRUE (the default), probability trees are grown (Malley et al., 2012). Ignored if family is gaussian(), for which regression trees are always grown. See ranger for more details.

numThreads

Number of threads. Default: 1.

verbose

Should debugging messages be printed? Default: FALSE.

...

Currently unused.

Value

A logical vector with length equal to ncol(X).

References

http://dx.doi.org/10.18637/jss.v077.i01 http://dx.doi.org/10.1023/A:1010933404324 http://dx.doi.org/10.3414/ME00-01-0052

Examples

data(iris)
Y <- as.numeric(iris$Species=="setosa")
X <- iris[,-which(colnames(iris)=="Species")]
screen.ranger(Y, X, binomial(), selector = "cutoff.k.percent", k = 0.75)

data(mtcars)
Y <- mtcars$mpg
X <- mtcars[,-which(colnames(mtcars)=="mpg")]
screen.ranger(Y, X, gaussian(), importanceType = "impurity")

# based on examples in SuperLearner package
set.seed(1)
n <- 100
p <- 20
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)

library(SuperLearner)
sl = SuperLearner(Y, X, family = gaussian(), cvControl = list(V = 2),
                  SL.library = list(c("SL.glm", "All"),
                                    c("SL.glm", "screen.ranger")))
sl
sl$whichScreen

saraemoore/SLScreenExtra documentation built on Nov. 4, 2023, 9:31 p.m.