screen.randomForest.imp: "Best of both worlds" Random Forest screening algorithm

screen.randomForest.impR Documentation

"Best of both worlds" Random Forest screening algorithm

Description

Customizability of screen.randomForest combined with the cutoff selectors of FSelector.

Usage

screen.randomForest.imp(
  Y,
  X,
  family,
  obsWeights,
  id,
  selector = c("cutoff.biggest.diff", "cutoff.k", "cutoff.k.percent"),
  k = switch(selector, cutoff.k = ceiling(0.5 * ncol(X)), cutoff.k.percent = 0.5, NULL),
  nTree = 1000,
  mTry = ifelse(family$family == "gaussian", floor(sqrt(ncol(X))), max(floor(ncol(X)/3),
    1)),
  nodeSize = ifelse(family$family == "gaussian", 5, 1),
  importanceType = c("permutation", "impurity"),
  maxNodes = NULL,
  verbose = FALSE,
  ...
)

Arguments

Y

Outcome (numeric vector). See SuperLearner for specifics.

X

Predictor variable(s) (data.frame or matrix). See SuperLearner for specifics.

family

Error distribution to be used in the model: gaussian or binomial. Currently unused. See SuperLearner for specifics.

obsWeights

Optional numeric vector of observation weights. Currently unused.

id

Cluster identification variable. Currently unused.

selector

A string corresponding to a subset selecting function implemented in the FSelector package. One of: cutoff.biggest.diff (default), cutoff.k, or cutoff.k.percent.

k

Passed through to the selector in the case where selector is cutoff.k or cutoff.k.percent. Otherwise, should remain NULL (the default). For cutoff.k, this is an integer indicating the number of features to keep from X. For cutoff.k.percent, this is instead the proportion of features to keep.

nTree

Integer. Number of trees. Default: 1000.

mTry

Integer. Number of columns of X sampled at each split. Default: square root (gaussian() family) or one third (binomial() family) of total number of features, rounded down.

nodeSize

Integer. Minimum number of observations in terminal nodes. Default: 5 (gaussian() family) or 1 (binomial() family).

importanceType

Importance type. "permutation" (default) indicates mean decrease in accuracy (for binomial() family) or percent increase in mean squared error (for gaussian() family) when comparing predictions using the original variable versus a permuted version of the variable (column of X). "impurity" indicates increase in node purity achieved by splitting on that column of X (for binomial() family, measured by Gini index; for gaussian(), measured by residual sum of squares). See randomForest for more details, where "permutation" corresponds to type = 1 and "impurity" corresponds to type = 2.

maxNodes

Maximum number of terminal nodes allowed in a tree. Default (NULL) indicates that trees should be grown to maximum possible size. See randomForest for more details.

verbose

Should debugging messages be printed? Default: FALSE.

...

Currently unused.

Value

A logical vector with length equal to ncol(X).

Examples

data(iris)
Y <- as.numeric(iris$Species=="setosa")
X <- iris[,-which(colnames(iris)=="Species")]
screen.randomForest.imp(Y, X, binomial(), selector = "cutoff.k.percent", k = 0.75)

data(mtcars)
Y <- mtcars$mpg
X <- mtcars[,-which(colnames(mtcars)=="mpg")]
screen.randomForest.imp(Y, X, gaussian(), importanceType = "impurity")

# based on examples in SuperLearner package
set.seed(1)
n <- 100
p <- 20
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)

library(SuperLearner)
sl = SuperLearner(Y, X, family = gaussian(), cvControl = list(V = 2),
                  SL.library = list(c("SL.glm", "All"),
                                    c("SL.glm", "screen.randomForest.imp")))
sl
sl$whichScreen

saraemoore/SLScreenExtra documentation built on Nov. 4, 2023, 9:31 p.m.