screen.FSelector.random.forest.importance: Random Forest screening algorithm

View source: R/fselector.R

screen.FSelector.random.forest.importanceR Documentation

Random Forest screening algorithm

Description

The random.forest.importance algorithm uses randomForest (with ntree = 1000) to estimate the specified type of importance for each column of X.

Usage

screen.FSelector.random.forest.importance(
  Y,
  X,
  family,
  type = formals(random.forest.importance)$importance.type,
  selector = c("cutoff.biggest.diff", "cutoff.k", "cutoff.k.percent"),
  k = switch(selector, cutoff.k = ceiling(0.5 * ncol(X)), cutoff.k.percent = 0.5, NULL),
  verbose = FALSE,
  ...
)

Arguments

Y

Outcome (numeric vector). See SuperLearner for specifics.

X

Predictor variable(s) (data.frame or matrix). See SuperLearner for specifics.

family

Error distribution to be used in the model: gaussian or binomial. Currently unused. See SuperLearner for specifics.

type

Importance type. Integer: 1, indicating mean decrease in accuracy (for binomial() family) or percent increase in mean squared error (for gaussian() family) when comparing predictions using the original variable versus a permuted version of the variable (column of X), or 2, indicating the increase in node purity achieved by splitting on that column of X (for binomial() family, measured by Gini index; for gaussian(), measured by residual sum of squares). For default value, see random.forest.importance.

selector

A string corresponding to a subset selecting function implemented in the FSelector package. One of: cutoff.biggest.diff, cutoff.k, cutoff.k.percent, or "all". Note that "all" is a not a function but indicates pass-thru should be performed in the case of a filter which selects rather than ranks features. Default: "cutoff.biggest.diff".

k

Passed through to the selector in the case where selector is cutoff.k or cutoff.k.percent. Otherwise, should remain NULL (the default). For cutoff.k, this is an integer indicating the number of features to keep from X. For cutoff.k.percent, this is instead the proportion of features to keep.

verbose

Should debugging messages be printed? Default: FALSE.

...

Currently unused.

Value

A logical vector with length equal to ncol(X).

Examples

data(iris)
Y <- as.numeric(iris$Species=="setosa")
X <- iris[,-which(colnames(iris)=="Species")]
screen.FSelector.random.forest.importance(Y, X, binomial(), selector = "cutoff.k.percent", k = 0.75)

data(mtcars)
Y <- mtcars$mpg
X <- mtcars[,-which(colnames(mtcars)=="mpg")]
screen.FSelector.random.forest.importance(Y, X, gaussian(), type = 2)

# based on examples in SuperLearner package
set.seed(1)
n <- 100
p <- 20
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)

library(SuperLearner)
sl = SuperLearner(Y, X, family = gaussian(), cvControl = list(V = 2),
                  SL.library = list(c("SL.glm", "All"),
                                    c("SL.glm", "screen.FSelector.random.forest.importance")))
sl
sl$whichScreen

saraemoore/SLWeightedScreen documentation built on Nov. 7, 2023, 5:18 a.m.