screen.FSelector.entropy: Entropy-based screening algorithms

View source: R/fselector.R

screen.FSelector.entropyR Documentation

Entropy-based screening algorithms

Description

Information gain, gain ratio, and symmetrical uncertainty scores are calculated from the Shannon entropy of X and Y. Information gain (information.gain) (or, equivalently, mutual information) is a measure of entropy reduction achieved by the feature with regard to the outcome, Y. The information gain ratio (gain.ratio) is a normalized version of information gain, normalized by the entropy of the feature. Symmetrical uncertainty (symmetrical.uncertainty) is a normalized and bias-corrected version of information gain. Implemented for binomial() family only and designed to be used with binary or categorical X. Continuous X will be discretized by FSelector and Discretize using the MDL method (Fayyad & Irani, 1993).

Usage

screen.FSelector.entropy(
  Y,
  X,
  family,
  filter = c("symmetrical.uncertainty", "gain.ratio", "information.gain"),
  unit = formals(information.gain)$unit,
  selector = c("cutoff.biggest.diff", "cutoff.k", "cutoff.k.percent"),
  k = switch(selector, cutoff.k = ceiling(0.5 * ncol(X)), cutoff.k.percent = 0.5, NULL),
  verbose = FALSE,
  ...
)

Arguments

Y

Outcome (numeric vector). See SuperLearner for specifics.

X

Predictor variable(s) (data.frame or matrix). See SuperLearner for specifics.

family

Error distribution to be used in the model: gaussian or binomial. Currently unused. See SuperLearner for specifics.

filter

Character string. One of: "symmetrical.uncertainty" (default), "gain.ratio", or "information.gain"

unit

Unit in which entropy is measured by entropy. Character string. One of: "log" (default), "log2", or "log10".

selector

A string corresponding to a subset selecting function implemented in the FSelector package. One of: cutoff.biggest.diff, cutoff.k, cutoff.k.percent, or "all". Note that "all" is a not a function but indicates pass-thru should be performed in the case of a filter which selects rather than ranks features. Default: "cutoff.biggest.diff".

k

Passed through to the selector in the case where selector is cutoff.k or cutoff.k.percent. Otherwise, should remain NULL (the default). For cutoff.k, this is an integer indicating the number of features to keep from X. For cutoff.k.percent, this is instead the proportion of features to keep.

verbose

Should debugging messages be printed? Default: FALSE.

...

Currently unused.

Value

A logical vector with length equal to ncol(X).

References

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.4643 http://hdl.handle.net/2014/35171

Examples

data(iris)
Y <- as.numeric(iris$Species=="setosa")
X <- iris[,-which(colnames(iris)=="Species")]
screen.FSelector.entropy(Y, X, binomial(), selector = "cutoff.k.percent", k = 0.75)

# based on example in SuperLearner package
set.seed(1)
n <- 100
p <- 20
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
X <- data.frame(X)
Y <- rbinom(n, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] + .1*X[, 3]*X[, 4] - .2*abs(X[, 4])))

library(SuperLearner)
sl = SuperLearner(Y, X, family = binomial(), cvControl = list(V = 2),
                  SL.library = list(c("SL.lm", "All"),
                                    c("SL.lm", "screen.FSelector.entropy")))
sl
sl$whichScreen

saraemoore/SLScreenExtra documentation built on Nov. 4, 2023, 9:31 p.m.