eliminator: The Eliminator algorithm
In larssnip/mpda: Classification in a multivariate setting

View source: R/eliminator.R

eliminator

R Documentation

The Eliminator algorithm

Description

Variable elimination using pda.

Usage

eliminator(
  y,
  X,
  reg = 0.5,
  prior = NULL,
  max.dim = NULL,
  frac = 0.25,
  vip.lim = 1,
  n.seg = 10,
  verbose = TRUE
)

Arguments

`y`	Vector of responses, a factor of exact 2 levels.
`X`	Matrix of predictor values.
`reg`	The regularization parameter, see below.
`prior`	Vector of prior probabilities, one value for each factor level in `y`.
`max.dim`	Integer, the maximum number of dimensions to consider.
`frac`	Fraction of unimportant variables to eliminate in each iteration (default is 0.25).
`vip.lim`	The threshold for the VIP criterion (default is 1.0).
`n.seg`	Integer, the number of cross-validation segments (default is 10).
`verbose`	Logical, turns on/off output during computations.

Details

This is a variable selection (elimination) algorithm based on the pda method where the response (y) is a factor with 2 levels, i.e. a two-class problem. The restriction to a two-class problem comes from the use of the VIP criterion. The idea is a slight modification of the one described in Mehmood et al, (2011).

The algorithm starts out by using all variables, fitting a pda model, using the pdaDim function to estimate the optimal dimensionality for the PLS step after each elimination. The regularization parameter reg and the number of cross-validation segments n.seg are needed for this, see pdaDim for details.

Based on the fitted model, the VIP-criterion is computed for all variables, see vipCriterion for details. Variables are ranked by this criterion, and those with a VIP-score below vip.lim are considered unimportant. The fraction frac of the unimportant variables are eliminated. If no unimportant variables are left, the algorithm terminates. It will mostly continue until only 1 variable is left, and then terminate.

After each elimination, the classification accuracy, i.e. the fraction of correctly classified samples, is computed based on the cross-validation of the pdaDim results. The reduced model accuracy is tested against the maximum so far, using the mcnemar.test. The p-value of this test is reported for each elimination step, indicating if the reduced model has a significantly poorer accuracy than the maximum so far. The idea is to keep on eliminating variables until the accuracy becomes significantly poorer than the maximum.

The argument max.dim allows you to specify the maximum number of PLS components to consider. The optimal dimension found by pdaDim is between 1 and max.dim. If it turns out identical to your max.dim, increase its value slightly and re-run.

Value

A list with two matrices, Elimination and Selected.

Elimination has one row for each iteration, with accuracy results after each. The first column is the number of variables left, the second is the fraction of correctly classified samples, and the third is the p-value of the McNemar-test, see above. You will typically look down this matrix for iterations where you have the maximum accuracy. Then, continue down the matrix from that row, until there is a significant drop in accuracy (column 2) and a corresponding small p-value (column 3), and choose the result just before this as your final selection.

In Selected you also find one row for each iteration. This logical indicates which variables are selected after each iteration. Thus, if cell [i,j] is TRUE variable j is still included after iteration i.

Author(s)

Lars Snipen.

References

Mehmood, T, Martens, H, Saeboe, S, Warringer, J, Snipen, L (2011). A Partial Least Squares based algorithm for parsimonious variable selection. Algorithms for Molecular Biology, 6:27.

Examples

data(microbiome)
y <- microbiome[1:40, 1]
X <- as.matrix(microbiome[1:40, -1])
lst <- eliminator(y, X, max.dim = 10)
print(lst$Elimination)
# Seems like iteration 23 is the place to stop, since a significant drop
# in performance is found at iteration 24.
# There are 2 variables left in iteration 23. These are variables
print(which(lst$Selected[23,]))

larssnip/mpda documentation built on March 28, 2022, 3:37 p.m.