Variable selection using the PLS weights

Share:

Description

The function variable.selection performs variable selection for binary classification.

Usage

1

Arguments

X

a (n x p) data matrix of predictors. X may be a matrix or a data frame. Each row corresponds to an observation and each column corresponds to a predictor variable.

Y

a vector of length n giving the classes of the n observations. The two classes must be coded as 1,2.

nvar

the number of variables to be returned. If nvar=NULL, all the variables are returned.

Details

The function variable.selection orders the variables according to the absolute value of the weight defining the first PLS component. This ordering is equivalent to the ordering obtained with the F-statistic and t-test with equal variances (Boulesteix, 2004).

For computational reasons, the function variable.selection does not use the pls algorithm, but the obtained ordering of the variables is exactly equivalent to the ordering obtained using the PLS weights output by pls.regression.

Value

A vector of length nvar (or of length p if nvar=NULL) containing the indices of the variables to be selected. The variables are ordered from the best to the worst variable.

Author(s)

Anne-Laure Boulesteix (http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/ 020_professuren/boulesteix/index.html)

References

A. L. Boulesteix (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology 3, Issue 1, Article 33.

A. L. Boulesteix, K. Strimmer (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 7:32-44.

S. de Jong (1993). SIMPLS: an alternative approach to partial least squares regression, Chemometrics Intell. Lab. Syst. 18, 251–263.

See Also

pls.regression.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# load plsgenomics library
library(plsgenomics)

# generate X and Y (4 observations and 3 variables)
X<-matrix(c(4,3,3,4,1,0,6,7,3,5,5,9),4,3,byrow=FALSE)
Y<-c(1,1,2,2)

# select the 2 best variables
variable.selection(X,Y,nvar=2)
# order the 3 variables
variable.selection(X,Y)

# load the leukemia data 
data(leukemia)

# select the 50 best variables from the leukemia data
variable.selection(leukemia$X,leukemia$Y,nvar=50)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.