GeneSelection: General method for variable selection with various methods

Description Usage Arguments Value Note Author(s) References See Also Examples

Description

For different learning data sets as defined by the argument learningsets, this method ranks the genes from the most relevant to the less relevant using one of various 'filter' criteria or provides a sparse collection of variables (Lasso, ElasticNet, Boosting). The results are typically used for variable selection for the classification procedure that follows.
For S4 class information, s. GeneSelection-methods.

Usage

1
GeneSelection(X, y, f, learningsets, method = c("t.test", "welch.test", "wilcox.test", "f.test", "kruskal.test", "limma", "rfe", "rf", "lasso", "elasticnet", "boosting", "golub", "shrinkcat"), scheme, trace = TRUE, ...)

Arguments

X

Gene expression data. Can be one of the following:

  • A matrix. Rows correspond to observations, columns to variables.

  • A data.frame, when f is not missing (s. below).

  • An object of class ExpressionSet.

y

Class labels. Can be one of the following:

  • A numeric vector.

  • A factor.

  • A character if X is an ExpressionSet.

  • missing, if X is a data.frame and a proper formula f is provided.

f

A two-sided formula, if X is a data.frame. The left part correspond to class labels, the right to variables.

learningsets

An object of class learningsets. May be missing, then the complete datasets is used as learning set.

method

A character specifying the method to be used:

t.test

two-sample t.test (equal variances for both classes assumed).

welch.test

Welch modification of the t.test (unequal variances for both classes).

wilcox.test

Wilcoxon rank sum test.

f.test

F test belonging to the linear hypothesis that the mean is the same for all classes. Usually used for the multiclass scheme, is equivalent to method = t.test in the two-class case.

kruskal.test

Multi-class generalization of the Wilcoxon rank sum test and the nonparametric pendant to the F test, respectively.

limma

'Moderated t' statistic for the two-class case and 'moderated F' statistic for the multiclass case, described in Smyth (2003). Requires the package limma.

rfe

One-step Recursive Feature Elimination, based on the Support Vector Machine. The method is decribed in Guyon et al. (2002). Requires the package e1071. Take care that appropriate hyperparameters are passed by the ... argument.

rf

Random Forest Variable Importance Measure. Requires the package randomForest

lasso

L1 penalized logistic regression leads to sparsity with respect to the variables used. Calls the function LassoCMA, which requires the package glmpath. warning: Take care that appropriate hyperparameters are passed by the ... argument.

elasticnet

Penalized logistic regression with both L1 and L2 penalty, claimed by Zhou and Hastie (2004) to select 'variable groups'. Calls the function ElasticNetCMA, which requires the package glmpath. warning: Take care that appropriate hyperparameters are passed by the ... argument.

boosting

Componentwise boosting (Buehlmann and Yu, 2003) has been shown to mimic the LASSO (Efron et al., 2004; Buehlmann and Yu, 2006). Calls the function compBoostCMA Take care that appropriate hyperparameters are passed by the ... argument.

golub

The (theoretically unfounded) variable selection criterion used by Golub et al. (1999), s. golub.

shrinkcat

The correlation-adjusted t-score from Zuber and Strimmer (2009)

scheme

The scheme to be used in the case of a non-binary response. Must be one of "pairwise","one-vs-all" or "multiclass". The last case only makes sense if method is one of f.test, limma, rf, boosting, which can directly be applied to the multi class case.

trace

Should the progress be traced ? Default is TRUE.

...

Further arguments passed to the function performing variable selection, s. method.

Value

An object of class genesel.

Note

most of the methods described above are only apt for the binary classification case. The only ones that can be used without restriction in the multiclass case are

For the rest, pairwise or one-vs-all schemes are used.

Author(s)

Martin Slawski ms@cs.uni-sb.de

Anne-Laure Boulesteix boulesteix@ibe.med.uni-muenchen.de

Christoph Bernau bernau@ibe.med.uni-muenchen.de

References

Smyth, G. K., Yang, Y.-H., Speed, T. P. (2003).
Statistical issues in microarray data analysis.
Methods in Molecular Biology 224, 111-136.

Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002).
Gene Selection for Cancer Classification using support vector machines. Journal of Machine Learning Research, 46, 389-422

Zhou, H., Hastie, T. (2004).
Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2),301-320

Buelmann, P., Yu, B. (2003).
Boosting with the L2 loss: Regression and Classification.
Journal of the American Statistical Association, 98, 324-339

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004).
Least Angle Regression.
Annals of Statistics, 32:407-499

Buehlmann, P., Yu, B. (2006).
Sparse Boosting.
Journal of Machine Learning Research, 7- 1001:1024

Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439

See Also

filter, GenerateLearningsets, tune, classification

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# load Golub AML/ALL data
data(golub)
### extract class labels
golubY <- golub[,1]
### extract gene expression from first 10 genes
golubX <- as.matrix(golub[,-1])
### Generate five different learningsets
set.seed(111)
five <- GenerateLearningsets(y=golubY, method = "CV", fold = 5, strat = TRUE)
### simple t-test:
selttest <- GeneSelection(golubX, golubY, learningsets = five, method = "t.test")
### show result:
show(selttest)
toplist(selttest, k = 10, iter = 1)
plot(selttest, iter = 1)

Example output

Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
    rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

GeneSelection: iteration 1 
GeneSelection: iteration 2 
GeneSelection: iteration 3 
GeneSelection: iteration 4 
GeneSelection: iteration 5 
gene selection performed with 't.test'
scheme used :'pairwise'
number of genes:  3051 
number of different learningsets:  5 
top  10  genes for iteration  1 
 
   index importance
1    829   9.195902
2   2670   8.502019
3    378   8.178456
4   1009   7.792680
5   2124   7.695132
6    896   7.659007
7    515   7.522447
8    808   7.163428
9   1448   7.013725
10   394   6.937388

CMA documentation built on Nov. 8, 2020, 5:02 p.m.