GeneSelection: General method for variable selection with various methods
In CMA: Synthesis of microarray-based classification

Description Usage Arguments Value Note Author(s) References See Also Examples

For different learning data sets as defined by the argument learningsets, this method ranks the genes from the most relevant to the less relevant using one of various 'filter' criteria or provides a sparse collection of variables (Lasso, ElasticNet, Boosting). The results are typically used for variable selection for the classification procedure that follows.
For S4 class information, s. GeneSelection-methods.

GeneSelection(X, y, f, learningsets, method = c("t.test", "welch.test", "wilcox.test", "f.test", "kruskal.test", "limma", "rfe", "rf", "lasso", "elasticnet", "boosting", "golub", "shrinkcat"), scheme, trace = TRUE, ...)

`X`	Gene expression data. Can be one of the following: A `matrix`. Rows correspond to observations, columns to variables. A `data.frame`, when `f` is not missing (s. below). An object of class `ExpressionSet`.
`y`	Class labels. Can be one of the following: A `numeric` vector. A `factor`. A `character` if `X` is an `ExpressionSet`. `missing`, if `X` is a `data.frame` and a proper formula `f` is provided.
`f`	A two-sided formula, if `X` is a `data.frame`. The left part correspond to class labels, the right to variables.
`learningsets`	An object of class `learningsets`. May be missing, then the complete datasets is used as learning set.
`method`	A character specifying the method to be used: `t.test` two-sample t.test (equal variances for both classes assumed). `welch.test` Welch modification of the t.test (unequal variances for both classes). `wilcox.test` Wilcoxon rank sum test. `f.test` F test belonging to the linear hypothesis that the mean is the same for all classes. Usually used for the multiclass scheme, is equivalent to `method = t.test` in the two-class case. `kruskal.test` Multi-class generalization of the Wilcoxon rank sum test and the nonparametric pendant to the F test, respectively. `limma` 'Moderated t' statistic for the two-class case and 'moderated F' statistic for the multiclass case, described in Smyth (2003). Requires the package `limma`. `rfe` One-step Recursive Feature Elimination, based on the Support Vector Machine. The method is decribed in Guyon et al. (2002). Requires the package `e1071`. Take care that appropriate hyperparameters are passed by the `...` argument. `rf` Random Forest Variable Importance Measure. Requires the package `randomForest` `lasso` `L1` penalized logistic regression leads to sparsity with respect to the variables used. Calls the function `LassoCMA`, which requires the package `glmpath`. warning: Take care that appropriate hyperparameters are passed by the `...` argument. `elasticnet` Penalized logistic regression with both `L1` and `L2` penalty, claimed by Zhou and Hastie (2004) to select 'variable groups'. Calls the function `ElasticNetCMA`, which requires the package `glmpath`. warning: Take care that appropriate hyperparameters are passed by the `...` argument. `boosting` Componentwise boosting (Buehlmann and Yu, 2003) has been shown to mimic the LASSO (Efron et al., 2004; Buehlmann and Yu, 2006). Calls the function `compBoostCMA` Take care that appropriate hyperparameters are passed by the `...` argument. `golub` The (theoretically unfounded) variable selection criterion used by Golub et al. (1999), s. `golub`. `shrinkcat` The correlation-adjusted t-score from Zuber and Strimmer (2009)
`scheme`	The scheme to be used in the case of a non-binary response. Must be one of `"pairwise"`,`"one-vs-all"` or `"multiclass"`. The last case only makes sense if `method` is one of `f.test, limma, rf, boosting`, which can directly be applied to the multi class case.
`trace`	Should the progress be traced ? Default is `TRUE`.
`...`	Further arguments passed to the function performing variable selection, s. `method`.

An object of class genesel.

most of the methods described above are only apt for the binary classification case. The only ones that can be used without restriction in the multiclass case are

f.test
kruskal.test
rf
boosting

For the rest, pairwise or one-vs-all schemes are used.

Martin Slawski ms@cs.uni-sb.de

Anne-Laure Boulesteix boulesteix@ibe.med.uni-muenchen.de

Christoph Bernau bernau@ibe.med.uni-muenchen.de

Smyth, G. K., Yang, Y.-H., Speed, T. P. (2003).
Statistical issues in microarray data analysis.
Methods in Molecular Biology 224, 111-136.

Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002).
Gene Selection for Cancer Classification using support vector machines. Journal of Machine Learning Research, 46, 389-422

Zhou, H., Hastie, T. (2004).
Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2),301-320

Buelmann, P., Yu, B. (2003).
Boosting with the L2 loss: Regression and Classification.
Journal of the American Statistical Association, 98, 324-339

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004).
Least Angle Regression.
Annals of Statistics, 32:407-499

Buehlmann, P., Yu, B. (2006).
Sparse Boosting.
Journal of Machine Learning Research, 7- 1001:1024

Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439

filter, GenerateLearningsets, tune, classification

# load Golub AML/ALL data
data(golub)
### extract class labels
golubY <- golub[,1]
### extract gene expression from first 10 genes
golubX <- as.matrix(golub[,-1])
### Generate five different learningsets
set.seed(111)
five <- GenerateLearningsets(y=golubY, method = "CV", fold = 5, strat = TRUE)
### simple t-test:
selttest <- GeneSelection(golubX, golubY, learningsets = five, method = "t.test")
### show result:
show(selttest)
toplist(selttest, k = 10, iter = 1)
plot(selttest, iter = 1)

Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
    rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

GeneSelection: iteration 1 
GeneSelection: iteration 2 
GeneSelection: iteration 3 
GeneSelection: iteration 4 
GeneSelection: iteration 5 
gene selection performed with 't.test'
scheme used :'pairwise'
number of genes:  3051 
number of different learningsets:  5 
top  10  genes for iteration  1 
 
   index importance
1    829   9.195902
2   2670   8.502019
3    378   8.178456
4   1009   7.792680
5   2124   7.695132
6    896   7.659007
7    515   7.522447
8    808   7.163428
9   1448   7.013725
10   394   6.937388