selectFeatures: Select features using frequency-based or ensemble feature...

Description Usage Arguments Details Value Author(s) References Examples

View source: R/PAA.r

Description

Performs a multivariate feature selection using frequency-based feature selection (based on RF-RFE, RJ-RFE or SVM-RFE) or ensemble feature selection (based on SVM-RFE).

Usage

1
2
3
4
5
6
7
selectFeatures(elist = NULL, n1 = NULL, n2 = NULL, label1 = "A", label2 = "B",
 log=NULL, cutoff = 10, selection.method = "rf.rfe",
 preselection.method = "mMs", subruns = 100, k = 10, subsamples = 10,
 bootstraps = 10, candidate.number = 300, above=1500, between=400,
 panel.selection.criterion="accuracy", importance.measure="MDA", ntree = 500,
 mtry = NULL, plot = FALSE, output.path = NULL, verbose = FALSE,
 method = "frequency")

Arguments

elist

EListRaw or EList object containing all microarray data (mandatory).

n1

integer indicating the sample number in group 1 (mandatory).

n2

integer indicating the sample number in group 2 (mandatory).

label1

class label of group 1 (default: "A").

label2

class label of group 2 (default: "B").

log

indicates whether the data is in log scale (mandatory; note: if TRUE log2 scale is expected).

cutoff

integer indicating how many features will be selected (default: 10).

selection.method

string indicating the feature selection method: "rf.rfe" (default), "svm.rfe" or "rj.rfe". Has no effect when method="ensemble".

preselection.method

string indicating the feature preselection method: "mMs" (default), "tTest", "mrmr" or "none". Has no effect when method="ensemble".

subruns

integer indicating the number of resampling repeats to be performed (default: 100). Has no effect when method="ensemble".

k

integer indicating the number of k-fold cross validation subsets (default: 10, i.e., 10-fold CV).

subsamples

integer indicating the number of subsamples for ensemble feature selection (default: 10). Has no effect when method="frequency".

bootstraps

integer indicating the number of bootstrap samples for ensemble feature selection (default: 10). Has no effect when method="frequency" only.

candidate.number

integer indicating how many features shall be preselected. Default is "300". Has no effect when method="ensemble".

above

mMs above parameter (integer). Default is "1500". There will be no effect when method="ensemble".

between

mMs between parameter (integer). Default is "400". There will be no effect when method="ensemble".

panel.selection.criterion

indicating the panel selection criterion: "accuracy" (default), "sensitivity"
or "specificity". No effect for method="ensemble".

importance.measure

string indicating the random forest importance measure: "MDA" (default) or "MDG". Has no effect when method="ensemble".

ntree

random forest parameter ntree (default: "500"). There will be no effect when method="ensemble".

mtry

random forest parameter mtry (default: sqrt(p) where p is the number of predictors). Has no effect when method="ensemble".

plot

logical indicating whether performance plots shall be plotted (default: FALSE).

output.path

string indicating the results output folder (optional).

verbose

logical indicating whether additional information shall be printed to the console (default: FALSE).

method

the feature selection method: "frequency" (default) for frequency-based or "ensemble" for ensemble feature selection.

Details

This function takes an EListRaw or EList object, group-specific sample numbers, group labels and parameters choosing and configuring a multivariate feature selection method (frequency-based or ensemble feature selection) to select a panel of differential features. When an output path is defined (via output.path) results will be saved on the hard disk and when verbose is TRUE additional information will be printed to the console.

Frequency-based feature selection (method="frequency"): The whole data is splitted in k cross validation training and test set pairs. For each training set a multivariate feature selection procedure is performed. The resulting k feature subsets are tested using the corresponding test sets (via classification). As a result, selectFeatures() returns the average k-fold cross validation classification accuracy as well as the selected feature panel (i.e., the union set of the k particular feature subsets). As multivariate feature selection methods random forest recursive feature elimination (RF-RFE), random jungle recursive feature elimination (RJ-RFE) and support vector machine recursive feature elimination (SVM-RFE) are supported. To reduce running times, optionally, univariate feature preselection can be performed (control via preselection.method). As univariate preselection methods mMs ("mMs"), Student's t-test ("tTest") and mRMR ("mrmr") are supported. Alternatively, no preselection can be chosen ("none"). This approach is similar to the method proposed in Baek et al.

Ensemble feature selection (method="ensemble"): From the whole data the previously defined number of subsamples is drawn defining pairs of training and test sets. Moreover, for each training set a previously defined number of bootstrap samples is drawn. Then, for each bootstrap sample SVM-RFE is performed and a feature ranking is obtained. To obtain a final ranking for a particular training set, all associated bootstrap rankings are aggregated to a single ranking. To score the cutoff best features, for each subsample a classification of the test set is performed (using a svm trained with the cutoff best features from the training set) and the classification accuracy is determined. Finally, the stability of the subsample-specific panels is assessed (via Kuncheva index, Kuncheva LI, 2007), all subsample-specific rankings are aggregated, the top n features (defined by cutoff) are selected, the average classification accuracy is computed, and all these results are returned in a list. This approach has been proposed in Abeel et al.

Value

If method is "frequency", the results list contains the following elements:

accuracy

average k-fold cross validation accuracy.

sensitivity

average k-fold cross validation sensitivity.

specificity

average k-fold cross validation specificity.

features

selected feature panel.

all.results

complete cross validation results.

If method is "ensemble", the results list contains the following elements:

accuracy

average accuracy regarding all subsamples.

sensitivity

average sensitivity regarding all subsamples.

specificity

average specificity regarding all subsamples.

features

selected feature panel.

all.results

all feature ranking results.

stability

stability of the feature panel (i.e., Kuncheva index for the subrun-specific panels).

Author(s)

Michael Turewicz, michael.turewicz@rub.de

References

Baek S, Tsai CA, Chen JJ.: Development of biomarker classifiers from high- dimensional data. Brief Bioinform. 2009 Sep;10(5):537-46.

Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010 Feb 1;26(3):392-8.

Kuncheva, LI: A stability index for feature selection. Proceedings of the IASTED International Conference on Artificial Intelligence and Applications. February 12-14, 2007. Pages: 390-395.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
cwd <- system.file(package="PAA")
load(paste(cwd, "/extdata/Alzheimer.RData", sep=""))
elist <- elist[elist$genes$Block < 10,]

c1 <- paste(rep("AD",20), 1:20, sep="")
c2 <- paste(rep("NDC",20), 1:20, sep="")

pre.sel.results <- preselect(elist=elist, columns1=c1, columns2=c2, label1="AD",
 label2="NDC", log=FALSE, discard.threshold=0.1, fold.thresh=1.9,
 discard.features=TRUE, method="tTest")
elist <- elist[-pre.sel.results$discard,]

selectFeatures.results <- selectFeatures(elist, n1=20, n2=20, label1="AD",
    label2="NDC", log=FALSE, subsamples=2, bootstraps=1, candidate.number=20,
    method="ensemble")

Example output

Loading required package: Rcpp
selectFeatures - current subsample: 1 of 2

selectFeatures - best features for current subsample: 
5 15 176 9 71 5 193 7 212 16 52 19 18 10 78 8 56 16 33 7 1, 


selectFeatures - classification results - accuracy:0.75, sensitivity: 0.5, specificity: 1


selectFeatures - current subsample: 2 of 2

selectFeatures - best features for current subsample: 
8 8 172 19 12 16 56 9 71 5 198 8 51 2 136 16 34 16 133 7 21, 


selectFeatures - classification results - accuracy:0.75, sensitivity: 0.5, specificity: 1


selectFeatures - finally selected features: 
6 9 71 5 192 16 52 19 15 15 173 7 218 8 56 16 31 2 138 10 7, 


selectFeatures - average classification resultds - accuracy: 0.75, sensitivity: 0.5, specificity: 1



There were 50 or more warnings (use warnings() to see the first 50)

PAA documentation built on Nov. 8, 2020, 8:30 p.m.