Description Usage Arguments Details Value Author(s) References Examples
Performs a multivariate feature selection using frequency-based feature selection (based on RF-RFE, RJ-RFE or SVM-RFE) or ensemble feature selection (based on SVM-RFE).
1 2 3 4 5 6 7 | selectFeatures(elist = NULL, n1 = NULL, n2 = NULL, label1 = "A", label2 = "B",
log=NULL, cutoff = 10, selection.method = "rf.rfe",
preselection.method = "mMs", subruns = 100, k = 10, subsamples = 10,
bootstraps = 10, candidate.number = 300, above=1500, between=400,
panel.selection.criterion="accuracy", importance.measure="MDA", ntree = 500,
mtry = NULL, plot = FALSE, output.path = NULL, verbose = FALSE,
method = "frequency")
|
elist |
|
n1 |
integer indicating the sample number in group 1 (mandatory). |
n2 |
integer indicating the sample number in group 2 (mandatory). |
label1 |
class label of group 1 (default: "A"). |
label2 |
class label of group 2 (default: "B"). |
log |
indicates whether the data is in log scale (mandatory; note: if TRUE log2 scale is expected). |
cutoff |
integer indicating how many features will be selected (default: 10). |
selection.method |
string indicating the feature selection method:
|
preselection.method |
string indicating the feature preselection
method: |
subruns |
integer indicating the number of resampling repeats to be
performed (default: 100). Has no effect when |
k |
integer indicating the number of k-fold cross validation subsets (default: 10, i.e., 10-fold CV). |
subsamples |
integer indicating the number of subsamples for ensemble
feature selection (default: 10). Has no effect when
|
bootstraps |
integer indicating the number of bootstrap samples for
ensemble feature selection (default: 10). Has no effect when
|
candidate.number |
integer indicating how many features shall be
preselected. Default is |
above |
mMs above parameter (integer). Default is |
between |
mMs between parameter (integer). Default is |
panel.selection.criterion |
indicating the panel selection
criterion: |
importance.measure |
string indicating the random forest importance
measure: |
ntree |
random forest parameter ntree (default: |
mtry |
random forest parameter mtry (default: |
plot |
logical indicating whether performance plots shall be plotted (default: FALSE). |
output.path |
string indicating the results output folder (optional). |
verbose |
logical indicating whether additional information shall be printed to the console (default: FALSE). |
method |
the feature selection method: "frequency" (default) for frequency-based or "ensemble" for ensemble feature selection. |
This function takes an EListRaw
or EList
object, group-specific
sample numbers, group labels and parameters choosing and configuring a
multivariate feature selection method (frequency-based or ensemble feature
selection) to select a panel of differential features. When an output path is
defined (via output.path
) results will be saved on the hard disk and
when verbose
is TRUE additional information will be printed to the
console.
Frequency-based feature selection (method="frequency"
): The whole data is
splitted in k cross validation training and test set pairs. For each training
set a multivariate feature selection procedure is performed. The resulting k
feature subsets are tested using the corresponding test sets (via
classification). As a result, selectFeatures()
returns the average k-fold
cross validation classification accuracy as well as the selected feature panel
(i.e., the union set of the k particular feature subsets). As multivariate
feature selection methods random forest recursive feature elimination (RF-RFE),
random jungle recursive feature elimination (RJ-RFE) and support vector machine
recursive feature elimination (SVM-RFE) are supported. To reduce running times,
optionally, univariate feature preselection can be performed (control via
preselection.method
). As univariate preselection methods mMs
("mMs"
), Student's t-test ("tTest"
) and mRMR ("mrmr"
) are
supported. Alternatively, no preselection can be chosen ("none"
). This
approach is similar to the method proposed in Baek et al.
Ensemble feature selection (method="ensemble"
): From the whole data the
previously defined number of subsamples is drawn defining pairs of training and
test sets. Moreover, for each training set a previously defined number of
bootstrap samples is drawn. Then, for each bootstrap sample SVM-RFE is performed
and a feature ranking is obtained. To obtain a final ranking for a particular
training set, all associated bootstrap rankings are aggregated to a single
ranking. To score the cutoff
best features, for each subsample a
classification of the test set is performed (using a svm trained with the
cutoff
best features from the training set) and the classification
accuracy is determined. Finally, the stability of the subsample-specific panels
is assessed (via Kuncheva index, Kuncheva LI, 2007), all subsample-specific
rankings are aggregated, the top n features (defined by cutoff
) are
selected, the average classification accuracy is computed, and all these results
are returned in a list. This approach has been proposed in Abeel et al.
If method
is "frequency"
, the results list contains the following
elements:
accuracy |
average k-fold cross validation accuracy. |
sensitivity |
average k-fold cross validation sensitivity. |
specificity |
average k-fold cross validation specificity. |
features |
selected feature panel. |
all.results |
complete cross validation results. |
If method
is "ensemble"
, the results list contains the following
elements:
accuracy |
average accuracy regarding all subsamples. |
sensitivity |
average sensitivity regarding all subsamples. |
specificity |
average specificity regarding all subsamples. |
features |
selected feature panel. |
all.results |
all feature ranking results. |
stability |
stability of the feature panel (i.e., Kuncheva index for the subrun-specific panels). |
Michael Turewicz, michael.turewicz@rub.de
Baek S, Tsai CA, Chen JJ.: Development of biomarker classifiers from high- dimensional data. Brief Bioinform. 2009 Sep;10(5):537-46.
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010 Feb 1;26(3):392-8.
Kuncheva, LI: A stability index for feature selection. Proceedings of the IASTED International Conference on Artificial Intelligence and Applications. February 12-14, 2007. Pages: 390-395.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | cwd <- system.file(package="PAA")
load(paste(cwd, "/extdata/Alzheimer.RData", sep=""))
elist <- elist[elist$genes$Block < 10,]
c1 <- paste(rep("AD",20), 1:20, sep="")
c2 <- paste(rep("NDC",20), 1:20, sep="")
pre.sel.results <- preselect(elist=elist, columns1=c1, columns2=c2, label1="AD",
label2="NDC", log=FALSE, discard.threshold=0.1, fold.thresh=1.9,
discard.features=TRUE, method="tTest")
elist <- elist[-pre.sel.results$discard,]
selectFeatures.results <- selectFeatures(elist, n1=20, n2=20, label1="AD",
label2="NDC", log=FALSE, subsamples=2, bootstraps=1, candidate.number=20,
method="ensemble")
|
Loading required package: Rcpp
selectFeatures - current subsample: 1 of 2
selectFeatures - best features for current subsample:
5 15 176 9 71 5 193 7 212 16 52 19 18 10 78 8 56 16 33 7 1,
selectFeatures - classification results - accuracy:0.75, sensitivity: 0.5, specificity: 1
selectFeatures - current subsample: 2 of 2
selectFeatures - best features for current subsample:
8 8 172 19 12 16 56 9 71 5 198 8 51 2 136 16 34 16 133 7 21,
selectFeatures - classification results - accuracy:0.75, sensitivity: 0.5, specificity: 1
selectFeatures - finally selected features:
6 9 71 5 192 16 52 19 15 15 173 7 218 8 56 16 31 2 138 10 7,
selectFeatures - average classification resultds - accuracy: 0.75, sensitivity: 0.5, specificity: 1
There were 50 or more warnings (use warnings() to see the first 50)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.