feature.selection | R Documentation |
Functions to create functions that perform feature selection (or at least feature reduction) using statistics that access class labels.
keepAll(data, group)
fsTtest(fdr, ming=500)
fsModifiedFisher(q)
fsPearson(q = NULL, rho)
fsSpearman(q = NULL, rho)
fsMedSplitOddsRatio(q = NULL, OR)
fsChisquared(q = NULL, cutoff)
fsEntropy(q = 0.9, kind=c("information.gain", "gain.ratio", "symmetric.uncertainty"))
fsFisherRandomForest(q)
fsTailRank(specificity=0.9, tolerance=0.5, confidence=0.5)
data |
A matrix containng the data; columns are samples and rows are features. |
group |
A factor with two levels defining the sample classes. |
fdr |
A real number between 0 and 1 specifying the target false discovery rate (FDR). |
ming |
An integer specifing the minimum number of features to return; overrides the FDR. |
q |
A real number between 0.5 and 1 specifiying the fraction of features to discard. |
rho |
A real number between 0 and 1 specifying the absolute value of the correlation coefficient used to filter features. |
OR |
A real number specifying the desired odds ratio for filtering features. |
cutoff |
A real number specifiyng the targeted cutoff rate when using the statistic to filter features. |
kind |
The kind of information metric to use for filtering features. |
specificity |
See |
tolerance |
See |
confidence |
See |
Following the usual conventions introduced from the world of gene expression microarrays, a typical data matrix is constructed from columns representing samples on which we want to make predictions amd rows representing the features used to construct the predictive model. In this context, we define a feature selector or pruner to be a function that accepts a data matrix and a two-level factor as its only arguments and returns a logical vector, whose length equals the number of rows in the matrix, where 'TRUE' indicates features that should be retrained. Most pruning functions belong to parametrized families. We implement this idea using a set of function-generating functions, whose arguments are the parameters that pick out the desired member of the family. The return value is an instantiation of a particular filtering function. The decison to define things this way is to be able to apply the methods in cross-validaiton (or other) loops where we want to ensure that we use the same feature selection rule each time.
We have implemented the following algorithms:
keepAll
: retain all features; do nothing.
fsTtest
: Keep features based on the false discovery rate
from a two-goup t-test, but always retain a specified minimum number
of genes.
fsModifiedFisher
Retain the top quantile of features
for the statistic
\frac{(m_A - m)^2 + (m_B - m)^2}{v_A + v_B}
where m is the mean and v is the variance.
fsPearson
: Retain the top quantile of features based on
the absolute value of the Pearson correlation with the binary outcome.
fsSpearman
: Retain the top quantile of features based on
the absolute value of the Spearman correlation with the binary outcome.
fsMedSplitOddsRatio
: Retain the top quantile of
features based on the odds ratio to predict the binary outcome,
after first dichotomizing the continuous predictor using a split at
the median value.
fsChisquared
: retain the top quantile of features based
on a chi-squared test comparing the binary outcome to continous
predictors discretized into ten bins.
fsEntropy
: retain the top quantile of features based on
one of three information-theoretic measures of entropy.
fsFisherRandomForest
: retain the top features based on
their importance in a random forest analysis, after first filtering
using the modified Fisher statistic.
fsTailRank
: Retain features that are significant based
on the TailRank test, which is a measure of whether the tails of the
distributions are different.
The keepAll
function is a "pruner"; it takes the data matrix and
grouping factor as arguments, and returns a logical vector indicating
which features to retain.
Each of the other nine functions described here return uses its
arguments to contruct and return a pruning function,
f
, that has the same interface as keepAll
.
Kevin R. Coombes <krc@silicovore.com>
See Modeler-class
and Modeler
for details
about how to train and test models.
set.seed(246391)
data <- matrix(rnorm(1000*36), nrow=1000, ncol=36)
data[1:50, 1:18] <- data[1:50, 1:18] + 1
status <- factor(rep(c("A", "B"), each=18))
fsel <- fsPearson(q = 0.9)
summary(fsel(data, status))
fsel <- fsPearson(rho=0.3)
summary(fsel(data, status))
fsel <- fsEntropy(kind="gain.ratio")
summary(fsel(data, status))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.