Filtering of Features in an ExpressionSet
Description
The function nsFilter
tries to provide a onestop shop for
different options of filtering (removing) features from an ExpressionSet.
Filtering features exhibiting little variation, or a consistently low
signal, across samples can be advantageous for
the subsequent data analysis (Bourgon et al.).
Furthermore, one may decide that there is little value in considering
features with insufficient annotation.
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13  nsFilter(eset, require.entrez=TRUE,
require.GOBP=FALSE, require.GOCC=FALSE,
require.GOMF=FALSE, require.CytoBand=FALSE,
remove.dupEntrez=TRUE, var.func=IQR,
var.cutoff=0.5, var.filter=TRUE,
filterByQuantile=TRUE, feature.exclude="^AFFX", ...)
varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE)
featureFilter(eset, require.entrez=TRUE,
require.GOBP=FALSE, require.GOCC=FALSE,
require.GOMF=FALSE, require.CytoBand=FALSE,
remove.dupEntrez=TRUE, feature.exclude="^AFFX")

Arguments
eset 
an 
var.func 
The function used as the perfeature filtering statistic. This function should return a numeric vector of length one when given a numeric vector as input. 
var.filter 
A logical indicating whether to perform
filtering based on 
filterByQuantile 
A logical indicating whether 
var.cutoff 
A numeric value. If 
require.entrez 
If 
require.GOBP, require.GOCC, require.GOMF 
If 
require.CytoBand 
If 
remove.dupEntrez 
If 
feature.exclude 
A character vector of regular expressions.
Feature identifiers (i.e. value of 
... 
Unused, but available for specializing methods. 
Details
In this Section, the effect of filtering on the type I error rate estimation / control of subsequent hypothesis testing is explained. See also the paper by Bourgon et al.
Marginal type I errors:
Filtering on the basis of a statistic which is independent of the test
statistic used for detecting differential gene expression can increase
the detection rate at the same marginal type I error. This is
clearly the case for filter criteria that do not depend on the data,
such as the annotation based criteria provided by the nsFilter
and featureFilter
functions. However, marginal type I error can
also be controlled for certain types of datadependent criteria.
Call U^1 the stage 1 filter statistic, which is a function
that is applied feature by feature,
based on whose value the feature is or is not accepted to
pass to stage 2, and which depends only on the data for that feature
and not any other feature, and call
U^2 the stage 2 test statistic for differential expression.
Sufficient conditions for marginal typeI error control are:

U^1 the overall (across all samples) variance or mean, U^2 the tstatistic (or any other scale and location invariant statistic), data normal distributed and exchangeable across samples.

U^1 the overall mean, U^2 the moderated tstatistic (as in limma's
eBayes
function), data normal distributed and exchangeable. 
U^1 a sampleclass label independent function (e.g. overall mean, median, variance, IQR), U^2 the Wilcoxon rank sum statistic, data exchangeable.
Experimentwide type I error: Marginal typeI error control provided by the conditions above is sufficient for control of the family wise error rate (FWER). Note, however, that common false discovery rate (FDR) methods depend not only on the marginal behaviour of the test statistics under the null hypothesis, but also on their joint distribution. The joint distribution can be affected by filtering, even when this filtering leaves the marginal distributions of truenull test statistics unchanged. Filtering might, for example, change correlation structure. The effect of this is negligible in many cases in practice, but this depends on the dataset and the filter used, and the assessment is in the responsibility of the data analyst.
Annotation Based Filtering Arguments require.entrez
,
require.GOBP
, require.GOCC
, require.GOMF
and
require.CytoBand
filter based on available annotation data. The annotation
package is determined by calling annotation(eset)
.
Variance Based Filtering The var.filter
,
var.func
, var.cutoff
and varByQuantile
arguments
control numerical cutoffbased filtering.
Probes for which var.func
returns NA
are
removed.
The default var.func
is IQR
, which we here define as
rowQ(eset, ceiling(0.75 * ncol(eset)))  rowQ(eset, floor(0.25 * ncol(eset)))
;
this choice is motivated by the observation that unexpressed genes are
detected most reliably through low variability of their features
across samples.
Additionally, IQR
is robust to outliers (see note below). The
default var.cutoff
is 0.5
and is motivated by a rule of
thumb that in many tissues only 40% of genes are expressed.
Please adapt this value to your data and question.
By default the numericalfilter cutoff is interpreted as a quantile, so with the default settings, 50% of the genes are filtered.
Variance filtering is performed last, so that
(if varByQuantile=TRUE
and remove.dupEntrez=TRUE
) the
final number of genes does indeed exclude precisely the var.cutoff
fraction of unique genes remaining after all other filters were
passed.
The standalone function varFilter
does only
var.func
based filtering
(and no annotation based filtering).
featureFilter
does only
annotation based filtering and duplicate removal; it always
performs duplicate removal to retain the highestIQR
probe for each gene.
Value
For nsFilter
a list consisting of:
eset 
the filtered 
filter.log 
a list giving details of how many probe sets where removed for each filtering step performed. 
For both varFilter
and featureFilter
the filtered
ExpressionSet
.
Note
IQR
is a reasonable variancefilter choice when the dataset
is split into two roughly equal and relatively homogeneous phenotype
groups. If your dataset has important groups smaller than 25% of the
overall sample size, or if you are interested in unusual
individuallevel patterns, then IQR
may not be sensitive enough
for your needs. In such cases, you should consider using less robust
and more sensitive measures of variance (the simplest of which would
be sd
).
Author(s)
Seth Falcon (somewhat revised by Assaf Oron)
References
R. Bourgon, R. Gentleman, W. Huber, Independent filtering increases power for detecting differentially expressed genes, Technical Report.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 