empiricalFDR: Control the empirical FDR

View source: R/empiricalFDR.R

empiricalFDRR Documentation

Control the empirical FDR

Description

Control the empirical FDR across clusters for comparisons to negative controls, based on tests that are significant in the “wrong” direction.

Usage

empiricalFDR(
  ids,
  tab,
  weights = NULL,
  pval.col = NULL,
  fc.col = NULL,
  fc.threshold = 0.05,
  neg.down = TRUE
)

Arguments

ids

An integer vector or factor containing the cluster ID for each test.

tab

A data.frame of results with PValue and at least one logFC field for each test.

weights

A numeric vector of weights for each test. Defaults to 1 for all tests.

pval.col

An integer scalar or string specifying the column of tab containing the p-values. Defaults to "PValue".

fc.col

An integer or string specifying the single column of tab containing the log-fold change.

fc.threshold

A numeric scalar specifying the FDR threshold to use within each cluster for counting tests changing in each direction, see ?"cluster-direction" for more details.

neg.down

A logical scalar indicating if negative log-fold changes correspond to the “wrong” direction.

Details

Some experiments involve comparisons to negative controls where there should be no signal/binding. In such case, genuine differences should only occur in one direction, i.e., up in the non-control samples. Thus, the number of significant tests that change in the wrong direction can be used as an estimate of the number of false positives.

This function converts two-sided p-values in tab[,pval.col] to one-sided counterparts in the wrong direction. It combines the one-sided p-values for each cluster using combineTests. The number of significant clusters at some p-value threshold represents the estimated number of false positive clusters.

The same approach is applied for one-sided p-values in the right direction, where the number of detected clusters at the threshold represents the total number of discoveries. Dividing the number of false positives by the number of discoveries yields the empirical FDR at each p-value threshold. Monotonicity is enforced (i.e., the empirical FDR only decreases with decreasing p-value) as is the fact that the empirical FDR must be below unity.

The p-values specified in pval.col are assumed to be originally computed from some two-sided test, where the distribution of p-values is the same regardless of the direction of the log-fold change (under both the null and alternative hypothesis). This rules out p-values computed from ANODEV where multiple contrasts are tested at once; or from methods that yield asymmetric p-value distributions, e.g., GLM-based TREAT.

Value

A DataFrame with one row per cluster and various fields:

  • An integer field num.tests, specifying the total number of tests in each cluster.

  • Two integer fields num.up.* and num.down.* for each log-FC column in tab, containing the number of tests with log-FCs significantly greater or less than 0, respectively. See ?"cluster-direction" for more details.

  • A numeric field containing the cluster-level p-value. If pval.col=NULL, this column is named PValue, otherwise its name is set to colnames(tab[,pval.col]).

  • A numeric field FDR, containing the empirical FDR corresponding to that cluster's p-value.

  • A character field direction (if fc.col is of length 1), specifying the dominant direction of change for tests in each cluster. See ?"cluster-direction" for more details.

  • One integer field rep.test containing the row index (for tab) of a representative test for each cluster. See ?"cluster-direction" for more details.

  • One numeric field rep.* for each log-FC column in tab, containing a representative log-fold change for the differential tests in the cluster. See ?"cluster-direction" for more details.

Each row is named according to the ID of the corresponding cluster.

Caution

Control of the empirical FDR is best used for very noisy data sets where the BH method is not adequate. The BH method only protects against statistical false positives under the null hypothesis that the log-fold change is zero. However, the empirical FDR also protects against experimental false positives, caused by non-specific binding that yields uninteresting (but statistically significant) DB.

The downside is that the empirical FDR calculation relies on the availability of a good estimate of the number of false positives. It also assumes that the distribution of p-values is the same for non-specific binding events in both directions (i.e., known events with negative log-FCs and unknown events among those with positive log-FCs). Even if the log-fold changes are symmetric around zero, this does not mean that the p-value distributions will be the same, due to differences in library size and number between control and ChIP samples.

In summary, the BH method in combineTests is more statistically rigorous and should be preferred for routine analyses.

Author(s)

Aaron Lun

References

Zhang Y et al. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 9, R137.

See Also

combineTests, used to combine the p-values in each direction.

Examples

ids <- round(runif(100, 1, 10))
tab <- data.frame(logFC=rnorm(100), logCPM=rnorm(100), PValue=rbeta(100, 1, 2))
empirical <- empiricalFDR(ids, tab)
head(empirical)


LTLA/csaw documentation built on Dec. 21, 2024, 1:10 a.m.