clusterFDR: Compute the cluster-level FDR
In csaw: ChIP-Seq Analysis with Windows

Description Usage Arguments Value Definition of the cluster-level FDR Searching for the best threshold Note about weights Author(s) See Also Examples

View source: R/clusterFDR.R

Compute the FDR across clusters based on the test-level FDR threshold

clusterFDR(ids, threshold, weights=NULL)

controlClusterFDR(target, adjp, FUN, ..., weights=NULL, 
    grid.length=21, iterations=4)

`ids`	An integer vector of cluster IDs for each significant test below `threshold`.
`threshold`	A numeric scalar, specifying the FDR threshold used to define the significant tests.
`target`	A numeric scalar specifying the desired cluster-level FDR threshold.
`adjp`	A numeric vector of window-level adjusted p-values.
`FUN`	A clustering function that takes a logical vector indicating which windows are significant, and returns an integer vector of cluster IDs (see below).
`...`	Additional arguments to be passed to `FUN`.
`weights`	A numeric vector of frequency weights, for internal use.
`grid.length`	Integer scalar specifying the number of points to use in the grid search.
`iterations`	Integer scalar specifying the number of iterations of the grid search.

For clusterFDR, a numeric scalar is returned as the cluster-level FDR.

For controlClusterFDR, a list is returned containing two numeric scalars – threshold, the window-level FDR threshold to control the cluster-level FDR near target; and FDR, the estimate of the cluster-level FDR corresponding to threshold.

The clusterFDR function computes an informal estimate of the cluster-level FDR, where each cluster is formed by aggregating only significant tests. In the context of ChIP-seq, each significant test refers to a DB window that is detected at a FDR below threshold. The idea is to obtain an error rate while reporting the precise coordinates of a DB subinterval in a complex region.

This complements the standard pipeline based on combineTests, which defines regions independently of the DB status of the windows. In a complex region, the precise coordinates of the DB subinterval cannot be reported. Here, we overcome this by clustering directly on DB windows and applying post-hoc control of the cluster-level FDR. See clusterWindows for more details.

The cluster-level FDR is defined as the proportion of reported clusters that have no true positives. Simply using threshold on the window-level adjusted p-values is not sufficient to control this, as the cluster- and window-level FDRs are not equivalent. Instead, the observed number of false positive tests is estimated based on threshold and the total number of significant tests, and a conservative estimate for the number of false positive clusters (where all tests are true nulls) is computed.

However, note that the calculation of the cluster-level FDR here is not statistically rigorous. This is not guaranteed to be an upper bound, especially with few or correlated tests. Thus, users should use the standard combineTests-based pipeline wherever possible. Clustering on significant windows should only be performed where the precise coordinates of the DB subinterval are important for interpretation.

controlClusterFDR will identify the window-level FDR threshold required to control the cluster-level FDR at target. The former is not a simple function of the latter (neither continuous nor guaranteed to be monotonic), so a grid search is used. Clusters of significant windows are identified at each window-level threshold, and the corresponding cluster-level FDR is computed with clusterFDR.

The grid is initially defined with grid.length equally spaced points in [0, target]. At each iteration, the grid points with cluster-level FDRs above and below target are chosen, and the grid is redefined within that interval. This is repeated for iterations iterations, and the largest window-level threshold that achieves a cluster-level FDR below target is chosen.

The FUN argument should be a function that accepts a logical vector specifying significance, and returns an integer vector of cluster IDs. If, for example, it accepts an input vector ix, then the output should contain cluster IDs corresponding to the entries of which(ix). This is because cluster IDs are only defined for significant tests, given that only those tests are used for clustering.

A consequence of this search strategy is that the returned window-level FDR threshold will always be less than target. In other words, each window should be significantly DB on its own merits (i.e., after controlling the window-level FDR) before it is placed into a cluster. This protects against scenarios where very large thresholds yield low cluster-level FDRs, due to the formation of a few large clusters.

In both functions, the weights argument is assumed to contain frequency weights of significant tests/windows. For example, a weight of 2 for a test would be equivalent to repeating that test (i.e., repeating the same window so it shows up twice in your analysis). These weights should be the same as those used during weighted FDR control to compute adjusted p-values. In general, you should not set this argument unless you know what you're doing.

Aaron Lun

mergeWindows, combineTests, clusterWindows

# Setting up the windows and p-values.
set.seed(100)
windows <- GRanges("chrA", IRanges(1:1000, 1:1000))
test.p <- runif(1000)
test.p[c(1:10, 100:110, 220:240)] <- 0 # 3 significant subintervals.

# Defining significant windows.
threshold <- 0.05
is.sig <- p.adjust(test.p, method="BH") <= threshold

# Assuming that we only cluster significant windows.
merged <- mergeWindows(windows[is.sig], tol=0)
clusterFDR(merged$id, threshold)

# Setting up another example with more subintervals.
test.p <- runif(1000)
test.p[rep(1:2, 50) + rep(0:49, each=2) * 20] <- 0 
adj.p <- p.adjust(test.p, method="BH")
clusterFUN <- function(x) { mergeWindows(windows[x], tol=0)$id }
controlClusterFDR(0.05, adj.p, clusterFUN)