minimalTests: Require rejection of a minimal number of tests
In csaw: ChIP-Seq Analysis with Windows

Description Usage Arguments Details Value Author(s) References Examples

Compute a p-value for each cluster based around the rejection of a minimal number or proportion of tests from that cluster.

minimalTests(
  ids,
  tab,
  min.sig.n = 3,
  min.sig.prop = 0.4,
  weights = NULL,
  pval.col = NULL,
  fc.col = NULL,
  fc.threshold = 0.05
)

`ids`	An integer vector or factor containing the cluster ID for each test.
`tab`	A data.frame of results with `PValue` and at least one `logFC` field for each test.
`min.sig.n`	Integer scalar containing the minimum number of significant barcodes when `method="holm-min"`.
`min.sig.prop`	Numeric scalar containing the minimum proportion of significant barcodes when `method="holm-min"`.
`weights`	A numeric vector of weights for each test. Defaults to 1 for all tests.
`pval.col`	An integer scalar or string specifying the column of `tab` containing the p-values. Defaults to `"PValue"`.
`fc.col`	An integer or character vector specifying the columns of `tab` containing the log-fold changes. Defaults to all columns in `tab` starting with `"logFC"`.
`fc.threshold`	A numeric scalar specifying the FDR threshold to use within each cluster for counting tests changing in each direction, see `?"cluster-direction"` for more details.

All tests with the same value of ids are used to define a single cluster. For each cluster, this function applies the Holm-Bonferroni correction to the p-values from all of its tests. It then chooses the xth-smallest adjusted p-value as the cluster-level p-value, where x is defined from the larger of min.sig.n and the product of min.sig.prop and the number of tests. (If x is larger than the total number of tests, the largest per-test p-value is used instead.)

Here, a cluster can only achieve a low p-value if at least x tests also have low p-values. This favors clusters that exhibit consistent changes across all tests, which is useful for detecting, e.g., systematic increases in binding across a broad genomic region spanning many windows. By comparison, combineTests will detect a strong change in a small subinterval of a large region, which may not be of interest in some circumstances.

The importance of each test within a cluster can be adjusted by supplying different relative weights values. This may be useful for downweighting low-confidence tests, e.g., those in repeat regions. In the weighted Holm procedure, weights are used to downscale the per-test p-values, effectively adjusting the distribution of per-test errors that contribute to family-wise errors. Note that these weights have no effect between clusters.

To obtain ids, a simple clustering approach for genomic windows is implemented in mergeWindows. However, anything can be used so long as it is independent of the p-values and does not compromise type I error control, e.g., promoters, gene bodies, independently called peaks. Any tests with NA values for ids will be ignored.

A DataFrame with one row per cluster and various fields:

An integer field num.tests, specifying the total number of tests in each cluster.
Two integer fields num.up.* and num.down.* for each log-FC column in tab, containing the number of tests with log-FCs significantly greater or less than 0, respectively. See ?"cluster-direction" for more details.
A numeric field containing the cluster-level p-value. If pval.col=NULL, this column is named PValue, otherwise its name is set to colnames(tab[,pval.col]).
A numeric field FDR, containing the BH-adjusted cluster-level p-value.
A character field direction (if fc.col is of length 1), specifying the dominant direction of change for tests in each cluster. See ?"cluster-direction" for more details.
One integer field rep.test containing the row index (for tab) of a representative test for each cluster. See ?"cluster-direction" for more details.
One numeric field rep.* for each log-FC column in tab, containing a representative log-fold change for the differential tests in the cluster. See ?"cluster-direction" for more details.

Each row is named according to the ID of the corresponding cluster.

Aaron Lun

Holm S (1979). A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65-70.

ids <- round(runif(100, 1, 10))
tab <- data.frame(logFC=rnorm(100), logCPM=rnorm(100), PValue=rbeta(100, 1, 2))
minimal <- minimalTests(ids, tab)
head(minimal)