getBestTest: Get the best test in a cluster
In LTLA/csaw: ChIP-Seq Analysis with Windows

getBestTest

R Documentation

Get the best test in a cluster

Description

Find the test with the greatest significance or the highest abundance in each cluster.

Usage

getBestTest(
  ids,
  tab,
  by.pval = TRUE,
  weights = NULL,
  pval.col = NULL,
  fc.col = NULL,
  fc.threshold = 0.05,
  cpm.col = NULL
)

Arguments

`ids`	An integer vector or factor containing the cluster ID for each test.
`tab`	A data.frame of results with `PValue` and at least one `logFC` field for each test.
`by.pval`	Logical scalar indicating whether the best test should be selected on the basis of the smallest p-value. If `FALSE`, the best test is defined as that with the highest abundance.
`weights`	A numeric vector of weights for each test. Defaults to 1 for all tests.
`pval.col`	An integer scalar or string specifying the column of `tab` containing the p-values. Defaults to `"PValue"`.
`fc.col`	An integer or character vector specifying the columns of `tab` containing the log-fold changes. Defaults to all columns in `tab` starting with `"logFC"`.
`fc.threshold`	A numeric scalar specifying the FDR threshold to use within each cluster for counting tests changing in each direction, see `?"cluster-direction"` for more details.
`cpm.col`	An integer scalar or string specifying the column of `tab` containing the log-CPM values. Defaults to `"logCPM"`.

Details

Each cluster is defined as a set of tests with the same value of ids (any NA values are ignored). If by.pval=TRUE, this function identifies the test with the lowest p-value as that with the strongest evidence against the null in each cluster. The p-value of the chosen test is adjusted using the (Holm-)Bonferroni correction, based on the total number of tests in the parent cluster. This is necessary to obtain strong control of the family-wise error rate such that the best test can be taken from each cluster for further consideration.

The importance of each window in each cluster can be adjusted by supplying different relative weights values. Each weight is interpreted as a different threshold for each test in the cluster using the weighted Holm procedure. Larger weights correspond to lower thresholds, i.e., less evidence is needed to reject the null for tests deemed to be more important. This may be useful for upweighting particular tests such as those for windows containing a motif for the TF of interest.

Note the difference between this function and combineTests. The latter presents evidence for any rejections within a cluster. This function specifies the exact location of the rejection in the cluster, which may be more useful in some cases but at the cost of conservativeness. In both cases, clustering procedures such as mergeWindows can be used to identify the cluster.

If by.pval=FALSE, the best test is defined as that with the highest log-CPM value. This should be independent of the p-value so no adjustment is necessary. Weights are not applied here. This mode may be useful when abundance is correlated to rejection under the alternative hypothesis, e.g., picking high-abundance regions that are more likely to contain peaks.

To obtain ids, a simple clustering approach for genomic windows is implemented in mergeWindows. However, anything can be used so long as it is independent of the p-values and does not compromise type I error control, e.g., promoters, gene bodies, independently called peaks. Any tests with NA values for ids will be ignored.

Value

A DataFrame with one row per cluster and various fields:

An integer field num.tests, specifying the total number of tests in each cluster.
Two integer fields num.up.* and num.down.* for each log-FC column in tab, containing the number of tests with log-FCs significantly greater or less than 0, respectively. See ?"cluster-direction" for more details.
A numeric field containing the cluster-level p-value. If pval.col=NULL, this column is named PValue, otherwise its name is set to colnames(tab[,pval.col]).
A numeric field FDR, containing the BH-adjusted cluster-level p-value.
A character field direction (if fc.col is of length 1), specifying the dominant direction of change for tests in each cluster. See ?"cluster-direction" for more details.
One integer field rep.test containing the row index (for tab) of a representative test for each cluster. See ?"cluster-direction" for more details.
One numeric field rep.* for each log-FC column in tab, containing a representative log-fold change for the differential tests in the cluster. See ?"cluster-direction" for more details.

Each row is named according to the ID of the corresponding cluster.

Author(s)

Aaron Lun

Examples

ids <- round(runif(100, 1, 10))
tab <- data.frame(logFC=rnorm(100), logCPM=rnorm(100), PValue=rbeta(100, 1, 2))
best <- getBestTest(ids, tab)
head(best)

best <- getBestTest(ids, tab, cpm.col="logCPM", pval.col="PValue")
head(best)

# With window weighting.
w <- round(runif(100, 1, 5))
best <- getBestTest(ids, tab, weight=w)
head(best)

# By logCPM.
best <- getBestTest(ids, tab, by.pval=FALSE)
head(best)

best <- getBestTest(ids, tab, by.pval=FALSE, cpm.col=2, pval.col=3)
head(best)

LTLA/csaw documentation built on Jan. 30, 2025, 8:21 p.m.