csaw: ChIP-Seq Analysis with Windows

Documented in combineTests

#' Combine statistics across multiple tests
#' 
#' Combines p-values across clustered tests using Simes' method to control the cluster-level FDR.
#' 
#' @param ids An integer vector or factor containing the cluster ID for each test.
#' @param tab A data.frame of results with \code{PValue} and at least one \code{logFC} field for each test.
#' @param weights A numeric vector of weights for each test. 
#' Defaults to 1 for all tests.
#' @param pval.col An integer scalar or string specifying the column of \code{tab} containing the p-values.
#' Defaults to \code{"PValue"}.
#' @param fc.col An integer or character vector specifying the columns of \code{tab} containing the log-fold changes.
#' Defaults to all columns in \code{tab} starting with \code{"logFC"}.
#' @param fc.threshold A numeric scalar specifying the FDR threshold to use \emph{within} each cluster for counting tests changing in each direction, see \code{?"\link{cluster-direction}"} for more details.
#' 
#' @return
#' A \linkS4class{DataFrame} with one row per cluster and various fields:
#' \itemize{
#' \item An integer field \code{num.tests}, specifying the total number of tests in each cluster.
#' \item Two integer fields \code{num.up.*} and \code{num.down.*} for each log-FC column in \code{tab}, containing the number of tests with log-FCs significantly greater or less than 0, respectively.
#' See \code{?"\link{cluster-direction}"} for more details.
#' \item A numeric field containing the cluster-level p-value. 
#' If \code{pval.col=NULL}, this column is named \code{PValue}, otherwise its name is set to \code{colnames(tab[,pval.col])}.
#' \item A numeric field \code{FDR}, containing the BH-adjusted cluster-level p-value.
#' \item A character field \code{direction} (if \code{fc.col} is of length 1), specifying the dominant direction of change for tests in each cluster.
#' See \code{?"\link{cluster-direction}"} for more details.
#' \item One integer field \code{rep.test} containing the row index (for \code{tab}) of a representative test for each cluster.
#' See \code{?"\link{cluster-direction}"} for more details.
#' \item One numeric field \code{rep.*} for each log-FC column in \code{tab}, containing a representative log-fold change for the differential tests in the cluster.
#' See \code{?"\link{cluster-direction}"} for more details.
#' }
#' Each row is named according to the ID of the corresponding cluster.
#' 
#' @details
#' All tests with the same value of \code{ids} are used to define a single cluster.
#' This function applies Simes' procedure to the per-test p-values to compute the combined p-value for each cluster,
#' which represents evidence against the global null hypothesis, i.e., all individual nulls are true in each cluster. 
#' The BH method is then applied to control the FDR across all clusters.
#' 
#' Rejection of the global null is more relevant than the significance of each individual test when multiple tests in a cluster represent parts of the same underlying event, e.g., differentially bound genomic regions consisting of clusters of windows.
#' Control of the FDR across tests may not equate to control of the FDR across clusters;
#' we ensure the latter by explicitly computing cluster-level p-values for use in the BH method.
#'
#' We use Simes' method as it is relatively relaxed and rejects the global null upon observing any change in the cluster.
#' More stringent methods are available in functions like \code{\link{minimalTests}} and \code{\link{getBestTest}}.
#'
#' The importance of each test within a cluster can be adjusted by supplying different relative \code{weights} values. 
#' This may be useful for downweighting low-confidence tests, e.g., those in repeat regions. 
#' In Simes' procedure, weights are interpreted as relative frequencies of the tests in each cluster. 
#' Note that these weights have no effect between clusters.
#' 
#' To obtain \code{ids}, a simple clustering approach for genomic windows is implemented in \code{\link{mergeWindows}}.
#' However, anything can be used so long as it is independent of the p-values and does not compromise type I error control, e.g., promoters, gene bodies, independently called peaks. 
#' Any tests with \code{NA} values for \code{ids} will be ignored.
#' 
#' @examples
#' ids <- round(runif(100, 1, 10))
#' tab <- data.frame(logFC=rnorm(100), logCPM=rnorm(100), PValue=rbeta(100, 1, 2))
#' combined <- combineTests(ids, tab)
#' head(combined)
#' 
#' # With window weighting.
#' w <- round(runif(100, 1, 5))
#' combined <- combineTests(ids, tab, weights=w)
#' head(combined)
#' 
#' # With multiple log-FCs.
#' tab$logFC.whee <- rnorm(100, 5)
#' combined <- combineTests(ids, tab)
#' head(combined)
#' 
#' # Manual specification of column IDs.
#' combined <- combineTests(ids, tab, fc.col=c(1,4), pval.col=3)
#' head(combined)
#' 
#' combined <- combineTests(ids, tab, fc.col="logFC.whee", pval.col="PValue")
#' head(combined)
#' 
#' @seealso
#' \code{\link{groupedSimes}}, which does the heavy lifting.
#'
#' \code{\link{minimalTests}} and \code{\link{getBestTest}}, for another method of combining p-values for each cluster.
#'
#' \code{\link{mergeWindows}}, for one method of generating \code{ids}.
#' 
#' \code{\link{glmQLFTest}}, for one method of generating \code{tab}.
#' 
#' @author Aaron Lun
#' 
#' @references
#' Simes RJ (1986). 
#' An improved Bonferroni procedure for multiple tests of significance. 
#' \emph{Biometrika} 73, 751-754.
#' 
#' Benjamini Y and Hochberg Y (1995). 
#' Controlling the false discovery rate: a practical and powerful approach to multiple testing. 
#' \emph{J. R. Stat. Soc. Series B} 57, 289-300. 
#' 
#' Benjamini Y and Hochberg Y (1997). 
#' Multiple hypotheses testing with weights. 
#' \emph{Scand. J. Stat.} 24, 407-418.
#' 
#' Lun ATL and Smyth GK (2014). 
#' De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. 
#' \emph{Nucleic Acids Res.} 42, e95
#' 
#' @keywords testing
#'
#' @export
#' @importFrom metapod groupedSimes
combineTests <- function(ids, tab, weights=NULL, pval.col=NULL, fc.col=NULL, fc.threshold=0.05) {
    .general_test_combiner(ids=ids, tab=tab, weights=weights, 
        pval.col=pval.col, fc.col=fc.col, fc.threshold=fc.threshold,
        FUN=groupedSimes, count.correct="BH")
}

#' @importFrom S4Vectors DataFrame
#' @importFrom BiocGenerics cbind
#' @importFrom stats p.adjust
#' @importFrom metapod countGroupedDirection summarizeGroupedDirection
.general_test_combiner <- function(ids, tab, weights=NULL, pval.col=NULL, fc.col=NULL, fc.threshold=0.05, FUN, count.correct) {
	stopifnot(length(ids)==nrow(tab))
    if (!is.null(weights)) {
        stopifnot(length(ids)==length(weights))
    }

    if (!all(okay.ids <- !is.na(ids))) { 
        originals <- which(okay.ids)
        ids <- ids[okay.ids]
        weights <- weights[okay.ids]
        tab <- tab[okay.ids,,drop=FALSE]
    } else {
        originals <- seq_along(ids)
    }

    is.pval <- .getPValCol(pval.col, tab)
    p.vals <- tab[[is.pval]]
    p.out <- FUN(p.values=p.vals, grouping=ids, weights=weights)
    com.p <- p.out$p.value
    ref.names <- names(com.p)

    fc.col <- .parseFCcol(fc.col, tab) 
    fcs <- as.list(tab[fc.col])
    fc.out <- vector("list", length(fcs))
    for (f in seq_along(fcs)) {

        # NOTE: p.threshold=fc.threshold is not a bug.
        n.out <- countGroupedDirection(p.vals, ids, fcs[[f]], 
            p.threshold=fc.threshold, method=count.correct) 

        stopifnot(identical(names(n.out$up), ref.names))
        stopifnot(identical(names(n.out$down), ref.names))
        n.out <- n.out[c("up", "down")]
        names(n.out) <- sprintf("num.%s.%s", names(n.out), names(fcs)[f])
        fc.out[[f]] <- n.out
    }
    
    ntests <- table(ids)
    ntests <- ntests[ref.names]
    ntests <- as.integer(ntests)

    if (length(fc.out)) {
        fc.out <- unlist(fc.out, recursive=FALSE)
        combined <- DataFrame(num.tests=ntests, lapply(fc.out, unname), row.names=ref.names)
    } else {
        combined <- DataFrame(num.tests=ntests, row.names=ref.names)
    }
    com.p <- unname(com.p)
    combined[[colnames(tab)[is.pval]]] <- com.p
    combined$FDR <- p.adjust(com.p, method="BH")

    # Adding direction, if we can.
    if (length(fcs)==1L) {
        dir.out <- summarizeGroupedDirection(fcs[[1]], ids, p.out$influential) 
        stopifnot(identical(names(dir.out), ref.names))
        combined$direction <- unname(dir.out)
    }

    # Adding representative log-fold changes.
    combined$rep.test <- originals[p.out$representative]
    if (length(fc.col)!=0L) {
        combined <- cbind(combined, rep=tab[p.out$representative,fc.col,drop=FALSE])
    }

    combined
}