R/combineTests.R

Defines functions .general_test_combiner combineTests

Documented in combineTests

#' Combine statistics across multiple tests
#' 
#' Combines p-values across clustered tests using Simes' method to control the cluster-level FDR.
#' 
#' @param ids An integer vector or factor containing the cluster ID for each test.
#' @param tab A data.frame of results with \code{PValue} and at least one \code{logFC} field for each test.
#' @param weights A numeric vector of weights for each test. 
#' Defaults to 1 for all tests.
#' @param pval.col An integer scalar or string specifying the column of \code{tab} containing the p-values.
#' Defaults to \code{"PValue"}.
#' @param fc.col An integer or character vector specifying the columns of \code{tab} containing the log-fold changes.
#' Defaults to all columns in \code{tab} starting with \code{"logFC"}.
#' @param fc.threshold A numeric scalar specifying the FDR threshold to use \emph{within} each cluster for counting tests changing in each direction, see \code{?"\link{cluster-direction}"} for more details.
#' 
#' @return
#' A \linkS4class{DataFrame} with one row per cluster and various fields:
#' \itemize{
#' \item An integer field \code{num.tests}, specifying the total number of tests in each cluster.
#' \item Two integer fields \code{num.up.*} and \code{num.down.*} for each log-FC column in \code{tab}, containing the number of tests with log-FCs significantly greater or less than 0, respectively.
#' See \code{?"\link{cluster-direction}"} for more details.
#' \item A numeric field containing the cluster-level p-value. 
#' If \code{pval.col=NULL}, this column is named \code{PValue}, otherwise its name is set to \code{colnames(tab[,pval.col])}.
#' \item A numeric field \code{FDR}, containing the BH-adjusted cluster-level p-value.
#' \item A character field \code{direction} (if \code{fc.col} is of length 1), specifying the dominant direction of change for tests in each cluster.
#' See \code{?"\link{cluster-direction}"} for more details.
#' \item One integer field \code{rep.test} containing the row index (for \code{tab}) of a representative test for each cluster.
#' See \code{?"\link{cluster-direction}"} for more details.
#' \item One numeric field \code{rep.*} for each log-FC column in \code{tab}, containing a representative log-fold change for the differential tests in the cluster.
#' See \code{?"\link{cluster-direction}"} for more details.
#' }
#' Each row is named according to the ID of the corresponding cluster.
#' 
#' @details
#' All tests with the same value of \code{ids} are used to define a single cluster.
#' This function applies Simes' procedure to the per-test p-values to compute the combined p-value for each cluster,
#' which represents evidence against the global null hypothesis, i.e., all individual nulls are true in each cluster. 
#' The BH method is then applied to control the FDR across all clusters.
#' 
#' Rejection of the global null is more relevant than the significance of each individual test when multiple tests in a cluster represent parts of the same underlying event, e.g., differentially bound genomic regions consisting of clusters of windows.
#' Control of the FDR across tests may not equate to control of the FDR across clusters;
#' we ensure the latter by explicitly computing cluster-level p-values for use in the BH method.
#'
#' We use Simes' method as it is relatively relaxed and rejects the global null upon observing any change in the cluster.
#' More stringent methods are available in functions like \code{\link{minimalTests}} and \code{\link{getBestTest}}.
#'
#' The importance of each test within a cluster can be adjusted by supplying different relative \code{weights} values. 
#' This may be useful for downweighting low-confidence tests, e.g., those in repeat regions. 
#' In Simes' procedure, weights are interpreted as relative frequencies of the tests in each cluster. 
#' Note that these weights have no effect between clusters.
#' 
#' To obtain \code{ids}, a simple clustering approach for genomic windows is implemented in \code{\link{mergeWindows}}.
#' However, anything can be used so long as it is independent of the p-values and does not compromise type I error control, e.g., promoters, gene bodies, independently called peaks. 
#' Any tests with \code{NA} values for \code{ids} will be ignored.
#' 
#' @examples
#' ids <- round(runif(100, 1, 10))
#' tab <- data.frame(logFC=rnorm(100), logCPM=rnorm(100), PValue=rbeta(100, 1, 2))
#' combined <- combineTests(ids, tab)
#' head(combined)
#' 
#' # With window weighting.
#' w <- round(runif(100, 1, 5))
#' combined <- combineTests(ids, tab, weights=w)
#' head(combined)
#' 
#' # With multiple log-FCs.
#' tab$logFC.whee <- rnorm(100, 5)
#' combined <- combineTests(ids, tab)
#' head(combined)
#' 
#' # Manual specification of column IDs.
#' combined <- combineTests(ids, tab, fc.col=c(1,4), pval.col=3)
#' head(combined)
#' 
#' combined <- combineTests(ids, tab, fc.col="logFC.whee", pval.col="PValue")
#' head(combined)
#' 
#' @seealso
#' \code{\link{mergeWindows}}, for one method of generating \code{ids}.
#' 
#' \code{\link{glmQLFTest}}, for one method of generating \code{tab}.
#' 
#' @author Aaron Lun
#' 
#' @references
#' Simes RJ (1986). 
#' An improved Bonferroni procedure for multiple tests of significance. 
#' \emph{Biometrika} 73, 751-754.
#' 
#' Benjamini Y and Hochberg Y (1995). 
#' Controlling the false discovery rate: a practical and powerful approach to multiple testing. 
#' \emph{J. R. Stat. Soc. Series B} 57, 289-300. 
#' 
#' Benjamini Y and Hochberg Y (1997). 
#' Multiple hypotheses testing with weights. 
#' \emph{Scand. J. Stat.} 24, 407-418.
#' 
#' Lun ATL and Smyth GK (2014). 
#' De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. 
#' \emph{Nucleic Acids Res.} 42, e95
#' 
#' @keywords testing
#'
#' @export
combineTests <- function(ids, tab, weights=NULL, pval.col=NULL, fc.col=NULL, fc.threshold=0.05) {
    .general_test_combiner(ids=ids, tab=tab, weights=weights, 
        pval.col=pval.col, fc.col=fc.col, fc.threshold=fc.threshold,
        FUN=function(fcs, p, ids, weights, threshold, tab) {
            .Call(cxx_compute_cluster_simes, fcs, p, ids, weights, threshold)
        }
    )
}

#' @importFrom S4Vectors DataFrame
#' @importFrom BiocGenerics cbind
#' @importFrom stats p.adjust
.general_test_combiner <- function(ids, tab, weights=NULL, pval.col=NULL, fc.col=NULL, fc.threshold=0.05, FUN) {
    input <- .check_test_inputs(ids, tab, weights)
    ids <- input$ids
    tab <- input$tab
    groups <- input$groups
    weights <- input$weight
 
    fc.col <- .parseFCcol(fc.col, tab) 
    is.pval <- .getPValCol(pval.col, tab)
    out <- FUN(fcs=as.list(tab[fc.col]), p=tab[[is.pval]], ids=ids, weights=weights, threshold=fc.threshold, tab=tab)

    names(out[[2]]) <- sprintf("num.%s.%s", c("up", "down"), rep(colnames(tab)[fc.col], each=2))
    combined <- DataFrame(num.tests=out[[1]], out[[2]], row.names=groups)
    combined[[colnames(tab)[is.pval]]] <- out[[3]]
    combined$FDR <- p.adjust(out[[3]], method="BH")

    # Adding direction.
    if (length(fc.col)==1L) {
        labels <- c("mixed", "up", "down")
        combined$direction <- labels[out[[4]] + 1L]
    }

    # Adding representative log-fold changes.
    combined$rep.test <- input$original[out[[5]]+1L]
    if (length(fc.col)!=0L) {
        combined <- cbind(combined, rep=tab[out[[5]] + 1L,fc.col,drop=FALSE])
    }

    combined
}

Try the csaw package in your browser

Any scripts or data that you put into this service are public.

csaw documentation built on Nov. 12, 2020, 2:03 a.m.