R/cluster-direction.R
In csaw: ChIP-Seq Analysis with Windows

#' Reporting cluster-level direction in \pkg{csaw}
#'
#' An overview of the strategies used to obtain cluster-level summaries of the direction of change,
#' based on the directionality information of individual tests.
#' This is relevant to all functions that aggregate per-test statistics into a per-cluster summary,
#' e.g., \code{\link{combineTests}}, \code{\link{minimalTests}}.
#' It assumes that there are zero, one or many columns of log-fold changes in the data.frame of per-test statistics,
#' typically specified using a \code{fc.cols} argument.
#'
#' @section Counting the per-test directions:
#' For each cluster, we will report the number of tests that are up (positive values) or down (negative) for each column of log-fold change values listed in \code{fc.col}.
#' This provide some indication of whether the change is generally positive or negative - or both - across tests in the cluster.
#' If a cluster contains non-negligble numbers of both up and down tests, this indicates that there may be a complex differential event within that cluster (see comments in \code{\link{mixedTests}}).
#'
#' To count up/down tests, we apply a multiple testing correction to the p-values \emph{within} each cluster.
#' Only the tests with adjusted p-values no greater than \code{fc.threshold} are counted as being up or down.
#' We can interpret this as a test of conditional significance; assuming that the cluster is interesting (i.e., contains at least one true positive), what is the distribution of the signs of the changes within that cluster?
#' Note that this procedure has no bearing on the p-value reported for the cluster itself.
#'
#' The nature of the per-test correction within each cluster varies with each function.
#' In most cases, there is a per-test correction that naturally accompanies the per-cluster p-value:
#' \itemize{
#' \item For \code{\link{combineTests}}, the Benjamini-Hochberg correction is used.
#' \item For \code{\link{minimalTests}}, the Holm correction is used. 
#' \item For \code{\link{getBestTest}} with \code{by.pval=TRUE}, the Holm correction is used.
#' We could also use the Bonferroni correction here but Holm is universally more powerful so we use that instead.
#' \item For \code{\link{getBestTest}} with \code{by.pval=FALSE}, 
#' all tests bar the one with the highest abundance are simply ignored,
#' which mimics the application of an independent filter.
#' No correction is applied as only one test remains.
#' \item For \code{\link{mixedTests}} and \code{\link{empiricalFDR}}, the Benjamini-Hochberg correction is used, 
#' given that both functions just call \code{\link{combineTests}} on the one-sided p-values in each direction.
#' Here, the number of up tests is obtained using the one-sided p-values for a positive change;
#' similarly, the number of down tests is obtained using the one-sided p-values for a negative change.
#' }
#' 
#' @section Representative tests and their log-fold changes:
#' For each combining procedure, we identify a representative test for the entire cluster.
#' This is based on the observation that, in each method, 
#' there is often one test that is especially important for computing the cluster-level p-value.
#' \itemize{
#' \item For \code{\link{combineTests}}, the representative is the test with the lowest BH-adjusted p-value before enforcing monotonicity.
#' This is because the p-value for this test is directly used as the combined p-value in Simes' method.
#' \item For \code{\link{minimalTests}}, the test with the \eqn{x}th-smallest p-value is used as the representative.
#' See the function's documentation for the definition of \eqn{x}.
#' \item For \code{\link{getBestTest}} with \code{by.pval=TRUE}, the test with the lowest p-value is used.
#' \item For \code{\link{getBestTest}} with \code{by.pval=FALSE}, the test with the highest abundance is used.
#' \item For \code{\link{mixedTests}}, two representative tests are reported in each direction.
#' The representative test in each direction is defined using \code{\link{combineTests}} as described above.
#' \item For \code{\link{empiricalFDR}}, the test is chosen in the same manner as described for \code{\link{combineTests}}
#' after converting all p-values to their one-sided counterparts in the \dQuote{desirable} direction,
#' i.e., up tests when \code{neg.down=TRUE} and down tests otherwise.
#' }
#'
#' The index of the associated test is reported in the output as the \code{"rep.test"} field along with its log-fold changes.
#' For clusters with simple differences, the log-fold change for the representative is a good summary of the effect size for the cluster.
#'
#' @section Determining the cluster-level direction:
#' When only one log-fold change column is specified, we will try to determine which direction contributes to the combined p-value.
#' This is done by tallying the directions of all tests with (weighted) p-values below that of the representative test.
#' If all tests in a cluster have positive or negative log-fold changes, that cluster's direction is reported as \code{"up"} or \code{"down"} respectively; otherwise it is reported as \code{"mixed"}.
#' This is stored as the \code{"direction"} field in the returned data frame.
#' 
#' Assessing the contribution of per-test p-values to the cluster-level p-value is roughly equivalent to asking whether the latter would increase if all tests in one direction were assigned p-values of unity.
#' If there is an increase, then tests changing in that direction must contribute to the combined p-value calculations. 
#' In this manner, clusters are labelled based on whether their combined p-values are driven by tests with only positive, negative or mixed log-fold changes.
#' (Note that this interpretation is not completely correct for \code{\link{minimalTests}} due to equality effects from enforcing monotonicity in the Holm procedure, but this is of little practical consequence.)
#' 
#' Users should keep in mind that the label only describes the direction of change among the most significant tests in the cluster.
#' Clusters with complex differences may still be labelled as changing in only one direction, if the tests changing in one direction have much lower p-values than the tests changing in the other direction (even if both sets of p-values are significant).
#' More rigorous checks for mixed changes should be performed with \code{\link{mixedTests}}.
#'
#' There are several functions for which the \code{"direction"} is set to a constant value:
#' \itemize{
#' \item For \code{\link{mixedTests}}, it is simply set to \code{"mixed"} for all clusters.
#' This reflects the fact that the reported p-value represents the evidence for mixed directionality in this function;
#' indeed, the field itself is simply reported for consistency, given that we already know we are looking for mixed clusters! 
#' \item For \code{\link{empiricalFDR}}, it is set to \code{"up"} when \code{neg.down=FALSE} and \code{"down"} otherwise.
#' This reflects the fact that the empirical FDR reflects the significance of changes in the desired direction.
#' }
#' 
#' @author Aaron Lun
#'
#' @seealso
#' \code{\link{combineTests}}, \code{\link{minimalTests}}, \code{\link{getBestTest}},
#' \code{\link{empiricalFDR}} annd \code{\link{mixedTests}} for the functions that do the work.
#' @name cluster-direction
NULL