#' Search for aberrantly methylated regions
#'
#' @description
#' `getAMR` returns a `GRanges` object with aberrantly methylated
#' regions (AMRs / epimutations) for all samples in a data set.
#'
#' @details
#' In the provided data set, `getAMR` finds stretches of outlier beta values
#' to identify rare long-range methylation
#' aberrations (epimutations) in one or several samples. As a rule, methods for
#' differential methylation analysis rely on between-group comparisons ---
#' `getAMR` performs this comparison within-group, which is not only faster,
#' but also more sensitive. The logic of computations is described below.
#'
#' \subsection{Compute}{
#' This section describes computations that are performed in order to identify
#' individual outlier beta values --- but
#' \bold{before values are deemed as outliers}.
#' At the moment, the two supported methods are:
#'
#' \describe{
#' \item{"IQR"}{
#' When `compute=="IQR"`, for every genomic location (CpG) in
#' `data.ranges` the IQR-normalized deviation from the median value is
#' calculated using the following formula:
#' \deqn{xIQR_i=\frac{x_i-M}{IQR}}
#' where \eqn{x_i} is a beta value for i-th sample,
#' \eqn{M} and \eqn{IQR} are a median and an interquartile range
#' of all beta values at this genomic location, respectively.
#' }
#' \item{"beta+binom"}{
#' When `compute=="beta+binom"`, for every genomic location (CpG),
#' `getAMR` will estimate the probability of each beta value to occur.
#' For all beta values inside the open \eqn{(0,1)} interval, beta
#' distribution is used. For all \eqn{\{0;1\}} endpoint (extreme) values
#' where beta distribution is not defined, binomial probability is
#' calculated using coverage data (supplied using `data.coverage`
#' parameter).
#'
#' Alpha \eqn{{\alpha}} and beta \eqn{{\beta}} parameters of beta
#' distribution can be estimated using one of the following methods:
#'
#' \enumerate{
#' \item{
#' Method of moments (`compute.estimate="mom"`) based on
#' (both optionally weighted) mean and unbiased variance
#' \deqn{\text{sample mean} = \bar{x} = \frac{ \sum\limits_{i=1}^n w_i x_i}{\sum\limits_{i=1}^n w_i}}
#' \deqn{\text{unbiased variance} =\bar{v} = \frac {\sum\limits_{i=1}^N w_i (x_i - \bar{x})^2} {\sum_{i=1}^N w_i - (\sum_{i=1}^N w_i^2 / \sum_{i=1}^N w_i)} }
#' where \eqn{x_i} and \eqn{w_i} are a beta value and its reliability
#' weight for i-th sample.
#' \deqn{\hat{\alpha} = \bar{x} \left(\frac{\bar{x} (1 - \bar{x})}{\bar{v}} - 1 \right)}
#' \deqn{\hat{\beta} = (1-\bar{x}) \left(\frac{\bar{x} (1 - \bar{x})}{\bar{v}} - 1 \right)}
#' Note: all beta values from the closed \eqn{[0,1]} interval are used
#' for calculation of arithmetic mean value.
#' For more details on used formulas, visit
#' \href{https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition}{Weighted arithmetic mean},
#' \href{https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights}{Weighted variance with reliability weights},
#' and \href{https://en.wikipedia.org/wiki/Beta_distribution#Method_of_moments}{Method of moments for beta distribution}
#' }
#' \item{
#' Approximate maximum likelihood (`compute.estimate="amle"`) based on
#' (both optionally weighted) geometric means of \eqn{x} and \eqn{(1-x)}
#' \deqn{ \hat{G}_x = \left(\prod_{i=1}^n x_i^{w_i}\right)^{1 / \sum_{i=1}^n w_i} }
#' \deqn{ \hat{G}_{(1-x)} = \left(\prod_{i=1}^n {(1-x_i)}^{w_i}\right)^{1 / \sum_{i=1}^n w_i} }
#' where \eqn{x_i} and \eqn{w_i} are a beta value and its reliability
#' weight for i-th sample.
#' \deqn{ \hat{\alpha}\approx \tfrac{1}{2} + \frac{\hat{G}_{x}}{2(1-\hat{G}_x-\hat{G}_{(1-x)})} }
#' \deqn{ \hat{\beta}\approx \tfrac{1}{2} + \frac{\hat{G}_{(1-x)}}{2(1-\hat{G}_x-\hat{G}_{(1-x)})} }
#' Note: only beta values from the open \eqn{(0,1)} interval are used
#' for calculation of geometric means.
#' For more details, visit
#' \href{https://en.wikipedia.org/wiki/Weighted_geometric_mean}{Weighted geometric mean},
#' and \href{https://en.wikipedia.org/wiki/Beta_distribution#Two_unknown_parameters_2}{Maximum likelihood for beta distribution}
#' }
#' \item{
#' And a numerical maximum likelihood (`compute.estimate="nmle"`) which
#' is yet to be implemented.
#' }
#' }
#'
#' After estimating parameters of beta distribution,
#' probabilities of observed beta values from the open \eqn{(0,1)} interval
#' are computed using regularized incomplete beta function
#' \eqn{I_x(\alpha, \beta)}. For more details, visit
#' \href{https://en.wikipedia.org/wiki/Beta_distribution#Cumulative_distribution_function}{Cumulative distribution function for beta distribution}.
#'
#' Probabilities of \eqn{\{0;1\}} endpoint values are computed using
#' mean value (either arithmetic for `compute.estimate="mom"` or geometric
#' for `compute.estimate="*mle"`) using the following formulas:
#' \deqn{{p^0}_i=(1-\bar{x})^k}
#' \deqn{{p^1}_i=\bar{x}^k}
#' where \eqn{{p^0}_i} and \eqn{{p^1}_i} are probabilities of
#' observing 0 or 1 for i-th sample, respectively,
#' \eqn{\bar{x}} is a mean of all beta values for this genomic
#' position, and \eqn{k} is a sequencing coverage of this genomic
#' position for i-th sample.
#' }
#' }
#'
#'
#' }
#'
#' \subsection{Combine}{
#' This section describes how individual beta values are deemed as outliers
#' and how these outliers are combined into extended genomic regions
#' (aberrantly methylated regions, AMRs, or epimutations).
#' The only method supported at the moment is thresholding as
#' described below.
#'
#' \describe{
#' \item{"threshold"}{
#' This method applies a fixed threshold (specified using
#' `combine.threshold` parameter) to all `xIQR` or probability
#' values computed at the previous step. All observed beta values that
#' are passing this threshold (above or equal to the threshold for `xIQR`;
#' below or equal in case of probability values) are considered to be
#' outliers and therefore retained.
#'
#' Next, for all outlier beta values per sample, corresponding genomic
#' positions are merged into genomic intervals using the window of
#' `combine.window` and keeping the strand information unless
#' `combine.ignore.strand` is TRUE.
#' }
#' \item{"comb-p"}{
#' This method of combining outliers into genomic ranges is yet to be
#' implemented.
#' }
#' }
#'
#' Resulting genomic intervals are then filtered:
#' only the regions containing at least `combine.min.cpgs`
#' outliers and which are at leas as wide as `combine.min.width` are
#' reported back.
#'
#' As a final step, the following average values are computed for every
#' aberrantly methylated region:
#' \enumerate{
#' \item{
#' arithmetic mean of distances of all outlier beta values to
#' a median beta value (`dbeta`)
#' }
#' \item{
#' if `compute=="IQR"`, arithmetic mean of xIQR values of all outlier
#' genomic position (`xiqr`) OR
#' }
#' \item{
#' if `compute=="beta+binom"`, geometric mean of probability values
#' of all outlier genomic position (`pval`)
#' }
#' }
#'
#' }
#'
#' @note
#' NA values within metadata columns of `data.ranges` are silently dropped
#' in all computations.
#'
#' @param data.ranges A `GRanges` object with genomic locations and
#' corresponding beta values included as metadata.
#' @param data.samples A character vector with sample names (e.g., a subset of
#' metadata column names). If `NULL` (the default), then all samples (metadata
#' columns) are included in the analysis.
#' @param data.coverage An optional `data.frame` object with coverage data. If
#' provided, must be an all-integer `data.frame`, with the same dimensions
#' and column names as the `data.ranges` metadata columns.
#' @param transform A character scalar specifying if beta values should be
#' used as supplied ("identity", the default) or linearly transformed ("linear")
#' to force all \eqn{\{0;1\}} endpoint (extreme) values within open \eqn{(0,1)}
#' interval using the following formula: \deqn{x'=\frac{x(N-1)+0.5}{N}}
#' where \eqn{x} is the beta value before transformation, and \eqn{N} is the
#' number of samples. Such transformation is recommended only
#' if beta values contain \eqn{\{0;1\}} values \bold{and} coverage data is NOT
#' available. See \href{https://doi.org/10.1037/1082-989x.11.1.54}{doi: 10.1037/1082-989X.11.1.54}
#' and \href{https://stats.stackexchange.com/questions/31300/dealing-with-0-1-values-in-a-beta-regression}{Dealing with 0,1 values in a beta regression} for more details.
#' @param exclude.range A numeric vector of length two. Unless `NULL`
#' (the default), all `data.ranges` genomic locations with their median
#' methylation beta value within the `exclude.range` interval are filtered out.
#' @param compute A character scalar: when "IQR" (the
#' default), AMR search based on interquantile range is performed.
#' When "beta+binom" - search is based on fitting optionally weighted
#' beta distribution (with the possibility of estimating binomial probability
#' of \eqn{\{0;1\}} endpoint values). See Details section for explanation.
#' @param compute.estimate A character scalar for the method of parameter
#' estimation of beta distribution. The default ("mom") stands for the method
#' of moments based on the unbiased estimator of variance and includes
#' \eqn{\{0;1\}} endpoints in calculation of moments (mean, unbiased variance).
#' Other options are "amle" (approximation of maximum likelihood estimation)
#' and "nmle" (numeric maximum likelihood estimation) - both ignore
#' \eqn{\{0;1\}} endpoints in calculations. See Details section for explanation.
#' @param compute.weights A character scalar for the weights assigned to
#' individual observations (beta values) and used to compute weighted
#' means and variance as described in "Compute" section below. Four available
#' weighing schemes result in different sensitivity of outlier detection and
#' rate of false positive (FP) findings: "equal" (the default) is the least
#' sensitive and gives least number of FPs, while "invDist" is the most
#' sensitive but may result in a very high number of FPs especially when
#' `combine.threshold` is too high (1e-3 or higher).
#' "logInvDist" is recommended when one desires a balance between relatively
#' low type I error rate and higher
#' detection sensitivity for both unique and non-unique AMRs.
#'
#' If "equal", all weights are equal to 1 (\eqn{w_i=1}).
#' Otherwise, weights of observations inversely depend on their distance from
#' the median (thus emphasizing outliers) and are calculated using the
#' following formulas:
#' \deqn{\text{"logInvDist": } w_i = \log{\frac{1}{|M-x_i|+\epsilon}}}
#' \deqn{\text{"sqrtInvDist": } w_i = \sqrt{\frac{1}{|M-x_i|+\epsilon}}}
#' \deqn{\text{"invDist": } w_i = \frac{1}{|M-x_i|+\epsilon}}
#' where \eqn{x_i} is a beta value for i-th sample, \eqn{M} is a median of
#' all beta values at this genomic location, and \eqn{\epsilon} is a very
#' small number (= FLT_EPSILON \eqn{\approx} 1.192093e-07).
#' @param combine A character scalar for the method used to combine individual
#' outlier genomic positions into genomic intervals (ranges). When "threshold"
#' (the default) is used, simple thresholding is applied.
#' More details are given in the "Combine" subsection below.
#' @param combine.threshold A numeric scalar setting the threshold for an
#' outlier value. When `compute=="IQR"`, methylation beta values differing
#' from the median value by at least `combine.threshold` interquartile ranges
#' are considered to be outliers (the default: 5). When `compute=="beta+binom"`,
#' all probability values not higher than `combine.threshold` are considered
#' to be outliers (the default: 0.001).
#' @param combine.window A positive integer. All significant (survived the
#' filtering stage) `data.ranges` genomic locations within this distance will be
#' merged to create AMRs (the default: 300).
#' @param combine.min.cpgs A single integer >= 1. All AMRs containing less than
#' `combine.min.cpgs` significant genomic locations are filtered out (the
#' default: 7).
#' @param combine.min.width A single integer >= 1 (the default).
#' Only AMRs with the width of at least `combine.min.width` are returned.
#' @param combine.ignore.strand A boolean scalar to ignore strand information
#' in the input `data.ranges` and when outlier genomic positions are combined
#' into genomic intervals (the default: FALSE).
#' @param ncores A single integer >= 1. Number of OpenMP threads for parallel
#' computation (the default: half of the available cores). Results of parallel
#' processing are fully reproducible.
#' @param verbose Boolean to report progress and timings (default: TRUE).
#' @return The output is a `GRanges` object that contains all the aberrantly
#' methylated regions (AMRs / epimutations) for all `data.samples` samples in
#' `data.ranges` object. The following metadata columns may be present:
#' \itemize{
#' \item `revmap` -- integer list of significant CpGs (`data.ranges` genomic
#' locations) that are included in this AMR region
#' \item `ncpg` -- number of significant CpGs within this AMR region
#' \item `sample` -- contains an identifier of a sample to which
#' corresponding AMR belongs
#' \item `dbeta` -- average deviation of beta values for significant CpGs from
#' their corresponding median values
#' \item `pval` -- geometric mean of p-values for significant CpGs
#' \item `xiqr` -- average IQR-normalised deviation of beta values for
#' significant CpGs from their corresponding median values
#' }
#' @seealso \code{\link{plotAMR}} for plotting AMRs, \code{\link{getUniverse}}
#' for info on enrichment analysis, \code{\link{simulateAMR}} and
#' \code{\link{simulateData}} for the generation of simulated test data sets,
#' and `ramr` vignettes for the description of usage and sample data.
#' @examples
#' data(ramr)
#' getAMR(data.ranges=ramr.data, data.samples=ramr.samples,
#' compute="beta+binom", compute.estimate="amle",
#' combine.min.cpgs=5, combine.window=1000, combine.threshold=1e-3)
#' @export
getAMR <- function (data.ranges,
data.samples=NULL,
data.coverage=NULL,
transform=c("identity", "linear"),
exclude.range=NULL,
compute=c("IQR", "beta+binom"),
compute.estimate=c("mom", "amle", "nmle"),
compute.weights=c("equal", "logInvDist", "sqrtInvDist", "invDist"),
combine=c("threshold", "comb-p"),
combine.threshold=ifelse(compute=="IQR", 5, 1e-3),
combine.window=300,
combine.min.cpgs=7,
combine.min.width=1,
combine.ignore.strand=FALSE,
ncores=NULL,
verbose=TRUE)
{
if (!methods::is(data.ranges,"GRanges"))
stop("'data.ranges' must be a GRanges object")
data.mcols <- GenomicRanges::mcols(data.ranges)
if (is.null(data.samples)) {
data.samples <- colnames(data.mcols)
} else {
if (!all(data.samples %in% colnames(data.mcols)))
stop("'data.ranges' metadata must include 'data.samples'")
}
if (length(data.samples)<3)
stop("at least three 'data.samples' must be provided")
if (is.null(data.coverage)) {
data.coverage <- as.data.frame(
sapply(data.samples, function (s) integer(0))
)
is.coverage <- FALSE
} else {
if (!methods::is(data.coverage, "data.frame") |
!identical(dim(data.mcols), dim(data.coverage)) |
!identical(colnames(data.mcols), colnames(data.coverage)) |
!all(apply(data.coverage, 2, is.integer)))
stop("When provided, 'data.coverage' must be",
" an all-integer 'data.frame' object",
" of the same dimensions as, and with the colnames identical to",
" 'data.ranges' metadata")
is.coverage <- TRUE
}
if (is.null(exclude.range))
exclude.range <- c(2,0) # // exclude those that are >= than 2 and <= than 0
transform <- match.arg(transform)
compute <- match.arg(compute)
compute.estimate <- match.arg(compute.estimate)
compute.weights <- match.arg(compute.weights)
combine <- match.arg(combine)
if (compute.estimate=="nmle")
stop("compute.estimate=='nmle' is not available yet")
if (combine=="comb-p")
stop("combine=='comb-p' is not available yet")
#####################################################################################
.data <- .preprocessData(
data.ranges=data.ranges, data.samples=data.samples,
data.coverage=data.coverage, transform=transform,
exclude.range=exclude.range, ncores=ncores, verbose=verbose
)
if (compute=="IQR") {
.getAMR.IQR(
data.list=.data, threshold=combine.threshold, verbose=verbose
)
} else if (compute=="beta+binom") {
.getAMR.beta(
data.list=.data,
estimate=compute.estimate,
weights=compute.weights,
coverage=is.coverage,
threshold=log(combine.threshold),
verbose=verbose
)
}
amr.ranges <- .createGranges(
data.list=.data,
compute=compute,
window=combine.window,
min.cpgs=combine.min.cpgs,
min.width=combine.min.width,
ignore.strand=combine.ignore.strand,
verbose=verbose
)
return(amr.ranges)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.