distBioCond: Quantify the Distance between Each Pair of Samples in a...
In MAnorm2: Tools for Normalizing and Comparing ChIP-seq Samples

distBioCond

R Documentation

Quantify the Distance between Each Pair of Samples in a `bioCond`

Description

Given a bioCond object, distBioCond deduces, for each pair of samples contained in it, the average absolute difference in signal intensities of genomic intervals between them. Specifically, the function calculates a weighted minkowski (i.e., p-norm) distance between each pair of vectors of signal intensities, with the weights being inversely proportional to variances of individual intervals (see also "Details"). distBioCond returns a dist object recording the deduced average |M| values. The object effectively quantifies the distance between each pair of samples and can be passed to hclust to perform a clustering analysis (see "Examples" below).

Usage

distBioCond(
  x,
  subset = NULL,
  method = c("prior", "posterior", "none"),
  min.var = 0,
  p = 2,
  diag = FALSE,
  upper = FALSE
)

Arguments

`x`	A `bioCond` object.
`subset`	An optional vector specifying a subset of genomic intervals to be used for deducing the distances between samples of `x`. In practice, you may want to use only the intervals associated with large variations across the samples to calculate the distances, as such intervals are most helpful for distinguishing between the samples (see `varTestBioCond` and "Examples" below).
`method`	A character string indicating the method to be used for calculating the variances of individual intervals. Must be one of `"prior"` (default), `"posterior"` and `"none"`. Can be abbreviated. Note that the `"none"` method does not consider the mean-variance trend associated with `x` (see "Details").
`min.var`	Lower bound of variances read from the mean-variance curve associated with `x`. Any variance read from the curve less than `min.var` will be adjusted to this value. It's primarily used for safely reading positive values from the curve and taking into account the practical significance of a signal variation. Ignored if `method` is set to `"none"`.
`p`	The power used to calculate the p-norm distance between each pair of samples (see "Details" for the specific formula). Any positive real could be specified, though setting `p` to a value other than 1 and 2 makes little sense. The default corresponds to the Euclidean distance.
`diag, upper`	Two arguments to be passed to `as.dist`.

Details

Variance of signal intensity varies considerably across genomic intervals, due to the heteroscedasticity inherent to count data as well as most of their transformations. On this account, separately scaling the signal intensities of each interval in a bioCond should lead to a more reasonable measure of distances between its samples. Suppose that X and Y are two vectors of signal intensities representing two samples of a bioCond and that xi, yi are their ith elements corresponding to the ith interval. distBioCond calculates the distance between X and Y as follows:

d(X, Y) = (sum(wi * |yi - xi| ^ p) / sum(wi)) ^ (1 / p)

where wi is the reciprocal of the scaled variance (see below) of interval i, and p defaults to 2. Since the weights of intervals are normalized to have a sum of 1, the resulting distance could be interpreted as an average absolute difference in signal intensities of intervals between the two samples.

Since there typically exists a clear mean-variance dependence across genomic intervals, distBioCond takes advantage of the mean-variance curve associated with the bioCond to improve estimates of variances of individual intervals. By default, prior variances, which are the ones read from the curve, are used to deduce the weights of intervals for calculating the distances. Alternatively, one can choose to use posterior variances of intervals by setting method to "posterior", which are weighted averages of prior and observed variances, with the weights being proportional to their respective numbers of degrees of freedom (see fitMeanVarCurve for details). Since the observed variances of intervals are associated with large uncertainty when the total number of samples is small, it is not recommended to use posterior variances in such cases. To be noted, if method is set to "none", distBioCond will consider all genomic intervals to be associated with a constant variance. In this case, neither the prior variance nor the observed variance of each interval is used to deduce its weight for calculating the distances. This method is particularly suited to bioCond objects that have gone through a variance-stabilizing transformation (see vstBioCond for details and "Examples" below) as well as bioConds whose structure matrices have been specifically designed (see below and "References" also).

Another point deserving special attention is that distBioCond has considered the possibility that genomic intervals in the supplied bioCond are associated with different structure matrices. In order to objectively compare signal variation levels between genomic intervals, distBioCond further scales the variance of each interval (deduced by using whichever method is selected) by multiplying it with the geometric mean of diagonal elements of the interval's structure matrix. See bioCond and setWeight for a detailed description of structure matrix.

Given a set of bioCond objects, distBioCond could also be used to quantify the distance between each pair of them, by first combining the bioConds into a single bioCond and fitting a mean-variance curve for it (see cmbBioCond and "Examples" below).

Value

A dist object quantifying the distance between each pair of samples of x.

References

Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.

Examples

data(H3K27Ac, package = "MAnorm2")
attr(H3K27Ac, "metaInfo")

## Cluster a set of ChIP-seq samples from different cell lines (i.e.,
## individuals).

# Perform MA normalization and construct a bioCond.
norm <- normalize(H3K27Ac, 4:8, 9:13)
cond <- bioCond(norm[4:8], norm[9:13], name = "all")

# Fit a mean-variance curve.
cond <- fitMeanVarCurve(list(cond), method = "local",
                        occupy.only = FALSE)[[1]]
plotMeanVarCurve(list(cond), subset = "all")

# Measure the distance between each pair of samples and accordingly perform
# a hierarchical clustering. Note that biological replicates of each cell
# line are clustered together.
d1 <- distBioCond(cond, method = "prior")
plot(hclust(d1, method = "average"), hang = -1)

# Measure the distances using only hypervariable genomic intervals. Note the
# change of scale of the distances.
res <- varTestBioCond(cond)
f <- res$fold.change > 1 & res$pval < 0.05
d2 <- distBioCond(cond, subset = f, method = "prior")
plot(hclust(d2, method = "average"), hang = -1)

# Apply a variance-stabilizing transformation and associate a constant
# function with the resulting bioCond as its mean-variance curve.
vst_cond <- vstBioCond(cond)
vst_cond <- setMeanVarCurve(list(vst_cond), function(x)
                            rep_len(1, length(x)), occupy.only = FALSE,
                            method = "constant prior")[[1]]
plotMeanVarCurve(list(vst_cond), subset = "all")

# Repeat the clustering analyses on the VSTed bioCond.
d3 <- distBioCond(vst_cond, method = "none")
plot(hclust(d3, method = "average"), hang = -1)
res <- varTestBioCond(vst_cond)
f <- res$fold.change > 1 & res$pval < 0.05
d4 <- distBioCond(vst_cond, subset = f, method = "none")
plot(hclust(d4, method = "average"), hang = -1)

## Cluster a set of individuals.

# Perform MA normalization and construct bioConds to represent individuals.
norm <- normalize(H3K27Ac, 4, 9)
norm <- normalize(norm, 5:6, 10:11)
norm <- normalize(norm, 7:8, 12:13)
conds <- list(GM12890 = bioCond(norm[4], norm[9], name = "GM12890"),
              GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))
conds <- normBioCond(conds)

# Group the individuals into a single bioCond and fit a mean-variance curve
# for it.
cond <- cmbBioCond(conds, name = "all")
cond <- fitMeanVarCurve(list(cond), method = "local",
                        occupy.only = FALSE)[[1]]
plotMeanVarCurve(list(cond), subset = "all")

# Measure the distance between each pair of individuals and accordingly
# perform a hierarchical clustering. Note that GM12891 and GM12892 are
# actually a couple and they are clustered together.
d1 <- distBioCond(cond, method = "prior")
plot(hclust(d1, method = "average"), hang = -1)

# Measure the distances using only hypervariable genomic intervals. Note the
# change of scale of the distances.
res <- varTestBioCond(cond)
f <- res$fold.change > 1 & res$pval < 0.05
d2 <- distBioCond(cond, subset = f, method = "prior")
plot(hclust(d2, method = "average"), hang = -1)

MAnorm2 documentation built on Oct. 29, 2022, 1:12 a.m.