| distBioCond | R Documentation |
bioCondGiven a bioCond object, distBioCond deduces, for each
pair of samples contained in it, the average absolute difference in signal
intensities of genomic intervals between them. Specifically, the function
calculates a weighted minkowski (i.e., p-norm) distance between each
pair of vectors of signal intensities, with the weights being inversely
proportional to variances of individual intervals (see also
"Details"). distBioCond returns a dist object
recording the deduced average |M| values. The object effectively
quantifies the distance between each pair of samples and can be passed to
hclust to perform a clustering analysis (see
"Examples" below).
distBioCond(
x,
subset = NULL,
method = c("prior", "posterior", "none"),
min.var = 0,
p = 2,
diag = FALSE,
upper = FALSE
)
x |
A |
subset |
An optional vector specifying a subset of genomic intervals to
be used for deducing the distances between samples of |
method |
A character string indicating the method to be used for
calculating the variances of individual intervals. Must be one of
|
min.var |
Lower bound of variances read from the mean-variance
curve associated with |
p |
The power used to calculate the p-norm distance between
each pair of samples (see "Details" for the specific formula).
Any positive real could be
specified, though setting |
diag, upper |
Two arguments to be passed to
|
Variance of signal intensity varies considerably
across genomic intervals, due to
the heteroscedasticity inherent to count data as well as most of their
transformations. On this account, separately scaling the signal intensities
of each interval in a bioCond should lead to a more
reasonable measure of distances between its samples.
Suppose that X and Y are two vectors of signal intensities
representing two samples of a bioCond and that xi, yi
are their ith elements corresponding to the ith interval.
distBioCond calculates the distance between X and Y as
follows:
d(X, Y) = (sum(wi * |yi - xi| ^ p) / sum(wi)) ^ (1 / p)
where wi is the reciprocal of the scaled variance (see below) of interval i, and p defaults to 2. Since the weights of intervals are normalized to have a sum of 1, the resulting distance could be interpreted as an average absolute difference in signal intensities of intervals between the two samples.
Since there typically exists a clear mean-variance dependence across genomic
intervals, distBioCond takes advantage of the mean-variance curve
associated with the bioCond to improve estimates of variances of
individual intervals. By default, prior variances, which are the ones read
from the curve, are used to deduce the weights of intervals for calculating
the distances. Alternatively, one can choose to use posterior variances of
intervals by setting method to "posterior", which are weighted
averages of prior and observed variances, with the weights being
proportional to their respective numbers of degrees of freedom (see
fitMeanVarCurve for details). Since the observed variances of
intervals are associated with large uncertainty when the total number of
samples is small, it is not recommended to use posterior variances in such
cases. To be noted, if method is set to "none",
distBioCond will consider all genomic intervals to be associated with
a constant variance. In this case, neither the prior variance nor the
observed variance of each interval is used
to deduce its weight for calculating the distances.
This method is particularly suited to bioCond objects
that have gone through a variance-stabilizing transformation (see
vstBioCond for details and "Examples" below) as well as
bioConds whose structure matrices have been specifically
designed (see below and "References" also).
Another point deserving special attention is that distBioCond has
considered the possibility that
genomic intervals in the supplied bioCond
are associated with different structure matrices. In order to objectively
compare signal variation levels between genomic intervals,
distBioCond further scales the variance of each interval
(deduced by using whichever method is selected) by
multiplying it with the geometric mean of diagonal
elements of the interval's structure matrix. See bioCond and
setWeight for a detailed description of structure matrix.
Given a set of bioCond objects,
distBioCond could also be used to quantify the distance between
each pair of them, by first combining the bioConds into a
single bioCond and fitting a mean-variance curve for
it (see cmbBioCond and "Examples" below).
A dist object quantifying the distance between
each pair of samples of x.
Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.
bioCond for creating a bioCond object;
fitMeanVarCurve for fitting a mean-variance curve;
cmbBioCond for combining a set of bioCond objects
into a single one; hclust for performing a
hierarchical clustering on a dist object;
vstBioCond for applying a variance-stabilizing
transformation to signal intensities of samples of a bioCond.
data(H3K27Ac, package = "MAnorm2")
attr(H3K27Ac, "metaInfo")
## Cluster a set of ChIP-seq samples from different cell lines (i.e.,
## individuals).
# Perform MA normalization and construct a bioCond.
norm <- normalize(H3K27Ac, 4:8, 9:13)
cond <- bioCond(norm[4:8], norm[9:13], name = "all")
# Fit a mean-variance curve.
cond <- fitMeanVarCurve(list(cond), method = "local",
occupy.only = FALSE)[[1]]
plotMeanVarCurve(list(cond), subset = "all")
# Measure the distance between each pair of samples and accordingly perform
# a hierarchical clustering. Note that biological replicates of each cell
# line are clustered together.
d1 <- distBioCond(cond, method = "prior")
plot(hclust(d1, method = "average"), hang = -1)
# Measure the distances using only hypervariable genomic intervals. Note the
# change of scale of the distances.
res <- varTestBioCond(cond)
f <- res$fold.change > 1 & res$pval < 0.05
d2 <- distBioCond(cond, subset = f, method = "prior")
plot(hclust(d2, method = "average"), hang = -1)
# Apply a variance-stabilizing transformation and associate a constant
# function with the resulting bioCond as its mean-variance curve.
vst_cond <- vstBioCond(cond)
vst_cond <- setMeanVarCurve(list(vst_cond), function(x)
rep_len(1, length(x)), occupy.only = FALSE,
method = "constant prior")[[1]]
plotMeanVarCurve(list(vst_cond), subset = "all")
# Repeat the clustering analyses on the VSTed bioCond.
d3 <- distBioCond(vst_cond, method = "none")
plot(hclust(d3, method = "average"), hang = -1)
res <- varTestBioCond(vst_cond)
f <- res$fold.change > 1 & res$pval < 0.05
d4 <- distBioCond(vst_cond, subset = f, method = "none")
plot(hclust(d4, method = "average"), hang = -1)
## Cluster a set of individuals.
# Perform MA normalization and construct bioConds to represent individuals.
norm <- normalize(H3K27Ac, 4, 9)
norm <- normalize(norm, 5:6, 10:11)
norm <- normalize(norm, 7:8, 12:13)
conds <- list(GM12890 = bioCond(norm[4], norm[9], name = "GM12890"),
GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))
conds <- normBioCond(conds)
# Group the individuals into a single bioCond and fit a mean-variance curve
# for it.
cond <- cmbBioCond(conds, name = "all")
cond <- fitMeanVarCurve(list(cond), method = "local",
occupy.only = FALSE)[[1]]
plotMeanVarCurve(list(cond), subset = "all")
# Measure the distance between each pair of individuals and accordingly
# perform a hierarchical clustering. Note that GM12891 and GM12892 are
# actually a couple and they are clustered together.
d1 <- distBioCond(cond, method = "prior")
plot(hclust(d1, method = "average"), hang = -1)
# Measure the distances using only hypervariable genomic intervals. Note the
# change of scale of the distances.
res <- varTestBioCond(cond)
f <- res$fold.change > 1 & res$pval < 0.05
d2 <- distBioCond(cond, subset = f, method = "prior")
plot(hclust(d2, method = "average"), hang = -1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.