estimateJDiv | R Documentation |
Given a the methylation levels from two individuals at a given cytosine site, this function computes the J information divergence (JD) between methylation levels. The motivation to introduce JD in Methyl-IT is founded on:
It is a symmetrised form of Kullback–Leibler divergence
(D_{KL}
). Kullback and Leibler themselves actually defined
the divergence as: D_{KL}(P || Q) + D_{KL}(Q || P)
, which
is symmetric and nonnegative, where the probability distributions
P and Q are defined on the same probability space
(see reference (1) and Wikipedia).
In general, JD is highly correlated with Hellinger
divergence, which is the main divergence currently used in
Methyl-IT (see examples for function
estimateDivergence
.
By construction, the unit of measurement of JD is: bit of information, which set the basis for further information-thermodynamics analyses.
estimateJDiv(
p,
logbase = 2,
stat = FALSE,
n = NULL,
output = c("single", "all")
)
p |
A numerical vector of the methylation levels |
logbase |
Logarithm base used to compute the JD. Logarithm base 2 is used as default. Use logbase = exp(1) for natural logarithm. |
stat |
logical(1). Whether to compute the statistic with asymptotic Chi-squared distribution with one degree of freedom (details). |
n |
An integer vector |
output |
Optional. If 'stat = TRUE', whether to return |
The methylation levels p_{ij}
, at a given cytosine site
i from an individual j, lead to the probability vectors
p^{ij} = (p_{ij}, 1 - p_{ij})
. Then, the J-information divergence
between the methylation levels p_{ij}
and q^i = (q_i, 1 - q_i)
,
used as reference, is given by (1-2):
JD(p^{ij}, q^i) = ((p_{ij} - q_i) * log(p_{ij}/q_i) +
(q_i - p_{ij}) * log((1 - p_{ij})/(1 - q_i)))/2
Missing methylation levels, reported as NA or NaN, are replaced with zero.
The statistic with asymptotic Chi-squared distribution is based on the
statistic suggested by Kupperman (1957) (2) for D_{KL}
and commented in
references (3-4). That is,
2*(n_{1} + 1)*(n_{2} + 1)*JD(p^{ij}, q^i)/(n_1 + n_2 + 2)
Where n_{1}
and n_{2}
are the total counts (coverage in the case
of methylation) used to compute the probabilities p_{ij}
and
q_i
. A basic Bayesian correction is added to prevent zero counts.
The J divergence value for the given methylation levels is returned
Robersy Sanchez 11/27/2019 https://github.com/genomaths
Kullback S, Leibler RA. On Information and Sufficiency. Ann Math Stat. 1951;22: 79–86. doi:10.1214/aoms/1177729694.
Kupperman, M., 1957. Further application to information theory to multivariate analysis and statistical inference. Ph.D. Dissertation, George Washington University.
Salicrú M, Morales D, Menéndez ML, Pardo L. On the applications of divergence type measures in testing statistical hypotheses. Journal of Multivariate Analysis. 1994. pp. 372–391. doi:10.1006/jmva.1994.1068.
Basu A, Mandal A, Pardo L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat Probab Lett. Elsevier B.V.; 2010;80: 206–214. doi:10.1016/j.spl.2009.10.008.
J. K. Chung, P. L. Kannappan, C. T. Ng, P. K. Sahoo, Measures of distance between probability distributions. J. Math. Anal. Appl. 138, 280–292 (1989).
Lin J. Divergence Measures Based on the Shannon Entropy. IEEE Trans Inform Theory, 1991, 37:145–51.
Sanchez R, Mackenzie SA. Information thermodynamics of cytosine DNA methylation. PLoS One, 2016, 11:e0150427.
https://en.wikipedia.org/wiki/Kullback-Leibler_divergence
for more details, and estimateDivergence
for an example of
using it.
p <- c(0.5, 0.5)
estimateJDiv(p)
## A numerical trick is implemented. The J-divergence values are the
## same for the following vectors:
p <- c(0.9999999999999999, 0)
q <- c(1, 0)
estimateJDiv(p) == estimateJDiv(q)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.