automorphism_prob | R Documentation |
This function applies a Dirichlet-Multinomial Modelling (in Bayesian framework) to compute the posterior probability of each type of mutational event. DNA bases are classified based on the physicochemical criteria used to ordering the set of codons: number of hydrogen bonds (strong-weak, S-W), chemical type (purine-pyrimidine, Y-R), and chemical groups (amino versus keto, M-K) (see reference 4). Preserved codon positions are labeled with letter “H”.
As a result, mutational events are grouped by type of mutation covering all the possible combinations of symbols: "Y", "R", "M", "W", "K", "S", and "H", for example: "YSH", "RSH", "MSH", "WSH", "KSH", "SSH", "HSH", "YHH", "RHH", and so on. Insertion/deletion mutations are not considered.
Given the observed data n = (n_1, ..., n_k)
, where n_i
is the
frequency for the mutation type i
, we want to estimate the parameters
of the Dirichlet distribution, \alpha = (\alpha_1, ..., \alpha_k)
,
that maximize the marginal likelihood.
The marginal likelihood of the data under the Dirichlet-multinomial model is given by
P\left(n\middle|\alpha\right) = \frac{N!}{\prod_{i=1}^{k}n_i}
\frac{\Gamma\left(\sum_{i=1}^{k}\alpha_i\right)}{\Gamma
\left(N+\sum_{i=1}^{k}\alpha_i\right)}\prod_{i=1}^{k}\frac{\Gamma
\left(n_i+\alpha_i\right)}{\Gamma\left(\alpha_i\right)}
where N = \sum_{i=1}^{k}n_i
.
To perform MLE, we maximize the log of this likelihood:
log(P\left(n\middle|\alpha\right))
. That is, we aim to maximize this
log-likelihood with respect to \alpha
. This is done numerically
because there's no closed-form solution. Here, we use:
Arg\max_{\alpha} {\{log\left(P\left(n\middle|\alpha\right)\right\}}
with initial guess set as \alpha_i = 1 / (n_i + 1)
.
Let be \theta
the vector of probabilities for each mutation type. It's
the parameter we're estimating, which represents the probabilities of
observing each mutation type in the multinomial distribution. The conjugate
distribution in this context refers to the Dirichlet distribution, which is
the prior distribution for \theta
. When we have observed data, the
posterior distribution of \theta
given this data is also a Dirichlet
distribution due to the conjugate property.
The prior distribution, before observing any data, our belief about the
distribution of \theta
is given by:
\theta\sim\ Dirichlet\left(\alpha_1,\alpha_k,\ldots,\alpha_k\right)
where the \alpha_i
are known as initial concentration parameters,
which represents our initial belief or assumption about the distribution
of the mutation types. These parameters control how the probability mass is
distributed among the mutation types. We might set these initial parameters
based on some prior knowledge or simply to provide a non-informative prior.
Once we have the MLE for \alpha
, the posterior distribution of
\theta
given the data updates these parameters to incorporate the
observed data:
\theta\mid\ n\sim\ Dirichlet\left(\alpha_1^\ast+n1,\alpha_k^\ast+n_2,
\ldots,\alpha_k^\ast+n_k\right)
where \alpha^\ast_i
are the MLE estimates, i.e.,
\alpha_i^\ast+n_i
are our updated belief about the frequencies of
mutation types in the population sample.
The expected values of \theta_i
, given the data under the posterior
Dirichlet distribution is:
E\left[\theta_i\right]=\frac{\alpha_i^*+n_i}
{\sum_{j=1}^{k}\left(\alpha_j^*+n_j\right)}
These expected values directly corresponds to the "posterior probability" of
observing mutation type i
.
This approach provides a rigorous estimation of the Dirichlet parameters under the Dirichlet-multinomial model using MLE.
automorphism_prob(x, ...)
## S4 method for signature 'AutomorphismByCoef'
automorphism_prob(
x,
initial_alpha = NULL,
method = c("L-BFGS-B", "Nelder-Mead", "BFGS", "CG", "SANN", "Brent"),
maxit = 500,
abstol = 10^-8,
...
)
## S4 method for signature 'AutomorphismByCoefList'
automorphism_prob(
x,
initial_alpha = NULL,
method = c("L-BFGS-B", "Nelder-Mead", "BFGS", "CG", "SANN", "Brent"),
maxit = 500,
abstol = 10^-8
)
x |
An AutomorphismByCoefList-class object returned by function automorphism_bycoef. |
... |
Not in use yet. |
initial_alpha |
A vector of initial guess values for pseudo counts
|
method , maxit , abstol |
Parameter values to pass into
|
A data frame with the posterior probabilities.
automorphism_bycoef
## Load the data set
data("autby_coef", package = "GenomAutomorphism")
post_prob <- automorphism_prob(autby_coef[1:10])
head(post_prob,10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.