knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The goal of LDTM is to ...
You can install the development version of LDTM from GitHub with:
# install.packages("devtools") devtools::install_github("Goodgolden/LDTM")
This is a basic example which shows you how to solve a common problem:
library(LDTM) ## basic example code
microbial genomic sequencess clustered by sequence similarity
partition sequences into discrete groups instead of traditional taxonomic units.
the most abundant sequence in an OTU is the representative sequence
representative sequences from all the OTUs are used to construct a phylogenetic tree among all the OTUs
Microbial community information == OTUs + counts + phylogenetic relationship + taxonomy
the effect of diet on gut microbiome composition
identify a few gut microbiome associated nutrients
unable to provide information on how dietary nutrient affect bacterial taxa
identify both the key nutrients as well as the taxa the nutrient affect
Chen and Li (2013) adopted a regression-based approach
OTU abundance data as multivariate count responses, and nutrient as covaraite
$Y = (Y_1, ..., Y_K)^T$
$y = (y_1, ..., y_K)^T$
$\sum_{k=1}^K Y_k = \sum_{k=1}^K y_k$
$p = (p_1, ..., p_K)^T, \sum_{k=1} ^K p_k =1$
$f_M(y;\ p) = \frac {\Gamma ({\sum_{k=1}^K y_k + 1})} {\prod_{k=1}^K \Gamma ({y_k + 1})} \prod_{k=1}^Kp_k^{y_k}$
The link function is a multinomial-Poisson transformation
might need to use Poissonization to simulate the data
the multinomial distribution is not appropriate
$p$ is a random variable with some prior distribution
$\Phi^d$ is a (d-1) dimensional simplex,
the support of p is $\Phi^{K-1}$
Dirichlet distribution density at $u = (u_1, ..., u_K) \in \Phi^{K-1}$
Dirichlet is a conjugate prior to multinomial distribution
posterior is Dirichlet multionmial distribution, aka the Dirichlet compound multinomial distribution.
$a = (a_1, ..., a_K)^T,\ a_i > 0$
$f_D (u;a) = \frac {\Gamma ({\sum_{k=1}^K \alpha_k})} {\prod_{k=1}^K \Gamma ({\sum_{k=1}^K \alpha_k})} \prod_{k=1}^K u_k^{\alpha_k - 1}$ $f_{DM}(y; \alpha) = \int_{u \in \Phi^{K-1}} f_M(y; u) f_D(u; \alpha)$
$= \frac {\Gamma ({\sum_{k=1}^K y_k + 1}) \Gamma ({\sum_{k=1}^K \alpha_k})} {\Gamma ({\sum_{k=1}^K y_k} + {\sum_{k=1}^K \alpha_k})} \prod_{k=1}^K \frac {\Gamma (y_k + \alpha_k)} {\Gamma ({y_k + 1}) \Gamma ({\alpha_k})}$
all components must share a common variance parameter
components are mutually independent, up to the constraint that must sum up to 1
distribution fails to take into account the special and inherent property of microbiome count data (evolutionary relationships in the phylogenetic tree)
the relationships among the components of the count vector can be represented as a tree, node-by-node.
each component has a independent variance
components are correlated at subtree levels
a regression model with the effects of covariates
Billheimer with Aitchison's logistic normal distribution instead of Dirichlet
covaraites to the count vector (Billheimer et al. 2001)
total number of counts, determined by the sequence depth, as an ancillary statistics
analysis conditioning on this number
A Tree $T$ representing the hierarchical structure over the count responses
we will incorporate the tree structural information into the modeling
$\mathcal L$ the set of leaf nodes
$\mathcal V$ the set of internal nodes of $T$
for each $v \in \mathcal V$, $\mathcal C_v$ the child nodes of $v$
for each $v$ and $c \in \mathcal C_v$, $\delta_{vc} (l) = 1(v\rightarrow c, l \in \mathcal L)$
$\mathcal L = { 1, ..., K}$
$y_{vc} = \sum_{l\in\mathcal L} \delta_{vc} (l) \ y_l$
$p_{vc} = \sum_{l\in \mathcal L} \delta_{vc} (l) \ p_l$
$c \in \mathcal C_v$ is the index for the subtree counts and probabilities
Let $v_0$ to be the root node of $T$.
each leaf node is $(v_0 \equiv v_0^l )\rightarrow v_1^l \rightarrow ... \rightarrow v_{D_l -1}^l \rightarrow (v_{D_l}^l \equiv l)$
the path from $v_0$ to $l$, where $v_d^l \in \mathcal V, \forall d = 0, ..., D_{l-1}$
$p_{l} = b_{v_0 v_1^l} \times b_{v_1^l v_2^l} \times ... \times b_{v_{D_l - 1}^l} = \prod_{v\in \mathcal V} \prod_{c \in \mathcal C} b_{vc}^{\delta^{vc}(l)}$
for $p_l$ is the product of branch probabilities from root $v_0$ to final leaf $l$
this is for the leaf level
$f_M(y; p) = f_M(y; b_v, v\in \mathcal V) = \prod_{v \in \mathcal V} \frac {\Gamma (\sum_{c \in \mathcal C_v} y_{vc} + 1)} {\prod_{c \in \mathcal C_v} \Gamma ({y_{vc} + 1})} \prod_{c \in \mathcal C_v} b_{vc}^{y_vc}$
$b_v = {b_{vc}, c\in \mathcal C_v}$
$f_{DT} (u_v; \alpha_v) = \prod_{v\in \mathcal V} f_D(u_v; \alpha_v) = \prod_{v \in \mathcal V} \frac {\Gamma (\sum_{c \in \mathcal C_v} \alpha_{vc})} {\prod_{c \in \mathcal C_v} \Gamma ({\alpha_{vc}})} \prod_{c \in \mathcal C_v} u_{vc}^{\alpha_vc - 1}$
$u_v = (u_{vc}, c\in \mathcal C_v) \in \Phi^{K_v -1}$
$K$ is the number of children of $v$
$\alpha_v = (\alpha_vc > 0, c \in \mathcal C_v)$
$f_{DTM} (y; \alpha_v, v\in\mathcal V)$
$= \prod_{v\in\mathcal V} \int_{u_v \in \Phi^{K_{v}-1}} f_M(y; u_v) f_{D}(u_v; \alpha_v) du_v$
$= \prod_{v\in\mathcal V} \frac {\Gamma ({\sum_{c\in \mathcal C_v} y_{vc} + 1}) \Gamma ({\sum_{c\in \mathcal C_v} \alpha_{vc}})} {\Gamma ({\sum_{c \in \mathcal C_v} y_{vc} } + {\sum_{c\in \mathcal C_v}\alpha_{vc}})} \prod_{c\in \mathcal C_v} \frac {\Gamma (y_{vc} + \alpha_{vc})} {\Gamma ({y_{vc} + 1}) \Gamma ({\alpha_{vc}})}$
that each component in the product $\prod_{v \in \mathcal V}$ corresponds to a interior node in the tree
the Dirichlet multinomial distribution based on the accumulated counts along branches of given node.
$E[p_l] = \prod_{v\in\mathcal V} \prod_{c\in\mathcal C_v}\bigg(E[b_{vc}] \bigg)^{\delta_{vc}(l)} = \prod_{v\in\mathcal V} \prod_{c\in\mathcal C_v}\bigg( \frac {\alpha_{vc}} {\sum_{c\in \mathcal C_v} \alpha_{vc}} \bigg)^{\delta_{vc}(l)}$
given $\sum_{k=1}^K Y_k = \sum_{k=1}^K y_k; \ E[Y_l] = E[p_l] \sum_{k=1} ^Ky_k$
when $\mathcal V \neq {v_0}$, each $p_l$ has an independent variance
a regression of $Y = (y_1, ..., y_K)^T$ on $p$ covariates $X_1, ... X_p$
for each $v \in \mathcal V$ and $c \in \mathcal C_v$ we express $\log(\alpha_{vc})$ as a linear combination of the covariates.
$\log(\alpha_{vc}) = X^T \beta_{vc} \ \ \ \ (2.a)$
$\log(\alpha_{(i)vc}) = X_i^T \beta_{(i)vc} + Z_i^T b_{(i){vc}}\ \ \ \ (2.b)$
$X = (X_1, ..., X_p)^T$
$\beta_{vc} = (\beta_{vc1}, ..., \beta_{vcp})^T$
$y_i = (y_{i1}, ..., y_{iK})^T$
$x_i = (x_{i1}, ..., x_{ip})^T$
$b_{vc} = (b_{vc1}, ..., b_{vcq})^T$
$z_i = (z_{i1}, ..., z_{iq})^T$
$\beta_v = (\beta_{vc}, c \in \mathcal C_v)$
$\beta = (\beta_v, v \in \mathcal V)$
$\beta_{vi} | b_{vi} = (\beta_{(i)vc} | b_{(i)vc}, c \in \mathcal C_v)$
$\beta_i | b_{i} = (\beta_{(i)v}| b_{(i)v}, v \in \mathcal V)$
$l_{DTM}(\beta) = \log L_{DTM}(\beta) = \sum_{v \in \mathcal V} \log L_v(\beta_v) \ \ \ \ (3.a)$
$l_v(\beta_v) = log L_v(\beta_v) = \sum_{i=1}^n \Bigg[\tilde \Gamma \Big(\sum_{c\in \mathcal C_v} \alpha_{ivc}\Big) - \tilde \Gamma \Big(\sum_{c\in \mathcal C_v} y_{ivc} + \sum_{c\in \mathcal C_v} \alpha_{ivc}\Big) +c\sum_{c \in \mathcal C_v} \bigg{ \tilde \Gamma \Big(y_{ivc} + \alpha_{ivc}\Big) - \tilde \Gamma \Big(\alpha_{ivc}\Big) \bigg}\Bigg]$
$\tilde \Gamma (.) = \log \big(\Gamma (.) \big)$
$\alpha_{ivc} = \exp(x_i^T\beta_{vc})$
$y_{ivc} = \sum_{l \in \mathcal L} \delta_{vc} (l) y_{il}$
$l_v(\beta_v) = \log L_v(\beta_v) = \sum_{i=1}^n \Bigg[\log \bigg(\Gamma \Big(\sum_{c\in \mathcal C_v} \exp(x_i^T\beta_{vc})\Big) \bigg) - \log \bigg(\Gamma \Big(\sum_{c\in \mathcal C_v} y_{ivc} + \sum_{c\in \mathcal C_v} \exp(x_i^T\beta_{vc})\Big) \bigg) +\sum_{c \in \mathcal C_v} \bigg{ \log \bigg(\Gamma \Big(y_{ivc} + \exp(x_i^T\beta_{vc})\Big)\bigg) -\log \bigg(\Gamma \Big(\exp(x_i^T\beta_{vc})\Big) \bigg)\bigg}\Bigg]$
parametrization for Dirichlet Tree Multinomial Regression Model
the number of $p$ time the number of tree branches $\sum_{v\in \mathcal V} K_v$
$pnl_{DTM}(\beta; \lambda, \gamma) = -l_{DTM}(\beta) + \lambda \bigg{ (1- \gamma)\sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|{L1} + \gamma \sum{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|_{L2} \bigg}$
approximate $l_{DTM} (\beta)$ at $\beta^{(t)}$
$- \tilde l_{DTM}(\beta; \eta^{(t)}) = -l_{DTM}(\eta^{(t)}) - \langle\beta - \eta^{(t)}, \nabla l_{DTM} (\eta^{(t)})\rangle + \frac {C} {2} \|\beta - \eta^{(t)}\|^2_2 \ \ \ \ (4)$
$pn \tilde{l}{DMT}(\beta) = - \tilde l{DTM} (\beta; \eta^{(t)}) + \lambda_1 \sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|{1} + \lambda_2\sum{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|_{2} \ \ \ \ (5)$
$\beta^{(t+1)} = \underset {\beta} {arg min} \big(pn \tilde{l}_{DTM} (\beta)\big) \ \ \ \ (6)$
$\eta^{(t+1)} = \beta^{(t+1)} + \frac {1-\alpha_t} {\alpha_t} \alpha_{t+1} (\beta^{(t+1)} - \beta^{(t)}) \ \ \ \ (7)$
$\alpha_t = 2/(t+2)$
$\beta_{vc}^{(t+1)} = \underset {\beta_{vc}} {argmin} \Bigg[ \frac 1 2 \bigg \|\beta_{vc} - \Big{\eta_{vc}^{(t)} - \frac 1 C \nabla l_{vc} (\eta_v^{(t)}) \Big}\bigg\|2^2 + \frac {\lambda_1} {C}\|\beta{vc}\|1 + \frac {\lambda_2} {C} \|\beta{vc}\|_2 \Bigg]$
$\text {s.t.} \ \ \beta_{vc}^{(t+1)} = \underset {\beta_{vc}} {argmin} \Bigg[ \frac 1 2 \bigg \| \beta_{vc} +\frac 1 C \nabla l_{vc} (0) \bigg\|2^2 + \lambda \bigg(\frac {1-\gamma} {C}\|\beta{vc}\|1 + \frac {\gamma} {C} \|\beta{vc}\|_2\bigg) \Bigg]$
$\tilde \beta_{vc}^{(t+1)} = sgn \bigg{ \eta_{vc}^{(t)} - \frac 1 C \nabla l_{vc} (\eta_{vc}^{(t)})\bigg} \ \max \bigg{0, \Big|\eta_{vc}^{(t)} - \frac 1 C \nabla l_{vc}(\eta_v^{(t)}) \Big| - \frac {\lambda_1} {C}\bigg}$
$\beta_{vc}^{(t+1)} = \frac {\tilde \beta_{vc}^{(t+1)}}{\|\tilde \beta_{vc}^{(t+1)}\|2} \max \bigg{0, \Big\|\tilde \beta{vc}^{(t+1)} \Big\|_2 - \frac {\lambda_1} {C}\bigg} \ \ \ \ (8)$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.