README.md

TEST

LDTM

The goal of LDTM is to …

Installation

You can install the development version of LDTM from GitHub with:

# install.packages("devtools")
devtools::install_github("Goodgolden/LDTM")

Example

This is a basic example which shows you how to solve a common problem:

library(LDTM)
#> Loading required package: tidyverse
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
#> ✔ tibble  3.1.6     ✔ dplyr   1.0.8
#> ✔ tidyr   1.2.0     ✔ stringr 1.4.0
#> ✔ readr   2.1.2     ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> Welcome to my package
## basic example code

Introduction

OTU operational taxonomic units

microbial genomic sequencess clustered by sequence similarity

partition sequences into discrete groups instead of traditional taxonomic units.

the most abundant sequence in an OTU is the representative sequence

representative sequences from all the OTUs are used to construct a phylogenetic tree among all the OTUs

Microbial community information == OTUs + counts + phylogenetic relationship + taxonomy

Human gut microbiome study in University of Pennsylvania

the effect of diet on gut microbiome composition

distance based analysis

Goal

identify both the key nutrients as well as the taxa the nutrient affect

Chen and Li (2013) adopted a regression-based approach

OTU abundance data as multivariate count responses, and nutrient as covaraite

Model

Multinomial logistic regression

Y = (Y_1, ..., Y_K)^T

y = (y_1, ..., y_K)^T

\sum_{k=1}^K Y_k = \sum_{k=1}^K y_k

p = (p_1, ..., p_K)^T, \sum_{k=1} ^K p_k =1

f_M(y;\ p) = \frac {\Gamma ({\sum_{k=1}^K y_k + 1})} {\prod_{k=1}^K \Gamma ({y_k + 1})} \prod_{k=1}^Kp_k^{y_k}

The link function is a multinomial-Poisson transformation

might need to use Poissonization to simulate the data

Overdisperson

a = (a_1, ..., a_K)^T,\ a_i > 0

f_D (u;a) = \frac {\Gamma ({\sum_{k=1}^K \alpha_k})} {\prod_{k=1}^K \Gamma ({\sum_{k=1}^K \alpha_k})} \prod_{k=1}^K u_k^{\alpha_k - 1} f_{DM}(y; \alpha) = \int_{u \in \Phi^{K-1}} f_M(y; u) f_D(u; \alpha)

= \frac {\Gamma ({\sum_{k=1}^K y_k + 1}) \Gamma ({\sum_{k=1}^K \alpha_k})} {\Gamma ({\sum_{k=1}^K y_k} + {\sum_{k=1}^K \alpha_k})} \prod_{k=1}^K \frac {\Gamma (y_k + \alpha_k)} {\Gamma ({y_k + 1}) \Gamma ({\alpha_k})}

Limitation

Generalization of Dirichlet Multinomial distribution

Logistic Normal Multinomial distribution

Dirichlet-Tree Multinomial Distributions

p_{l} = b_{v_0 v_1^l} \times b_{v_1^l v_2^l} \times ... \times b_{v_{D_l - 1}^l} = \prod_{v\in \mathcal V} \prod_{c \in \mathcal C} b_{vc}^{\delta^{vc}(l)}

f_M(y; p) = f_M(y; b_v, v\in \mathcal V) = \prod_{v \in \mathcal V} \frac {\Gamma (\sum_{c \in \mathcal C_v} y_{vc} + 1)} {\prod_{c \in \mathcal C_v} \Gamma ({y_{vc} + 1})} \prod_{c \in \mathcal C_v} b_{vc}^{y_vc}

b_v = \{b_{vc}, c\in \mathcal C_v\}

f_{DT} (u_v; \alpha_v) = \prod_{v\in \mathcal V} f_D(u_v; \alpha_v) = \prod_{v \in \mathcal V} \frac {\Gamma (\sum_{c \in \mathcal C_v} \alpha_{vc})} {\prod_{c \in \mathcal C_v} \Gamma ({\alpha_{vc}})} \prod_{c \in \mathcal C_v} u_{vc}^{\alpha_vc - 1}

f_{DTM} (y; \alpha_v, v\in\mathcal V)

= \prod_{v\in\mathcal V} \int_{u_v \in \Phi^{K_{v}-1}} f_M(y; u_v) f_{D}(u_v; \alpha_v) du_v

= \prod_{v\in\mathcal V} \frac {\Gamma ({\sum_{c\in \mathcal C_v} y_{vc} + 1}) \Gamma ({\sum_{c\in \mathcal C_v} \alpha_{vc}})} {\Gamma ({\sum_{c \in \mathcal C_v} y_{vc} } + {\sum_{c\in \mathcal C_v}\alpha_{vc}})} \prod_{c\in \mathcal C_v} \frac {\Gamma (y_{vc} + \alpha_{vc})} {\Gamma ({y_{vc} + 1}) \Gamma ({\alpha_{vc}})}

E[p_l] = \prod_{v\in\mathcal V} \prod_{c\in\mathcal C_v}\bigg(E[b_{vc}] \bigg)^{\delta_{vc}(l)} = \prod_{v\in\mathcal V} \prod_{c\in\mathcal C_v}\bigg( \frac {\alpha_{vc}} {\sum_{c\in \mathcal C_v} \alpha_{vc}} \bigg)^{\delta_{vc}(l)}

Dirichlet Tree Multinomial Regression Model

l_v(\beta_v) = log L_v(\beta_v) = \sum_{i=1}^n \Bigg[\tilde \Gamma \Big(\sum_{c\in \mathcal C_v} \alpha_{ivc}\Big) - \tilde \Gamma \Big(\sum_{c\in \mathcal C_v} y_{ivc} + \sum_{c\in \mathcal C_v} \alpha_{ivc}\Big) +c\sum_{c \in \mathcal C_v} \bigg\{ \tilde \Gamma \Big(y_{ivc} + \alpha_{ivc}\Big) - \tilde \Gamma \Big(\alpha_{ivc}\Big) \bigg\}\Bigg]

\tilde \Gamma (.) = \log \big(\Gamma (.) \big)

\alpha_{ivc} = \exp(x_i^T\beta_{vc})

y_{ivc} = \sum_{l \in \mathcal L} \delta_{vc} (l) y_{il}

l_v(\beta_v) = \log L_v(\beta_v) = \sum_{i=1}^n \Bigg[\log \bigg(\Gamma \Big(\sum_{c\in \mathcal C_v} \exp(x_i^T\beta_{vc})\Big) \bigg) - \log \bigg(\Gamma \Big(\sum_{c\in \mathcal C_v} y_{ivc} + \sum_{c\in \mathcal C_v} \exp(x_i^T\beta_{vc})\Big) \bigg) +\sum_{c \in \mathcal C_v} \bigg\{ \log \bigg(\Gamma \Big(y_{ivc} + \exp(x_i^T\beta_{vc})\Big)\bigg) -\log \bigg(\Gamma \Big(\exp(x_i^T\beta_{vc})\Big) \bigg)\bigg\}\Bigg]

Regularized Likelihood Estimation

pnl_{DTM}(\beta; \lambda, \gamma) = -l_{DTM}(\beta) + \lambda \bigg\{ (1- \gamma)\sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\\|\beta_{cv}\\|_{L1} + \gamma \sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\\|\beta_{cv}\\|_{L2} \bigg\}

Algorithm of accelerated proximal gradient method



Goodgolden/LDTM documentation built on May 25, 2022, 5:25 p.m.