knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

LTN-LDA (LeBlanc and Ma XXXX) is a mixed-membership model which seeks to appropriately incorporate cross-sample heterogeneity in subcommunity compositions: a characteristic of the data prevalent in most microbiome studies. Incorporating such cross-sample heterogeneity leads to substantially improved inference compared to existing models.

Data

To demonstrate the use of LTNLDA function, we load the data included with the LTNLDA package. This is a modified version of the dataset analyzed in Dethlefsen and Relman 2011. The original study analyzed the microbiome composition of three patients who were given two different five-day courses of the antibiotic ciproflaxin over the course of ten months. We limit ourselves to only patient F. The original data for patient F featured $2,852$ unique ASVs across $54$ samples. We merged ASVs into taxa at their finest known taxa and pruned all taxa which did not total at least $100$ sequencing reads across all $54$ samples. This resulted in $44$ taxa totaling $99.86$ percent of the original counts.

library(phyloseq)
data("ps",package = "LTNLDA")
ps

The LTNLDA function takes phyloseq objects as one of its inputs. Moreover, any phyloseq object used as in input in the LTNLDA function must have two features. The first is that is must have a table of otu counts accessible with otu_table(ps).

otu = otu_table(ps)
otu[1:5,1:5]

This object is a matrix where rows correspond to taxa, columns correspond to samples, and entry $i,j$ is the number of sequencing reads assigned to taxa $i$ in sample $j$. The second object is a phylogenetic tree accessible with phy_tree(ps)$edge.

tree = phy_tree(ps)
tree.edge = tree$edge
head(tree.edge)

This object is an edge matrix encoding a phylogenetic tree. Each row details one edge of the tree, where the entry in the first column is the tail of the edge and the entry in the second column is the head. The root node appears only in the first column, and the leaves appear only in the second column.

As this vignette is purely for instructional purposes, it will suffice to consider only the first five samples in the dataset.

ps = subset_samples(ps,time < 6)

LTNLDA

library(LTNLDA)

Given that we have a phyloseq object ps which possess the requisite data structures, we need specify only one more variable to run an LTN-LDA Gibbs sampler. This is the number of subcommunities, $K$, we choose to model. We now demonstrate the default use of the LTNLDA function:

K = 2
model = LTNLDA(ps,K)

We also note here that the value of the second tuning parameter, $C$, the threshold controlling the level of cross-sample heterogeneity, has a default value equal to $5$. This is unlikely to be optimal for many phylogenetic trees, and so we allow the user to set the value of $C$. We will detail how to choose $K$ and $C$ using perplexity in another vignette, for now take $K = 2$ and $C = 10$.

K = 2
C = 10
model = LTNLDA(ps,K,C)

Additional Options

We do not run the full version of the above code because it prints quite a bit of text to standard out and clutters the document. Instead, we run a reduced version, but first we shall cover some additional and optional inputs to the LTNLDA function.

We now run the LTNLDA function for a smaller than default number of iterations:

K = 2
burnin = 300
iterations = 30
thin = 10
model = LTNLDA(ps,K,iterations = iterations,burnin = burnin,thin = thin)

The LTNLDA function prints messages to standard out to inform the user that it is running and how much longer it has to run. Note that for actual implementation we recommend running the Gibbs sampler for far longer than it was above --- the default values ought to be long enough for most applications.

Summary

We also provide a function, Summary, to provide a high level summary of the result of an LTNLDA function.

sum = Summary(model)

Summary returns a list with two components. The first, Abundance, is a matrix which provides the average abundance of each subcommunity in descending oder.

sum$Abundance

The other component, Top_ASVs, is a list providing the top $n$ ASVs in each subcommunity. The $k^{th}$ entry in the list is a matrix where each row corresponds to a certain ASV and the entry is the prevalene of that ASV in subcommunity $k$. The subcommunities are ordered according to descending average subcommunity abundance so that the subcommunities in Abundance and Top_ASVs have the same ordering.

sum$Top_ASVs

By default, the function has $n = 5$, however this parameter can be modified.

top_n = 10
sum = Summary(model, top_n)
sum$Top_ASVs

LTNLDA Output

The output of the LTNLDA function is a list. There are two main types of data contained in this list: posterior mean parameter estimates and Markov chains. The first provides an easily accessible source of posterior inference while the second allows users to assess variance and Markov Chain convergence. We will now cover every output in greater detail.

Mean_Post_Phi_d

Mean_Post_Phi_d returns a matrix containing posterior mean estimates for the $\phi_{d}$. The $d^{th}$ row is a probability vector such that the $k^{th}$ entry gives the proportion of subcommunity $k$ in sample $d$.

post_phi = model$Mean_Post_Phi_d
d = 1
k = 1
post_phi[d,]
post_phi[d,k]

Mean_Post_Beta_kd

Mean_Post_Beta_kd returns an array containing posterior mean estimates for the $\beta_{k,d}$. The probability vector in the $k^{th}$ row and $d^{th}$ column gives the sample-specific multinomial ASV-subcommunity distribution for the $k^{th}$ subcommunity in the $d^{th}$ sample. The $v^{th}$ entry in this vector gives the incidence of the $v^{th}$ ASV.

post_beta_kd = model$Mean_Post_Beta_kd
k = 1
d = 1
post_beta_kd[k,d,]
v = 1
post_beta_kd[k,d,v]

Mean_Post_Beta_k

Mean_Post_Beta_k returns a matrix containing posterior mean estimates for the $\beta_{k}$. The probability vector in the $k^{th}$ row gives the "average" multinomial ASV-subcommunity distribution for the $k^{th}$ subcommunity. The $v^{th}$ entry in this vector gives the incidence of the $v^{th}$ ASV.

post_beta_k = model$Mean_Post_Beta_k
k = 1
post_beta_k[k,]
v = 1
post_beta_k[k,v]

Chain_Phi

Chain_Phi returns an array containing the Markov chain for the $\phi_d$. The $d^{th}$ row and the $k^{th}$ column contain the Markov chain for $\phi_{d,k}$.

chain_phi = model$Chain_Phi
k = 1
d = 1
plot(chain_phi[d,k,],type="l",
     xlab = "Iteration",
     ylab = "Phi_{d,k}",
     main = "Trace Plot")

Chain_Psi

Chain_Psi returns a list containing the Markov chains for the $\psi_{p,d,k}$. The $k^{th}$ entry in this list returns an array containing the Markov chains for the $\psi_{\cdot,\cdot,k}$. The $p^{th}$ column and $d^{th}$ slice contains the Markov chain for $\psi_{p,d,k}$.

chain_psi = model$Chain_Psi
k = 1
chain_psi_k = chain_psi[[k]]
p = 1
d = 1
plot(chain_psi_k[,p,d],type="l",
     xlab = "Iteration",
     ylab = "Psi_{p,d,k}",
     main = "Trace Plot")

Chain_Mu

Chain_Mu returns a list containing the Markov chains for the $\mu_k$. The $k^{th}$ entry in this list returns a matrix containing the Markov chains for the $\mu_{k,\cdot}$. The $p^{th}$ column contains the Markov chain for $\mu_{k,p}$.

chain_mu = model$Chain_Mu
k = 1
chain_mu_k = chain_mu[[k]]
p = 1
plot(chain_mu_k[,p],type="l",
     xlab = "Iteration",
     ylab = "Mu_{p,k}",
     main = "Trace Plot")

Chain_Sigma

Chain_Sigma returns a list containing the Markov chains for the $\Sigma_k$. The $k^{th}$ entry in this list returns an array containing the Markov chains for the $\Sigma_{\cdot, \cdot,k}$. In particular, only the diagonal entries are nonzero and thus of interest. The Markov chain for the variance parameter associated with node $p$ is contained in the $p^{th}$ column and $p^{th}$ slice of this array.

chain_sigma = model$Chain_Sigma
k = 1
chain_sigma_k = chain_sigma[[k]]
p = 1
plot(chain_sigma_k[,p,p],type="l",
     xlab = "Iteration",
     ylab = "Sigma_{p,p,k}",
     main = "Trace Plot")

phyloseq

phyloseq returns the phyloseq object analyzed by the data.

ps = model$phyloseq
ps

References



PatrickLeBlanc/LTNLDA documentation built on May 22, 2022, 12:49 p.m.