library(diffTop)
knitr::opts_chunk$set(tidy = FALSE, cache = FALSE, dev = "png", message = FALSE, error = FALSE, warning = TRUE)
Note: if you use diffTop in published research, please cite:
Jeganathan, P. and Holmes, S.P. (2021) A statistical perspective on the challenges in molecular microbial biology. Journal of Agricultural, Biological and Environmental Statistics, 26(2):131 - 160. 10.1007/s13253-021-00447-1
Here we show the steps for the differential topic analysis. The following code assume that the phyloseq
object is available. Please refer to phyloseq
for creating a phyloseq
class object.
K <- 11 alpha <- 0.8 gamma <- 0.8 ps <- psE_BARBI stan.data <- setUpData( ps = ps, K = K, alpha = alpha, gamma = gamma ) iter <- 2000 chains <- 4 stan.fit <- LDAtopicmodel( stan_data = stan.data, iter = iter, chains = chains, sample_file = NULL, diagnostic_file = NULL, cores = 4, control = list(adapt_delta = 0.9), save_dso = TRUE, algorithm = "NUTS" ) samples <- rstan::extract( stan.fit, permuted = TRUE, inc_warmup = FALSE, include = TRUE ) theta <- samples$theta aligned <- alignmentMatrix( theta, ps, K, iterUse = iter/2, chain = chains, SampleID_name = "unique_names" ) theta_aligned <- thetaAligned( theta, K, aligned, iterUse = iter/2, chain = chains ) beta <- samples$beta beta_aligned <- betaAligned( beta, K, aligned, iterUse = iter/2, chain = chains ) dds_all <- diffTopAnalysis( design = ~ pna, ps, theta_aligned, subsetSample = sample_names(psE_BARBI) ) res_all <- results(dds_all) res_all
We set-up data for latent Dirichlet allocation (LDA) based on a phyloseq
object, the number of topics, hyperparameters for topic proportion in each sample and ASVs proportion in each topic.
We set the hyperparameters $\alpha$ and $\gamma$ less than one to generate mixtures that are different from each other. We choose 0.8 for $\alpha$ across all samples so that we avoid generating unrealistic topics.
data("psE_BARBI") ps <- psE_BARBI
K <- 11 alpha <- 0.8 gamma <- 0.8 stan.data <- setUpData( ps = ps, K = K, alpha = alpha, gamma = gamma )
We need to estimate $\theta$ and $B = \left[\beta_{1}, \cdots, \beta_{K} \right]^{T}.$
We estimate the parameters using HMC NUTS with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples.
LDAtopicmodel
uses an LDA model written in lda.stan. The user doesn't need to write the Stan model. We use stan
function from rstan
package. Please refer to ?stan
to specify the inputs for the stan
function. For the whole dataset, we can run the following chunk in high-performance computing and save the results.
iter <- 2000 chains <- 4 stan.fit <- LDAtopicmodel( stan_data = stan.data, iter = iter, chains = chains, sample_file = NULL, diagnostic_file = NULL, cores = 4, control = list(adapt_delta = 0.9), save_dso = TRUE, algorithm = "NUTS" )
We extract posterior samples from the results.
samples <- rstan::extract( stan.fit, permuted = TRUE, inc_warmup = FALSE, include = TRUE )
We estimated the parameters using the HMC-NUTS sampler with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples and discarded. Label switching across chains makes it difficult to directly compute log predictive density, split-$\hat{R}$, effective sample size for model assessment, and evaluate convergence and mixing of chains.
To address this issue, we fixed the order of topics in chain one and then found the permutation that best aligned the topics across all four chains. For each chain two to four, we identified the estimated topics pair with the highest correlation, then found the next highest pair among the remaining, and so forth.
We create a Topic $*$ Chain matrix.
theta <- samples$theta aligned <- alignmentMatrix( theta, ps, K, iterUse = iter/2, chain = chains, SampleID_name = "unique_names" )
We align the topic proportion in each specimen across four chains.
theta_aligned <- thetaAligned( theta, K, aligned, iterUse = iter/2, chain = chains )
We align the ASVs proportion in each topic across four chains.
# an array (iterations *topic * ASV) beta <- samples$beta # an array (iterations *topic * ASV) beta_aligned <- betaAligned( beta, K, aligned, iterUse = iter/2, chain = chains )
Plot the topic proportion in specimens and we can save it for publication purposes. We can draw an informative summary of bacterial communities in each phenotype.
plotTopicProportion( ps, theta_aligned, K, col_names_theta_all = c("iteration", "Sample", "Topic", "topic.dis"), chain = 4, iterUse = iter/2, SampleIDVarInphyloseq = "unique_names", design = ~ pna )
We plot the ASVs distribution in each topic.
plotASVCircular( ps, K, iterUse = iter/2, chain = chains, beta_aligned, varnames = c("iterations", "topic", "rsv_ix"), value_name = "beta_h", taxonomylevel = "Class", thresholdASVprop = 0.008 )
We do goodness of fit test using the posterior estimates and observed data.
modelAssessment( ps, stan.fit, iterUse = iter/2, statistic = "max", ASVsIndexToPlot = c(1, 3, 10:14, 19:26, 36, 51:53, 148) )
We compute effective sample size for each parameter.
p <- diagnosticsPlot( theta_aligned = theta_aligned, beta_aligned = beta_aligned iterUse = iter/2, chain = chains ) p_ess_bulk <- p[[[1]]] p_ess_bulk
We can check whether $\hat{R}$ is less than 1.02
p_rhat <- p[[2]]
p_rhat
dds_all <- diffTopAnalysis( design = ~ pna, ps, theta_aligned, subsetSample = sample_names(ps), fitType = "mean" ) results(dds_all)
Let's test on the paired-specimens.
dds_all_paired <- diffTopAnalysis( design = ~ pna, ps, theta_aligned, subsetSample = paired_specimens ) results(dds_all_paired)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.