In PratheepaJ/diffTop: Differential topic analysis using latent Dirichlet allocation

library(diffTop)

knitr::opts_chunk$set(tidy = FALSE,
                      cache = FALSE,
                      dev = "png",
                      message = FALSE, error = FALSE, warning = TRUE)

Standard workflow

Note: if you use diffTop in published research, please cite:

Jeganathan, P. and Holmes, S.P. (2021) A statistical perspective on the challenges in molecular microbial biology. Journal of Agricultural, Biological and Environmental Statistics, 26(2):131 - 160. 10.1007/s13253-021-00447-1

Quick start

Here we show the steps for the differential topic analysis. The following code assume that the phyloseq object is available. Please refer to phyloseq for creating a phyloseq class object.

K <- 11
alpha <- 0.8
gamma <- 0.8
ps <- psE_BARBI
stan.data <- setUpData(
  ps = ps,
  K = K,
  alpha = alpha,
  gamma = gamma
)

iter <- 2000
chains <- 4
stan.fit <- LDAtopicmodel(
  stan_data = stan.data, 
  iter = iter, 
  chains = chains, 
  sample_file = NULL,
  diagnostic_file = NULL,
  cores = 4,
  control = list(adapt_delta = 0.9),
  save_dso = TRUE,
  algorithm = "NUTS"
  )

samples <- rstan::extract(
  stan.fit, 
  permuted = TRUE, 
  inc_warmup = FALSE, 
  include = TRUE
  )

theta <- samples$theta 

aligned <- alignmentMatrix(
  theta, 
  ps, 
  K, 
  iterUse = iter/2,
  chain = chains,
  SampleID_name = "unique_names"
  )

theta_aligned <- thetaAligned(
  theta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  )

beta <- samples$beta 

beta_aligned <- betaAligned(
  beta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  ) 


dds_all <- diffTopAnalysis(
  design = ~ pna,
  ps,
  theta_aligned,
  subsetSample = sample_names(psE_BARBI)
)
res_all <- results(dds_all)
res_all

Input data

We set-up data for latent Dirichlet allocation (LDA) based on a phyloseq object, the number of topics, hyperparameters for topic proportion in each sample and ASVs proportion in each topic.

We set the hyperparameters $\alpha$ and $\gamma$ less than one to generate mixtures that are different from each other. We choose 0.8 for $\alpha$ across all samples so that we avoid generating unrealistic topics.

data("psE_BARBI")
ps <- psE_BARBI

K <- 11
alpha <- 0.8
gamma <- 0.8
stan.data <- setUpData(
  ps = ps,
  K = K,
  alpha = alpha,
  gamma = gamma
)

Posterior sampling

We need to estimate $\theta$ and $B = \left[\beta_{1}, \cdots, \beta_{K} \right]^{T}.$

We estimate the parameters using HMC NUTS with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples.

LDAtopicmodel uses an LDA model written in lda.stan. The user doesn't need to write the Stan model. We use stan function from rstan package. Please refer to ?stan to specify the inputs for the stan function. For the whole dataset, we can run the following chunk in high-performance computing and save the results.

iter <- 2000
chains <- 4

stan.fit <- LDAtopicmodel(
  stan_data = stan.data, 
  iter = iter, 
  chains = chains, 
  sample_file = NULL,
  diagnostic_file = NULL,
  cores = 4,
  control = list(adapt_delta = 0.9),
  save_dso = TRUE,
  algorithm = "NUTS"
  )

We extract posterior samples from the results.

samples <- rstan::extract(
  stan.fit, 
  permuted = TRUE, 
  inc_warmup = FALSE, 
  include = TRUE
  )

Alignment

We estimated the parameters using the HMC-NUTS sampler with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples and discarded. Label switching across chains makes it difficult to directly compute log predictive density, split-$\hat{R}$, effective sample size for model assessment, and evaluate convergence and mixing of chains.

To address this issue, we fixed the order of topics in chain one and then found the permutation that best aligned the topics across all four chains. For each chain two to four, we identified the estimated topics pair with the highest correlation, then found the next highest pair among the remaining, and so forth.

We create a Topic $*$ Chain matrix.

theta <- samples$theta 

aligned <- alignmentMatrix(
  theta, 
  ps, 
  K, 
  iterUse = iter/2,
  chain = chains,
  SampleID_name = "unique_names"
  )

Align topic proportion

We align the topic proportion in each specimen across four chains.

theta_aligned <- thetaAligned(
  theta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  )

Align ASV proportion

We align the ASVs proportion in each topic across four chains.

# an array (iterations *topic * ASV)
beta <- samples$beta 

# an array (iterations *topic * ASV)
beta_aligned <- betaAligned(
  beta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  )

Visualization

Plot topic proportion

Plot the topic proportion in specimens and we can save it for publication purposes. We can draw an informative summary of bacterial communities in each phenotype.

plotTopicProportion(
  ps,
  theta_aligned,
  K,
  col_names_theta_all = c("iteration", "Sample", "Topic", "topic.dis"),
  chain = 4,
  iterUse = iter/2,
  SampleIDVarInphyloseq = "unique_names",
  design = ~ pna
)

Plot ASVs distribution

We plot the ASVs distribution in each topic.

plotASVCircular(
  ps,
  K,
  iterUse = iter/2,
  chain = chains,
  beta_aligned,
  varnames = c("iterations", "topic", "rsv_ix"),
  value_name = "beta_h",
  taxonomylevel = "Class",
  thresholdASVprop = 0.008
  )

Model assessement

We do goodness of fit test using the posterior estimates and observed data.

modelAssessment(
  ps,
  stan.fit,
  iterUse = iter/2,
  statistic = "max",
  ASVsIndexToPlot = c(1, 3, 10:14, 19:26, 36, 51:53, 148)
)

Diagnostics

Effective sample size (ESS)

We compute effective sample size for each parameter.

p <- diagnosticsPlot(
  theta_aligned = theta_aligned,
  beta_aligned = beta_aligned
  iterUse = iter/2,
  chain = chains
)
p_ess_bulk <- p[[[1]]]
p_ess_bulk

$\hat{R}$

We can check whether $\hat{R}$ is less than 1.02

p_rhat <- p[[2]]
p_rhat

Differential topic analysis

Test on all specimens

dds_all <- diffTopAnalysis(
  design = ~ pna,
  ps,
  theta_aligned,
  subsetSample = sample_names(ps),
  fitType = "mean"
)
results(dds_all)

Test on selected specimens

Let's test on the paired-specimens.

dds_all_paired <- diffTopAnalysis(
  design = ~ pna,
  ps,
  theta_aligned,
  subsetSample = paired_specimens
)
results(dds_all_paired)

PratheepaJ/diffTop documentation built on Dec. 18, 2021, 7:48 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PratheepaJ/diffTop
Differential topic analysis using latent Dirichlet allocation

In PratheepaJ/diffTop: Differential topic analysis using latent Dirichlet allocation

Standard workflow

Quick start

Input data

Posterior sampling

Alignment

Align topic proportion

Align ASV proportion

Visualization

Plot topic proportion

Plot ASVs distribution

Model assessement

Diagnostics

Effective sample size (ESS)

$\hat{R}$

Differential topic analysis

Test on all specimens

Test on selected specimens

R Package Documentation

Browse R Packages

We want your feedback!

PratheepaJ/diffTop Differential topic analysis using latent Dirichlet allocation

In PratheepaJ/diffTop: Differential topic analysis using latent Dirichlet allocation

Standard workflow

Quick start

Input data

Posterior sampling

Alignment

Align topic proportion

Align ASV proportion

Visualization

Plot topic proportion

Plot ASVs distribution

Model assessement

Diagnostics

Effective sample size (ESS)

$\hat{R}$

Differential topic analysis

Test on all specimens

Test on selected specimens

R Package Documentation

Browse R Packages

We want your feedback!

PratheepaJ/diffTop
Differential topic analysis using latent Dirichlet allocation