In PratheepaJ/BARBI: A Bayesian Approach to Contamination Removal in Molecular Microbial Studies

knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE, 
                      message = FALSE)

Install R and RStudio. Open this Rmd r Githubpkg("rstudio/rmarkdown") file in vignettes folder in RStudio. Then run the following code to install all required packages, including BARBI from Github repository.

pkgs <- c("phyloseq",
         "dplyr",
         "HDInterval",
         "grid",
         "gtable",
         "gridExtra",
         "magrittr",
         "ggplot2",
         "DESeq2")

if (!requireNamespace("BiocManager")){
  install.packages("BiocManager")
}

BiocManager::install(setdiff(pkgs, installed.packages()))

devtools::install_github("PratheepaJ/BARBI")

Load packages:

library(BARBI)
library(phyloseq)
library(dplyr)
library(HDInterval)
library(grid)
library(gtable)
library(gridExtra)
library(magrittr)
library(ggplot2)
library(DESeq2)
library(reshape2)
library(ggwordcloud)

Example dataset

We load an example dataset stored as a phyloseq object in the BARBI package (or use your own data stored as a phyloseq object)

Load the phyloseq object

We validated our the BARBI method for removing DNA contamination using a dilution series of samples of eight known bacterial species in the standard ZymoBIOMICS microbial community (Zymo Research).
We saved this data as phyloseq object in the BARBI package.
Seven rounds of six-fold dilutions from the standard, from 1:1 up to 1:279,936 ($n_{1} = 8$), as well as ten negative extraction controls $n_{2} = 10$, were made.
Then all $N =18$ specimens were processed and analyzed with the ZymoBIOMICS® Service: Targeted Metagenomic Sequencing (Zymo Research, Irvine, CA), which leverages 16S rRNA gene sequencing.

ps <- ps

Specify that the samples are on the columns and taxa are on the rows of otu_table.

if(dim(otu_table(ps))[1]!=ntaxa(ps)){
  otu_table(ps) <- t(otu_table(ps))}

Adding blocks/batches

To reduce the batch-effects of contamination, we can specify the block information and analyze each block separately with BARBI.

We highly recommend that you keep track of batch effects (Especially DNA extraction and library prep batches), visualize your data with PCA on ranks, and separate your data into different blocks of necessary.

In the example dataset, all samples are in one block.

blocks <- rep("Set1", nsamples(ps))

sample_data(ps)$block <- blocks

Remove taxa not in dilution series samples

Identify the taxa that are not present in at least one dilution series sample and removed them from the phyloseq object. Label these species as contaminants.

ps <- prune_taxa(taxa_sums(ps) > 0, ps)
ps_specimen <-  subset_samples(ps, 
                               SampleType %in% c("Standard"))
prevTaxaP <- apply(otu_table(ps_specimen), 1,
                   function(x){sum(x>0)})

Contaminants1 <- names(prevTaxaP)[prevTaxaP == 0]
ps <- prune_taxa(prevTaxaP > 0, ps)
ps

We identified 142 ASVs not is any dilution series samples, and they are classified as contaminants before using BARBI.

We use BARBI to infer true ASVs in each dilution series sample.

Library depth

We check the distribution of sample library depth to see whether there are samples with very small library depth that should be dropped from the analysis.

totalReads <- colSums(otu_table(ps))
hist(log(totalReads), 
     yaxs="i", 
     xaxs="i", 
     main="Distribution of total reads per sample", 
     breaks=50)

We do not need to drop any sample.

Summary

We look at a summary of the specimens and negative control samples in each block.

table(sample_data(ps)$SampleType,sample_data(ps)$block)

BARBI

Prepare the phyloseq object for the BARBI method

We use BARBI to identify contaminants in each block separately. Thus, we split the phyloseq object into multiple phyloseq objects corresponding to each block, and store the phyloseq objects as a list of phyloseq objects, psByBlock.

We select negative control samples from each block and store as a list of phyloseq objects, psNCbyBlock.

We select all taxa that have a prevalence of zero (i.e., have zero reads) in all negative control samples for each block and store as a list of phyloseq objects, psallzeroInNC.

We select all specimen samples from each block and store as a list of phyloseq objects, psPlByBlock.

psBlockResult <- psBlockResults(ps, 
                               sampleTypeVar = "SampleType",
                               caselevels = c("Standard"),
                               controllevel = c("Negative"),
                               sampleName = "sampleID", 
                               blockVar = "block")

psByBlock <- psBlockResult[[1]]
psNCbyBlock <- psBlockResult[[2]]
psallzeroInNC <- psBlockResult[[3]]
psPlByBlock <- psBlockResult[[4]]

Estimate the density parameters for the contaminant intensities in negative control samples

Estimate the gamma density parameters for the contaminant intensities using the negative control samples for each block.

$\lambda_{il}^{(c)} \sim \text{gamma }\left(\frac{1}{\gamma_{i}^{0}},\frac{1}{\gamma_{i}^{0} \mu_{il}^{0}}\right),$ $l$ is the $l$-th negative control.

con_int_neg_ctrl <- alphaBetaNegControl(psNCbyBlock = psNCbyBlock)

Estimate the density parameters for the contaminant intensities in each specimen

For each specimen, we estimate the gamma density parameters for the contaminant intensities using the scaling property of the gamma distribution.

$\lambda_{ij}^{(c)} \sim \text{gamma }\left( \frac{d^{0}{l} }{d{j}} \frac{1}{\gamma_{i}^{0}},\frac{1}{\gamma_{i}^{0} \mu_{il}^{0}}\right),$ where $j$ is the $j$-th specimen.

num_blks <- length(con_int_neg_ctrl)
blks <- seq(1, num_blks) %>% as.list

con_int_specimen <- lapply(blks, function(x){
    con_int_specimen_each_blk <- alphaBetaContInPlasma(psPlByBlock = psPlByBlock,
                                                       psallzeroInNC = psallzeroInNC,
                                                       blk = x,
                                                       alphaBetaNegControl = con_int_neg_ctrl)
        return(con_int_specimen_each_blk)
})

Sample from the marginal posterior for the true intensities

For all specimen samples and for all taxa, sample from the posterior for the true intensities using the Metropolis-Hasting MCMC.

We need to specify the number of iterations in the MCMC using the option itera.

Save the gamma prior for the intensity of contamination and the posterior samples.

The suggested itera is 10,000.

itera = 100
t1 <- proc.time()

mar_post_true_intensities <- lapply(blks,function(x){
    mar_post_true_intensities_each_blk <- samplingPosterior(psPlByBlock = psPlByBlock,
                                                            blk = x,
                                                            gammaPrior_Cont = con_int_specimen[[x]],
                                                            itera = itera)
    return(mar_post_true_intensities_each_blk)
})

proc.time()-t1


con_int_specimen_mar_post_true_intensities <- list(con_int_specimen, mar_post_true_intensities)

Save the results.

saveRDS(con_int_specimen_mar_post_true_intensities, 
        file= "./con_int_specimen_mar_post_true_intensities_vignettes.rds")

Make summaries from the BARBI results.

Make tables for each sample

Choose the number of MCMC to be removed using the option burnIn. It must be less than itera.

Choose the coverage probability to construct the highest posterior density (HPD) interval$\left(L_{ij}^{(r)}, U_{ij}^{(r)}\right)$ (for each taxon $i$ in a specimen $j$) using the option cov.pro for true intensities.

Compute the highest density interval (HDI) for the contaminant intensities $\left(L_{ij}^{(c)}, U_{ij}^{(c)}\right)$ for each taxon $i$ in a specimen $j$.

For a contaminant taxon, the lower limit $L_{ij}^{(r)}$ will be smaller than the upper limit $U_{ij}^{(c)}$.

The suggested burnIn is 5000 for itera <- 10,000.

ASV <- as.character(paste0("ASV_",seq(1,ntaxa(ps))))
ASV.Genus <- paste0("ASV_",seq(1,ntaxa(ps)),"_",as.character(tax_table(ps)[,6]))
ASV.Genus.Species <- paste0(ASV,"_",as.character(tax_table(ps)[,6]),"_", as.character(tax_table(ps)[,7]))

df.ASV <- data.frame(seq.variant = taxa_names(ps), ASV = ASV, ASV.Genus = ASV.Genus, ASV.Genus.Species = ASV.Genus.Species)

itera <- 100
burnIn <- 10
cov.pro <- .95
mak_tab <- FALSE # Save tables or print tables 

# con_int_specimen_mar_post_true_intensities <- readRDS("./con_int_specimen_mar_post_true_intensities_vignettes.rds")

con_int_specimen <- con_int_specimen_mar_post_true_intensities[[1]]
mar_post_true_intensities <- con_int_specimen_mar_post_true_intensities[[2]]

## Keep true 
all_true_taxa_blk <- list()

for(blk in 1:num_blks){

  mar_post_true_intensities_blk <- mar_post_true_intensities[[blk]]
  con_int_specimen_blk <- con_int_specimen[[blk]]

  all_true_taxa <- character()

  for(sam in 1:nsamples(psPlByBlock[[blk]])){
      taxa_post <- mar_post_true_intensities_blk[[sam]]
      acceptance <- list()
      lower.r <- list()
      upper.r <- list()
      lower.c <- list()
      upper.c <- list()
      all.zero.nc <- list()

      for(taxa in 1:length(taxa_post)){
        burnIn  <- burnIn
        acceptance[[taxa]]  <-  1 - mean(duplicated(taxa_post[[taxa]][-(1:burnIn),]))

        HPD.r <- hdi(taxa_post[[taxa]][-(1:burnIn),],
                    credMass = cov.pro)
        lower.r[[taxa]] <- round(HPD.r[1], digits = 0)
        upper.r[[taxa]] <- round(HPD.r[2], digits = 0)
        lamda.c <- rgamma((itera-burnIn+1), 
                    shape= con_int_specimen_blk[[sam]][[1]][taxa],
                    rate = con_int_specimen_blk[[sam]][[2]][taxa])

        HDI.c <- hdi(lamda.c, credMass = cov.pro)
        lower.c[[taxa]] <- round(HDI.c[1], digits = 0)
        upper.c[[taxa]] <- round(HDI.c[2], digits = 0)

        all.zero.nc[[taxa]] <-  con_int_specimen_blk[[sam]][[5]][taxa]
      }

    tax_names <- taxa_names(psPlByBlock[[blk]])
    tax_names <- df.ASV$ASV.Genus[which(as.character(df.ASV$seq.variant) %in%  tax_names)]

    df <- data.frame(Species = tax_names,
                    xj = as.numeric(con_int_specimen_blk[[sam]][[3]]),
                    l.r = unlist(lower.r),
                    u.r = unlist(upper.r),
                    l.c = unlist(lower.c),
                    u.c = unlist(upper.c),
                    all.zero.nc = unlist(all.zero.nc))


      # List all true taxa
      df <- arrange(filter(df,(l.r > u.c) & (l.r > 0)),
                   desc(xj))

      # If there is no true taxa
      if(dim(df)[1]==0){
          df <- data.frame(Species="Negative",
                           xj="Negative",
                           l.r="Negative",
                           u.r="Negative",
                           l.c ="Negative",
                           u.c="Negative",
                           all.zero.nc = "Negative")
      }



      # collect all true taxa in the specimen
      all_true_taxa <- c(all_true_taxa,
                        as.character(df$Species))

      if(mak_tab){
        filname <- paste("./",
                         sample_names(psPlByBlock[[blk]])[sam],
                        ".png",
                        sep = "")

        png(filname, height = 600, width = 750)

        df.p <- tableGrob(df)
        title <- textGrob(sample_names(psPlByBlock[[blk]])[sam], 
                         gp = gpar(fontsize = 12))

        padding <- unit(0.5,"line")

        df.p <- gtable_add_rows(df.p, 
                               heights = grobHeight(title) + padding, 
                               pos = 0)

        df.p <- gtable_add_grob(df.p, 
                               list(title),
                               t = 1, 
                               l = 1, 
                               r = ncol(df.p))

        grid.newpage()
        grid.draw(df.p)
        dev.off()

      }else{
        df.p <- tableGrob(df)
        title <- textGrob(sample_names(psPlByBlock[[blk]])[sam], 
                         gp = gpar(fontsize = 12))

        padding <- unit(0.5,"line")

        df.p <- gtable_add_rows(df.p, 
                               heights = grobHeight(title) + padding, 
                               pos = 0)

        df.p <- gtable_add_grob(df.p, 
                               list(title),
                               t = 1, 
                               l = 1, 
                               r = ncol(df.p))
        grid.newpage()
        grid.draw(df.p)
      }


      all_true_taxa <- unique(all_true_taxa)
  }

  all_true_taxa_blk[[blk]] <- all_true_taxa
}

Construct a phyloseq object with the true taxa

all_true_taxa_blk <- unlist(all_true_taxa_blk)
ASV = df.ASV$seq.variant[which(as.character(df.ASV$ASV.Genus) %in% as.character(all_true_taxa_blk))] %>% as.character()
ps_decon <- prune_taxa(ASV, ps)
ps_decon

Histograms

set.seed(10000)
itera <- 100
burnIn <- 10
cov.pro <- .95

num_blks <- length(psByBlock)

# con_int_specimen_mar_post_true_intensities <- readRDS("./con_int_specimen_mar_post_true_intensities_vignettes.rds")

con_int_specimen <- con_int_specimen_mar_post_true_intensities[[1]]
mar_post_true_intensities <- con_int_specimen_mar_post_true_intensities[[2]]

blk <- 1

con_int_specimen_blk <- con_int_specimen[[blk]]
mar_post_true_intensities_blk <- mar_post_true_intensities[[blk]]


sample.names <- sample_names(psPlByBlock[[blk]])

for(j in 1: length(sample.names)){
    desired.sample.name <- sample.names[j]
    desired.sample.index <- which(sample_names(psPlByBlock[[blk]]) %in% desired.sample.name)
    tax_interested <- rownames(sort(otu_table(psPlByBlock[[blk]])[,desired.sample.index],decreasing = TRUE))[c(1:16)]
    tax_interested_ind <- which(as.character(taxa_names(psPlByBlock[[blk]])) %in% tax_interested)
    tax_names <- taxa_names(psPlByBlock[[blk]])[tax_interested_ind]
    tax_names <- df.ASV$ASV.Genus[which(as.character(df.ASV$seq.variant) %in%  tax_names)] %>% as.character()

    taxa.post <- mar_post_true_intensities_blk[[desired.sample.index]]

    burnIn <- burnIn
    signal.hist <- taxa.post[tax_interested_ind]
    signal.hist <- lapply(signal.hist,function(x){x[-(1:burnIn),]})
    signal.df <- data.frame(do.call("cbind", signal.hist))
    colnames(signal.df) <- tax_names
    signal.df$group <- rep("True", length=dim(signal.df)[1])

    bg <- list()
    for(ind in 1:length(tax_interested_ind)){
        bg[[ind]] <- rgamma((itera-burnIn+1),
                            shape = con_int_specimen_blk[[desired.sample.index]][[1]][tax_interested_ind[ind]],
                            rate = con_int_specimen_blk[[desired.sample.index]][[2]][tax_interested_ind[ind]])
    }

    bg.df <- data.frame(do.call("cbind",bg))
    colnames(bg.df) <- tax_names
    bg.df$group <- rep("Contaminant", length=dim(bg.df)[1])

    bg.signal <- rbind(signal.df, bg.df)
    bg.signal$group <- as.factor(bg.signal$group)
    bg_sig_long <- tidyr::gather(bg.signal,key="Taxa",
                                 value="Reads",1:(dim(bg.signal)[2]-1))
    bg_sig_long$Taxa <- as.factor(bg_sig_long$Taxa)

    p <- ggplot(bg_sig_long, aes(x= Reads))+
      geom_density(aes(y = ..scaled.., fill = group, color = group)) +
      facet_wrap(~Taxa,scales = "free")+
      scale_fill_manual(values=c("blue","brown")) +
      scale_color_manual(values=c("blue","brown")) +
      ggtitle(desired.sample.name)+
      theme(plot.title = element_text(hjust = 0.5),
            legend.title=element_blank(), 
            strip.text.x = element_text(size=5),
            strip.background = element_blank(), 
            panel.grid = element_blank(), 
            panel.background = element_blank()) + 
      xlab("") + 
      ylab("density")

    print(p)
    # fileN <- paste0("./Figures/", desired.sample.name,"_hist",".eps")
    # ggsave(fileN, plot = p, width = 10, height = 5, device = "eps")
}

Latent Dirichlet Allocation

Co-occurrences of contaminants provide important information about their potential source and we provide a topic modeling approach that identifies "contaminant topics" and the distribution of taxa in each topic. This model helps identify contaminant sources related to each topic. This is important in the appropriate design of followup experiments.

short.sample.names = c(paste0("NC.", c(1,10)), 
                       paste0("Di.", seq(1,8)), 
                       paste0("NC.", seq(2,9)))
sample_names(ps) = short.sample.names

x = t(get_taxa(ps))
dimnames(x) = NULL
K = 4
stan.data <- list(K = K, 
  V = ncol(x), 
  D = nrow(x), 
  n = x, 
  alpha = rep(1, K), 
  gamma = rep(0.5, ncol(x))
)

stan.fit = LDAtopicmodel(stan_data = stan.data, iter = 100, chains = 1)

Extract posterior samples

$\beta$ and $\theta$

samples = rstan::extract(stan.fit, permuted = TRUE, inc_warmup = FALSE, include = TRUE)

Posterior distribution of topic in each sample

theta = samples$theta 
names(theta) = c("theta", "Sample", "Topic")
dimnames(theta)[[2]] = sample_names(ps)
dimnames(theta)[[3]] = c(paste0("Topic ", seq(1,K)))

theta.all = melt(theta)
colnames(theta.all) = c("Iteration", "Sample", "Topic", "topic.dis")

theta.all$Sample = factor(theta.all$Sample)
theta.all$Topic = factor(theta.all$Topic)

# add control or dilution series
sam = sample_data(ps) %>% data.frame()
sam$unique_names = rownames(sam)
theta.all = left_join(theta.all, sam, by =c("Sample"= "unique_names"))
theta.all$Sample_Type = factor(theta.all$SampleType)

Topic distirbution in each sample

ggplot(data = theta.all) + 
  geom_boxplot(aes(x = Sample, 
                   y = topic.dis, 
                   color = Topic)) + 
  facet_grid(Topic ~ Sample_Type, scales = "free_x") +
  ylab(bquote(theta[k])) + 
  ggtitle("") + 
  theme(plot.title = element_text(hjust = 0.5, size = 15), 
    strip.text.y= element_text(size = 12), 
    strip.text.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "none")

ASV cloud in each topic

beta = samples$beta
dimnames(beta)[[2]] = c(paste0("Topic ", seq(1,K)))

tax_tab = tax_table(ps) %>% data.frame()
tax_tab = mutate(tax_tab, seq.variant = rownames(tax_tab))

dimnames(beta)[[3]] =tax_tab[, "seq.variant"]
beta.all = melt(beta)
colnames(beta.all) = c("Chain", "Topic", "ASV", "ASV.distribution")
beta.all$ASV = as.character(beta.all$ASV)
beta.all = left_join(beta.all, tax_tab, by = c("ASV"= "seq.variant"))
beta.all$Topic = factor(beta.all$Topic)
beta.all$ASV = factor(beta.all$ASV)

max.beta.in.each.asv.all.topics = group_by(beta.all, 
                                           Topic, 
                                           Family, 
                                           Genus) %>% summarise(max_beta = max(ASV.distribution)) %>% top_n(10, max_beta) %>% as.data.frame()


ggplot(max.beta.in.each.asv.all.topics, 
                 aes(label = Genus, size = max_beta, color = Family)) + 
  geom_text_wordcloud() +
  theme_minimal() +
  scale_size_area(max_size = 8) + 
  facet_wrap(~ Topic) + 
  theme(strip.text.x = element_text(size = 12, face = "bold"))

We observed two contaminant topics.

Session Info

sessionInfo()

PratheepaJ/BARBI documentation built on April 16, 2020, 11:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PratheepaJ/BARBI
A Bayesian Approach to Contamination Removal in Molecular Microbial Studies

In PratheepaJ/BARBI: A Bayesian Approach to Contamination Removal in Molecular Microbial Studies

Example dataset

Load the phyloseq object

Adding blocks/batches

Remove taxa not in dilution series samples

Library depth

Summary

BARBI

Prepare the phyloseq object for the BARBI method

Estimate the density parameters for the contaminant intensities in negative control samples

Estimate the density parameters for the contaminant intensities in each specimen

Sample from the marginal posterior for the true intensities

Save the results.

Make summaries from the BARBI results.

Make tables for each sample

Construct a phyloseq object with the true taxa

Histograms

Latent Dirichlet Allocation

Extract posterior samples

Posterior distribution of topic in each sample

Topic distirbution in each sample

ASV cloud in each topic

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

PratheepaJ/BARBI A Bayesian Approach to Contamination Removal in Molecular Microbial Studies

In PratheepaJ/BARBI: A Bayesian Approach to Contamination Removal in Molecular Microbial Studies

Example dataset

Load the phyloseq object

Adding blocks/batches

Remove taxa not in dilution series samples

Library depth

Summary

BARBI

Prepare the phyloseq object for the BARBI method

Estimate the density parameters for the contaminant intensities in negative control samples

Estimate the density parameters for the contaminant intensities in each specimen

Sample from the marginal posterior for the true intensities

Save the results.

Make summaries from the BARBI results.

Make tables for each sample

Construct a phyloseq object with the true taxa

Histograms

Latent Dirichlet Allocation

Extract posterior samples

Posterior distribution of topic in each sample

Topic distirbution in each sample

ASV cloud in each topic

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

PratheepaJ/BARBI
A Bayesian Approach to Contamination Removal in Molecular Microbial Studies