knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
Install R and RStudio. Open this Rmd
r Githubpkg("rstudio/rmarkdown")
file in vignettes
folder in RStudio. Then run the following code to install all required packages, including BARBI from Github repository.
pkgs <- c("phyloseq", "dplyr", "HDInterval", "grid", "gtable", "gridExtra", "magrittr", "ggplot2", "DESeq2") if (!requireNamespace("BiocManager")){ install.packages("BiocManager") } BiocManager::install(setdiff(pkgs, installed.packages())) devtools::install_github("PratheepaJ/BARBI")
Load packages:
library(BARBI) library(phyloseq) library(dplyr) library(HDInterval) library(grid) library(gtable) library(gridExtra) library(magrittr) library(ggplot2) library(DESeq2) library(reshape2) library(ggwordcloud)
We load an example dataset stored as a phyloseq object in the BARBI
package (or use your own data stored as a phyloseq object)
We validated our the BARBI method for removing DNA contamination using a dilution series of samples of eight known bacterial species in the standard ZymoBIOMICS microbial community (Zymo Research).
We saved this data as phyloseq object in the BARBI package.
Seven rounds of six-fold dilutions from the standard, from 1:1 up to 1:279,936 ($n_{1} = 8$), as well as ten negative extraction controls $n_{2} = 10$, were made.
Then all $N =18$ specimens were processed and analyzed with the ZymoBIOMICS® Service: Targeted Metagenomic Sequencing (Zymo Research, Irvine, CA), which leverages 16S rRNA gene sequencing.
ps <- ps
Specify that the samples are on the columns and taxa are on the rows of otu_table
.
if(dim(otu_table(ps))[1]!=ntaxa(ps)){ otu_table(ps) <- t(otu_table(ps))}
To reduce the batch-effects of contamination, we can specify the block information and analyze each block separately with BARBI.
We highly recommend that you keep track of batch effects (Especially DNA extraction and library prep batches), visualize your data with PCA on ranks, and separate your data into different blocks of necessary.
In the example dataset, all samples are in one block.
blocks <- rep("Set1", nsamples(ps)) sample_data(ps)$block <- blocks
Identify the taxa that are not present in at least one dilution series sample and removed them from the phyloseq object. Label these species as contaminants.
ps <- prune_taxa(taxa_sums(ps) > 0, ps) ps_specimen <- subset_samples(ps, SampleType %in% c("Standard")) prevTaxaP <- apply(otu_table(ps_specimen), 1, function(x){sum(x>0)}) Contaminants1 <- names(prevTaxaP)[prevTaxaP == 0] ps <- prune_taxa(prevTaxaP > 0, ps) ps
We identified 142 ASVs not is any dilution series samples, and they are classified as contaminants before using BARBI.
We use BARBI to infer true ASVs in each dilution series sample.
We check the distribution of sample library depth to see whether there are samples with very small library depth that should be dropped from the analysis.
totalReads <- colSums(otu_table(ps)) hist(log(totalReads), yaxs="i", xaxs="i", main="Distribution of total reads per sample", breaks=50)
We do not need to drop any sample.
We look at a summary of the specimens and negative control samples in each block.
table(sample_data(ps)$SampleType,sample_data(ps)$block)
We use BARBI to identify contaminants in each block separately.
Thus, we split the phyloseq object into multiple phyloseq objects corresponding to each block, and store the phyloseq objects as a list of phyloseq objects, psByBlock
.
We select negative control samples from each block and store as a list of phyloseq objects, psNCbyBlock
.
We select all taxa that have a prevalence of zero (i.e., have zero reads) in all negative control samples for each block and store as a list of phyloseq objects, psallzeroInNC
.
We select all specimen samples from each block and store as a list of phyloseq objects, psPlByBlock
.
psBlockResult <- psBlockResults(ps, sampleTypeVar = "SampleType", caselevels = c("Standard"), controllevel = c("Negative"), sampleName = "sampleID", blockVar = "block") psByBlock <- psBlockResult[[1]] psNCbyBlock <- psBlockResult[[2]] psallzeroInNC <- psBlockResult[[3]] psPlByBlock <- psBlockResult[[4]]
Estimate the gamma density parameters for the contaminant intensities using the negative control samples for each block.
$\lambda_{il}^{(c)} \sim \text{gamma }\left(\frac{1}{\gamma_{i}^{0}},\frac{1}{\gamma_{i}^{0} \mu_{il}^{0}}\right),$ $l$ is the $l$-th negative control.
con_int_neg_ctrl <- alphaBetaNegControl(psNCbyBlock = psNCbyBlock)
For each specimen, we estimate the gamma density parameters for the contaminant intensities using the scaling property of the gamma distribution.
$\lambda_{ij}^{(c)} \sim \text{gamma }\left( \frac{d^{0}{l} }{d{j}} \frac{1}{\gamma_{i}^{0}},\frac{1}{\gamma_{i}^{0} \mu_{il}^{0}}\right),$ where $j$ is the $j$-th specimen.
num_blks <- length(con_int_neg_ctrl) blks <- seq(1, num_blks) %>% as.list con_int_specimen <- lapply(blks, function(x){ con_int_specimen_each_blk <- alphaBetaContInPlasma(psPlByBlock = psPlByBlock, psallzeroInNC = psallzeroInNC, blk = x, alphaBetaNegControl = con_int_neg_ctrl) return(con_int_specimen_each_blk) })
For all specimen samples and for all taxa, sample from the posterior for the true intensities using the Metropolis-Hasting MCMC.
We need to specify the number of iterations in the MCMC using the option itera
.
Save the gamma prior for the intensity of contamination and the posterior samples.
The suggested itera is 10,000.
itera = 100 t1 <- proc.time() mar_post_true_intensities <- lapply(blks,function(x){ mar_post_true_intensities_each_blk <- samplingPosterior(psPlByBlock = psPlByBlock, blk = x, gammaPrior_Cont = con_int_specimen[[x]], itera = itera) return(mar_post_true_intensities_each_blk) }) proc.time()-t1 con_int_specimen_mar_post_true_intensities <- list(con_int_specimen, mar_post_true_intensities)
saveRDS(con_int_specimen_mar_post_true_intensities, file= "./con_int_specimen_mar_post_true_intensities_vignettes.rds")
Choose the number of MCMC to be removed using the option burnIn
. It must be less than itera
.
Choose the coverage probability to construct the highest posterior density (HPD) interval$\left(L_{ij}^{(r)}, U_{ij}^{(r)}\right)$ (for each taxon $i$ in a specimen $j$) using the option cov.pro
for true intensities.
Compute the highest density interval (HDI) for the contaminant intensities $\left(L_{ij}^{(c)}, U_{ij}^{(c)}\right)$ for each taxon $i$ in a specimen $j$.
For a contaminant taxon, the lower limit $L_{ij}^{(r)}$ will be smaller than the upper limit $U_{ij}^{(c)}$.
The suggested burnIn is 5000 for itera <- 10,000.
ASV <- as.character(paste0("ASV_",seq(1,ntaxa(ps)))) ASV.Genus <- paste0("ASV_",seq(1,ntaxa(ps)),"_",as.character(tax_table(ps)[,6])) ASV.Genus.Species <- paste0(ASV,"_",as.character(tax_table(ps)[,6]),"_", as.character(tax_table(ps)[,7])) df.ASV <- data.frame(seq.variant = taxa_names(ps), ASV = ASV, ASV.Genus = ASV.Genus, ASV.Genus.Species = ASV.Genus.Species)
itera <- 100 burnIn <- 10 cov.pro <- .95 mak_tab <- FALSE # Save tables or print tables # con_int_specimen_mar_post_true_intensities <- readRDS("./con_int_specimen_mar_post_true_intensities_vignettes.rds") con_int_specimen <- con_int_specimen_mar_post_true_intensities[[1]] mar_post_true_intensities <- con_int_specimen_mar_post_true_intensities[[2]] ## Keep true all_true_taxa_blk <- list() for(blk in 1:num_blks){ mar_post_true_intensities_blk <- mar_post_true_intensities[[blk]] con_int_specimen_blk <- con_int_specimen[[blk]] all_true_taxa <- character() for(sam in 1:nsamples(psPlByBlock[[blk]])){ taxa_post <- mar_post_true_intensities_blk[[sam]] acceptance <- list() lower.r <- list() upper.r <- list() lower.c <- list() upper.c <- list() all.zero.nc <- list() for(taxa in 1:length(taxa_post)){ burnIn <- burnIn acceptance[[taxa]] <- 1 - mean(duplicated(taxa_post[[taxa]][-(1:burnIn),])) HPD.r <- hdi(taxa_post[[taxa]][-(1:burnIn),], credMass = cov.pro) lower.r[[taxa]] <- round(HPD.r[1], digits = 0) upper.r[[taxa]] <- round(HPD.r[2], digits = 0) lamda.c <- rgamma((itera-burnIn+1), shape= con_int_specimen_blk[[sam]][[1]][taxa], rate = con_int_specimen_blk[[sam]][[2]][taxa]) HDI.c <- hdi(lamda.c, credMass = cov.pro) lower.c[[taxa]] <- round(HDI.c[1], digits = 0) upper.c[[taxa]] <- round(HDI.c[2], digits = 0) all.zero.nc[[taxa]] <- con_int_specimen_blk[[sam]][[5]][taxa] } tax_names <- taxa_names(psPlByBlock[[blk]]) tax_names <- df.ASV$ASV.Genus[which(as.character(df.ASV$seq.variant) %in% tax_names)] df <- data.frame(Species = tax_names, xj = as.numeric(con_int_specimen_blk[[sam]][[3]]), l.r = unlist(lower.r), u.r = unlist(upper.r), l.c = unlist(lower.c), u.c = unlist(upper.c), all.zero.nc = unlist(all.zero.nc)) # List all true taxa df <- arrange(filter(df,(l.r > u.c) & (l.r > 0)), desc(xj)) # If there is no true taxa if(dim(df)[1]==0){ df <- data.frame(Species="Negative", xj="Negative", l.r="Negative", u.r="Negative", l.c ="Negative", u.c="Negative", all.zero.nc = "Negative") } # collect all true taxa in the specimen all_true_taxa <- c(all_true_taxa, as.character(df$Species)) if(mak_tab){ filname <- paste("./", sample_names(psPlByBlock[[blk]])[sam], ".png", sep = "") png(filname, height = 600, width = 750) df.p <- tableGrob(df) title <- textGrob(sample_names(psPlByBlock[[blk]])[sam], gp = gpar(fontsize = 12)) padding <- unit(0.5,"line") df.p <- gtable_add_rows(df.p, heights = grobHeight(title) + padding, pos = 0) df.p <- gtable_add_grob(df.p, list(title), t = 1, l = 1, r = ncol(df.p)) grid.newpage() grid.draw(df.p) dev.off() }else{ df.p <- tableGrob(df) title <- textGrob(sample_names(psPlByBlock[[blk]])[sam], gp = gpar(fontsize = 12)) padding <- unit(0.5,"line") df.p <- gtable_add_rows(df.p, heights = grobHeight(title) + padding, pos = 0) df.p <- gtable_add_grob(df.p, list(title), t = 1, l = 1, r = ncol(df.p)) grid.newpage() grid.draw(df.p) } all_true_taxa <- unique(all_true_taxa) } all_true_taxa_blk[[blk]] <- all_true_taxa }
all_true_taxa_blk <- unlist(all_true_taxa_blk) ASV = df.ASV$seq.variant[which(as.character(df.ASV$ASV.Genus) %in% as.character(all_true_taxa_blk))] %>% as.character() ps_decon <- prune_taxa(ASV, ps) ps_decon
set.seed(10000) itera <- 100 burnIn <- 10 cov.pro <- .95 num_blks <- length(psByBlock) # con_int_specimen_mar_post_true_intensities <- readRDS("./con_int_specimen_mar_post_true_intensities_vignettes.rds") con_int_specimen <- con_int_specimen_mar_post_true_intensities[[1]] mar_post_true_intensities <- con_int_specimen_mar_post_true_intensities[[2]] blk <- 1 con_int_specimen_blk <- con_int_specimen[[blk]] mar_post_true_intensities_blk <- mar_post_true_intensities[[blk]] sample.names <- sample_names(psPlByBlock[[blk]]) for(j in 1: length(sample.names)){ desired.sample.name <- sample.names[j] desired.sample.index <- which(sample_names(psPlByBlock[[blk]]) %in% desired.sample.name) tax_interested <- rownames(sort(otu_table(psPlByBlock[[blk]])[,desired.sample.index],decreasing = TRUE))[c(1:16)] tax_interested_ind <- which(as.character(taxa_names(psPlByBlock[[blk]])) %in% tax_interested) tax_names <- taxa_names(psPlByBlock[[blk]])[tax_interested_ind] tax_names <- df.ASV$ASV.Genus[which(as.character(df.ASV$seq.variant) %in% tax_names)] %>% as.character() taxa.post <- mar_post_true_intensities_blk[[desired.sample.index]] burnIn <- burnIn signal.hist <- taxa.post[tax_interested_ind] signal.hist <- lapply(signal.hist,function(x){x[-(1:burnIn),]}) signal.df <- data.frame(do.call("cbind", signal.hist)) colnames(signal.df) <- tax_names signal.df$group <- rep("True", length=dim(signal.df)[1]) bg <- list() for(ind in 1:length(tax_interested_ind)){ bg[[ind]] <- rgamma((itera-burnIn+1), shape = con_int_specimen_blk[[desired.sample.index]][[1]][tax_interested_ind[ind]], rate = con_int_specimen_blk[[desired.sample.index]][[2]][tax_interested_ind[ind]]) } bg.df <- data.frame(do.call("cbind",bg)) colnames(bg.df) <- tax_names bg.df$group <- rep("Contaminant", length=dim(bg.df)[1]) bg.signal <- rbind(signal.df, bg.df) bg.signal$group <- as.factor(bg.signal$group) bg_sig_long <- tidyr::gather(bg.signal,key="Taxa", value="Reads",1:(dim(bg.signal)[2]-1)) bg_sig_long$Taxa <- as.factor(bg_sig_long$Taxa) p <- ggplot(bg_sig_long, aes(x= Reads))+ geom_density(aes(y = ..scaled.., fill = group, color = group)) + facet_wrap(~Taxa,scales = "free")+ scale_fill_manual(values=c("blue","brown")) + scale_color_manual(values=c("blue","brown")) + ggtitle(desired.sample.name)+ theme(plot.title = element_text(hjust = 0.5), legend.title=element_blank(), strip.text.x = element_text(size=5), strip.background = element_blank(), panel.grid = element_blank(), panel.background = element_blank()) + xlab("") + ylab("density") print(p) # fileN <- paste0("./Figures/", desired.sample.name,"_hist",".eps") # ggsave(fileN, plot = p, width = 10, height = 5, device = "eps") }
Co-occurrences of contaminants provide important information about their potential source and we provide a topic modeling approach that identifies "contaminant topics" and the distribution of taxa in each topic. This model helps identify contaminant sources related to each topic. This is important in the appropriate design of followup experiments.
short.sample.names = c(paste0("NC.", c(1,10)), paste0("Di.", seq(1,8)), paste0("NC.", seq(2,9))) sample_names(ps) = short.sample.names x = t(get_taxa(ps)) dimnames(x) = NULL K = 4 stan.data <- list(K = K, V = ncol(x), D = nrow(x), n = x, alpha = rep(1, K), gamma = rep(0.5, ncol(x)) ) stan.fit = LDAtopicmodel(stan_data = stan.data, iter = 100, chains = 1)
samples = rstan::extract(stan.fit, permuted = TRUE, inc_warmup = FALSE, include = TRUE)
theta = samples$theta names(theta) = c("theta", "Sample", "Topic") dimnames(theta)[[2]] = sample_names(ps) dimnames(theta)[[3]] = c(paste0("Topic ", seq(1,K))) theta.all = melt(theta) colnames(theta.all) = c("Iteration", "Sample", "Topic", "topic.dis") theta.all$Sample = factor(theta.all$Sample) theta.all$Topic = factor(theta.all$Topic) # add control or dilution series sam = sample_data(ps) %>% data.frame() sam$unique_names = rownames(sam) theta.all = left_join(theta.all, sam, by =c("Sample"= "unique_names")) theta.all$Sample_Type = factor(theta.all$SampleType)
ggplot(data = theta.all) + geom_boxplot(aes(x = Sample, y = topic.dis, color = Topic)) + facet_grid(Topic ~ Sample_Type, scales = "free_x") + ylab(bquote(theta[k])) + ggtitle("") + theme(plot.title = element_text(hjust = 0.5, size = 15), strip.text.y= element_text(size = 12), strip.text.x = element_text(size = 12), axis.title.y = element_text(size = 12), legend.position = "none")
beta = samples$beta dimnames(beta)[[2]] = c(paste0("Topic ", seq(1,K))) tax_tab = tax_table(ps) %>% data.frame() tax_tab = mutate(tax_tab, seq.variant = rownames(tax_tab)) dimnames(beta)[[3]] =tax_tab[, "seq.variant"] beta.all = melt(beta) colnames(beta.all) = c("Chain", "Topic", "ASV", "ASV.distribution") beta.all$ASV = as.character(beta.all$ASV) beta.all = left_join(beta.all, tax_tab, by = c("ASV"= "seq.variant")) beta.all$Topic = factor(beta.all$Topic) beta.all$ASV = factor(beta.all$ASV)
max.beta.in.each.asv.all.topics = group_by(beta.all, Topic, Family, Genus) %>% summarise(max_beta = max(ASV.distribution)) %>% top_n(10, max_beta) %>% as.data.frame() ggplot(max.beta.in.each.asv.all.topics, aes(label = Genus, size = max_beta, color = Family)) + geom_text_wordcloud() + theme_minimal() + scale_size_area(max_size = 8) + facet_wrap(~ Topic) + theme(strip.text.x = element_text(size = 12, face = "bold"))
We observed two contaminant topics.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.