knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
NOTE While we continue to maintain this R package, the development has been discontinued as we have shifted to supporting methods development based on the new TreeSummarizedExperiment data container, which provides added capabilities for multi-omics data analysis. Check the miaverse project for details.
The microbiomeutilities
R package is part of the microbiome-verse tools that provides additional data handling and visualization support for the microbiome R/BioC package
Philosophy: "Seemingly simple tasks for experienced R users can always be further simplified for novice users"
install.packages("devtools") devtools::install_github("microsud/microbiomeutilities")
Load libraries
library(microbiomeutilities) library(microbiome) library(knitr) library(tibble) library(dplyr)
Package data from Zackular et al., 2014: The Gut Microbiome Modulates Colon Tumorigenesis.
Import test data.
data("zackular2014") ps0 <- zackular2014 ps0
The print_ps
from microbiomeutilities
can give additional information. See also microbiome::summarize_phyloseq
.
print_ps(ps0)
Most commonly it is observed that the taxonomy file has classification until a given taxonomic level. We can change the names in both otu table and taxonomy table with the best taxonomic classification available. This can be useful if the analysis has to be done at OTU/ASVs level. Only ID are less useful.
Check the taxonomy in phyloseq object.
kable(head(tax_table(ps0)))
Some have only g of s information.
data("zackular2014") p0 <- zackular2014 # reduce size for example ps0 <- core(p0, detection = 10, prevalence = 20 / 100) # Add a prefix to taxa labels ps0.f2 <- format_to_besthit(ps0, prefix = "MyBug-") kable(head(tax_table(ps0.f2))[3:6])
Now the available taxonomy is added.
As can be seen, the rownames have the OTUIDs and available taxonomic name(s).
Useful for QC purposes. Check for distribution of sequencing depth.
data("zackular2014") p0 <- zackular2014 p <- plot_read_distribution(p0, groups = "DiseaseState", plot.type = "density") print(p + theme_biome_utils())
This is a diagnostic step. Key to check if there is variation between groups that will be compared for downstream analysis.
Useful if the user wants to plot specific features.
Converting to long data format opens several opportunities to work with Tidyverse.
data("zackular2014") p0 <- zackular2014 # reduce size for example ps0 <- core(p0, detection = 10, prevalence = 20 / 100) pseq_df <- phy_to_ldf(ps0, transform.counts = NULL) kable(head(pseq_df))
One of the first questions arise regarding which taxa are present , how they are distributed in the data. This can be done with following functionality.
A quick check to see how different taxa are distributed in your data.
library(microbiomeutilities) library(RColorBrewer) library(patchwork) library(ggpubr) data("zackular2014") pseq <- zackular2014 # check healthy health_ps <- subset_samples(pseq, DiseaseState=="H") p_hc <- taxa_distribution(health_ps) + theme_biome_utils() + labs(title = "Healthy") # check CRC crc_ps <- subset_samples(pseq, DiseaseState=="CRC") p_crc <- taxa_distribution(crc_ps) + theme_biome_utils() + labs(title = "CRC") # harnessing the power of patchwork p_hc / p_crc + plot_layout(guides = "collect") + plot_annotation(tag_levels = "A")
There are Cyanobacteria/Chloroplast related sequences which can be removed if not expected in the samples.
Sometimes, we are interested in identifying the most dominant taxa in each sample. We may also wish to check what percent of samples within a given group are these taxa dominating.
library(microbiomeutilities) library(dplyr) data("zackular2014") p0 <- zackular2014 p0.gen <- aggregate_taxa(p0,"Genus") x.d <- dominant_taxa(p0,level = "Genus", group="DiseaseState") head(x.d$dominant_overview)
As seen in the table above, 50% of the samples in H
group are dominated by g__Bacteroides and so on...
This can be used for entire dataset.
library(microbiomeutilities) data("zackular2014") p0 <- zackular2014 tx.sum1 <- taxa_summary(p0, "Phylum") tx.sum1
For group specific abundances of taxa get_group_abundances
.
library(microbiomeutilities) data("zackular2014") p0 <- zackular2014 grp_abund <- get_group_abundances(p0, level = "Phylum", group="DiseaseState", transform = "compositional")
How to visualize these data?
mycols <- c("brown3", "steelblue","grey50") # clean names grp_abund$OTUID <- gsub("p__", "",grp_abund$OTUID) grp_abund$OTUID <- ifelse(grp_abund$OTUID == "", "Unclassified", grp_abund$OTUID) mean.plot <- grp_abund %>% # input data ggplot(aes(x= reorder(OTUID, mean_abundance), # reroder based on mean abundance y= mean_abundance, fill=DiseaseState)) + # x and y axis geom_bar( stat = "identity", position=position_dodge()) + scale_fill_manual("DiseaseState", values=mycols) + # manually specify colors theme_bw() + # add a widely used ggplot2 theme ylab("Mean Relative Abundance") + # label y axis xlab("Phylum") + # label x axis coord_flip() # rotate plot mean.plot
This plot is diagnostic to have an idea about the taxonomy in each group. Statements based on comparisons between groups may not make sense here because only mean abundance values are plotted, and not standard deviation within groups. For visualizing error bars and more check Statistical tools for high-throughput data analysis.
Finding samples dominated by user provided taxa in a phyloseq object. This is useful especially if user suspects a taxa to be contaminant and wishes to identify which samples are dominated by the contaminant taxa.
library(microbiomeutilities) data("zackular2014") p0 <- zackular2014 p0.f <- aggregate_taxa(p0, "Genus") bac_dom <- find_samples_taxa(p0.f, taxa = "g__Bacteroides") bac_dom #get samples dominated by g__Bacteroides ps.sub <- prune_samples(sample_names(p0.f) %in% bac_dom, p0.f)
A common approach to check for sequencing depth and diversity measures.
Here, we can access the numerous alpha diversity measures supported by microbiome R package.
NOTE: This can take sometime to complete.
library(microbiomeutilities) data("zackular2014") p0 <- zackular2014 # set seed set.seed(1) subsamples <- seq(0, 5000, by=100)[-1] #subsamples = c(10, 5000, 10000, 20000, 30000) p <- plot_alpha_rcurve(p0, index="observed", subsamples = subsamples, lower.conf = 0.025, upper.conf = 0.975, group="DiseaseState", label.color = "brown3", label.size = 3, label.min = TRUE) # change color of line mycols <- c("brown3", "steelblue","grey50") p <- p + scale_color_manual(values = mycols) + scale_fill_manual(values = mycols) print(p)
Utility plot function for diversity measures calculated by microbiome
package.
library(microbiome) data("zackular2014") ps0 <- zackular2014 mycols <- c("brown3", "steelblue","grey50") p <- plot_alpha_diversities(ps0, type = "dominance", index.val = "all", plot.type = "stripchart", variableA = "DiseaseState", palette = mycols) p <- p + theme_biome_utils() + ggplot2::theme(legend.position = "top", text = element_text(size=14)) print(p)
comps <- make_pairs(sample_data(ps0)$DiseaseState) p <- p + stat_compare_means( comparisons = comps, label = "p.format", tip.length = 0.05, method = "wilcox.test") p
Alternatively, one can plot one index at a time with pair-wise stats.
library(gghalves) library(microbiomeutilities) data("zackular2014") p0 <- zackular2014 mycols <- c("brown3", "steelblue","grey50") p.m <- plot_diversity_stats(p0, group = "DiseaseState", index = "diversity_shannon", group.order = c("H", "CRC", "nonCRC"), group.colors = mycols, label.format="p.format", stats = TRUE) p.m + ylab("Shannon Diversity") + xlab("")
library(microbiome) library(microbiomeutilities) library(dplyr) data("zackular2014") p0 <- zackular2014 tern_df <- prep_ternary(p0, group="DiseaseState", abund.thres=0.000001, level= "Genus", prev.thres=0.01) head(tern_df)
# install.packages("ggtern") require(ggtern) # Replace empty with Other tern_df$Phylum[tern_df$Phylum==""] <- "Other" ggtern(data=tern_df, aes(x=CRC, y=H, z=nonCRC)) + geom_point(aes(color= Phylum), alpha=0.25, show.legend=T, size=3) + #scale_size(range=c(0, 6)) + geom_mask() + scale_colour_brewer(palette = "Paired") + theme_biome_utils() detach("package:ggtern", unload=TRUE)
Plot relative abundance of top taxa specified by user.
library(microbiomeutilities) library(RColorBrewer) data("zackular2014") ps0 <- zackular2014 mycols <- c("brown3", "steelblue", "grey50") pn <- plot_taxa_boxplot(ps0, taxonomic.level = "Family", top.otu = 3, group = "DiseaseState", add.violin= FALSE, title = "Top three family", keep.other = FALSE, group.order = c("H","CRC","nonCRC"), group.colors = mycols, dot.size = 1) print(pn + theme_biome_utils())
Using a list of taxa specified by the user for comparisons.
library(microbiome) library(microbiomeutilities) library(ggpubr) data("zackular2014") p0 <- zackular2014 p0.f <- format_to_besthit(p0) #top_taxa(p0.f, 5) select.taxa <- c("d__denovo1:g__Blautia", "d__denovo3:g__Bacteroides") mycols <- c("brown3", "steelblue", "grey50") p <- plot_listed_taxa(p0.f, select.taxa, group= "DiseaseState", group.order = c("H","CRC","nonCRC"), group.colors = mycols, add.violin = TRUE, violin.opacity = 0.3, dot.opacity = 0.25, box.opacity = 0.25, panel.arrange= "grid") print(p + ylab("Relative abundance") + scale_y_continuous(labels = scales::percent))
Adding statistical test with ggpubr::stat_compare_means()
# If more than two variables comps <- make_pairs(sample_data(p0.f)$DiseaseState) print(comps) p <- p + stat_compare_means( comparisons = comps, label = "p.format", tip.length = 0.05, method = "wilcox.test") p + scale_y_continuous(labels = scales::percent)
Plot top four Genera
library(microbiome) library(microbiomeutilities) data("zackular2014") p0 <- zackular2014 p0.f <- aggregate_taxa(p0, "Genus") top_four <- top_taxa(p0.f, 4) top_four mycols <- c("brown3", "steelblue", "grey50") p <- plot_listed_taxa(p0.f, top_four, group= "DiseaseState", group.order = c("H","CRC","nonCRC"), group.colors = mycols, add.violin = TRUE, violin.opacity = 0.3, dot.opacity = 0.25, box.opacity = 0.25, panel.arrange= "wrap") comps <- make_pairs(sample_data(p0.f)$DiseaseState) p <- p + stat_compare_means( comparisons = comps, label = "p.format", tip.length = 0.05, method = "wilcox.test") print(p + ylab("Relative abundance") + scale_y_continuous(labels = scales::percent))
Plots mean Abundance-Prevalence for taxa. Mean abundance, mean prevalence, and upper and lower confidence interval for each taxa is calculated by random sub-sampling.
This can be useful in identifying highly prevalent taxa and their mean relative abundance in a group of samples. Taxa that are highly prevalent with low variation in lower and upper CI can be identified at varying values of mean relative abundance. These are likely core taxa in the groups of samples.
See core microbiota analysis in microbiome R package
library(microbiomeutilities) library(dplyr) library(ggrepel) asv_ps <- zackular2014 asv_ps <- microbiome::transform(asv_ps, "compositional") # Select healthy samples asv_ps <- subset_samples(asv_ps, DiseaseState=="H") asv_ps <- core(asv_ps,detection = 0.0001, prevalence = 0.50) # reduce size for example asv_ps <- format_to_besthit(asv_ps) set.seed(14285) p_v <- plot_abund_prev(asv_ps, label.core = TRUE, color = "Phylum", # NA or "blue" mean.abund.thres = 0.01, mean.prev.thres = 0.99, dot.opacity = 0.7, label.size = 3, label.opacity = 1.0, nudge.label=-0.15, bs.iter=9, # increase for actual analysis e.g. 999 size = 20, # increase to match your nsamples(asv_ps) replace = TRUE, label.color="#5f0f40") p_v <- p_v + geom_vline(xintercept = 0.95, lty="dashed", alpha=0.7) + geom_hline(yintercept = 0.01,lty="dashed", alpha=0.7) + scale_color_brewer(palette = "Dark2") p_v
Useful for visualizing differences in top OTUs between sample groups.
library(microbiomeutilities) library(pheatmap) library(RColorBrewer) data("zackular2014") ps0 <- zackular2014 #optional step if required to rename taxa_names taxa_names(ps0) <- gsub("d__denovo", "OTU:", taxa_names(ps0)) # create a gradient color palette for abundance grad_ab <- colorRampPalette(c("#faf3dd","#f7d486" ,"#5e6472")) grad_ab_pal <- grad_ab(10) # create a color palette for varaibles of interest meta_colors <- list(c("positive" = "#FFC857", "negative" = "#05B083"), c("CRC" = "steelblue", "nonCRC" = "grey50", "H"="brown3")) # add labels for pheatmap to detect names(meta_colors) <- c("FOBT.result", "DiseaseState") p <- plot_taxa_heatmap(ps0, subset.top = 25, VariableA = c("DiseaseState","FOBT.result"), heatcolors = grad_ab_pal, #rev(brewer.pal(6, "RdPu")), transformation = "log10", cluster_rows = T, cluster_cols = F, show_colnames = F, annotation_colors=meta_colors)
#the plot is stored here p$plot # table used for plot is here p$tax_tab[1:3,1:3]
library(microbiomeutilities) data("zackular2014") p0 <- zackular2014 p0.rel <- transform(p0, "compositional") #heat.cols <- c("#a8dadc","#457b9d", "#1d3557") # create a gradient color palette for abundance grad_ab <- colorRampPalette(c("#96d4ca","#d3f3f1", "#7c65a9")) heat.cols <- grad_ab(10) simple_heatmap(p0.rel, group.facet = "DiseaseState", group.order = NULL, abund.thres = 0.01, prev.thres = 0.1, level = "Genus", scale.color = "log10", na.fill = "white", color.fill = heat.cols, taxa.arrange=TRUE, remove.other=TRUE, panel.arrange="grid", ncol=NULL, nrow=NULL)
For heatmap options see microbiome::heat() and ampvis2
For other longitudinal related functionality check the [Longitudinal data analysis and visualization] section in the Articles section.
We provide access to a subset of studies included in the MicrobiomeHD
database from Duvallet et al 2017: Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature communications.
The phyloseq objects are stored and accessed from microbiomedatarepo.
study <- list_microbiome_data(printtab = FALSE) knitr::kable(study)
Below is the per study reference.
NOTE: When using these studies, please cite Duvallet et al. 2017 and the respective studies.
file <- system.file("extdata", "microbiomeHD_ref.txt", package = "microbiomeutilities") reference <- read.table(file, header = T, sep = "\t") knitr::kable(reference)
library(microbiomeutilities) library(RColorBrewer) data("zackular2014") p0 <- zackular2014 ps1 <- format_to_besthit(p0) #ps1 <- subset_samples(ps1, DiseaseState == "H") ps1 <- prune_taxa(taxa_sums(ps1) > 0, ps1) prev.thres <- seq(.05, 1, .05) det.thres <- 10^seq(log10(1e-4), log10(.2), length = 10) pseq.rel <- microbiome::transform(ps1, "compositional") # reduce size for example pseq.rel <- core(pseq.rel, detection = 0.001, prevalence = 20 / 100) ord.bray <- ordinate(pseq.rel, "NMDS", "bray") p <- plot_ordiplot_core(pseq.rel, ord.bray, prev.thres, det.thres, min.prevalence = 0.9, color.opt = "DiseaseState", shape = NULL, Sample = TRUE) print(p)
Useful resources:
The microbiomeutilties
depends on the phyloseq data structure and core Phyloseq functions.
Tools for microbiome analysis in R. Microbiome package URL: microbiome package.
For more tutorials and examples of data analysis in R please check:
Contributions are welcome:
Issue Tracker
Pull requests
Star us on the Github page
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.