knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)

NOTE While we continue to maintain this R package, the development has been discontinued as we have shifted to supporting methods development based on the new TreeSummarizedExperiment data container, which provides added capabilities for multi-omics data analysis. Check the miaverse project for details.

The microbiomeutilities R package is part of the microbiome-verse tools that provides additional data handling and visualization support for the microbiome R/BioC package

Philosophy: "Seemingly simple tasks for experienced R users can always be further simplified for novice users"

Install

install.packages("devtools")
devtools::install_github("microsud/microbiomeutilities")

Load libraries

library(microbiomeutilities)
library(microbiome)
library(knitr)
library(tibble)
library(dplyr)

Basics

Example data

Package data from Zackular et al., 2014: The Gut Microbiome Modulates Colon Tumorigenesis.

Import test data.

data("zackular2014")
ps0 <- zackular2014
ps0

The print_ps from microbiomeutilities can give additional information. See also microbiome::summarize_phyloseq.

print_ps(ps0)

Formatting the Phyloseq Object

Most commonly it is observed that the taxonomy file has classification until a given taxonomic level. We can change the names in both otu table and taxonomy table with the best taxonomic classification available. This can be useful if the analysis has to be done at OTU/ASVs level. Only ID are less useful.

Check the taxonomy in phyloseq object.

kable(head(tax_table(ps0)))

Some have only g of s information.

data("zackular2014")
p0 <- zackular2014
# reduce size for example
ps0 <- core(p0, detection = 10, prevalence = 20 / 100)
# Add a prefix to taxa labels
ps0.f2 <- format_to_besthit(ps0, prefix = "MyBug-")

kable(head(tax_table(ps0.f2))[3:6])

Now the available taxonomy is added.

As can be seen, the rownames have the OTUIDs and available taxonomic name(s).

Distribution of reads

Useful for QC purposes. Check for distribution of sequencing depth.

data("zackular2014")
p0 <- zackular2014
p <- plot_read_distribution(p0, groups = "DiseaseState", 
                            plot.type = "density")
print(p + theme_biome_utils())

This is a diagnostic step. Key to check if there is variation between groups that will be compared for downstream analysis.

Convert phyloseq object to long data format

Useful if the user wants to plot specific features.
Converting to long data format opens several opportunities to work with Tidyverse.

data("zackular2014")
p0 <- zackular2014
# reduce size for example
ps0 <- core(p0, detection = 10, prevalence = 20 / 100)

pseq_df <- phy_to_ldf(ps0, transform.counts = NULL)

kable(head(pseq_df))

Taxa overview

One of the first questions arise regarding which taxa are present , how they are distributed in the data. This can be done with following functionality.

Check distribution

A quick check to see how different taxa are distributed in your data.

library(microbiomeutilities)
library(RColorBrewer)
library(patchwork)
library(ggpubr)
data("zackular2014")
pseq <- zackular2014

# check healthy
health_ps <- subset_samples(pseq, DiseaseState=="H")
p_hc <- taxa_distribution(health_ps) + 
  theme_biome_utils() + 
  labs(title = "Healthy")

# check CRC
crc_ps <- subset_samples(pseq, DiseaseState=="CRC")
p_crc <- taxa_distribution(crc_ps) + 
  theme_biome_utils() + 
  labs(title = "CRC")

# harnessing the power of patchwork
p_hc / p_crc + plot_layout(guides = "collect") + 
  plot_annotation(tag_levels = "A")

There are Cyanobacteria/Chloroplast related sequences which can be removed if not expected in the samples.

Dominant taxa

Sometimes, we are interested in identifying the most dominant taxa in each sample. We may also wish to check what percent of samples within a given group are these taxa dominating.

library(microbiomeutilities)
library(dplyr)
data("zackular2014")
p0 <- zackular2014
p0.gen <- aggregate_taxa(p0,"Genus")
x.d <- dominant_taxa(p0,level = "Genus", group="DiseaseState")
head(x.d$dominant_overview)

As seen in the table above, 50% of the samples in H group are dominated by g__Bacteroides and so on...

Get taxa summary

This can be used for entire dataset.

library(microbiomeutilities)
data("zackular2014")
p0 <- zackular2014
tx.sum1 <- taxa_summary(p0, "Phylum")
tx.sum1

Get taxa summary by group(s)

For group specific abundances of taxa get_group_abundances.

library(microbiomeutilities)
data("zackular2014")
p0 <- zackular2014
grp_abund <- get_group_abundances(p0, 
                                  level = "Phylum", 
                                  group="DiseaseState",
                                  transform = "compositional")

How to visualize these data?

mycols <- c("brown3", "steelblue","grey50")

# clean names 
grp_abund$OTUID <- gsub("p__", "",grp_abund$OTUID)
grp_abund$OTUID <- ifelse(grp_abund$OTUID == "", 
                          "Unclassified", grp_abund$OTUID)


mean.plot <- grp_abund %>% # input data
  ggplot(aes(x= reorder(OTUID, mean_abundance), # reroder based on mean abundance
             y= mean_abundance,
             fill=DiseaseState)) + # x and y axis 
  geom_bar(     stat = "identity", 
                position=position_dodge()) + 
  scale_fill_manual("DiseaseState", values=mycols) + # manually specify colors
  theme_bw() + # add a widely used ggplot2 theme
  ylab("Mean Relative Abundance") + # label y axis
  xlab("Phylum") + # label x axis
  coord_flip() # rotate plot 

mean.plot

This plot is diagnostic to have an idea about the taxonomy in each group. Statements based on comparisons between groups may not make sense here because only mean abundance values are plotted, and not standard deviation within groups. For visualizing error bars and more check Statistical tools for high-throughput data analysis.

Find samples dominated by specific taxa

Finding samples dominated by user provided taxa in a phyloseq object. This is useful especially if user suspects a taxa to be contaminant and wishes to identify which samples are dominated by the contaminant taxa.

library(microbiomeutilities)
data("zackular2014")
p0 <- zackular2014
p0.f <- aggregate_taxa(p0, "Genus")
bac_dom <- find_samples_taxa(p0.f, taxa = "g__Bacteroides")
bac_dom

#get samples dominated by g__Bacteroides
ps.sub <- prune_samples(sample_names(p0.f) %in% bac_dom, p0.f)

Alpha diversity

Rarefaction curves for alpha diversity indices

A common approach to check for sequencing depth and diversity measures.
Here, we can access the numerous alpha diversity measures supported by microbiome R package.

NOTE: This can take sometime to complete.

library(microbiomeutilities)
data("zackular2014")
p0 <- zackular2014
# set seed
set.seed(1)
subsamples <- seq(0, 5000, by=100)[-1]
#subsamples = c(10, 5000, 10000, 20000, 30000)

p <- plot_alpha_rcurve(p0, index="observed", 
                       subsamples = subsamples,
                       lower.conf = 0.025, 
                       upper.conf = 0.975,
                       group="DiseaseState",
                       label.color = "brown3",
                       label.size = 3,
                       label.min = TRUE) 
# change color of line 
mycols <- c("brown3", "steelblue","grey50")

p <- p + scale_color_manual(values = mycols) + 
  scale_fill_manual(values = mycols)
print(p)

Plot alpha diversities

Utility plot function for diversity measures calculated by microbiome package.

library(microbiome)
data("zackular2014")
ps0 <- zackular2014
mycols <- c("brown3", "steelblue","grey50")
p <- plot_alpha_diversities(ps0,
                            type = "dominance",
                            index.val = "all",
                            plot.type = "stripchart",
                            variableA = "DiseaseState",
                            palette = mycols)

p <- p + theme_biome_utils() + 
  ggplot2::theme(legend.position = "top",
                 text = element_text(size=14))
print(p)
comps <- make_pairs(sample_data(ps0)$DiseaseState)

p <- p + stat_compare_means(
      comparisons = comps,
      label = "p.format",
      tip.length = 0.05,
      method = "wilcox.test")
p

Alternatively, one can plot one index at a time with pair-wise stats.

library(gghalves)
library(microbiomeutilities)
data("zackular2014")
p0 <- zackular2014
mycols <- c("brown3", "steelblue","grey50")

p.m <- plot_diversity_stats(p0, group = "DiseaseState", 
                            index = "diversity_shannon", 
                            group.order = c("H", "CRC", "nonCRC"), 
                            group.colors = mycols,
                            label.format="p.format",
                            stats = TRUE)
p.m + ylab("Shannon Diversity") + xlab("")

Composition plots

Ternary plot

library(microbiome)
library(microbiomeutilities)
library(dplyr)
data("zackular2014")
p0 <- zackular2014
tern_df <- prep_ternary(p0, group="DiseaseState", 
               abund.thres=0.000001, level= "Genus", prev.thres=0.01)
head(tern_df)
# install.packages("ggtern")
require(ggtern)

# Replace empty with Other
tern_df$Phylum[tern_df$Phylum==""] <- "Other"

ggtern(data=tern_df, aes(x=CRC, y=H, z=nonCRC)) + 
  geom_point(aes(color= Phylum), 
             alpha=0.25, 
             show.legend=T, 
             size=3) +
  #scale_size(range=c(0, 6)) + 
  geom_mask() + 
  scale_colour_brewer(palette = "Paired") +
  theme_biome_utils()

detach("package:ggtern", unload=TRUE)

Plot taxa boxplot

Plot relative abundance of top taxa specified by user.

library(microbiomeutilities)
library(RColorBrewer)
data("zackular2014")
ps0 <- zackular2014
mycols <- c("brown3", "steelblue", "grey50")
pn <- plot_taxa_boxplot(ps0,
                        taxonomic.level = "Family",
                        top.otu = 3, 
                        group = "DiseaseState",
                        add.violin= FALSE,
                        title = "Top three family", 
                        keep.other = FALSE,
                        group.order = c("H","CRC","nonCRC"),
                        group.colors = mycols,
                        dot.size = 1)

print(pn + theme_biome_utils())

Plotting selected taxa

Using a list of taxa specified by the user for comparisons.

library(microbiome)
library(microbiomeutilities)
library(ggpubr)
data("zackular2014")
p0 <- zackular2014
p0.f <- format_to_besthit(p0)
#top_taxa(p0.f, 5)
select.taxa <- c("d__denovo1:g__Blautia", "d__denovo3:g__Bacteroides")

mycols <- c("brown3", "steelblue", "grey50")

p <- plot_listed_taxa(p0.f, select.taxa, 
                      group= "DiseaseState",
                      group.order = c("H","CRC","nonCRC"),
                      group.colors = mycols,
                      add.violin = TRUE,
                      violin.opacity = 0.3,
                      dot.opacity = 0.25,
                      box.opacity = 0.25,
                      panel.arrange= "grid")
print(p + ylab("Relative abundance") + scale_y_continuous(labels = scales::percent))

Adding statistical test with ggpubr::stat_compare_means()

# If more than two variables
comps <- make_pairs(sample_data(p0.f)$DiseaseState)
print(comps)
p <- p + stat_compare_means(
      comparisons = comps,
      label = "p.format",
      tip.length = 0.05,
      method = "wilcox.test")
p + scale_y_continuous(labels = scales::percent)

Plot top four Genera

library(microbiome)
library(microbiomeutilities)
data("zackular2014")
p0 <- zackular2014
p0.f <- aggregate_taxa(p0, "Genus")
top_four <- top_taxa(p0.f, 4)
top_four

mycols <- c("brown3", "steelblue", "grey50")

p <- plot_listed_taxa(p0.f, top_four, 
                      group= "DiseaseState",
                      group.order = c("H","CRC","nonCRC"),
                      group.colors = mycols,
                      add.violin = TRUE,
                      violin.opacity = 0.3,
                      dot.opacity = 0.25,
                      box.opacity = 0.25,
                      panel.arrange= "wrap")

comps <- make_pairs(sample_data(p0.f)$DiseaseState)
p <- p + stat_compare_means(
      comparisons = comps,
      label = "p.format",
      tip.length = 0.05,
      method = "wilcox.test")

print(p + ylab("Relative abundance") + scale_y_continuous(labels = scales::percent))

Abundance-Prevalence relationship

Plots mean Abundance-Prevalence for taxa. Mean abundance, mean prevalence, and upper and lower confidence interval for each taxa is calculated by random sub-sampling.
This can be useful in identifying highly prevalent taxa and their mean relative abundance in a group of samples. Taxa that are highly prevalent with low variation in lower and upper CI can be identified at varying values of mean relative abundance. These are likely core taxa in the groups of samples.

See core microbiota analysis in microbiome R package

library(microbiomeutilities)
library(dplyr)
library(ggrepel)
asv_ps <- zackular2014
asv_ps <- microbiome::transform(asv_ps, "compositional")
# Select healthy samples
asv_ps <- subset_samples(asv_ps, DiseaseState=="H")
asv_ps <- core(asv_ps,detection = 0.0001, prevalence = 0.50) # reduce size for example
asv_ps <- format_to_besthit(asv_ps)

set.seed(14285)
p_v <- plot_abund_prev(asv_ps, 
                       label.core = TRUE,
                       color = "Phylum", # NA or "blue"
                       mean.abund.thres = 0.01,
                       mean.prev.thres = 0.99,
                       dot.opacity = 0.7,
                       label.size = 3,
                       label.opacity = 1.0,
                       nudge.label=-0.15,
                       bs.iter=9, # increase for actual analysis e.g. 999
                       size = 20, # increase to match your nsamples(asv_ps)
                       replace = TRUE,
                       label.color="#5f0f40")  
p_v <- p_v + 
  geom_vline(xintercept = 0.95, lty="dashed", alpha=0.7) + 
  geom_hline(yintercept = 0.01,lty="dashed", alpha=0.7) +
  scale_color_brewer(palette = "Dark2")
p_v

Heatmaps

Heatmap using phyloseq and pheatmap

Useful for visualizing differences in top OTUs between sample groups.

library(microbiomeutilities)
library(pheatmap)
library(RColorBrewer)
data("zackular2014")
ps0 <- zackular2014

#optional step if required to rename taxa_names
taxa_names(ps0) <- gsub("d__denovo", "OTU:", taxa_names(ps0))

# create a gradient color palette for abundance
grad_ab <- colorRampPalette(c("#faf3dd","#f7d486" ,"#5e6472"))
grad_ab_pal <- grad_ab(10)

# create a color palette for varaibles of interest
meta_colors <- list(c("positive" = "#FFC857", 
                      "negative" = "#05B083"), 
                    c("CRC" = "steelblue", 
                      "nonCRC" = "grey50", 
                      "H"="brown3"))
# add labels for pheatmap to detect
names(meta_colors) <- c("FOBT.result", "DiseaseState")

p <- plot_taxa_heatmap(ps0,
                       subset.top = 25,
                       VariableA = c("DiseaseState","FOBT.result"),
                       heatcolors = grad_ab_pal, #rev(brewer.pal(6, "RdPu")),
                       transformation = "log10",
                       cluster_rows = T,
                       cluster_cols = F,
                       show_colnames = F,
                       annotation_colors=meta_colors)
#the plot is stored here
p$plot

# table used for plot is here
p$tax_tab[1:3,1:3]

Elegant option with ggplot2

library(microbiomeutilities)
data("zackular2014")
p0 <- zackular2014
p0.rel <- transform(p0, "compositional")
#heat.cols <- c("#a8dadc","#457b9d", "#1d3557")
# create a gradient color palette for abundance
grad_ab <- colorRampPalette(c("#96d4ca","#d3f3f1", "#7c65a9"))
heat.cols <- grad_ab(10)
simple_heatmap(p0.rel,
               group.facet = "DiseaseState",
               group.order = NULL,
               abund.thres = 0.01,
               prev.thres = 0.1,
               level = "Genus",
               scale.color = "log10",
               na.fill = "white",
               color.fill = heat.cols,
               taxa.arrange=TRUE,
               remove.other=TRUE,
               panel.arrange="grid",
               ncol=NULL,
               nrow=NULL)

For heatmap options see microbiome::heat() and ampvis2

For other longitudinal related functionality check the [Longitudinal data analysis and visualization] section in the Articles section.

MicrobiomeHD datasets as phyloseq objects

We provide access to a subset of studies included in the MicrobiomeHD database from Duvallet et al 2017: Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature communications.

The phyloseq objects are stored and accessed from microbiomedatarepo.

study <- list_microbiome_data(printtab = FALSE)

knitr::kable(study)

Below is the per study reference.

NOTE: When using these studies, please cite Duvallet et al. 2017 and the respective studies.

file <- system.file("extdata", "microbiomeHD_ref.txt", package = "microbiomeutilities")
reference <- read.table(file, header = T, sep = "\t")

knitr::kable(reference)

Experimental

Plot ordination and core

library(microbiomeutilities)
library(RColorBrewer)
data("zackular2014")
p0 <- zackular2014

ps1 <- format_to_besthit(p0)

#ps1 <- subset_samples(ps1, DiseaseState == "H")
ps1 <- prune_taxa(taxa_sums(ps1) > 0, ps1)
prev.thres <- seq(.05, 1, .05)
det.thres <- 10^seq(log10(1e-4), log10(.2), length = 10)
pseq.rel <- microbiome::transform(ps1, "compositional")
# reduce size for example
pseq.rel <- core(pseq.rel, detection = 0.001, prevalence = 20 / 100)

ord.bray <- ordinate(pseq.rel, "NMDS", "bray")

p <- plot_ordiplot_core(pseq.rel, ord.bray,
                        prev.thres, det.thres,
                        min.prevalence = 0.9,
                        color.opt = "DiseaseState", 
                        shape = NULL, Sample = TRUE)

print(p)

Useful resources:

The microbiomeutilties depends on the phyloseq data structure and core Phyloseq functions.

Tools for microbiome analysis in R. Microbiome package URL: microbiome package.

For more tutorials and examples of data analysis in R please check:

Contributions are welcome:

Issue Tracker
Pull requests
Star us on the Github page

sessionInfo()


microsud/microbiomeutilities documentation built on Nov. 29, 2022, 12:18 a.m.