knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This tutorial illustrates the process of estimating bias from control samples using this estimate to calibrate the relative abundances of the control taxa in samples whose true composition is unknown. The data we use is from the synthetic-community colonization experiment of Leopold and Busby (2020).

Leopold DR, Busby PE. 2020. Host Genotype and Colonist Arrival Order Jointly Govern Plant Microbiome Composition and Function. Curr Biol 30:3260-3266.e5. doi:10.1016/j.cub.2020.06.011

These authors performed a colonization experiment in which black cottonwood (Populus trichocarpa) was inoculated with the fungal rust pathogen Melampsora × columbiana and 8 species of foliar fungi. Fungal relative abundances were measured using ITS amplicon sequencing. To enable quantification of bias due to PCR bias and variation in ITS copy number, they also created a set of DNA mocks of these fungi, which they measured along with the experimental samples. We will use the mock samples to estimate the bias of the 9 focal species, and then calibrate these species' abundances in the full set of samples.

Setup

First, let's set up our R environment. This tutorial uses phyloseq objects and functions to store and manipulate the microbiome data, tidyverse packages for data manipulation and plotting, and some add-ons to ggplot2 from the ggbeeswarm and cowplot packages.

# Tools for microbiome data
library(phyloseq)
# Tools for general purpose data manipulation and plotting
library(tidyverse)
# ggplot helpers
library(ggbeeswarm)
library(cowplot)
theme_set(theme_cowplot())

library(metacal); packageVersion("metacal")

Next, we download the source data from Leopold and Busby (2020), which is available in Zenodo record 3872145. Set data_path to wherever you would like the data to be downloaded.

data_path <- here::here("vignettes", "data", "leopold2020host")
# To use a temporary directory:
# data_path <- file.path(tempdir(), "leopold2020host")
if (!dir.exists(data_path)) {
  dir.create(data_path, recursive = TRUE)
  download.file(
    "https://zenodo.org/record/3872145/files/dleopold/Populus_priorityEffects-v1.2.zip",
    file.path(data_path, "Populus_priorityEffects-v1.2.zip")
  )
  unzip(
    file.path(data_path, "Populus_priorityEffects-v1.2.zip"), 
    exdir = data_path
  )
}

The microbiome data is stored in a phyloseq object,

ps <- file.path(data_path, 
  "dleopold-Populus_priorityEffects-8594f7c/output/compiled/phy.rds") %>%
  readRDS %>%
  print

The (expected) actual compositions of the mock (control) samples can be computed from a csv file that contains the dilution factors used during sample construction. I will read in the csv file, convert it to a phyloseq object, and then compute the relative abundances of each species from the reciprocals of the dilution factors.

mock_actual <- file.path(data_path, 
  "dleopold-Populus_priorityEffects-8594f7c/data/MockCommunities.csv") %>%
  read.csv(row.names = 1) %>%
  select(-Sym4) %>%
  as("matrix") %>%
  otu_table(taxa_are_rows = FALSE) %>%
  transform_sample_counts(function(x) close_elts(1 / x))
mock_taxa <- taxa_names(mock_actual)

Note, I have converted the actual relative abundances to proportions, but the observed abundances in the ps object have the form of read counts,

otu_table(ps) %>% prune_taxa(mock_taxa, .) %>% corner

Let's take a look at the sample data and taxonomy table in the phyloseq object,

sam <- sample_data(ps) %>% as("data.frame") %>% as_tibble(rownames = "Sample")
tax <- tax_table(ps) %>% as("matrix") %>% as_tibble(rownames = "Taxon")
sam %>% glimpse
sam %>%
  count(Samp_type)

The control samples are the 10 samples with Samp_type == "Mock".

Now let's look at the taxonomy table,

tax %>% head
tax %>% filter(Taxon %in% mock_taxa) %>% select(Taxon, Family:Species)
ntaxa(ps) 

There are 219 OTUs, of which 9 have been assigned to the mock taxa (the pathogen, here named "Melampsora", and the 8 foliar fungi). The names of the mock taxa in ps have already been set to exactly match those in mock_actual, which is a requirement for the steps that follow.

Inspect mock community measurements

To get a sense of how the mock mixtures were created, we can take a look at the expected proportions of the 9 taxa across the 10 samples,

mock_actual %>%
  psmelt %>%
  mutate(
    across(OTU, factor, levels = mock_taxa),
    across(Sample, factor, levels = sample_names(mock_actual)),
  ) %>%
  ggplot(aes(Sample, OTU, fill = Abundance)) +
  geom_tile() +
  scale_fill_viridis_c(trans = "log10", breaks = c(0.02, 0.05, 1e-1, 0.2)) +
  theme_cowplot(font_size = 11)

All taxa are in every sample, but in varying proportions that range from around 0.01 to 0.3.

To visualize and estimate bias, we'll work with a phyloseq object of the observed compositions subset to just the mock community samples and taxa,

ps.mock <- ps %>% 
  subset_samples(Samp_type == "Mock") %>%
  prune_taxa(mock_taxa, .)

Next, let's compare the observed to actual taxon proportions in the mocks. I'll first combine the observed and actual proportions in a data frame in "long" format,

props <- list(Actual = mock_actual, Observed = ps.mock) %>%
  map(transform_sample_counts, close_elts) %>%
  map_dfr(psmelt, .id = "Type") %>%
  select(Type, OTU, Sample, Abundance)

We can plot the observed vs. actual proportions with ggplot as follows,

brks <- c(0, 1e-3, 1e-2, 1e-1, 0.3)
axes <- list(
  scale_y_continuous(trans = scales::pseudo_log_trans(1e-3), 
    breaks = brks, labels = brks),
  scale_x_continuous(trans = scales::pseudo_log_trans(1e-3),
    breaks = brks, labels = brks),
  coord_fixed()
)

props %>%
  pivot_wider(names_from = Type, values_from = Abundance) %>%
  ggplot(aes(Actual, Observed, color = OTU, shape = Sample == "Mock.5")) +
  geom_abline(color = "darkgrey") +
  geom_quasirandom() +
  axes +
  scale_shape_manual(values = c(16, 1)) +
  scale_color_brewer(type = "qual", palette = 3)

Some notes about this plot:

Estimate bias

Metacal provides a high-level interface to estimating bias, estimate_bias(), which expects two matrices or phyloseq objects with the observed and actual abundances for a set of control samples. To obtain sensible estimates using this function, we must first ensure that all taxa have an observed abundance that is greater than 0 in samples where their actual abundance is greater than 0. We will therefore add a pseudocount of 1 to the observed counts to remove 0s; other zero-replacement methods, such as cmultRepl() from zCompositions, would also work. The estimation method used by estimate_bias() also requires that taxa not supposed to be in a sample have an abundance of exactly 0, but the estimate_bias() function will automatically zero-out these observations for you.

sum(otu_table(ps.mock) == 0)
ps.mock.pseudo <- ps.mock %>%
  transform_sample_counts(function(x) x + 1)
all(otu_table(ps.mock.pseudo) > 0)

Now we can estimate bias using ps.mock.pseudo as our observed abundances. We use the boot = TRUE option to also generate bootstrap replicates that will give us a way to quantify the uncertainty in our estimate.

mc_fit <- estimate_bias(ps.mock.pseudo, mock_actual, boot = TRUE) %>% print
class(mc_fit)

estimate_bias() returns an "mc_bias_fit" object that can be interacted with through commonly used S3 methods such as print(), summary(), and coef(). The bias estimate can be extracted with coef(),

bias <- coef(mc_fit) %>% print

The summary() method uses the bootstrap replicates in the "mc_bias_fit" object to estimate the (geometric) standard error of the bias estimate,

mc_fit.summary <- summary(mc_fit)
print(mc_fit.summary)

producing a data frame with the estimate and standard error that can be accessed with

coef_tb <- mc_fit.summary$coefficients

Let's use this data frame to plot the bias estimate with two geometric standard errors,

coef_tb %>%
  mutate(taxon = fct_reorder(taxon, estimate)) %>%
  ggplot(aes(taxon, estimate, 
      ymin = estimate / gm_se^2, ymax = estimate * gm_se^2)) +
  geom_hline(yintercept = 1, color = "grey") +
  geom_pointrange() +
  scale_y_log10() +
  coord_flip()

This plot shows that these estimates are consistent but not identical with those in Leopold2020 Fig S2. The difference is due to our inclusion of the Mock.5 sample, which leads to a lower point estimate and larger standard error for Epicoccum.

Check model fit

The fitted values of the model can be accessed with fitted(); these are the proportions that we predict we'd observe for the control samples given the actual proportions and the estimated bias. The fitted() function returns a matrix with samples as rows and taxa as columns, which I'll convert into a phyloseq object.

observed.fitted <- fitted(mc_fit) %>% otu_table(taxa_are_rows = FALSE)

To plot the fitted values against the observed values, I'll add the fitted proportions to our earlier props data frame,

props.fitted <- bind_rows(
  props, 
  psmelt(observed.fitted) %>% add_column(Type = "Fitted")
)

Next, I'll use some tidyr manipulations and ggplot to plot the observed proportions vs. the actual and fitted predictions,

props.fitted %>%
  pivot_wider(names_from = Type, values_from = Abundance) %>%
  pivot_longer(c(Fitted, Actual), names_to = "Type", values_to = "Predicted") %>%
  ggplot(aes(Predicted, Observed, color = OTU)) +
  geom_abline(color = "darkgrey") +
  geom_point() +
  axes + 
  facet_wrap(~Type) +
  scale_color_brewer(type = "qual", palette = 3)

This plot shows that the fitted model is doing a good job of explaining the error across the full range of predicted proportions, though it is slightly underpredicting the observed proportions at low values.

Calibrate

Now we can use the estimated bias to calibrate the relative abundances of the mock taxa in all samples. Metacal provides the calibrate() function for this purpose, which takes as its arguments the observed compositions and the bias estimate returned by coef(mc_fit).

There are a few things to keep in mind about calibrate().

Here I will use a pseudocount to make all the taxa abundances positive prior to calibration.

ps.pseudo <- transform_sample_counts(ps, function(x) x + 1)
ps.pseudo.cal <- calibrate(ps.pseudo, bias) %>% print

To verify that calibration worked as expected, let's compare the observed, calibrated, and actual taxon proportions in the 10 control samples. Again, I'll start by adding the calibrated proportions to the data frame with observed and actual proportions,

ps.pseudo.cal.mock <- ps.pseudo.cal %>% 
  subset_samples(Samp_type == "Mock")
props.cal <- bind_rows(
  props, 
  psmelt(ps.pseudo.cal.mock) %>% add_column(Type = "Calibrated")
) %>%
  select(Type, OTU, Sample, Abundance)

First, let's look at bar plots comparing the observed, calibrated, and actual compositions,

props.cal %>%
  mutate(
    across(Type, fct_rev),
    across(Sample, factor, levels = sample_names(mock_actual)),
    ) %>%
  rename(Proportion = Abundance) %>%
  ggplot(aes(Type, Proportion, fill = OTU)) +
  geom_col(width = 0.7) +
  scale_fill_brewer(type = "qual", palette = 3) +
  facet_wrap(~Sample, ncol = 5) +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5)
  )

We can see that the calibrated compositions are typically substantially closer to the actual proportions than the original observations. We can also plot the calibrated vs. actual (pseudo-log-transformed) proportions and compare to the earlier plot of observed vs. actual.

props.cal %>%
  pivot_wider(names_from = Type, values_from = Abundance) %>%
  pivot_longer(c(Observed, Calibrated), 
    names_to = "Type", values_to = "Proportion") %>%
  mutate(across(Type, fct_relevel, "Observed")) %>%
  ggplot(aes(Actual, Proportion, color = OTU)) +
  geom_abline(color = "darkgrey") +
  geom_quasirandom() +
  facet_wrap(~Type) +
  axes +
  scale_color_brewer(type = "qual", palette = 3)

This plot has essentially same information as the earlier plot examining the model fit, but seen from a different angle. We can see that calibration in this dataset clearly reduces the relative error in proportions across the tested range of actual proportions, but leads to estimates that are systematically too low at low proportions.

Session info

sessioninfo::session_info()


mikemc/metacal documentation built on Feb. 20, 2022, 1:46 a.m.