knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial illustrates the process of estimating bias from control samples using this estimate to calibrate the relative abundances of the control taxa in samples whose true composition is unknown. The data we use is from the synthetic-community colonization experiment of Leopold and Busby (2020).
Leopold DR, Busby PE. 2020. Host Genotype and Colonist Arrival Order Jointly Govern Plant Microbiome Composition and Function. Curr Biol 30:3260-3266.e5. doi:10.1016/j.cub.2020.06.011
These authors performed a colonization experiment in which black cottonwood (Populus trichocarpa) was inoculated with the fungal rust pathogen Melampsora × columbiana and 8 species of foliar fungi. Fungal relative abundances were measured using ITS amplicon sequencing. To enable quantification of bias due to PCR bias and variation in ITS copy number, they also created a set of DNA mocks of these fungi, which they measured along with the experimental samples. We will use the mock samples to estimate the bias of the 9 focal species, and then calibrate these species' abundances in the full set of samples.
First, let's set up our R environment. This tutorial uses phyloseq objects and functions to store and manipulate the microbiome data, tidyverse packages for data manipulation and plotting, and some add-ons to ggplot2 from the ggbeeswarm and cowplot packages.
# Tools for microbiome data library(phyloseq) # Tools for general purpose data manipulation and plotting library(tidyverse) # ggplot helpers library(ggbeeswarm) library(cowplot) theme_set(theme_cowplot()) library(metacal); packageVersion("metacal")
Next, we download the source data from Leopold and Busby (2020), which is
available in Zenodo record 3872145. Set
data_path
to wherever you would like the data to be downloaded.
data_path <- here::here("vignettes", "data", "leopold2020host") # To use a temporary directory: # data_path <- file.path(tempdir(), "leopold2020host") if (!dir.exists(data_path)) { dir.create(data_path, recursive = TRUE) download.file( "https://zenodo.org/record/3872145/files/dleopold/Populus_priorityEffects-v1.2.zip", file.path(data_path, "Populus_priorityEffects-v1.2.zip") ) unzip( file.path(data_path, "Populus_priorityEffects-v1.2.zip"), exdir = data_path ) }
The microbiome data is stored in a phyloseq object,
ps <- file.path(data_path, "dleopold-Populus_priorityEffects-8594f7c/output/compiled/phy.rds") %>% readRDS %>% print
The (expected) actual compositions of the mock (control) samples can be computed from a csv file that contains the dilution factors used during sample construction. I will read in the csv file, convert it to a phyloseq object, and then compute the relative abundances of each species from the reciprocals of the dilution factors.
mock_actual <- file.path(data_path, "dleopold-Populus_priorityEffects-8594f7c/data/MockCommunities.csv") %>% read.csv(row.names = 1) %>% select(-Sym4) %>% as("matrix") %>% otu_table(taxa_are_rows = FALSE) %>% transform_sample_counts(function(x) close_elts(1 / x)) mock_taxa <- taxa_names(mock_actual)
Note, I have converted the actual relative abundances to proportions, but the
observed abundances in the ps
object have the form of read counts,
otu_table(ps) %>% prune_taxa(mock_taxa, .) %>% corner
Let's take a look at the sample data and taxonomy table in the phyloseq object,
sam <- sample_data(ps) %>% as("data.frame") %>% as_tibble(rownames = "Sample") tax <- tax_table(ps) %>% as("matrix") %>% as_tibble(rownames = "Taxon") sam %>% glimpse sam %>% count(Samp_type)
The control samples are the 10 samples with Samp_type == "Mock"
.
Now let's look at the taxonomy table,
tax %>% head tax %>% filter(Taxon %in% mock_taxa) %>% select(Taxon, Family:Species) ntaxa(ps)
There are 219 OTUs, of which 9 have been assigned to the mock taxa (the
pathogen, here named "Melampsora", and the 8 foliar fungi). The names of the
mock taxa in ps
have already been set to exactly match those in
mock_actual
, which is a requirement for the steps that follow.
To get a sense of how the mock mixtures were created, we can take a look at the expected proportions of the 9 taxa across the 10 samples,
mock_actual %>% psmelt %>% mutate( across(OTU, factor, levels = mock_taxa), across(Sample, factor, levels = sample_names(mock_actual)), ) %>% ggplot(aes(Sample, OTU, fill = Abundance)) + geom_tile() + scale_fill_viridis_c(trans = "log10", breaks = c(0.02, 0.05, 1e-1, 0.2)) + theme_cowplot(font_size = 11)
All taxa are in every sample, but in varying proportions that range from around 0.01 to 0.3.
To visualize and estimate bias, we'll work with a phyloseq object of the observed compositions subset to just the mock community samples and taxa,
ps.mock <- ps %>% subset_samples(Samp_type == "Mock") %>% prune_taxa(mock_taxa, .)
Next, let's compare the observed to actual taxon proportions in the mocks. I'll first combine the observed and actual proportions in a data frame in "long" format,
props <- list(Actual = mock_actual, Observed = ps.mock) %>% map(transform_sample_counts, close_elts) %>% map_dfr(psmelt, .id = "Type") %>% select(Type, OTU, Sample, Abundance)
We can plot the observed vs. actual proportions with ggplot as follows,
brks <- c(0, 1e-3, 1e-2, 1e-1, 0.3) axes <- list( scale_y_continuous(trans = scales::pseudo_log_trans(1e-3), breaks = brks, labels = brks), scale_x_continuous(trans = scales::pseudo_log_trans(1e-3), breaks = brks, labels = brks), coord_fixed() ) props %>% pivot_wider(names_from = Type, values_from = Abundance) %>% ggplot(aes(Actual, Observed, color = OTU, shape = Sample == "Mock.5")) + geom_abline(color = "darkgrey") + geom_quasirandom() + axes + scale_shape_manual(values = c(16, 1)) + scale_color_brewer(type = "qual", palette = 3)
Some notes about this plot:
Axes transformations: The multiplicative error created by bias can only be
easily seen for low-proportion taxa if we use some form of log
transformation. Options to consider include a log (or log10) transformation,
logit transformation, and a pseudo-log transformation (which is linear at low
values and logarithmic at higher values). Here I will use the pseudo-log
transform since it can more easily handle any 0s that happen to be in the
data than these other transforms. I set 1e-3
as the linear-to-log
transition point; you should play around with this value with your own data.
I saved the transformed axes in a variable axes
to reuse in later plots.
Because the actual proportions are clustered at just 4 values, I used
geom_quasirandom()
from the ggbeeswarm package to jitter the points.
I have highlighted the sample Mock.5 as it was identified as an outlier by Leopold2020. It is the only sample where a taxon (Epicoccum) was undetected in a sample where it should have been present (the point at (0.01, 0)). Other than this non-detection, the observations of this sample seem in line with the others. I will include this sample in this tutorial, but we need to be aware that how we deal with this zero value may have undue influence on our bias estimate.
Metacal provides a high-level interface to estimating bias, estimate_bias()
,
which expects two matrices or phyloseq objects with the observed and actual
abundances for a set of control samples. To obtain sensible estimates using
this function, we must first ensure that all taxa have an observed abundance
that is greater than 0 in samples where their actual abundance is greater than
0. We will therefore add a pseudocount of 1 to the observed counts to
remove 0s; other zero-replacement methods, such as cmultRepl()
from
zCompositions, would also work. The estimation method used by estimate_bias()
also requires that taxa not supposed to be in a sample have an abundance of
exactly 0, but the estimate_bias()
function will automatically zero-out these
observations for you.
sum(otu_table(ps.mock) == 0) ps.mock.pseudo <- ps.mock %>% transform_sample_counts(function(x) x + 1) all(otu_table(ps.mock.pseudo) > 0)
Now we can estimate bias using ps.mock.pseudo
as our observed abundances. We
use the boot = TRUE
option to also generate bootstrap replicates that will
give us a way to quantify the uncertainty in our estimate.
mc_fit <- estimate_bias(ps.mock.pseudo, mock_actual, boot = TRUE) %>% print class(mc_fit)
estimate_bias()
returns an "mc_bias_fit" object that can be interacted with
through commonly used S3 methods such as print()
, summary()
, and coef()
.
The bias estimate can be extracted with coef()
,
bias <- coef(mc_fit) %>% print
The summary()
method uses the bootstrap replicates in the "mc_bias_fit"
object to estimate the (geometric) standard error of the bias estimate,
mc_fit.summary <- summary(mc_fit) print(mc_fit.summary)
producing a data frame with the estimate and standard error that can be accessed with
coef_tb <- mc_fit.summary$coefficients
Let's use this data frame to plot the bias estimate with two geometric standard errors,
coef_tb %>% mutate(taxon = fct_reorder(taxon, estimate)) %>% ggplot(aes(taxon, estimate, ymin = estimate / gm_se^2, ymax = estimate * gm_se^2)) + geom_hline(yintercept = 1, color = "grey") + geom_pointrange() + scale_y_log10() + coord_flip()
This plot shows that these estimates are consistent but not identical with those in Leopold2020 Fig S2. The difference is due to our inclusion of the Mock.5 sample, which leads to a lower point estimate and larger standard error for Epicoccum.
The fitted values of the model can be accessed with fitted()
; these are the
proportions that we predict we'd observe for the control samples given the
actual proportions and the estimated bias. The fitted()
function returns a
matrix with samples as rows and taxa as columns, which I'll convert into a
phyloseq object.
observed.fitted <- fitted(mc_fit) %>% otu_table(taxa_are_rows = FALSE)
To plot the fitted values against the observed values, I'll add the fitted
proportions to our earlier props
data frame,
props.fitted <- bind_rows( props, psmelt(observed.fitted) %>% add_column(Type = "Fitted") )
Next, I'll use some tidyr manipulations and ggplot to plot the observed proportions vs. the actual and fitted predictions,
props.fitted %>% pivot_wider(names_from = Type, values_from = Abundance) %>% pivot_longer(c(Fitted, Actual), names_to = "Type", values_to = "Predicted") %>% ggplot(aes(Predicted, Observed, color = OTU)) + geom_abline(color = "darkgrey") + geom_point() + axes + facet_wrap(~Type) + scale_color_brewer(type = "qual", palette = 3)
This plot shows that the fitted model is doing a good job of explaining the error across the full range of predicted proportions, though it is slightly underpredicting the observed proportions at low values.
Now we can use the estimated bias to calibrate the relative abundances of
the mock taxa in all samples. Metacal provides the calibrate()
function for
this purpose, which takes as its arguments the observed compositions and the
bias estimate returned by coef(mc_fit)
.
There are a few things to keep in mind about calibrate()
.
Because of the relative nature of microbiome measurement and the metacal bias
estimate, it is only possible to calibrate the relative abundances of the
taxa whose bias has been estimated. For this reason, calibrate()
automatically drops taxa that are missing from the bias estimate.
A deterministic calibration method is used in which the observed abundances are simply divided by the bias and (optionally) normalized to sum to 1. Uncertainty in the bias estimate and in the observed abundances due to the noise inherent in microbiome measurement (such as random sampling of sequencing reads) is not accounted for.
An implication is that taxa whose observed abundance is 0 will have a calibrated abundance of 0. Whether and how you replace 0s prior to calibration may have a significant impact on downstream analyses.
Here I will use a pseudocount to make all the taxa abundances positive prior to calibration.
ps.pseudo <- transform_sample_counts(ps, function(x) x + 1) ps.pseudo.cal <- calibrate(ps.pseudo, bias) %>% print
To verify that calibration worked as expected, let's compare the observed, calibrated, and actual taxon proportions in the 10 control samples. Again, I'll start by adding the calibrated proportions to the data frame with observed and actual proportions,
ps.pseudo.cal.mock <- ps.pseudo.cal %>% subset_samples(Samp_type == "Mock") props.cal <- bind_rows( props, psmelt(ps.pseudo.cal.mock) %>% add_column(Type = "Calibrated") ) %>% select(Type, OTU, Sample, Abundance)
First, let's look at bar plots comparing the observed, calibrated, and actual compositions,
props.cal %>% mutate( across(Type, fct_rev), across(Sample, factor, levels = sample_names(mock_actual)), ) %>% rename(Proportion = Abundance) %>% ggplot(aes(Type, Proportion, fill = OTU)) + geom_col(width = 0.7) + scale_fill_brewer(type = "qual", palette = 3) + facet_wrap(~Sample, ncol = 5) + theme( axis.text.x = element_text(angle = 90, vjust = 0.5) )
We can see that the calibrated compositions are typically substantially closer to the actual proportions than the original observations. We can also plot the calibrated vs. actual (pseudo-log-transformed) proportions and compare to the earlier plot of observed vs. actual.
props.cal %>% pivot_wider(names_from = Type, values_from = Abundance) %>% pivot_longer(c(Observed, Calibrated), names_to = "Type", values_to = "Proportion") %>% mutate(across(Type, fct_relevel, "Observed")) %>% ggplot(aes(Actual, Proportion, color = OTU)) + geom_abline(color = "darkgrey") + geom_quasirandom() + facet_wrap(~Type) + axes + scale_color_brewer(type = "qual", palette = 3)
This plot has essentially same information as the earlier plot examining the model fit, but seen from a different angle. We can see that calibration in this dataset clearly reduces the relative error in proportions across the tested range of actual proportions, but leads to estimates that are systematically too low at low proportions.
sessioninfo::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.