This vignette presents two data examples to illustrate our suggested workflow for using DivNet to conduct beta diversity analyses with amplicon data. The first example covers beta diversity analyses in settings without technical replicate measurements, and the second demonstrates how we suggest incorporating technical replicate data into beta diversity analyses.
We would be remiss not to acknowledge Mike Lee, a bioinformatics wizard who recently put together some fantastic tutorials on bioinformatics for analysing microbiome data here. Mike has very kindly allowed us to distribute the phyloseq data object from this tutorial via DivNet. Thanks so much, Mike!
Let's load some required packages and check our session info:
library(magrittr) if ("speedyseq" %in% row.names(installed.packages())) { library(speedyseq) # if speedyseq is installed, use it for tax_glom } else { library(phyloseq) # if not, use phyloseq for tax_glom } library(breakaway) library(tidyr) library(ggplot2) library(DivNet) sessionInfo()
DivNet allows testing of hypotheses concerning the location of community centroids across conditions. Intuitively, we define "centroid" as the archetypal community from a given environment or under a given condition. (We give concrete definitions of various kinds of centroids below.)
Let's take a quick look at the Lee et al. dataset.
data(Lee)
Lee
16 samples, 1490 ASVs, and taxonomy information. Fantastic!
Suppose we are interested in determining how much evidence we have that
community centroids vary by the type of site (water, glassy, altered, biofilm, etc...) from which samples were taken,
as encoded in the
the char
column in the data.frame
below:
sample_data(Lee)
Before we start our analysis proper, we need to ensure a DivNet model can be
meaningfully fit on this data. Under each condition (unique entry of the char
column above), we need to be able to do two things:
(1) Estimate an archetypal composition (i.e., a centroid) (2) Estimate the (within-condition) variability in community composition
If you have encountered discussions of degrees of freedom in ANOVA, this may feel somewhat familiar (but don't worry if not). In brief, the story here is that more observations per condition is generally better, and at a minimum we need at least two observations per condition to fit a beta diversity model in any reasonable way. (If we have only one observation, we can estimate a centroid but not variability around that centroid.)
Since we only have a single observation under the biofilm and carbonate conditions, we will proceed by excluding these from our analysis and considering only the water, glassy, and altered conditions.
Somewhat arbitrarily (since this is an example analysis), we will compute beta diversities, and hence community centroids, on data aggregated to the phylum level.
Lee_phylum <- tax_glom(Lee, taxrank="Phylum") Lee_phylum <- subset_samples(Lee_phylum, sample_data(Lee_phylum)$char %in% c("glassy","altered","water"))
Now we are ready to run divnet()
. Note in the code below, that we have chosen a diagonal design matrix.
This is the recommended
setting for testing beta diversity (i.e. location of centroids) with DivNet -- other
choices of design matrix may result in estimated beta diversity within conditions being zero,
which is generally unrealistic.
(However, if you have technical replicate samples from the same
specimen, we do recommend you specify a design matrix that reflects this -- see below for an
analysis in this situation.) Variance estimation (and bootstrapping) is done in a subsequent step, so
we set variance equal to "none" here.
X <- diag(nrow(sample_data(Lee_phylum))) colnames(X) <- rownames(X) <- as.character(1:ncol(X)) divnet_phylum <- divnet(W = Lee_phylum, X = X, variance = "none")
After divnet()
runs, we can test a null hypothesis of equality of
centroid location across conditions (glassy, water, and altered).
We'll start with a test using Bray-Curtis dissimilarity,
which amounts to a test of whether the median composition
of samples differs across conditions (glassy, water, and altered).
set.seed(454) bc_test <- testBetaDiversity(divnet_phylum, h0 = "bray-curtis", n_boot = 1000, sample_specimen_matrix = divnet_phylum$X, groups = sample_data(Lee_phylum)$char) bc_test$p_value
We used 1000 bootstrap iterations for this example to keep it relatively fast, but we encourage you to use 10,000 or more for an analysis that you are preparing for publication.
On the basis of this test, we can conclude that estimated differences in group centroids are not particularly unusual under our null hypothesis of no difference between (true) group centroids -- if we repeat our experiment with new observations, we expect to observe a test statistic as large as we do in one of about every 9 repeated experiments (~10.5%) if the null of identical median community composition across conditions holds.
In a publication, we might summarize this result as follows:
We estimated specimen composition from read counts via a DivNet model (Willis and Martin, 2020) with a diagonal design matrix. To test a null hypothesis of equal median measured relative abundance across groups "glassy", "water", and "altered", we used a bootstrapped pseudo-F test with 1,000 bootstrap iterations as implemented in the
testBetaDiversity
function within DivNet. On the basis of this test, we fail to reject the null of equal median measured relative abundance across groups at the 0.05 level (bootstrap p =r round(bc_test$p_value, 2)
).
Hence the outcome of our test is that we lack sufficient evidence to conclude that measured median relative abundance varies by group. We include the word "measured" here because we cannot, on the basis of 16S data alone, reliably estimate true relative abundance profiles in our samples (McLaren et al., 2019).
We can also use Euclidean distance to test whether mean composition differs across condition.
set.seed(3532) euc_test <- testBetaDiversity(dv = divnet_phylum , h0 = "euclidean", n_boot = 1000, sample_specimen_matrix = divnet_phylum$X, groups = sample_data(Lee_phylum)$char) euc_test$p_value
We again (at the $p = 0.05$ significance level) fail to find evidence for a difference in (measured) group centroids -- in particular, we do not find strong evidence that mean measured relative abundance differs across groups. We could write this up in a scientific publication as follows:
We estimated specimen composition from read counts via a DivNet model (Willis and Martin, 2020) with a diagonal design matrix. To test a null hypothesis of equal mean measured relative abundance across groups "glassy", "water", and "altered", we used a bootstrapped pseudo-F test with 1,000 bootstrap iterations as implemented in the $testBetaDiversity$ function within DivNet. On the basis of this test, we do not reject the null of equal mean measured relative abundance across groups at the 0.05 level (p =
r round(euc_test$p_value,2)
).
The third beta diversity test DivNet enables is based on Aitchison distance, or equivalently Euclidean distance on the centered-log-ratio scale. In this test, we test whether mean centered-log-ratio-transformed community composition varies across conditions.
set.seed(3423) ait_test <- testBetaDiversity(dv = divnet_phylum, sample_specimen_matrix = divnet_phylum$X, h0 = "aitchison", n_boot = 1000, groups = sample_data(Lee_phylum)$char) ait_test$p_value
In this case, we observe a small p-value, which reflects (in part) the fact that Aitchison distance gives more weight to differences in relative abundance among rarer taxa than does Euclidean distance or Bray-Curtis dissimilarity. We might write this result as follows:
We estimated specimen composition from read counts via a DivNet model (Willis and Martin, 2020) with a diagonal design matrix. To test a null hypothesis of equal mean centered-log-ratio relative abundance across groups "glassy", "water", and "altered", we used a bootstrapped pseudo-F test with 1,000 bootstrap iterations as implemented in the $testBetaDiversity$ function within DivNet. On the basis of this test, we reject the null of equal mean centered-log-ratio measured relative abundance across groups at the 0.05 level (p =
r round(ait_test$p_value, 3)
).
As this wording indicates, a small p-value here leads us to reject the null
hypothesis that there is no difference in (measured) centroids across groups defined by the
char
variable.
Note, however, that the p-value tells us nothing about across which groups centroids may differ or how they differ: a small p-value does not guarantee that all group centroids differ or that any groups differ in scientifically meaningful ways.
As a side note, we're presenting the three analyses above to demonstrate the capabilities of DivNet. In a research setting, we recommend choosing a dissimilarity on scientific grounds (i.e., on the basis of what kind of differences between groups you care about). If you choose to run multiple analyses with varying choices of dissimilarity, it is important to report all of them. Selective reporting -- running many analyses but only reporting a subset of them -- can be very misleading for readers (and is in fact a form of p-hacking if analyses returning significant p-values are preferentially reported).
Using DivNet, we can also conduct beta-diversity analyses of amplicon data containing technical replicates. For clarity, by "technical replicates" we mean replicate samples from a single unique specimen -- e.g., aliquots from a single (homogenized) fecal sample. By contrast, for example, samples taken from the same study participant at different times are not technical replicates based on our definition here.
To demonstrate beta-diversity analysis with technical replicates, we'll look at a subset of 16S data published by Sinha et al. (2017) as part of the
Microbiome Quality Control Project (MBQC). In particular, we'll examine measurements on 7 freeze-dried specimens taken from patients awaiting
elective surgery for colorectal cancer ($n = 4$) or non-oncologic, elective surgery ($n = 3$). We'll limit ourselves to read counts reported
by a single bioinformatics laboratory, BL-2
, on sequencing data produced by a single sequencing laboratory, HL-B
(the MBQC data includes
read counts produced by many bioinformatics laboratories on sequencing data from many sequencing laboratories). Information about the specimens can be found in the data object B2
:
data(B2) ### how many technical replicates per specimen? table(B2$specimen,B2$Bioinformatics.ID) %>% apply(1, sum)
Here we see our 7 specimens (sample55, sample56, etc...) and depending on the specimen, we have between 3 and 6 technical replicates each. How can we use technical replicates in our analysis? In short, the recommended approach here is essentially the same as in the case without technical replicates, but instead of fitting DivNet with a diagonal design matrix, we construct a design matrix $X$ where rows are samples (these are our replicates or aliquots if you prefer that terminology) and columns are specimens. That is, we let the $i,j$-th entry of $X$ be $1$ if sample $i$ was taken from specimen $j$ and 0 otherwise. This tells DivNet to estimate a single common composition for each specimen using all technical replicates of that specimen.
Before we use DivNet, let's construct our design matrix that encodes
which samples are taken from which specimens and create a phyloseq
object
to give to DivNet. Recall that our design matrix $X$ has samples as rows and specimens as columns.
unique_specimens <- unique(B2$specimen) # create "specimen matrix" to be used as design matrix X in DivNet specimen_matrix <- do.call(cbind,lapply(unique_specimens,function(x) as.numeric(B2$specimen ==x))) specimen_df <- as.data.frame(specimen_matrix) %>% (function(x){ rownames(x) <- B2$Bioinformatics.ID; return(x)})
The counts data is available in DivNet as a phyloseq object called B2_phyloseq
. For this analysis, we will aggregate to the species level using the taxonomic information MBQC
has provided along with OTU counts.
data("B2_phyloseq") mbqc_species <- tax_glom(B2_phyloseq, taxrank="species")
Now we are ready to fit a DivNet model! Note again that we are not estimating any variances in the initial fit.
set.seed(43) mbqc_div <- divnet(W = mbqc_species, X = as.matrix(specimen_df), variance = "none")
Now using testBetaDiversity()
, we can conduct a hypothesis test via a nonparametric cluster bootstrap. Here,
we have decided to use Aitchison distance, so we are testing a hypothesis of equality of mean
centered-log-ratio-transformed relative abundance across groups (colorectal cancer vs. control).
set.seed(14321) ait_test_np <- testBetaDiversity(dv = mbqc_div, h0 = "aitchison", groups = B2$health_status, sample_specimen_matrix = as.matrix(specimen_df), n_boot = 1000) ait_test_np$p_value
So, we don't find sufficient evidence to reject a null of no difference in measured centroids (mean centered-log-ratio measured relative abundance, since we used Aitchison distance) between groups. We could write this up as follows:
We estimated specimen composition from read counts on all technical replicates via a DivNet model (Willis and Martin, 2020). To test a null hypothesis of equal mean centered-log-ratio measured relative abundance among colorectal cancer patients and among control patients, we used a cluster-bootstrapped pseudo-F test with 1,000 bootstrap iterations as implemented in the $testBetaDiversity$ function within DivNet. On the basis of this test, we do not reject the null of equal mean centered-log-ratio measured relative abundance across groups at the 0.05 level (p =
r round(ait_test_np$p_value, 2)
).
Some words of caution here -- we've conducted this example analysis on measurements taken on only 7(!) study participants across two groups, and the bootstrap test we have used has an asymptotic justification. In an ideal world, we would include more participants if we were going to do this analysis in earnest (as opposed to as a use-case).
Note: the inclusion of technical replicates does not solve the problems associated with small sample size -- more measurements on a given study participant may give us more information about that participant, but they do not help us characterize how typical that participant is of the larger population(s) we are interested in.
The function $\textit{testBetaDiversity}()$ tests a null of equality of centroids across groups using a pseudo-F test.
For observations on $p$ groups, with number of unique specimens observed in group $g$ given by $n_g$ for $g = 1, \dots, p$ the pseudo-F-statistic is given by
$$F = \dfrac{\sum_{g = 1}^p \sum_{f>g} \sum_{i = 1}^{n_g}\sum_{k = 1}^{n_f} h(\hat{Y}{gi},\hat{Y}{fk})}{\sum_{g = 1}^p \sum_{i = 1}^{n_g} \sum_{k>i} h(\hat{Y}{gi},\hat{Y}{gk})} \cdot \dfrac{\sum_{g = 1}^p n_g - p - 1}{p - 1}$$ where $h()$ is either squared distance (Euclidean or Aitchison) or non-squared Bray-Curtis dissimilarity and the $\hat{Y}$ are estimated specimen compositions. In other words, the pseudo-F-statistic is a scaled ratio of 1) the cross-group sum of dissimilarities between specimens 2) the within-group sum of dissimilarities between specimens.
The null distribution is estimated by bootstrap resampling specimens and subtracting from each (selected) specimen's composition the estimated centroid for the group it is a member of, recalculating dissimilarities (Euclidean, Aitchison, or Bray-Curtis) on resulting centered specimen compositions, and calculating a pseudo-F-statistic from these. Subtraction of centroids from specimen composition is performed on the natural scale (no transformation prior to subtraction) for Euclidean distance and Bray-Curtis dissimilarity. For Aitchison distance, we subtract group centroids from estimated specimen composition on the centered log-ratio scale.
Lee, Michael D., Nathan G. Walworth, Jason B. Sylvan, Katrina J. Edwards, and Beth N. Orcutt. "Microbial communities on seafloor basalts at Dorado Outcrop reflect level of alteration and highlight global lithic clades." Frontiers in microbiology 6 (2015): 1470.
McLaren, Michael R., Amy D. Willis, and Benjamin J. Callahan. "Consistent and correctable bias in metagenomic sequencing experiments." Elife 8 (2019): e46923.
Sinha, Rashmi, Galeb Abu-Ali, Emily Vogtmann, Anthony A. Fodor, Boyu Ren, Amnon Amir, Emma Schwager et al. "Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium." Nature biotechnology 35, no. 11 (2017): 1077-1086.
Willis, Amy D., and Bryan D. Martin. "Estimating diversity in networked ecological communities." Biostatistics (2020).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.