In thomazbastiaanssen/Tjazi: Microbiome Oriented Compositional Data Toolkit

0. Introduction

Here, we will demonstrate four strategies to enrich microbiome-gut-brain axis experiments. In this document we expand on the demonstration in the supplementary files of the companion piece to this manuscript. We strongly recommend readers go through that analysis before this document, for the sake of continuity and clarity.

For this demonstration, we have adapted some shotgun metagenomic data from the curatedMetagenomicData library in R. We're looking at a human cohort starring in the Metagenome-wide association of gut microbiome features for schizophrenia study (DOI: 10.1038/s41467-020-15457-9). After downloading the data it was simplified by summing together all strains by genus. This will make it easier to analyse without access to a server. For the set of operations used to pre-process, please see section Download and pre-process microbiome data in this document. Briefly, in this data set, we have WGS data from faecal samples from both patients with schizophrenia and healthy volunteers, which will be referred to as "healthy" in the Legends. This data has been included in the Tjazi library on github for easy access purposes. All R code used to transform, wrangle (reorganise) and plot the data is also shown below as to hopefully provide a toolkit for aspiring and veteran bioinformaticians alike.

Code chunk: Load our libraries

#Statistical tools        Primarily the CLR transformation.
library(Tjazi)            #devtools::install_github("thomazbastiaanssen/Tjazi")

#Data Wrangling
library(tidyverse)        #install.packages("tidyverse")
library(knitr)            #install.packages("knitr")

#Plotting
library(ggplot2)          #install.packages("ggplot2")
library(ggforce)          #install.packages("ggforce")
library(patchwork)        #install.packages("patchwork")

#Section specific packages
library(mediation)        #install.packages("mediation")
library(anansi)           #devtools::install_github("thomazbastiaanssen/anansi")
library(propr)            #devtools::install_github("tpq/propr")
library(volatility)       #devtools::install_github("thomazbastiaanssen/volatility")


#Load prepared data from the schizophrenia study stored in the Tjazi library
data(guidebook_data)

\newpage

1. Mediation analysis

Mediation analysis is a statistical tool that can help us answer causal questions. Mediation analysis can be used to address core questions in the microbiome-gut-brain axis field. One common and important example is as follows:

We know diet affects mental health.
We also know diet affects the microbiome.
We have observed associations between microbial taxa and mental health.

But how do we make sure that the association between the microbiome and mental health isn't due to the fact that they share a common 'driver', namely diet, rendering our observed associations between the microbiome and the brain merely spurious? Conversely, mediation analysis could help us find to which degree the underlying mechanism of the effect of diet on mental health is due to the effect of diet on the microbiome.

Here, we will demonstrate how one could go about performing mediation analysis to ask and answer this question. We will use the 2020 schizophrenia data set to demonstrate this point. It should be noted here that in reality, one would need strong biological mechanistic reasons to perform a mediation analysis in order to make the 'causal' part in causal mediation analysis carry any meaning. Here we are just performing the analysis for demonstrative purposes.

Gathering and preparing our data

First, Let's load the schizophrenia data set into our environment. We'll also quickly clean and CLR-transform our genus-level count table.

Code chunk: Preparing microbiome data

#Set a seed for the purposes of reproducibility in this document.
set.seed(1)

#Load the mediation library and the relevant files for demo
library(mediation)
data(guidebook_data)
counts   <- counts ; metadata <- metadata ; diet <- diet

#Repeat the cleaning and transformation steps from part 1
metadata$master_ID <- gsub(metadata$master_ID, pattern = "-", replacement = ".")

counts  <- counts[,metadata$master_ID]

#Fork off your count data so that you always have an untouched version handy.
genus   <- counts

#make sure our count data is all numbers
genus   <- apply(genus,c(1,2),function(x) as.numeric(as.character(x)))

#Remove features with prevalence < 10% in two steps:
#First, determine how often every feature is absent in a sample
n_zeroes <- rowSums(genus == 0)

#Then, remove features that are absent in more than your threshold (90% in this case).
genus    <- genus[n_zeroes <= round(ncol(genus) * 0.90),]

#Perform a CLR transformation
genus.exp <- clr_c(genus)

Preparing the dietary data

Furthermore, we will need to prepare the dietary data. For this study, we have access to a table of five columns denoting how frequently our participants eat food groups. It should be noted that we are only using this data for demonstration purposes. In reality, the resolution of the dietary information provided with this study would not be sufficient to make any type of causal claim for several reasons. That said, let's convert the dietary frequency categories into numbers and then perform a principal component analysis.

Code chunk: Preparing dietary data

diet.pca = diet %>% 

#Replace text with numbers
  mutate(across(everything(), ~str_replace( ., "hardly",                      "1" )),
         across(everything(), ~str_replace( ., "often",                       "2" )),
         across(everything(), ~str_replace( ., "twice or three times a week", "3" )),
         across(everything(), ~str_replace( ., "every day",                   "4" )), 
         across(everything(), as.numeric)) %>% 

#Center and rescale data, then perform a principal component analysis
  scale() %>%   
  prcomp()       

#Let's check out the loadings of the first principal component using a histogram 
PC1 = diet.pca$x[,1]
hist(PC1)

It's always a good idea to visually inspect your data. From this histogram, it looks like there may be two roughly normal populations in our first principal component. We'll take note of this and proceed with our preparations.

Prepare data for mediation analysis

We'll take two of the bacteria, Cronobacter and Gordonibacter, that were found to be statistically associated with schizophrenia in our initial analysis. We'll also take the first component of the dietary intake PCA we generated above.

Code chunk: Preparing for mediation analysis

df_mediation <- data.frame(
  Cronobacter    = unlist(genus.exp["Enterobacteriaceae_Cronobacter",]),
  Gordonibacter  = unlist(genus.exp["Eggerthellaceae_Gordonibacter",]),
  diet_PC1       = PC1, 
  phenotype      = metadata$Group == "schizophrenia")

\newpage

Fit models

For mediation analysis, we need to consider a few models. Microbes will be playing the role of (potential) mediator here.

First, we estimate the effect of diet on our phenotype: $phenotype \sim diet$. Does diet explain phenotype? If yes, we can proceed.
Second, we estimate the effect of the microbe on our phenotype of interest: $phenotype \sim microbe$. Does the microbe also explain phenotype? If also yes, great, we have a potential mediation on our hands.
Third, we estimate the effect of diet on our microbe of interest: $microbe \sim diet$. Does diet explain the abundance of our microbe? If yes, we can proceed. Now things are getting interesting as we have the scenario we laid out earlier.
Fourth, we estimate the joint effect of diet and the microbe on our phenotype of interest: $phenotype \sim diet + microbe$. Does diet now explain phenotype worse in the presence of the microbe? If so, we have a potential mediation on our hands. In the case that diet no longer explains phenotype at all, we may be dealing with a full mediation. in the case of a reduction of explanatory potential, we rather speak of a potential partial mediation.

To check for a mediation effect, we use the mediation package.

In our case, the phenotype (schizophrenia diagnosis) is a binary outcome. Because of that, we'll use a logistic regression model rather than a 'regular' linear model whenever we're trying to explain phenotype. We'll use a link function for this.

Let's give it a shot with our two bacteria.

Code chunk: Fitting statistical models

#Cronobacter
#Does diet explain phenotype?
diet.fit1  <- glm(phenotype ~ diet_PC1, 
                  family = binomial("logit"), data = df_mediation)

#Does diet explain Cronobacter?
diba.fit1  <-  lm(Cronobacter ~ diet_PC1, data = df_mediation)

#Does Gordonibacter explain phenotype?
bact.fit1  <-  glm(phenotype ~ Cronobacter, 
                    family = binomial("logit"), data = df_mediation)

#Does diet explain phenotype on the presence of Cronobacter?
both.fit1  <- glm(phenotype ~ diet_PC1 + Cronobacter, 
                  family = binomial("logit"), data = df_mediation)

#Is there a mediation effect here?
crono = mediate(diba.fit1, both.fit1, 
                treat = 'diet_PC1', mediator = 'Cronobacter', boot = T)


#Gordonibacter
#Does diet explain phenotype?
diet.fit2  <- glm(phenotype ~ diet_PC1, 
                  family = binomial("logit"), data = df_mediation)

#Does diet explain Gordonibacter?
diba.fit2  <-  lm(Gordonibacter ~ diet_PC1, data = df_mediation)

#Does Gordonibacter explain phenotype?
bact.fit2  <-  glm(phenotype ~ Gordonibacter, 
                    family = binomial("logit"), data = df_mediation)

#Does diet explain phenotype on the presence of Gordonibacter?
both.fit2  <- glm(phenotype ~ diet_PC1 + Gordonibacter, 
                  family = binomial("logit"), data = df_mediation)

#Is there a mediation effect here?
gordo = mediate(diba.fit2, both.fit2, 
                treat = 'diet_PC1', mediator = 'Gordonibacter', boot = T)

Notice that in the mediate function calls, we're essentially estimating the explanatory potential of diet on our phenotype, in the presence of our bacteria, in light of the fact that diet also explains our bacterial abundance.

\newpage

Investigate results

Let's take a look at the model summaries. We've picked the first bacterium, Cronobacter, to display a statistically significant mediation effect.

Code chunk: Cronobacter results

#Collect the relevant data to display in tables
res_diet.fit1 <- coefficients(summary(diet.fit1))
res_diba.fit1 <- coefficients(summary(diba.fit1))
res_bact.fit1 <- coefficients(summary(bact.fit1))
res_both.fit1 <- coefficients(summary(both.fit1))

#Plot the results in nice looking tables:
kable(res_diet.fit1, digits = 3, 
      caption = "Diet significantly explains phenotype (Estimate of 0.241).")
kable(res_diba.fit1, digits = 3, 
      caption = "Diet significantly explains Cronobacter.")
kable(res_bact.fit1, digits = 3, 
      caption = "Cronobacter significantly explains phenotype. This comes as 
      no surprise as saw this in our initial differential abundance analysis.")
kable(res_both.fit1, digits = 3, 
      caption = "In the presence of Cronobacter, diet significantly 
      explains phenotype less well (Estimate of 0.199).")

Looks like we may have a mediation effect here, so let's check out the results of our mediation analysis: ACME stands for Average Causal Mediation Effect, whereas ADE stands for Average Direct Effect.

Code chunk: Cronobacter mediation figure

summary(crono)

#Let's view the same information in a figure:
plot(crono, main = "Cronobacter as a mediator of the effect of diet on schizophrenia")

Here, we see that our estimate of the Average Causal Mediation Effect lies outside of zero. We have partial mediation. In other words, in this example, it could be the case that part of the effect of diet on schizophrenia happens because diet also influences Cronobacter, which in turn influences schizophrenia.

\newpage

Now, let's take a look at a negative example with Gordonibacter.

Code chunk: Gordonibacter results

#Collect the relevant data to display in tables
res_diet.fit2 <- coefficients(summary(diet.fit2))
res_diba.fit2 <- coefficients(summary(diba.fit2))
res_bact.fit2 <- coefficients(summary(bact.fit2))
res_both.fit2 <- coefficients(summary(both.fit2))

#Plot the results in nice looking tables:
kable(res_diet.fit2, digits = 3, 
      caption = "Diet significantly explains phenotype (Estimate of 0.241).")
kable(res_diba.fit2, digits = 3, 
      caption = "Diet does not significantly explain Gordonibacter")
kable(res_bact.fit2, digits = 3, 
      caption = "Gordonibacter significantly explains phenotype. This comes as 
      no surprise as saw this in our initial differential abundance analysis.")
kable(res_both.fit2, digits = 3, 
      caption = "In the presence of Gordonibacter, diet explains phenotype even better
      (Estimate of 0.263).")

Looks like we may have a mediation effect here, so let's check out the results of our mediation analysis: Again, ACME stands for Average Causal Mediation Effect, whereas ADE stands for Average Direct Effect.

Code chunk: Gordonibacter mediation figure

summary(gordo)

#Let's view the same information in a figure:
plot(gordo, main = "Gordonibacter as a mediator of the effect of diet on schizophrenia")

Here, we see that our estimate of the Average Causal Mediation Effect lies squarely on zero. No mediation. This makes sense, as diet couldn't significantly explain the abundance of Gordonibacter.

\newpage

2. Multi-omics integration

Here, we will demonstrate how to integrate and subsequently analyse two 'omics data sets. For this, we will use the anansi framework and package. Studies including both microbiome and metabolomics data are becoming more common. Often, it would be helpful to integrate both data sets in order to see if they corroborate each others patterns. All vs all association is imprecise and likely to yield spurious associations.

The anansi package computes and compares the association between the features of two 'omics data sets that are known to interact based on a database such as KEGG. anansi takes a knowledge-based approach to constrain association search space, only considering metabolite-function interactions that have been recorded in a pathway database.

We'll load a complementary training data set using data(FMT_data). This loads a curated snippet from the data set described in more detail here: https://doi.org/10.1038/s43587-021-00093-9
A very early version of anansi was used to generate "Extended Data Fig. 7" in that paper.

Code chunk: Load libraries and data

#install and load anansi
#devtools::install_github("thomazbastiaanssen/anansi")
library(anansi)

#load ggplot2 and ggforce to plot results
library(ggplot2)
library(ggforce)

#load dictionary and complementary human-readable names for KEGG compounds and orthologues
data(dictionary)
#load example data + metadata from FMT Aging study
data(FMT_data)

Data preparation

The main anansi function expects data in the anansiWeb format; Basically a list with exactly three tables: The first table, tableY, should be a count table of metabolites. The second table, tableX, should be a count table of functions. Both tables should have columns as features and rows as samples.

The third table should be a binary adjacency matrix with the column names of tableY as rows and the column names of tableX as columns. Such an adjacency matrix is provided in the anansi library and is referred to as a dictionary (because you use it to look up which metabolites interact with which functions). We can load it using data(dictionary).

Though this example uses metabolites and functions, anansi is able to handle any type of 'omics data, as long as there is a dictionary available. Because of this, anansi uses the type-naive nomenclature tableY and tableX. The Y and X refer to the position these measurements will have in the linear modeling framework:

$$Y \sim X \times {\text{covariates}}$$ \newpage

A note on functional microbiome data

Two common questions in the host-microbiome field are "Who's there?" and "What are they doing?".
Techniques like 16S sequencing and shotgun metagenomics sequencing are most commonly used to answer the first question. The second question can be a bit more tricky - often we'll need functional inference software to address them. For 16S sequencing, algorithms like PICRUSt2 and Piphillin can be used to infer function. For shotgun metagenomics, HUMANn3 in the bioBakery suite or woltka can be used.
All of these algorithms can produce functional count data in terms of KEGG Orthologues (KOs). These tables can be directly plugged in to anansi.

Code chunk: Prepare data

#Clean and CLR-transform the KEGG orthologue table.

#Only keep functions that are represented in the dictionary.
KOs     <- FMT_KOs[row.names(FMT_KOs) %in% sort(unique(unlist(anansi_dic))),]

#Cut the decimal part off.
KOs     <- floor(KOs)

#Ensure all entires are numbers.
KOs     <- apply(KOs,c(1,2),function(x) as.numeric(as.character(x)))

#Remove all features with < 10% prevalence in the dataset.
KOs     <- KOs[apply(KOs == 0, 1, sum) <= (ncol(KOs) * 0.90), ] 

#Perform a centered log-ratio transformation on the functional count table.
KOs.exp <- clr_c(KOs)

#anansi expects samples to be rows and features to be columns. 
t1      <- t(FMT_metab)
t2      <- t(KOs.exp)

Weave a web

The weaveWebFromTables() function can be used to parse the tables that we prepared above into an anansiWeb object. The anansiWeb format is a necessary input file for the main anansi workflow.

Code chunk: Generate web object

web <- weaveWebFromTables(tableY = t1, tableX = t2, dictionary = anansi_dic)

\newpage

Run anansi

The main workspider in this package is called anansi. Generally, you want to give it three arguments. First, there's web, which is an anansiWeb object, such as the one we generated in the above step. Second, there's formula, which should be a formula. For instance, to assess differential associations between treatments, we use the formula ~Treatment, provided we have a column with that name in our metadata object, the Third argument.

Code chunk: Run anansi

anansi_out <- anansi(web      = web,          #Generated above
                     method   = "pearson",    #Define the type of correlation used
                     formula  = ~ Legend,     #Compare associations between treatments
                     metadata = FMT_metadata, #With data referred to in formula as column
                     adjust.method = "BH",    #Apply the Benjamini-Hochberg procedure
                     verbose  = T             #To let you know what's happening
                     )

Spin to a table

anansi gives a complex nested anansiYarn object as an output. Two functions exist that will wrangle your data to more friendly formats for you. You can either use spinToLong() or spinToWide(). They will give you long or wide format data.frames, respectively. For general reporting, we recommend sticking to the wide format as it's the most legible. You can also use the plot() method on an anansiYarn object to gain some insights in the state of your p, q, R and R^2^ parameters.

Code chunk: Parse anansi results

anansiLong <- spinToLong(anansi_output = anansi_out, translate = T, 
                         Y_translation = anansi::cpd_translation, 
                         X_translation = anansi::KO_translation)  
#Now it's ready to be plugged into ggplot2, though let's clean up a bit more. 

#Only consider interactions where the entire model fits well enough. 
anansiLong <- anansiLong[anansiLong$model_full_q.values < 0.1,]

\newpage

Plot the results

The long format can be helpful to plug the data into ggplot2. Here, we recreate part of the results from the FMT Aging study.

Code chunk: Plot anansi results

ggplot(data = anansiLong, 
       aes(x      = r.values, 
           y      = feature_X, 
           fill   = type, 
           alpha  = model_disjointed_Legend_p.values < 0.05)) + 

  #Make a vertical dashed red line at x = 0
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+

  #Points show  raw correlation coefficients
  geom_point(shape = 21, size = 3) + 

  #facet per compound
  ggforce::facet_col(~feature_Y, space = "free", scales = "free_y") + 

  #fix the scales, labels, theme and other layout
  scale_y_discrete(limits = rev, position = "right") +
  scale_alpha_manual(values = c("TRUE" = 1, 
                                "FALSE" = 1/3), "Disjointed association\np < 0.05") +
  scale_fill_manual(values = c("Young yFMT" = "#2166ac", 
                               "Aged oFMT"  = "#b2182b", 
                               "Aged yFMT"  = "#ef8a62", 
                               "All"        = "gray"), 
                    breaks = c("Young yFMT", "Aged oFMT", "Aged yFMT", "All"), 
                    "Treatment")+
  theme_bw() + 
  ylab("") + 
  xlab("Pearson's rho")

Here, we can see the per-group correlations between metabolite-function pairs in terms of Pearson's correlation coefficient on the x-axis. Opaque points indicate significantly disjointed associations, meaning that these associations have a significantly different slope between groups.

\newpage

3. Computing functional modules

Functional modules such as the Gut Brain modules shown here can be a valuable framework to investigate your data through. One major benefit of this framework is that you greatly reduce your search-space to specific functional pathways you're interested in. This will allow you to greatly save on statistical tests, which in turn will help save statistical power when accounting for FDR.

Here, we will demonstrate how we got to the Gut Brain modules from our the schizophrenia data set. Be warned that the first few operations here are computationally very expensive and should be performed on a server or similar if possible.

Gathering and preparing our data

First, let's download the necessary data from curatedMicrobiomeData

Code chunk: Download and pre-process microbiome data

#Load the relevant libraries
library(curatedMetagenomicData)
library(SummarizedExperiment)

#Define what data we're interested in. The "|" sign here signifies that we want both.
query <- "2021-03-31.ZhuF_2020.relative_abundance|2021-03-31.ZhuF_2020.gene_families"

#Download the specified data. This will take time. 
ZhuF <- curatedMetagenomicData(query, counts = T, dryrun = F)

#Extract the relevant data from complex SummarizedExperiment objects
Zhu_F_gene_families = as.matrix( SummarizedExperiment::assay(ZhuF[[1]]))     #Functions
Zhu_F_microbiome    = data.frame(SummarizedExperiment::assay(ZhuF[[2]]))    #Taxa
Zhu_F_metadata      = data.frame(SummarizedExperiment::colData(ZhuF[[2]]))  #Metadata

genus_lv_counts <- Zhu_F_microbiome %>% 
  rownames_to_column("X") %>%

  #Collapse to genus level
  mutate(X = str_remove(X, "\\|s__.*")) %>% 
  group_by(X) %>% 
  summarise(across(everything(), sum)) %>% 
  ungroup() %>% 

  #Clean up the names
  mutate(X = str_remove(X, ".*\\|f__")) %>% 
  mutate(X = str_replace(X, "\\|g__", "_")) %>% 
  mutate(X = str_replace(X, "Clostridiales_unclassified_", "Clostridiales_")) %>% 
  filter(X != "Clostridiales_Clostridiales_unclassified") %>% 

  #Restore row.names
  column_to_rownames("X")

#Should be identical to file used
waldo::compare(sort(row.names(genus_lv_counts)), sort(row.names(counts)))

#Now that we have our data, we can write the individual tables to csv files. 
#We'll focus on the gene families here, they're necessary to compute Gut Brain modules.

write.csv(Zhu_F_gene_families, file = "uniref90.csv")
write.csv(genus_lv_counts,     file = "counts.csv")

Convert to KEGG orthologues

Zhu_F_gene_families contains the functional microbiome in terms of uniref90. For functional module analysis, we typically want to get to KEGG orthologues (KOs). Because curatedMetagenomicData gives essentially biobakery output, we can use commands from the python-based HUMAnN3 pipeline to translate our uniref90 table to KEGG orthologues. We will not go deeply into this, but see the excellent documentation here: https://github.com/biobakery/humann#humann_regroup_table

The next snippet is not R code, but rather Bash code.

Code chunk: Convert functional table to KEGG orthologues

#First, we may want to change our uniref90 file so that it uses tabs instead of commas
sed -E 's/("([^"]*)")?,/\2\t/g' uniref90.csv  > uniref90.tsv

#Then, let's use humann_regroup_table from HUMAnN3 to convert to KEGG orthologues:
humann_regroup_table -i uniref90.tsv -g uniref90_ko -o guidebook_out_KOs.tsv

\newpage

Compute functional modules

Now that we have our functional microbiome in terms of KEGG orthologues, we can load them back into R. An added benefit is that the KEGG orthologues table is much smaller than the uniref90 table, so we can deal with it on our computer locally if we so choose. In order to do this, we will require the omixer-rpmR library which can be found of github. If you're working with functional inference data from a 16S experiment, such as output from PICRUSt2, you should be able to read the table in and compute functional modules starting at this step. The file you're looking for would be called something like pred_metagenome_unstrat.tsv in that case.

Code chunk: Generate functional modules

#Load the required package
library(omixerRpm) #devtools::install_github("omixer/omixer-rpmR")

#Load the KEGG orthologue table into R. 
#Note that KEGG orthologue names should be the in first column, not the row names.  
KOs <- read.delim("guidebook_out_KOs.tsv", header = T)

#listDB will tell you which databases are available to annotate the functional data with. 
listDB()

#Pick the most recent GBM database
db <- loadDB(name = "GBMs.v1.0")

#Calculate GBM abundance and convert the output to a nice data.frame
GBMs <- rpm(x = KOs,  module.db = db)
GBMs <- asDataFrame(GBMs, "abundance")

#Write the file to a csv to save it. 
write.csv(GBMs, file = "GBMs_guidebook.csv")

#While we're at it, let's do GMMs too: First check the names of the available databases:
listDB()

#Pick the most recent GMM database
db <- loadDB(name = "GMMs.v1.07")

#Calculate GMM abundance and convert the output to a nice data.frame
GMMs <- rpm(x = KOs,  module.db = db)
GMMs <- asDataFrame(GMMs, "abundance")

#Write the file to a csv to save it. 
write.csv(GMMs, file = "GMMs_guidebook.csv")

And the resulting files are ready for statistical analysis! I would like to note here that it's also possible to perform a stratified functional module analysis, where the contribution of each taxon to each functional module is also considered. However, this explosively increases the dimensionality of your data (i.e. you get way more rows in our case). I would only recommend doing this as a targeted analysis as the power of any statistical tests will suffer greatly and the results will be almost impossible to interpret.

\newpage

4. Volatility Analysis

Volatility refers to the degree of instability (or change over time) in the microbiome. High volatility, i.e. an unstable microbiome, has been associated with an exaggerated stress response and conditions like IBS. Here, we'll demonstrate how one would go about calculating volatility in a real data set. Volatility analysis requires at least two time points per sample. Because of this, we cannot use the schizophrenia data set which only features single snapshots of the microbiome.

We'll be taking a look at the datasets used in the original Volatility paper: Volatility as a Concept to Understand the Impact of Stress on the Microbiome (DOI: 10.1016/j.psyneuen.2020.105047). Very briefly, mice were separated into two groups: Control and Stress. Faecal samples were taken twice, with a 10-day period in between. In this 10-day period, the mice in the Stress group were subjected to daily social defeat stress, whereas the Control mice were left alone. When we compared the degree of change in the microbiome (i.e. Volatility) between the two groups of mice, the Stressed mice consistently displayed higher levels of volatility than the control mice. Our reviewers asked us to replicate the experiment and we did so. The two cohorts are labeled discovery and validation. This data has been included in the volatility library on github for easy access purposes.

Traditionally, microbiome studies featuring high-throughput sequencing data only consider a single time point. However, there is utility in considering microbiomes as dynamic microbial ecosystems that change over time. By measuring the microbiome longitudinally and computing volatility, additional information can be revealed that would otherwise be missed. For instance, in the original volatility study, we found that volatility after stress is positively associated with severity of the stress response, including in terms of behaviour and hypothalamic-pituitary-adrenal (HPA) axis activity, both in mice and in humans.

Setup

OK, now let's get started.

Code chunk: Load data

#Install and load volatility library
library(volatility)     #devtools::install_github("thomazbastiaanssen/volatility")

#Load tidyverse to wrangle and plot results.
library(tidyverse)

#Load example data + metadata from the volatility study.
data(volatility_data)

\newpage

Considering our input data

The main volatility function does all the heavy lifting here. It expects two arguments: The first argument is counts, a microbiome feature count table, with columns as samples and rows and features. The vola_genus_table object is an example of an appropriately formatted count table.

Code chunk: Examine required data format

vola_genus_table[4:9,1:2]

The second arguent is metadata, a vector in the same order as the count table, denoting which samples are from the same source. The column ID in vola_metadata is appropriate for this.

Code chunk: Examine required metadata format

head(vola_metadata, 5)

Code chunk: Prepare the data for plotting

#This part should feel very reminiscent of what we did in chapters 1 & 3. 
counts  <- vola_genus_table[,vola_metadata$sample_ID]

#Fork off your count data so that you always have an untouched version handy.
genus   <- counts

#make sure our count data is all numbers
genus   <- apply(genus,c(1,2),function(x) as.numeric(as.character(x)))

#Remove features with prevalence < 10% in two steps:
#First, determine how often every feature is absent in a sample
n_zeroes <- rowSums(genus == 0)

#Then, remove features that are absent in more than your threshold (90% in this case).
genus    <- genus[n_zeroes <= round(ncol(genus) * 0.90),]

#Perform a CLR transformation
genus.exp <- clr_c(genus)

#Apply the base R principal component analysis function on our CLR-transformed data.
data.a.pca  <- prcomp(t(genus.exp))

\newpage

Plot data

Code chunk: Plot longitudinal PCA

#Extract the amount of variance the first four components explain for plotting. 
pc1 <- round(data.a.pca$sdev[1]^2/sum(data.a.pca$sdev^2),4) * 100
pc2 <- round(data.a.pca$sdev[2]^2/sum(data.a.pca$sdev^2),4) * 100
pc3 <- round(data.a.pca$sdev[3]^2/sum(data.a.pca$sdev^2),4) * 100
pc4 <- round(data.a.pca$sdev[4]^2/sum(data.a.pca$sdev^2),4) * 100

#Extract the scores for every sample for the first four components for plotting. 
pca  = data.frame(PC1 = data.a.pca$x[,1], 
                  PC2 = data.a.pca$x[,2], 
                  PC3 = data.a.pca$x[,3], 
                  PC4 = data.a.pca$x[,4])

#Add relevant information from the metadata. 
#Note that ID here refers to the mouse ID, not the sample ID. 
pca$ID                  = vola_metadata$ID
pca$Legend              = vola_metadata$treatment
pca$Timepoint           = vola_metadata$timepoint
pca$Cohort              = vola_metadata$cohort

#Plot the first two components of the PCA.
ggplot(pca, aes(x       = PC1, 
                y       = PC2, 
                fill    = Legend,
                colour  = Legend,
                shape   = Timepoint, 
                group   = ID)) +  
  #Add a line first, it will link points that share an ID. 
  geom_line() +

  #Then add the points.
  geom_point(size = 3, col = "black") + 

  #Plot the two cohorts separately. 
  facet_wrap(~Cohort, scales = "free_x", strip.position = "top") +

  #Improve appearance.
  scale_fill_manual(  values = c("Control"  = "#1f78b4", 
                                 "Stress"   = "#e31a1c")) + 
  scale_colour_manual(values = c("Control"  = "#3690c0", 
                                 "Stress"   = "#cb181d")) + 
  scale_shape_manual( values = c("Pre"  = 21, 
                                "Post" = 24)) +  
  theme_bw() +
  xlab(paste("PC1: ", pc1,  "%", sep="")) + 
  ylab(paste("PC2: ", pc2,  "%", sep="")) + 
  theme(text = element_text(size = 12)) +
  guides(fill = guide_legend(override.aes = list(shape = 22)))

We can see that points from the same mouse are connected by a line. It looks like some lines are longer than others, implying that some microbiomes have changed more than others over the 10 days. However, We're only looking at about 30% of the variance here, so it's hard to say anything conclusive.

Compute volatility

Volatility between two samples can be easily calculated using the titular volatilty function in the library by the same name. Under the hood, volatility can be calculated as the euclidean distance over CLR-transformed count data.

Code chunk: Calculate volatility

vola_out <- volatility(counts = genus, metadata = vola_metadata$ID)

head(vola_out)

The output of the main volatility function is a data.frame with two columns. ID corresponds to the pairs of samples passed on in the metadata argument, whereas volatility shows the measured volatility between those samples in terms of Aitchison distance (Euclidean distance of CLR-transformed counts).

\newpage

Plot the results

Code chunk: Plot volatility

vola_out %>%
  #Merge the volatilty output with the rest of the metadata using the shared "ID" column.
  left_join(vola_metadata[vola_metadata$timepoint == "Pre",], "ID") %>%

  #Pipe into ggplot.
  ggplot(aes(x = treatment, y = volatility, fill = treatment)) +

  #Define geoms, boxplots overlayed with data points in this case.
  geom_boxplot(alpha = 1/2, coef = 10)+
  geom_point(shape = 21) +

  #Split the plot by cohort.
  facet_wrap(~cohort) +

  #Tweak appearance.
  scale_fill_manual(values = c("Control" = "#3690c0", "Stress"  = "#cb181d")) +
  theme_bw() +
  xlab("") +
  ylab("Volatility (Aitchison distance)")

\newpage

5. Discussion

Here, we have presented four separate primers for techniques from four distinct topics, namely causal inference, multi-omics integration, mesoscale analysis and temporal analysis, and how one may go about applying them to microbiome-gut-brain axis experiments. While these techniques and corresponding fields may seem unrelated, we are of the opinion that all four will be essential to move the microbiome-gut-brain axis field forward towards an ecology oriented, mechanistic understanding.

As said in the discussion of part 1 of this perspective, this document is just a template. Depending on the experimental setup, findings and experimental questions, you may want to choose a differing approach. Given the highly complex nature of microbiome data, one should ideally avoid blindly applying models and pipelines without understanding what they are doing. D.R. Cox is famously ascribed the statement: "Most real life statistical problems have one or more nonstandard features. There are no routine statistical questions; only questionable statistical routines." We find this holds true for the microbiome as well, even more so for these more advanced techniques than for the ones presented in part 1 of this perspective piece. In particular, it is crucial that we apply our biological knowledge to determine how exactly to use these techniques. For instance, when performing a mediation analysis in section 1, it is crucial to have a biological reason why one would expect the effect of a variable on an outcome to be moderated by a second variable. Similarly, with the multi-omics integration using anansi in section 2, we're relying on pre-existing biological knowledge in the form of the KEGG database to only investigate interactions between metabolites and enzymatic functions that could take place, rather than naively assessing all of them.

Clear communication, both in terms of describing and explaining our methods as well as in terms of figure presentation, are essential for the health of the field. Indeed, failing to do so can lead to confusion among our peers. We hope that both aspiring and veteran bioinformaticians will find our guide helpful. We have tried to model this piece after what we would have loved to have access to ourselves when we first set out to study the microbiome.

\newpage

Session Info

sessioninfo::session_info()

thomazbastiaanssen/Tjazi documentation built on Aug. 22, 2023, 1:30 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

thomazbastiaanssen/Tjazi Microbiome Oriented Compositional Data Toolkit

In thomazbastiaanssen/Tjazi: Microbiome Oriented Compositional Data Toolkit

0. Introduction

Code chunk: Load our libraries

1. Mediation analysis

Gathering and preparing our data

Code chunk: Preparing microbiome data

Preparing the dietary data

Code chunk: Preparing dietary data

Prepare data for mediation analysis

Code chunk: Preparing for mediation analysis

Fit models

Code chunk: Fitting statistical models

Investigate results

Code chunk: Cronobacter results

Code chunk: Cronobacter mediation figure

Code chunk: Gordonibacter results

Code chunk: Gordonibacter mediation figure

2. Multi-omics integration

Code chunk: Load libraries and data

Data preparation

A note on functional microbiome data

Code chunk: Prepare data

Weave a web

Code chunk: Generate web object

Run anansi

Code chunk: Run anansi

Spin to a table

Code chunk: Parse anansi results

Plot the results

Code chunk: Plot anansi results

3. Computing functional modules

Gathering and preparing our data

Code chunk: Download and pre-process microbiome data

Convert to KEGG orthologues

Code chunk: Convert functional table to KEGG orthologues

Compute functional modules

Code chunk: Generate functional modules

4. Volatility Analysis

Setup

Code chunk: Load data

Considering our input data

Code chunk: Examine required data format

Code chunk: Examine required metadata format

Code chunk: Prepare the data for plotting

Plot data

Code chunk: Plot longitudinal PCA

Compute volatility

Code chunk: Calculate volatility

Plot the results

Code chunk: Plot volatility

5. Discussion

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

thomazbastiaanssen/Tjazi
Microbiome Oriented Compositional Data Toolkit