This document indicates how to use our model of HOst-Microbiota Evolution.
The first part presents how to simulate mock microbiota.
The second part explains how to perform an empirical application; it is illustrated with the great apes microbiota data.
The last part shows how to interpret the HTML outputs of HOME.
Citation: Perez‐Lamarque, B., & Morlon, H. (2019). Characterizing symbiont inheritance during host‐microbiota evolution: application to the great apes gut microbiota. Molecular Ecology Resources 19:1659-1671.
Contact: Benoît Perez-Lamarque, benoit.perez.lamarque@gmail.com
Installation\ Performing Simulations\ Running Empirical application:\ * Creating microbial alignments for each OTU\ * Example of empirical applications - great apes microbiota\ * Running HOME on the empirical application\ Interpretation of the Results:\ * Example 1: Results from a simulation with horizontal transmission\ * Example 2: Results from a simulation with strict vertical transmission\ * Example 3: Results from a simulation with environmental acquisition
Our model is part of the R-package HOME and can be installed from GitHub using devtools:
library(devtools)
install_github("BPerezLamarque/HOME", dependencies = TRUE)
Alternatively, our model is also part on the R package RPANDA (Morlon et al., 2016) available from GitHub.
library(devtools)
install_github("hmorlon/PANDA",ref="Benoit", dependencies = TRUE)
You can provide a host tree (e.g. an empirical tree) and simulate the evolution of a mock microbiota on it. Your tree must be binary, rooted and ultrametric. You can either directly provide a host tree in the function sim_microbiota, or you can have the host tree saved in your working directory (thus, its filename must be well-formated host_tree_"name".tre in a Newick format).
If you don't provide a host tree, the function sim_microbiota will randomly simulate a host tree (with a pure-birth process) according to the number of tips (i.e. number of host species) you have provided.
setwd("/my_working_directory/") # working directory where all the files and folders will be created.
# name of your simulation
name <- "simulation_tree_1"
# name of the different microbial OTU of your simulations (hereafter you will model the evolution of 6 OTUs on your host tree)
name_index <- c("OTU1","OTU2","OTU3","OTU4","OTU5","OTU6")
# simulated scenarii for each OTU:
simul <- c(0,1,3,5,"indep","indep")
# each simulated OTU has its own evolutionary history:
# i) "0" simulates strict vertical transmission (i.e. O host-switch)
# ii) any positive integer simulates horizontal transmissions (with this given number of host-switches)
# iii) "indep" simulates environmental acquisition (i.e. independent evolutions - the simulated OTU evolved on an independent simulated tree)
# Load the host tree to simulate the evolution of the OTUs on it
host_tree <- read.tree("my_tree.tre")
# Alternatively, if you don't provide a host tree, you must add the number of tips of the simulated host tree
# n <- 20
# proportion of variable nucleotides in the alignment
proportion_variant <- 0.1
# simulated substitution rate
simulated_mu <- 1 # simulated_mu=1 corresponds to on average 1 mutation per variable nucleotide
# total number of nucleotides in the alignment
N <- 300
# number of cores to run the analyses
nb_cores <- 1 # if you don't run in a multi-cores machine, the default value is 1
# seed used for simulations
seed <- 1
sim_microbiota(name=name, name_index=name_index, simul=simul,
provided_tree=host_tree, mu=simulated_mu, N=N, proportion_variant=proportion_variant,
seed=seed, nb_cores=nb_cores)
Then, you can proceed to the parameters estimation by running HOME on the simulated host-microbiota data.
# possible numbers of host-switches to test during the inference
lambda <- c(1:25)
# number of trees (for Monte Carlo estimation of the number of switches)
nb_tree <- 10000
# number of randomizations in the model selection testing independent evolutions (R parameter)
nb_random <- 10
# rarefactions on the number of trees
raref <- FALSE # if TRUE rarefactions on the number of trees will be performed
HOME_model(name=name, name_index=name_index, provided_tree=host_tree, nb_tree=nb_tree, lambda=lambda,
empirical=FALSE, raref=raref, nb_random=nb_random, seed=seed, nb_cores=nb_cores)
In order to run HOME, you need first to make the microbial alignments for each OTU of your empirical microbiota. The first step consists in the usual clustering of the raw reads into OTUs. Thus, the second step makes the OTU alignments for running HOME on them (using our own bash script: for each core OTU, it picks the most abundant sequence for every host and align them). Different pipelines can be used to obtain these microbial alignments for each OTU.
At the end of these two steps, all OTU alignments must be stored in a specific folder (i.e. the working directory where you will run HOME) with the filenames formatted as alignment_"name"_"OTUXXX".fas, where "name" will be the name our your HOME run ("name"), and "OTUXXX" is the name of the specific OTU ("name_index").
Examples of bash pipelines to prepare alignments before running HOME are available in the following.
This pipeline clusters OTU given a fixed similarity threshold (97%, or any other value) by combining QIIME1 and UPARSE thanks to the BMP pipelines. The bash pipelines used for studying the microbiota of great apes (Perez-Lamarque & Morlon, 2019) is available here. Although, this pipeline has been designed for 454 sequencing technology, it can also be used on Illumina datasets.
This pipeline clusters reads into SWARM OTUs (Mahé et al., 2014; Mahé et al., 2015): OTU are defined without chosing an global clustering threshold. The step 1 of this pipeline is a direct application of Frédéric Mahé's metabarcoding pipeline using VSEARCH and SWARM. This pipeline is specifically designed for Illumina datasets and is available here.
This pipeline clusters reads into OTUs given a fixed similarity threshold (97%, or any other value). The step 1 of this pipeline is derived from Frédéric Mahé's metabarcoding pipeline using VSEARCH. This pipeline is specifically designed for Illumina datasets and is available here.
A one-to-one correspondance between host tips in the tree and the fasta header of each OTU alignment is mandatory. If multiple sequences correspond to a unique host, there are several solutions : (1) you can only keep one sequence (the most abundant, and assumes that the other sequences are likely to come from PCR/sequencing errors); but if you really want to keep these sequences in the analyses, you have to add new tips in the host tree: (2) either by adding new tips with zero branch lengths to match the number of sequences (similar to what we did in Perez-Lamarque & Morlon (2019) ; see here for more details and a script to add new tips with branch lengths close to zero ), or (3) alternatively by adding a coalescent tree shape at the tip of each species (to model population differentiation for each species that gives the multiple sequences per host).
Let's run HOME for 3 OTUs from the great apes microbiota. First, your working directory must contain the host tree saved with a filename host_tree_"name".tre (Newick format), and all the OTU alignments with the filenames alignment_"name"_"OTUXXX".fas (FASTA format). You can also directly provide the host tree in the function HOME_model.
For instance, in this empirical application (name="great_apes"), you must provide:
You can directly download this example from RPANDA:
setwd("/my_working_directory/") # working directory where all the files and folders will be created.
example_great_apes_microbiota(name="great_apes")
# name of the empirical application
name <- "great_apes"
# name of the different OTUs
name_OTU <- c("OTU0001","OTU0002","OTU0003")
# possible numbers of host-switches to test during the inference
lambda <- c(1:25)
# number of trees (for Monte Carlo estimation of the number of switches)
nb_tree <- 10000
# number of randomizations in the model selection testing independent evolutions
nb_random <- 10
# rarefactions on the number of trees
raref <- FALSE # if TRUE rarefactions on the number of trees are performed
# number of cores to run the analyses
nb_cores <- 1 # if you don't run in a multi-cores machine, the default value is 1
# seed used for simulations
seed <- 1
provided_tree <- read.tree(paste0("host_tree_",name,".tre"))
HOME_model(name=name, name_index=name_OTU, provided_tree=provided_tree, nb_tree=nb_tree, lambda=lambda,
empirical=TRUE, raref=raref, nb_random=nb_random, seed=seed, nb_cores=nb_cores)
The results of each HOME run are available in the folder "/my_working_directory/figures/". For each OTU, a HTML file summarizes all the results with tables and figures. Here, we provide 3 examples of results interpretations.
Parameters | Values ------------- | ------------- Number of symbiont-host association:| 20 Simulated substitution rate:| 1 Number of simulated switches:| 3 Seed for simulations:| 30 Sequences length: |300 Probability segregated sites:| 0.1 Number of unvaried sites:| 281 Number of strictly variant sites:| 19
Parameters | Values ------------- | ------------- Inferred substitution model:| K80 Transition/Transversion ratio:| 0.6558 Estimated substitution rate:| 0.8519 Estimated number of switches:| 3 Associated likelihood:| 137.151 p-value: Strict vertical transmission:| 6e-04 p-value: Independent evolutions (nb switches):| 0 p-value: Independent evolutions (subs. rate):| 0
The most likely number of host-switches:
ksi | mu | -log(Likelihood) ------------- | ------------- | ------------- 3 | 0.8519 | 137.151
Minus log likelihood as a function of the number of host-switches (the minimum of this plot corresponds to the estimated number of switches):
Estimated substitution rate as a function of the number of switches:
Likelihood ratio test testing the model of strict vertical transmission (ksi=0). Strict vertical transmission is rejected if p-value < 0.05.
Results of the likelihood ratio test. The grey curve correponds to the Chi2 distribution with df=1. The dark blue line (resp. light) stands for the 0.05 (resp. 0.01) p-value threshold and the dashed orange line is the observed LRT ratio.
Model selection on independent evolutions.
Test | p-values ------------- | ------------- Empirical ranking (ksi distribution)| 0 Empirical ranking (mu distribution)| 0
Representation of the estimated numbers of switches and the estimated substitution rates for the randomized alignments (in blue) and the empirical alignment (in orange). Independent evolutions can be rejected if the orange dot stands alone in the bottom left corner (i.e. rejected if p-values < 0.05).
Estimated rate matrix:
Rates | A |C |G |T ------------- | -------------| -------------| -------------| ------------- A | -1 | 0.17 | 0.66 | 0.17 | C | 0.17 | -1 | 0.17 |0.66 | G | 0.66 | 0.17 | -1 | 0.17 | T | 0.17 | 0.66 | 0.17 | -1 |
Nucleotide frequencies:
A |C |G |T ------------- | -------------| -------------| ------------- 0.25 | 0.25 | 0.25 | 0.25
Simulated switch(es): | 1 |2 |3 ------------- | -------------| -------------| ------------- Branch origin | 1 |36| 35 Branch arrival |35| 2 |28 Absolute position| 0.022 |0.089 |0.13
Host tree:
Parameters | Values ------------- | ------------- Number of symbiont-host association:| 20 Simulated substitution rate:| 1 Number of simulated switches:| 0 Seed for simulations:| 30 Sequences length: |300 Probability segregated sites:| 0.1 Number of unvaried sites:| 279 Number of strictly variant sites:| 21
Parameters | Values ------------- | ------------- Inferred substitution model:| K80 Transition/Transversion ratio:| 0.70 Estimated substitution rate:| 0.70 Estimated number of switches:| 0 Associated likelihood:| 149.8 p-value: Strict vertical transmission:| 1 p-value: Independent evolutions (nb switches):| 0 p-value: Independent evolutions (subs. rate):| 0
CONCLUSION: STRICT VERTICAL TRANSMISSION.
The most likely number of host-switches:
ksi | mu | -log(Likelihood) ------------- | ------------- | ------------- 3 | 0.70 | 149.8
Minus log likelihood as a function of the number of switches (the minimum of this plot corresponds to the estimated number of switches):
Estimated substitution rate as a function of the number of switches:
Likelihood ratio test testing the model of strict vertical transmission (ksi=0). Strict vertical transmission is rejected if p-value < 0.05.
Results of the likelihood ratio test. The grey curve correponds to the Chi2 distribution with df=1. The dark blue line (resp. light) stands for the 0.05 (resp. 0.01) p-value threshold and the dashed orange line is the observed LRT ratio.
Model selection on independent evolutions.
Test | p-values ------------- | ------------- Empirical ranking (ksi distribution)| 0 Empirical ranking (mu distribution)| 0
Representation of the estimated numbers of switches and the estimated substitution rates for the randomized alignments (in blue) and the empirical alignment (in orange). Independent evolutions can be rejected if the orange dot stands alone in the bottom left corner (i.e. rejected if p-values < 0.05).
Estimated rate matrix:
Rates | A |C |G |T ------------- | -------------| -------------| -------------| ------------- A | -1 | 0.15 | 0.7 | 0.15 | C | 0.15 | -1 | 0.15 |0.7 | G | 0.7 | 0.15 | -1 | 0.15 | T | 0.15 | 0.7 | 0.15 | -1 |
Nucleotide frequencies:
A |C |G |T ------------- | -------------| -------------| ------------- 0.25 | 0.25 | 0.25 | 0.25
Parameters | Values ------------- | ------------- Number of symbiont-host association:| 20 Simulated substitution rate:| 1 Number of simulated switches:| independent Seed for simulations:| 30 Sequences length: |300 Probability segregated sites:| 0.1 Number of unvaried sites:| 278 Number of strictly variant sites:| 22
Parameters | Values ------------- | ------------- Inferred substitution model:| K80 Transition/Transversion ratio:| 0.54 Estimated substitution rate:| 3.3793 Estimated number of switches:| 25 Associated likelihood:| 270.3 p-value: Strict vertical transmission:| 0 p-value: Independent evolutions (nb switches):| 0.56 p-value: Independent evolutions (subs. rate):| 0.89
CONCLUSION:INDEPENDENT EVOLUTIONS (BASED ON THE NUMBER OF SWITCHES AND THE SUBSTITUTION RATE).
The most likely number of host-switches:
ksi | mu | -log(Likelihood) ------------- | ------------- | ------------- 25 | 3.38 | 270.3
Minus log likelihood as a function of the number of switches (the minimum of this plot corresponds to the estimated number of switches):
Estimated substitution rate as a function of the number of switches:
Likelihood ratio test testing the model of strict vertical transmission (ksi=0). Strict vertical transmission is rejected if p-value < 0.05.
Results of the likelihood ratio test. The grey curve correponds to the Chi2 distribution with df=1. The dark blue line (resp. light) stands for the 0.05 (resp. 0.01) p-value threshold and the dashed orange line is the observed LRT ratio.
Model selection on independent evolutions.
Test | p-values ------------- | ------------- Empirical ranking (ksi distribution)| 0.56 Empirical ranking (mu distribution)| 0.88
Representation of the estimated numbers of switches and the estimated substitution rates for the randomized alignments (in blue) and the empirical alignment (in orange). Independent evolutions can be rejected if the orange dot stands alone in the bottom left corner (i.e. rejected if p-values < 0.05).
Estimated rate matrix:
Rates | A |C |G |T ------------- | -------------| -------------| -------------| ------------- A | -1 | 0.23 | 0.54 | 0.23 | C | 0.23 | -1 | 0.23 |0.54 | G | 0.54 | 0.23 | -1 | 0.23 | T | 0.23 | 0.54 | 0.23 | -1 |
Nucleotide frequencies:
A |C |G |T ------------- | -------------| -------------| ------------- 0.25 | 0.25 | 0.25 | 0.25
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.