knitr::opts_chunk$set(echo = TRUE)
This walk-through provides instructions for implementing the RERconverge package to identify genes whose evolutionary rates shift in association with change in a binary trait. For information on how to download and install RERconverge, see the README on github.
In R, load the RERConverge library.
if (!require("RERconverge", character.only=T, quietly=T)) { require(devtools) install_github("nclark-lab/RERconverge") } library(RERconverge)
This should also download all the files we will be working with to your computer, in the directory where your R library lives. If you'd like to visualize or work with any of these files separately, this is where you can find them:
print(paste(.libPaths()[1],"/RERconverge",sep="")) #This is the path to the files rerpath = paste(.libPaths()[1],"/RERconverge",sep="")
readTrees
To run RERconverge, you will first need to supply a file containing gene trees for all genes to be included in your analysis. This is a tab delimited file with the following information on each line:
Gene_name Newick_tree
An example file is provided in extdata/mammal62aa_meredplus_wCM.trees, which you can view in any text editor.
Now in R, read in the gene trees. The readTrees
function takes quite a while to read in trees for all genes, so we will limit ourselves to the first 200 using max.read
(this will still take a minute or so, so be patient):
toytreefile = "subsetMammalGeneTrees.txt" #change filename once toy trees available mamTrees=readTrees(paste(rerpath,"/extdata/",toytreefile,sep=""), max.read = 200)
First, the code tells us that there are 38298 items, or gene trees, in the file. Since we have set max.read = 200
, it will only read the first 200 of these. Then it says that the maximum number of tips in the gene trees is 62 and, later, it reports that it will use the 32 genes in this set that have data for all 62 species to estimate a master tree. The tree topology for the master tree (but not its branch lengths) will be used for subsequent analyses.
getAllResiduals
The next step is to estimate relative evolutionary rates, or RER, for all branches in the tree for each gene. Intuitively, a gene's RER for a given branch represents how quickly or slowly the gene is evolving on that branch, relative to its overall rate of evolution throughout the tree. For more details about how RER are computed, see [@Chikina2016] and [@Partha2017].
We will use the getAllResiduals
function to calculate RER. This uses the following input variables (all the options set here are also the defaults):
useSpecies
: a vector that can be used to specify a subset of species to use in the analysis. Here we will use the full set of tip labels in the master tree.transform
: the method used to transform the raw data. By transforming the raw data, we reduce the heteroscedasticity (relationship between mean and variance) and the influence of outliers. Here we will use a square-root transform.weighted
: whether to use a weighted regression to estimate RER. Weighting allows further correction for the relationship between mean and variance, which can be directly estimated from the data. scale
: whether to scale the individual branches of the gene trees to account for variance across trees. This scales the variance, though not the mean, of each branch length, using the R function scale
.Here is the basic method, with the recommended settings:
mamRERw=getAllResiduals(mamTrees,useSpecies=mamTrees$masterTree$tip.label, transform = "sqrt", weighted = T, scale = T)
The plots generated by this function show the heteroscedasticity in the original data (on the left) and the data after transformation and weighted regression (on the right). The x-axis displays bins of branch lengths on the tree, and the y-axis is the (log-scaled) variance in these branch lengths across trees. As you can see by comparing the right plot to the left plot, transforming and performing a weighted regression reduces the relationship between the mean branch length (x-axis) and the variance in branch length (y-axis).
Now that we have RERs, we can visualize these for any given gene using the plotRers
function. Here is an example.
#make average and gene tree plots noneutherians <- c("Platypus","Wallaby","Tasmanian_devil","Opossum") par(mfrow=c(1,2)) #The following function is not in the compiled R package but is available in 'plottingFuncs.R' on the 'UpdateVignette' branch.... repopath = '~/repos/RERconverge' source(paste(repopath,'/R/plottingFuncs.R',sep='')) avgtree=plotTreeHighlightBranches(mamTrees$masterTree, outgroup=noneutherians, hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), main="Average tree") #plot average tree bend3tree=plotTreeHighlightBranches(mamTrees$trees$BEND3, outgroup=noneutherians, hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), main="BEND3 tree") #plot individual gene tree #plot RERs par(mfrow=c(1,1)) phenvExample <- foreground2Paths(c("Vole","Squirrel"),mamTrees) plotRers(mamRERw,"BEND3",phenv=phenvExample) #plot RERs
The upper left plot is a tree with branch lengths representing the average rates across all genes. The upper right plot is the same tree, but with branch lengths representing rates specifically for the BEND3 gene. The plot below these represents the estimated RERs for terminal branches. The foreground branches (set here using foreground2Paths
) are highlighted in red. Notice how the RER for vole is negative; this is because the branch leading to vole in the BEND3 tree is shorter than average. On the other hand, the RER for squirrel is positive because the branch leading to squirrel in the BEND3 tree is longer than average.
Here we describe the minimal set of steps to test for association between relative evolutionary rates and a binary trait.
Now we will associate variation in these RERs with variation in traits across the tree. To do so, we first need to provide information about which branches of the tree have the trait of interest (foreground branches). There are several possible ways to do this:
1) Provide a binary trait tree file. This should be a file in Newick format with branch lengths zero for background branches and one for foreground branches. An example is provided in data/MarineTreeBin.txt.
marineb=read.tree("../data/MarineTreeBinCommonNames.txt") marinebrooted = root(marineb,outgroup=noneutherians) mb1 = marineb mb1$edge.length = c(rep(1,length(mb1$edge.length))) par(mfrow=c(1,2)) plot(marinebrooted, main="Trait tree from file (1)") binplot1=plotTreeHighlightBranches(mb1, outgroup=noneutherians, hlspecies=which(marineb$edge.length==1), hlcols="blue", main="Foreground branches highlighted (1)") #alternative way of representing the tree
The plot on the left shows the tree you provided, with branch lengths 0 for all background lineages and branch lengths 1 for foreground lineages. The plot on the right displays the tree with all branch lengths 1 and the foreground lineages highlighted in blue. This binary tree represents the following as foreground lineages: all terminal branches leading to extant marine species, plus the branch leading to the common ancestor of the killer whale and the dolphin.
2) Generate a binary tree from a vector of foreground species using foreground2Tree
. By modifying the clade
argument, you can choose one of several options for how to handle the branches ancestral to the foreground species, as follows:
2a) clade = "ancestral"
: Use maximum parsimony to infer where transitions from background to foreground occurred in the tree, and set those transition lineages to foreground.
marineextantforeground = c("Walrus","Seal","Killer_whale","Dolphin","Manatee") marineb2a = foreground2Tree(marineextantforeground, mamTrees, clade="ancestral")
Here the branch leading to the common ancestor of killer whale and dolphin, as well as the branch leading to the common ancestor of walrus and seal, are foreground, along with the terminal branch leading to the manatee.
2b) clade = "terminal"
: Set only terminal lineages leading to foreground species as foreground.
marineb2b = foreground2Tree(marineextantforeground, mamTrees, clade="terminal")
Here the each terminal branch leading to a marine species is foreground, but no internal branches are foreground.
2c) clade = "all"
: Use maximum parsimony to infer where transitions from background to foreground occurred in the tree, and set those transition lineages, along with all daughter lineages, to foreground.
marineb2c = foreground2Tree(marineextantforeground, mamTrees, clade="all")
Here the foreground branches are all those inferred to be transitional from 2a, as well as the terminal branches. If we had a case in which some branches daughter to the transition branches were not terminal, those would be included in the foreground as well.
2d) clade = "weighted"
: Infer transition and daughter branches as in 2c, but spread a weight of 1 evenly across all branches within each independent foreground clade.
marineb2d = foreground2Tree(marineextantforeground, mamTrees, clade="weighted")
Here all branches in the cetacean and pinniped clades have length 1/3, whereas the terminal branch leading to the manatee has length 1. This is a way of distributing the weight given to each independent convergence event evenly across clades, rather than across lineages.
If you plot all the resulting trees, you can see how the different choices for clade
influence the resulting branch lengths.
par(mfrow=c(2,2)) plot(marine2a,main="ancestral") plot(marine2b,main="terminal") plot(marine2c,main="all") plot(marine2d,main="weighted")
None of these are the same as the binary tree that you specified in (1)!
plot(marineb,main="Manually specified binary tree (1)")
Note that, in this tree, the branch representing the ancestor of the killer whale and the dolphin has a branch length of 1, whereas the branch representing the ancestor of seal and walrus has a branch length of 0. This shows that providing a binary trait tree file (1) (or manually specifying foreground branches (3)) allows you to specify which branches should be foreground with more flexibility than providing a set of foreground species to foreground2Tree
.
3) Use the interactive branch selection tool. The following should open a plot of the master tree. When the GUI opens, select the marine foreground branches (Walrus, Seal, Killer whale, Dolphin, Killer whale-Dolphin ancestor, and Manatee), and click 'End selection.'
marineb3=selectforegroundbranches(mamTrees$masterTree)
You can double check that the tree you created by manual selection (3) matches the binary tree you provided in (1) (it should) using the following code:
marineb3=selectforegroundbranches(mamTrees$masterTree) marineb3rooted=root(marineb3,outgroup=c("Platypus", "Wallaby","Tasmanian_devil","Opossum")) plot(marineb3rooted, main = "Trait tree from manual selection (3)")
tree2Paths
or foreground2Paths
Some of the genes (like BEND3 above) may not have data for all species, meaning that their phylogeny will be a subset of the full phylogeny. To plot RERs for these genes and to correlate them with trait evolution, we run one of two functions that determine how the trait would evolve along all/many possible subsets of the full phylogeny, generating a set of paths. The function tree2Paths
takes a binary tree as input, and the function foreground2Paths
takes a set of foreground species as input. It has the same arguments as foreground2Tree
described above (see option 2 in Reading in or generating trait trees).
Important note: If you are using a binary tree to create paths, it must have the same topology and include the same species as the master tree (see previous section for how this is generated).
phenvMarine=tree2Paths(marineb, mamTrees) phenvMarine2=foreground2Paths(marineextantforeground, mamTrees, clade="all")
correlateWithBinaryPhenotype
Now that we have estimates for the RERs for all genes of interest, as well as a representation of how the trait of interest evolves across the tree, we can use correlateWithBinaryPhenotype
to test for an association between relative evolutionary rate and trait across all branches of the tree.
This uses the following input variables (all the options set here are also the defaults):
min.sp
: the minimum number of species in the gene tree for that gene to be included in the analysis. The default is 10, but you may wish to modify it depending upon the number of species in your master tree.min.pos
: the minimum number of independent foreground (non-zero) lineages represented in the gene tree for that gene to be included in the analysis. The default is 2, requiring the trait to have evolved in at least two separate clades with representatives in the gene tree.weighted
: whether to perform a weighted correlation where branch lengths represent weights. This can be used with the clade=weighted
option in foreground2Tree
to distribute phenotype weights across multiple lineages in a clade. The default is "auto", which will use weighted correlation if any branch lengths are between 0 and 1, and will use standard correlation otherwise.corMarine=correlateWithBinaryPhenotype(mamRERw, phenvMarine, min.sp=10, min.pos=2, weighted="auto")
The text displayed shows which correlation method is used to test for association. Here it uses the default for binary traits: the Kendall rank correlation coefficient, or Tau. For more details on correlation methods for binary and continuous traits, see ???.
The correlateWithBinaryPhenotype
function generates a table with the following output for each gene:
1) Rho: the correlation between relative evolutionary rate and trait across all branches
2) N: the number of branches in the gene tree
3) P: an estimate of the P-value for association between relative evolutionary rate and trait.
Let's take a look at some of the top genes within this set.
head(corMarine[order(corMarine$P),])
ANO2 and BMP10 are two of the top genes.
plotRers(mamRERv,"ANO2",phenv=phenvMarine) plotRers(mamRERv,"BMP10",phenv=phenvMarine)
In these RER plots, the marine lineages specified in the foreground are highlighted in red. The terminal branches are listed in alphabetical order, and internal branches are displayed at the bottom; note the red point at the bottom of each plot indicating the RER for killer whale-dolphin ancestral branch.
In the ANO2 tree, marine lineages have high RER, leading to a positive Rho and a low p-value. In contrast, in the BMP10 tree, marine lineages have low RER. This also yields a low p-value, but with a negative Rho.
To see what the overall pattern of association is across all genes in the set, we can plot a p-value histogram.
hist(corMarine$P, breaks=15, xlab="Kendall P-value", main="P-values for correlation between 200 genes and marine environment")
There appears to be a slight enrichment of low p-values, but since we have only evaluated the first 200 genes from our ~19,000 gene genome-wide set, we should hold off on drawing conclusions from this.
We've now walked through the basic workflow for RERConverge. For more information about these methods and some results relevant to marine and subterranean adaptation, see [@Chikina2016] and [@Partha2017].
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.