knitr::opts_chunk$set(echo = TRUE)

Replace this with a common intro combined with continuous traits.

Include links throughout to allow skipping and returning to earlier sections.

This walk-through provides instructions for implementing the RERconverge package to identify genes whose evolutionary rates shift in association with change in a binary trait. For information on how to download and install RERconverge, see the README on github.

Installing and loading RERconverge

In R, load the RERConverge library.

if (!require("RERconverge", character.only=T, quietly=T)) {
    require(devtools)
    install_github("nclark-lab/RERconverge")
}
library(RERconverge)

This should also download all the files we will be working with to your computer, in the directory where your R library lives. If you'd like to visualize or work with any of these files separately, this is where you can find them:

print(paste(.libPaths()[1],"/RERconverge",sep="")) #This is the path to the files
rerpath = paste(.libPaths()[1],"/RERconverge",sep="")

Reading in gene trees with readTrees

To run RERconverge, you will first need to supply a file containing gene trees for all genes to be included in your analysis. This is a tab delimited file with the following information on each line:

Gene_name Newick_tree

An example file is provided in extdata/mammal62aa_meredplus_wCM.trees, which you can view in any text editor.

Now in R, read in the gene trees. The readTrees function takes quite a while to read in trees for all genes, so we will limit ourselves to the first 200 using max.read (this will still take a minute or so, so be patient):

toytreefile = "subsetMammalGeneTrees.txt" #change filename once toy trees available
mamTrees=readTrees(paste(rerpath,"/extdata/",toytreefile,sep=""), max.read = 200)

First, the code tells us that there are 38298 items, or gene trees, in the file. Since we have set max.read = 200, it will only read the first 200 of these. Then it says that the maximum number of tips in the gene trees is 62 and, later, it reports that it will use the 32 genes in this set that have data for all 62 species to estimate a master tree. The tree topology for the master tree (but not its branch lengths) will be used for subsequent analyses.

Estimating relative evolutionary rates (RER) with getAllResiduals

The next step is to estimate relative evolutionary rates, or RER, for all branches in the tree for each gene. Intuitively, a gene's RER for a given branch represents how quickly or slowly the gene is evolving on that branch, relative to its overall rate of evolution throughout the tree. For more details about how RER are computed, see [@Chikina2016] and [@Partha2017].

We will use the getAllResiduals function to calculate RER. This uses the following input variables (all the options set here are also the defaults):

Here is the basic method, with the recommended settings:

mamRERw=getAllResiduals(mamTrees,useSpecies=mamTrees$masterTree$tip.label, 
                           transform = "sqrt", weighted = T, scale = T)

The plots generated by this function show the heteroscedasticity in the original data (on the left) and the data after transformation and weighted regression (on the right). The x-axis displays bins of branch lengths on the tree, and the y-axis is the (log-scaled) variance in these branch lengths across trees. As you can see by comparing the right plot to the left plot, transforming and performing a weighted regression reduces the relationship between the mean branch length (x-axis) and the variance in branch length (y-axis).

Now that we have RERs, we can visualize these for any given gene using the plotRers function. Here is an example.

#make average and gene tree plots
noneutherians <- c("Platypus","Wallaby","Tasmanian_devil","Opossum")
par(mfrow=c(1,2))
#The following function is not in the compiled R package but is available in 'plottingFuncs.R' on the 'UpdateVignette' branch....
repopath = '~/repos/RERconverge'
source(paste(repopath,'/R/plottingFuncs.R',sep=''))
avgtree=plotTreeHighlightBranches(mamTrees$masterTree, outgroup=noneutherians, hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), main="Average tree") #plot average tree
bend3tree=plotTreeHighlightBranches(mamTrees$trees$BEND3, outgroup=noneutherians, hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), main="BEND3 tree") #plot individual gene tree
#plot RERs
par(mfrow=c(1,1))
phenvExample <- foreground2Paths(c("Vole","Squirrel"),mamTrees)
plotRers(mamRERw,"BEND3",phenv=phenvExample) #plot RERs

The upper left plot is a tree with branch lengths representing the average rates across all genes. The upper right plot is the same tree, but with branch lengths representing rates specifically for the BEND3 gene. The plot below these represents the estimated RERs for terminal branches. The foreground branches (set here using foreground2Paths) are highlighted in red. Notice how the RER for vole is negative; this is because the branch leading to vole in the BEND3 tree is shorter than average. On the other hand, the RER for squirrel is positive because the branch leading to squirrel in the BEND3 tree is longer than average.

Associating RERs with binary traits

Here we describe the minimal set of steps to test for association between relative evolutionary rates and a binary trait.

Reading in or generating binary trait trees

Now we will associate variation in these RERs with variation in traits across the tree. To do so, we first need to provide information about which branches of the tree have the trait of interest (foreground branches). There are several possible ways to do this:

1) Provide a binary trait tree file. This should be a file in Newick format with branch lengths zero for background branches and one for foreground branches. An example is provided in data/MarineTreeBin.txt.

marineb=read.tree("../data/MarineTreeBinCommonNames.txt")
marinebrooted = root(marineb,outgroup=noneutherians)
mb1 = marineb
mb1$edge.length = c(rep(1,length(mb1$edge.length)))
par(mfrow=c(1,2))
plot(marinebrooted, main="Trait tree from file (1)")
binplot1=plotTreeHighlightBranches(mb1, outgroup=noneutherians, hlspecies=which(marineb$edge.length==1), hlcols="blue", main="Foreground branches highlighted (1)") #alternative way of representing the tree

The plot on the left shows the tree you provided, with branch lengths 0 for all background lineages and branch lengths 1 for foreground lineages. The plot on the right displays the tree with all branch lengths 1 and the foreground lineages highlighted in blue. This binary tree represents the following as foreground lineages: all terminal branches leading to extant marine species, plus the branch leading to the common ancestor of the killer whale and the dolphin.

Modify the following to demonstrate all possible options for converting foreground to trees.

2) Generate a binary tree from a vector of foreground species using foreground2Tree. By modifying the clade argument, you can choose one of several options for how to handle the branches ancestral to the foreground species, as follows:

2a) clade = "ancestral": Use maximum parsimony to infer where transitions from background to foreground occurred in the tree, and set those transition lineages to foreground.

marineextantforeground = c("Walrus","Seal","Killer_whale","Dolphin","Manatee")
marineb2a = foreground2Tree(marineextantforeground, mamTrees, clade="ancestral")

Here the branch leading to the common ancestor of killer whale and dolphin, as well as the branch leading to the common ancestor of walrus and seal, are foreground, along with the terminal branch leading to the manatee.

2b) clade = "terminal": Set only terminal lineages leading to foreground species as foreground.

marineb2b = foreground2Tree(marineextantforeground, mamTrees, clade="terminal")

Here the each terminal branch leading to a marine species is foreground, but no internal branches are foreground.

2c) clade = "all": Use maximum parsimony to infer where transitions from background to foreground occurred in the tree, and set those transition lineages, along with all daughter lineages, to foreground.

marineb2c = foreground2Tree(marineextantforeground, mamTrees, clade="all")

Here the foreground branches are all those inferred to be transitional from 2a, as well as the terminal branches. If we had a case in which some branches daughter to the transition branches were not terminal, those would be included in the foreground as well.

2d) clade = "weighted": Infer transition and daughter branches as in 2c, but spread a weight of 1 evenly across all branches within each independent foreground clade.

marineb2d = foreground2Tree(marineextantforeground, mamTrees, clade="weighted")

Here all branches in the cetacean and pinniped clades have length 1/3, whereas the terminal branch leading to the manatee has length 1. This is a way of distributing the weight given to each independent convergence event evenly across clades, rather than across lineages.

If you plot all the resulting trees, you can see how the different choices for clade influence the resulting branch lengths.

par(mfrow=c(2,2))
plot(marine2a,main="ancestral")
plot(marine2b,main="terminal")
plot(marine2c,main="all")
plot(marine2d,main="weighted")

None of these are the same as the binary tree that you specified in (1)!

plot(marineb,main="Manually specified binary tree (1)")

Note that, in this tree, the branch representing the ancestor of the killer whale and the dolphin has a branch length of 1, whereas the branch representing the ancestor of seal and walrus has a branch length of 0. This shows that providing a binary trait tree file (1) (or manually specifying foreground branches (3)) allows you to specify which branches should be foreground with more flexibility than providing a set of foreground species to foreground2Tree.

Determine whether to include the following. It is based on the implementation

of the interactive tool to select foreground branches using RShiny.

3) Use the interactive branch selection tool. The following should open a plot of the master tree. When the GUI opens, select the marine foreground branches (Walrus, Seal, Killer whale, Dolphin, Killer whale-Dolphin ancestor, and Manatee), and click 'End selection.'

marineb3=selectforegroundbranches(mamTrees$masterTree)

If including the interactive selection tool....

You can double check that the tree you created by manual selection (3) matches the binary tree you provided in (1) (it should) using the following code:

marineb3=selectforegroundbranches(mamTrees$masterTree)
marineb3rooted=root(marineb3,outgroup=c("Platypus", "Wallaby","Tasmanian_devil","Opossum"))
plot(marineb3rooted, main = "Trait tree from manual selection (3)")

Generating paths for all genes using tree2Paths or foreground2Paths

Some of the genes (like BEND3 above) may not have data for all species, meaning that their phylogeny will be a subset of the full phylogeny. To plot RERs for these genes and to correlate them with trait evolution, we run one of two functions that determine how the trait would evolve along all/many possible subsets of the full phylogeny, generating a set of paths. The function tree2Paths takes a binary tree as input, and the function foreground2Paths takes a set of foreground species as input. It has the same arguments as foreground2Tree described above (see option 2 in Reading in or generating trait trees).

Important note: If you are using a binary tree to create paths, it must have the same topology and include the same species as the master tree (see previous section for how this is generated).

phenvMarine=tree2Paths(marineb, mamTrees)
phenvMarine2=foreground2Paths(marineextantforeground, mamTrees, clade="all")

Correlating gene evolution with binary trait evolution using correlateWithBinaryPhenotype

Now that we have estimates for the RERs for all genes of interest, as well as a representation of how the trait of interest evolves across the tree, we can use correlateWithBinaryPhenotype to test for an association between relative evolutionary rate and trait across all branches of the tree.

This uses the following input variables (all the options set here are also the defaults):

corMarine=correlateWithBinaryPhenotype(mamRERw, phenvMarine, min.sp=10, min.pos=2, weighted="auto")

The text displayed shows which correlation method is used to test for association. Here it uses the default for binary traits: the Kendall rank correlation coefficient, or Tau. For more details on correlation methods for binary and continuous traits, see ???.

The correlateWithBinaryPhenotype function generates a table with the following output for each gene:

1) Rho: the correlation between relative evolutionary rate and trait across all branches

2) N: the number of branches in the gene tree

3) P: an estimate of the P-value for association between relative evolutionary rate and trait.

Let's take a look at some of the top genes within this set.

head(corMarine[order(corMarine$P),])

ANO2 and BMP10 are two of the top genes.

plotRers(mamRERv,"ANO2",phenv=phenvMarine) 
plotRers(mamRERv,"BMP10",phenv=phenvMarine) 

In these RER plots, the marine lineages specified in the foreground are highlighted in red. The terminal branches are listed in alphabetical order, and internal branches are displayed at the bottom; note the red point at the bottom of each plot indicating the RER for killer whale-dolphin ancestral branch.

In the ANO2 tree, marine lineages have high RER, leading to a positive Rho and a low p-value. In contrast, in the BMP10 tree, marine lineages have low RER. This also yields a low p-value, but with a negative Rho.

To see what the overall pattern of association is across all genes in the set, we can plot a p-value histogram.

hist(corMarine$P, breaks=15, xlab="Kendall P-value", 
     main="P-values for correlation between 200 genes and marine environment")

There appears to be a slight enrichment of low p-values, but since we have only evaluated the first 200 genes from our ~19,000 gene genome-wide set, we should hold off on drawing conclusions from this.

Conclusion

We've now walked through the basic workflow for RERConverge. For more information about these methods and some results relevant to marine and subterranean adaptation, see [@Chikina2016] and [@Partha2017].

References



nclark-lab/RERconverge documentation built on March 2, 2024, 8:51 a.m.