knitr::opts_chunk$set(echo = TRUE)
#Include links throughout to allow skipping and returning to earlier sections.

This walkthrough provides instructions for implementing the RERconverge package to identify genes whose evolutionary rates shift in association with change in a trait. For information on how to download and install RERconverge, see the README on github.

Overview

The following document describes the steps necessary to perform a standard RERconverge analysis to identify genomic elements with convergent rates of evolution in phenotypically convergent species using a binary trait or a continuous trait.

Output is a the list of genomic elements with statistics that represent the strength and directon of the relationship between the genomic element's evolutionary rate and the phenotype. These statistics can be used to make inferences about the genomic elements' importances to the phenotype. Genomic elements that evolve more slowly for a given phenotype may be under increased evolutionary constraint because their function is important for the development of the convergent phenotype, for example. On the other hand, genomic elements that evolve more quickly for a given phenotype may either be under decreased evolutionary constraint due to loss of function or relatively decreased functional importance or, conversely, undergoing directional selection to increase or alter functionality. The ranked gene list can be further used as input for functional enrichment methodologies to find pathways or other functional groups under convergent evolutionary pressures.

Data Input Requirements and Formatting

The analysis requires two sources of data as input:

  1. Phylogenetic trees for every genomic element with branch lengths that represent element-specific evolutionary rates
    • Trees should be in Newick format with tip labels and no node labels
  2. Species-labelled phenotype values
    • Species labels must match tree tip labels
    • For continuous traits: a named numeric vector
    • For binary traits: foreground/background species or a tree with 0 and 1 branch lengths

When choosing a dataset to work with, consider the availability and accuracy of both genomic and phenotypic data, and be sure to select a valid convergent phenotype that is observed at high and low levels in multiple independent clades in your phylogeny.

For a more detailed description of data formatting requirements and examples, please see the relevant sections of the walkthrough.

Detailed Walkthrough

Installing and loading RERconverge

In R, load the RERConverge library.

if (!require("RERconverge", character.only=T, quietly=T)) {
    require(devtools)
    install_github("nclark-lab/RERconverge")
}
library(RERconverge)

This should also download all the files we will be working with to your computer, in the directory where your R library lives. If you'd like to visualize or work with any of these files separately, this is where you can find them:

print(paste(.libPaths()[1],"/RERconverge",sep="")) #This is the path to the files
rerpath = paste(.libPaths()[1],"/RERconverge",sep="")

Reading in gene trees with readTrees

To run RERconverge, you will first need to supply a file containing gene trees for all genes to be included in your analysis. This is a tab delimited file with the following information on each line:

Gene_name Newick_tree

An example file is provided in extdata/mammal62aa_meredplus_wCM.trees, which you can view in any text editor.

Now in R, read in the gene trees. The readTrees function takes quite a while to read in trees for all genes, so we will limit ourselves to the first 200 using max.read (this will still take a minute or so, so be patient):

toytreefile = "subsetMammalGeneTrees.txt" #change filename once toy trees available
mamTrees=readTrees(paste(rerpath,"/extdata/",toytreefile,sep=""), max.read = 200)

First, the code tells us that there are 1000 items, or gene trees, in the file. Since we have set max.read = 200, it will only read the first 200 of these. Then it says that the maximum number of tips in the gene trees is 62 and, later, it reports that it will use the 32 genes in this set that have data for all 62 species to estimate a master tree. The master tree will be used for subsequent analyses.

Estimating relative evolutionary rates (RER) with getAllResiduals

The next step is to estimate relative evolutionary rates, or RERs, for all branches in the tree for each gene. Intuitively, a gene's RER for a given branch represents how quickly or slowly the gene is evolving on that branch relative to its overall rate of evolution throughout the tree. For a more detailed description of how RER are computed, see [@Chikina2016] and [@Partha2017].

Briefly, RERs are calculated by normalizing branch lengths across all trees by the master branch lengths. Branch lengths are then corrected for the heteroskedastic relationship between average branch length and variance using weighted regression.

We will use the getAllResiduals function to calculate RERs. This uses the following input variables (all the options set here are also the defaults):

Here is the basic method, with the recommended settings:

mamRERw=RERconverge::getAllResiduals(mamTrees,useSpecies=mamTrees$masterTree$tip.label, 
                           transform = "sqrt", weighted = T, scale = T)

The plots generated by this function show the heteroscedasticity in the original data (on the left) and the data after transformation and weighted regression (on the right). The x-axis displays bins of branch lengths on the tree, and the y-axis is the (log-scaled) variance in these branch lengths across trees. As you can see by comparing the right plot to the left plot, transforming and performing a weighted regression reduces the relationship between the mean branch length (x-axis) and the variance in branch length (y-axis).

Now that we have RERs, we can visualize these for any given gene using the plotRers function. Here is an example.

# #make average and gene tree plots
# noneutherians <- c("Platypus","Wallaby","Tasmanian_devil","Opossum")
# par(mfrow=c(1,2))
# #The following function is not in the compiled R package but is available in 'plottingFuncs.R' on the 'UpdateVignette' branch....
# source("../R/plottingFuncs.R")
# avgtree=plotTreeHighlightBranches(mamTrees$masterTree, outgroup=noneutherians, hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), main="Average tree") #plot average tree
# bend3tree=plotTreeHighlightBranches(mamTrees$trees$BEND3, outgroup=noneutherians, hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), main="BEND3 tree") #plot individual gene tree
# #plot RERs
# par(mfrow=c(1,1))
# phenvExample <- foreground2Paths(c("Vole","Squirrel"),mamTrees)
# plotRers(mamRERw,"BEND3",phenv=phenvExample) #plot RERs

The upper left plot is a tree with branch lengths representing the average rates across all genes. The upper right plot is the same tree, but with branch lengths representing rates specifically for the BEND3 gene. The plot below these represents the estimated RERs for terminal branches. The foreground branches (set here using foreground2Paths) are highlighted in red. Notice how the RER for vole is negative; this is because the branch leading to vole in the BEND3 tree is shorter than average. On the other hand, the RER for squirrel is positive because the branch leading to squirrel in the BEND3 tree is longer than average.



nclark-lab/RERconverge documentation built on Feb. 4, 2025, 8:38 a.m.