knitr::opts_chunk$set(echo = TRUE)

RERconverge Analysis on Continuous Traits

Overview

The following document will describe the steps necessary to perform a standard RERconverge analysis to identify genomic elements with convergent rates of evolution in phenotypically convergent species using a continuous trait. Output is a the list of genomic elements with statistics that represent the strength and directon of the correlation between the genomic element's evolutionary rate and the phenotype. These statistics can be used to make inferences about the genomic elements' importances to the phenotype. Genomic elements that evolve more slowly for a given phenotype may be under increased evolutionary constraint because their function is important for the development of the convergent phenotype, for example. On the other hand, genomic elements that evolve more quickly for a given phenotype may either be under decreased evolutionary constraint due to loss of function or relatively decreased functional importance or, conversely, undergoing directional selection to increase or alter functionality. The ranked gene list can be further used as input for functional enrichment methodologies to find pathways or other functional groups under convergent evolutionary pressures.

Data Input Requirements and Formatting

The analysis requires two sources of data as input:

  1. A named vector with numeric phenotype values
  2. Phylogenetic trees for every genomic element with branch lengths that represent element-specific evolutionary rates

When choosing a dataset to work with, consider the availability and accuracy of both genomic and phenotypic data, and be sure to select a valid convergent phenotype that is observed at high and low levels in multiple independent clades in your phylogeny. In the following lines of code, I load my phenotypic data of mammal adult weight:

<<<<<<< HEAD

head(logAdultWeightcm)
length(logAdultWeightcm)
=======
<<<<<<< HEAD
```r
library("RERconverge")
head(AdultWeightLog)
length(AdultWeightLog)
=======
```r
head(logAdultWeightcm)
length(logAdultWeightcm)
>>>>>>> master
>>>>>>> UpdateVignette

Note that my phenotype vector contains numeric values that are assigned species names. These names, by design, are the same names that are used as terminal node labels in my phylogenetic trees. A dataset of phylogenetic trees for a real analysis would contain one tree for every genomic element of interest with branch lengths representing element-specific evolutionary rates. Trees should be in Newick format in a tab-delimited file in which each line contains the tree name and the tree. Importantly, trees must preseve the same topology or encode the same evolutionary relationships among species. The "readTrees" function takes the filepath of the trees and reads them into a tree object that is used in later analysis steps. Reading in trees and performing subsequent analyses can take a long time for large genomic datasets over large phylogenies, so for the sake of brevity, we will read in a subset of 200 trees using the parameter "max.read=200".

<<<<<<< HEAD

<<<<<<< HEAD

rerpath = paste(.libPaths()[1],"/RERconverge",sep="")
toytreefile = "subsetMammalGeneTrees.txt"
toyTrees=readTrees(paste(rerpath,"/extdata/",toytreefile,sep=""), max.read = 200)
=======
>>>>>>> UpdateVignette
```r
library("RERconverge")
toytrees=readTrees("../data/MammalToyTrees.txt", max.read = 500)
<<<<<<< HEAD
=======
>>>>>>> master
>>>>>>> UpdateVignette

The output from "readTrees" indicates that there are 3170 total trees in the file provided with 62 total species. In addition to reading in the trees, "readTrees" also created a master tree with branch lengths that represent the average branch lengths across many trees. In this case, the master tree branch lengths were estimated from 55 genes because 55 of the trees had all species present.

At this point, note that the number of species in my trees does not match the number of species in my phenotype vector. In this case, the phenotype vector contains more species than the phlogenetic trees, which I could easily rememdy by trimming the trait vector. A more difficult problem that you may experience is the opposite situation in which you have more species in your phlogenetic trees than in your phenotype vector. Fortunately, RERconverge can handle both of these issues, so we will continue onward to the meat of the RERconverge functions.

Analysis

With our trees and phenotype successfully loaded, we can start to run the RERconverge analysis. First, we will calculate RERs, or relative evolutionary rates, for all of our trees. RERs are calculated by nromalizing all branch lengths across all tree lengths by the master branch lengths. Branch lengths are then corrected for the heteroskedastic relationship between average branch length and variance useing weighted regression. This function has a lot going on under the hood, so let me show you the line I use to calculate RERs before I explain the input parameters I have chosen:

<<<<<<< HEAD
RER=getAllResiduals(toytrees, transform="sqrt", weighted=T, scale=T, useSpecies =names(adult.weights.log), cutoff=.001)
=======
<<<<<<< HEAD
RER=RERconverge::getAllResiduals(toyTrees, transform="sqrt", weighted=T, scale=T, useSpecies =names(AdultWeightLog), cutoff=.001)
=======
RER=getAllResiduals(toytrees, transform="sqrt", weighted=T, scale=T, useSpecies =names(adult.weights.log), cutoff=.001)
>>>>>>> master
>>>>>>> UpdateVignette

The first parameter is my previously-loaded trees object. Next, I specify a square-root transformation on the branch lengths to help correct for heteroskedasticity. The square-root correction is the recommended method based on its ability to increase power more significantly than other methods, but you may also want to try a log correction for your data. Setting weighted=T uses a weighted regression to further correct for heteroskedasticity. The two plots indicate the reduction in heteroskedasticity with the transformation and weighted regression. The plot on the left shows an increase in variance along the y-axis as mean branch length increases along the x-axis in untransformed data. That trend is not seen in the corrected data on the right. Setting scale=T scales the variance of the branch lengths to 1 after all corrections. I chose this parameter based on further analysis of permutations of node labels that showed a uniform distribution of p-values when variance scaling was used and a non-uniform distribution without variance scaling. I set useSpecies to the names in my phenotype vector so RERs are only calculated based on species for which I have phenotypes. Finally, cutoff=.001 removes very short, noisy branches from trees. RERs are stored in a paths matrix with rows representing genomic elements and columns representing every potential branch that could exist along the phylogenetic topology. Non-NA values in the RER matrix are branch RERs for branches that exist in that genomic element phylogeny.

We must convert my trait vector to paths comparable to the paths in the RER matrix. To do that, I can use the function "char2Paths" as shown here:

<<<<<<< HEAD
charpaths=char2Paths(adult.weights.log, toytrees)
=======
<<<<<<< HEAD
charpaths=char2Paths(AdultWeightLog, toyTrees)
=======
charpaths=char2Paths(adult.weights.log, toytrees)
>>>>>>> master
>>>>>>> UpdateVignette

Note that there is a species present in my trees that is not present in my phenotype vector and that this will not impact our ability to continue with our analysis, but is important to note if it is unexpected. I am also using "metric diff", which means that branch lengths are assigned to my trait tree based on the difference in trait values on the nodes connected to that branch. The "char2Paths" function creates a paths vector with length equal to the number of columns in the RER matrix. The phylogenetic relationships represented in the "char2Paths" output are the same as those represented in the RER matrix.

Finally, we can perform our ultimate analysis to find correlations between the rate of evolution of a genomic element (encoded in the RER matrix) and the rate of change of a phenotype (encoded in charpaths). The final output is the list of input genes with relevant statistics. As input, I provide the RER matrix and trait path. I set method="p" to use a Pearson correlation, which is the most appropriate option for continuous traits. I set min.pos=0 (min.pos for "minimum positive") so no foreground species are required; correlation statistics will be calculated even if all species have negative RER values. Finally, winsorize=3 pulls the outermost three points in all four directions to the center of the data before calculating correlations to mitigate the effects of outlier points.

res=getAllCor(RER, charpaths, method = "p", min.pos = 0, winsorize = 3)
head(res[order(res$P),])

In these results, Rho is the standard statistic for a Pearson correlation, N is the length of the genomic element, and P is the uncorrected correlation p-value. Since huge numbers of statistical tests are being performed in these analyses, it is essential to correct p-values using a method such as the Benjamini-Hochberg correction.

We can also plot a distribution of our uncorrected p-values to allow us to speculate if they will remain significant after correction. A mostly uniform distribution with an elevated frequency of low p-values indicates the presence of genomic elements whose rates of evolution are correlated with the phenotype (note that this trend will also be increasingly distinct when larger numbers of genomic elements are considered).

hist(res$P, breaks=100)

One important consideration of the results is the impact of one or a few species on the overall correlation. To assess this risk, we can examine individual correlation plots as follows:

x=charpaths
<<<<<<< HEAD
y=RER['ADSS',]
pathnames=namePathsWSpecies(toytrees$masterTree)
=======
<<<<<<< HEAD
y=RER['TTN',]
pathnames=namePathsWSpecies(toyTrees$masterTree)
=======
y=RER['ADSS',]
pathnames=namePathsWSpecies(toytrees$masterTree)
>>>>>>> master
>>>>>>> UpdateVignette
names(y)=pathnames
plot(x,y, cex.axis=1, cex.lab=1, cex.main=1, xlab="Weight Change", ylab="Evolutionary Rate", main="Gene ADSS Pearson Correlation",pch=19, cex=1, xlim=c(-2,2))
text(x,y, labels=names(y), pos=4)
abline(lm(y~x), col='red',lwd=3)

In this case, we see that the positive correlation is driven by all species and not just a single clade. Note that labelled points are terminal nodes in the phylogeny and unlabelled points are internal nodes.

Further analyses could include using functional enrichment detection methods to find functionally-related genomic elements that are experiencing convergent evolutionary rates as a group and using branch-site models to determine if fast-evolving genes are experiencing relaxation of constraint or undergoing directional selection.



nclark-lab/RERconverge documentation built on March 2, 2024, 8:51 a.m.