knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warnings = FALSE, cache = TRUE )
Phylogenies play an important role in computational biology and bioinformatics. Phylogenetics itself is an obligatly computational field that only began rapid growth when computational power allowed the many algorithms it relies on to be done rapidly. Phylogeneies of species, genes and proteins are used to address many biological issues, including
The actual building of a phylogeny is a computationally intensive task; moreover, there are many bioinformatics and computational tasks the precede the construction of a phylogeny:
Once all of these steps have been carried out, the building of a phylogeny involves
In this chapter we will work through many of these steps. In most cases we will pick the easiest or fastest option; in later chapters we will unpack the various options. This chapter is written as an interactive R sessions. You can follow along by opening the .Rmd file of the chapter or typing the appropriate commands into your own script. I assume that all the necessary packages have been installed and they only need to be loaded into R using the library() command.
A few things need to be done to get started with most R sessions.
If you are workign through this in RStudio Cloud the necessary packages should already by downloaded and you just need to load them into memory as don before. See the previous sections of this book for information on downloading R packages and Bioconductor.
Even if we are in RStudio cloud, we do need to load up all our bioinformatics and phylogenetics software into R. This is done with the library() command.
If you opened up this code as a .Rmd file, all you need to do to load these packages is click on the green triangle on the far right hand side of the code chunk.
NOTE: You'll likely see some (maybe LOTS!) red code appear on your screen. No worries, totally normal!
library(msa) library(rentrez) library(seqinr) library(ape) library(Biostrings)
This book uses functions and data from a package called dayoff that is actively being developed. It is therefore important that it is re-downloaded reguarly. If you are working from RStudio Cloud then this should already have been done. If you are working from RStudio on your desktop then you need to make sure you have the devtools package downloaded, then run the code below; note that his will output a lot of code - no worry. Again, only do this if you are running RStudio form your desktop. Lots of code will be spit into the console too- no wories.
devtools::install_github("brouwern/dayoff") # Skip this if you are running RStudio Cloud
Assuming this works, load the package into memory:
library(dayoff)
You can check that this worked by loading a dataset using the data() command.
data("shroom_table")
Now check that the data are there; the head() command just prints out the first few rows.
head(shroom_table)
Macromolecular sequences include DNA, RNA (in all its many forms), and polypeptides. We're going to explore some protein sequences. First we need to download them. To do this we'll use a function, entrez_fretch(), which acesses the Entrez system of database (ncbi.nlm.nih.gov/search/). This function is from the rentrez package, which stands for "R-Entrez."
We need to tell entrez_fetch() several things
(Formally, these things are called arguements by R.)
We'll use these settings:
We'll set id = ... to 2 sequences whose accession numbers are:
There are two relatively large and highly conserved regions of shroom 3 1. ASD 1: aa 884 to aa 1062 in hShroom3 (178 aa) 1. ASD 2: aa 1671 to aa 1955 in hShroom3 (284 aa)
Normally we'd have to download these sequences by hand. In R we can do it automatically; this might take a second.
# Human shroom 3 (H. sapiens) hShroom3 <- entrez_fetch(db = "protein", id = "NP_065910", rettype = "fasta") # Mouse shroom 3a (M. musculus) mShroom3a <- entrez_fetch(db = "protein", id = "AAF13269", rettype = "fasta") # Human shroom 2 (H. sapiens) hShroom2 <- entrez_fetch(db = "protein", id = "CAA58534", rettype = "fasta") # Sea-urchin shroom sShroom <- entrez_fetch(db = "protein", id = "XP_783573", rettype = "fasta")
The rentrez package is actively under development and so there may be compatibility issues, especially if you are using R from your desktop. I you have trouble running entrez_fetch() you can get all of these data objects from the dayoff package using the data() command, eg.
data("hShroom3") #only do this if you had trouble with the previous code chunk
We can take a peek at what we just got
cat(hShroom3)
There's lots of labels and ID nubmers and stuff here, so I'm going to check about how long each of these sequences is - each should have an at least slightly different length. If any are identical, I might have repeated an accession number or re-used an object name. The function nchar() counts of the number of characters in an R object.
nchar(hShroom3) nchar(mShroom3a) nchar(sShroom) nchar(hShroom2)
"90% of data analysis is data cleaning" (-Just about every data analyst and data scientist on twitter)
We have our sequences, but the current format isn't useable for us yet. We can set it up using a function I wrote called fasta_cleaner().
If we run the name of the command with out any quotation marks we can see the code:
fasta_cleaner
Now use the function to clean our sequences
hShroom3 <- fasta_cleaner(hShroom3, parse = F) mShroom3a <- fasta_cleaner(mShroom3a, parse = F) hShroom2 <- fasta_cleaner(hShroom2, parse = F) sShroom <- fasta_cleaner(sShroom, parse = F)
Now let's take a peek at what our sequences look like:
hShroom3
We can do a global alignment of two sequences using the pairwiseAlignment() function from the Biostrings.
align.h3.vs.m3a <- pairwiseAlignment( hShroom3, mShroom3a)
I've given this kind of an awkward name "h3.vs.m3a" means "human shroom 3 aligned versus mouse shroom 3a".
We can peek at part of the alignment by just calling up the R object:
align.h3.vs.m3a
The score tells us how closely they are aligned; higher scores mean the sequences are more similar. Its hard to interpret the number on its own so we can get the percent sequence identity using the pid() function.
pid(align.h3.vs.m3a)
So, shroom3 from humans and shroom3 from mice are about 71% similar (at least using this particular method of alignment, and there are many ways to do this!)
What about human shroom 3 and human shroom 2?
align.h3.vs.h2 <- pairwiseAlignment( hShroom3, hShroom2)
First check out the raw score
score(align.h3.vs.h2)
Now the percent sequence alignment.
pid(align.h3.vs.h2)
How does it work out evolutionarily that Human shroom 3 and Mouse shroom 3 are more similar to each other than human shroom 3 and human shroom 2? What are the evolutionary relationships among these genes within the shroom gene family?
The shroom genes were discovered in frogs (Xenopus), humans and mice in the 1990s. In the early 2000s Hagens et al (2006) compiled information on genes known to resemble shroom and proposed a way to refer to all of them to keep track of them. A table from this paper is in teh dayoff package. Take a look:
shroom_table
Instead of getting one sequence at a time we can download several by accessing the "accession" column from the table. We can get just this column using a dollar sign "$"
shroom_table$accession
We can give this whole set of accessions to entrez_fetch(). Downloading all of these might take a second.
shrooms <- entrez_fetch(db = "protein", id = shroom_table$accession, rettype = "fasta")
We can look at what we got here
cat(shrooms)
We can align all of the sequences we downloaded and use that alignment to build a phylogenetic tree. Again, we first need to set things up properly. In this case, we need to make a text file (.txt) that contain our data. This code is kinda dense, such close your eyes and run it.
sink("shrooms.txt") cat(shrooms) sink()
You can view this file using file.show(), which will cause a text editor to open.
file.show("shrooms.txt")
Close this when you are done.
Now some more cleaning. We need to read in the .txt file we just made and make a particular type of R object for our alginment program. Luckily this is easy using a function from the Biostrings package, readAAStringSet.
shrooms_stringset <- readAAStringSet("shrooms.txt")
unlink(here::here("shrooms.txt"))
Let's take a look at the names of these proteins:
names(shrooms_stringset)
At the bottom of the list are two entries that say either "PREDICTED: hypothetical protein" or "PREDICTED: protein". What organisms are these from? Google the species names. What does it mean the protein is predicted.
These names we're looking at are really long because they are a combination of the accession number, annotation, the species names and some other stuff. Let's overwrite them with the short versions of the names from the table, which are:
shroom_table$name.new
Now overwrite them:
names(shrooms_stringset) <- shroom_table$name.new
Now, we'll use the software msa, which implements the ClustalW multiple sequence alignment algorithm. ClustalW is a classic alignment program, but there are many other options such as MUSCLE and T-COFFEE. We'll use ClustalW because it can easily be called from R, but it appears that the ClustalW algoritm is not as powerful as other software.
Normally we'd have to download the ClustalW program and either point-and-click our way through it or use the command line*, but these folks wrote up the algorithm in R so we can do this with a line of R code.
This will take a second or two (or minute if everyone runs this at the same time).
shrooms_align <- msa(shrooms_stringset, method = "ClustalW")
You should see a message pop up that says "use default substitution matrix." There's some important biology and computation going on here, but we'll ignore this for now.
Now we can take a look at the alignment in PDF format. I this case I'm going to just show about 100 amino acids near the end of the alignment, where there is the most overlap accross all of the sequences:
msaPrettyPrint(shrooms_align, #output="asis", y=c(2000, 2100), showNames="left", #showLogo="top", consensusColor="ColdHot", shadingMode = "functional", shadingModeArg = "structure", showLegend=TRUE, showConsensus = "bottom", askForOverwrite=FALSE)
If you are running from your desktop this may run into problems (no easy solution yet...)
First, we need to convert to a different data structure. Sigh...
shrooms_align_seqinr <- msaConvert(shrooms_align, type="seqinr::alignment")
Next we need to get an estimate of how similar each sequences is. THis will be the basis for our phylogenetic tree.
shrooms_dist <- dist.alignment(shrooms_align_seqinr, "identity")
We've made a matrix using dist.alginment(); let's round it off so its easier to look at:
shrooms_dist_rounded <- round(shrooms_dist,1)
Now let's look at it
shrooms_dist_rounded
Now let's use these data to build a tree. First, we let R figure out the structure of the tree. We'll use the neighbor joining algorith via the functin nj():
tree <- nj(shrooms_dist)
Now we'll make a quick plot of our tree.
plot(tree, main="Phylogenetic Tree") mtext(text = "Shroom family gene tree")
Let's make a fancier plot:
# plot(tree, main="Phylogenetic Tree") # mtext(text = "Shroom family gene tree") # # x <- 0.551 # x2 <- 0.6 # # segments(x0 = x, y0 = 1, # x1 = x, y1 = 4, # lwd=2) # text(x = x*1.01, y = 2.5, "Shrm 3",adj = 0) # # segments(x0 = x, y0 = 5, # x1 = x, y1 = 6, # lwd=2) # text(x = x*1.01, y = 5.5, "Shrm 2",adj = 0) # # segments(x0 = x, y0 = 7, # x1 = x, y1 = 9, # lwd=2) # text(x = x*1.01, y = 8, "Shrm 1",adj = 0) # # segments(x0 = x, y0 = 10, # x1 = x, y1 = 13, # lwd=2) # text(x = x*1.01, y = 12, "Shrm ?",adj = 0) # # # segments(x0 = x, y0 = 14, # x1 = x, y1 = 15, # lwd=2) # text(x = x*1.01, y = 14.5, "Shrm 4",adj = 0) # # # segments(x0 = x2, y0 = 1, # x1 = x2, y1 = 6, # lwd=2) # # segments(x0 = x2, y0 = 7, # x1 = x2, y1 = 9, # lwd=2) # # segments(x0 = x2, y0 = 10, # x1 = x2, y1 = 15, # lwd=2)
library() source() head() nchar()
entrez_fretch()
msa rentrez seqinr ape
Entrez: ncbi.nlm.nih.gov/search/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.