In brouwern/dayoff: Bioinformatics in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warnings = FALSE,
  cache = TRUE
)

"Worked example: Building a phylogeny in R"

Introduction

Phylogenies play an important role in computational biology and bioinformatics. Phylogenetics itself is an obligatly computational field that only began rapid growth when computational power allowed the many algorithms it relies on to be done rapidly. Phylogeneies of species, genes and proteins are used to address many biological issues, including

Patterns of protein evolution
Origin and evolution of phenotypic traits
Origin and progression of epidemics
Origin of evolution of diseases (eg, zooenoses)
Prediction of protein function from its sequence
... and many more

The actual building of a phylogeny is a computationally intensive task; moreover, there are many bioinformatics and computational tasks the precede the construction of a phylogeny:

genome sequencing and assembly
computational gene prediction and annotation
database searching and results screening
pairwise sequence alignment
data organization and cleaning
multiple sequence alignment
evaluation and validation of alignment accuracy

Once all of these steps have been carried out, the building of a phylogeny involves

picking a model of sequence evolution or other description of evolution
picking a statistical approach to tree construction
evaluationg uncertainty in the final tree

In this chapter we will work through many of these steps. In most cases we will pick the easiest or fastest option; in later chapters we will unpack the various options. This chapter is written as an interactive R sessions. You can follow along by opening the .Rmd file of the chapter or typing the appropriate commands into your own script. I assume that all the necessary packages have been installed and they only need to be loaded into R using the library() command.

Software Preliminaires

A few things need to be done to get started with most R sessions.

Download necessary packages

If you are workign through this in RStudio Cloud the necessary packages should already by downloaded and you just need to load them into memory as don before. See the previous sections of this book for information on downloading R packages and Bioconductor.

Load packages with library()

Even if we are in RStudio cloud, we do need to load up all our bioinformatics and phylogenetics software into R. This is done with the library() command.

If you opened up this code as a .Rmd file, all you need to do to load these packages is click on the green triangle on the far right hand side of the code chunk.

NOTE: You'll likely see some (maybe LOTS!) red code appear on your screen. No worries, totally normal!

library(msa)
library(rentrez)
library(seqinr)
library(ape)
library(Biostrings)

This book uses functions and data from a package called dayoff that is actively being developed. It is therefore important that it is re-downloaded reguarly. If you are working from RStudio Cloud then this should already have been done. If you are working from RStudio on your desktop then you need to make sure you have the devtools package downloaded, then run the code below; note that his will output a lot of code - no worry. Again, only do this if you are running RStudio form your desktop. Lots of code will be spit into the console too- no wories.

devtools::install_github("brouwern/dayoff") # Skip this if you are running RStudio Cloud

Assuming this works, load the package into memory:

library(dayoff)

You can check that this worked by loading a dataset using the data() command.

data("shroom_table")

Now check that the data are there; the head() command just prints out the first few rows.

head(shroom_table)

Downloading macromolecular sequences

Macromolecular sequences include DNA, RNA (in all its many forms), and polypeptides. We're going to explore some protein sequences. First we need to download them. To do this we'll use a function, entrez_fretch(), which acesses the Entrez system of database (ncbi.nlm.nih.gov/search/). This function is from the rentrez package, which stands for "R-Entrez."

We need to tell entrez_fetch() several things

db = the type of entrez database.
id = the accession (ID) number of the sequence
rettype = file type what we want the function to return.

(Formally, these things are called arguements by R.)

We'll use these settings:

db = "protein" to access the Entrez database of protein sequences
rettype = "fasta", which is a standard file format for nucleic acid and protein sequences

We'll set id = ... to 2 sequences whose accession numbers are:

NP_065910: Human shroom 3
AAF13269: Mouse shroom 3a
CAA58534: Human shroom2
XP_783573: Sea urchin shroom

There are two relatively large and highly conserved regions of shroom 3 1. ASD 1: aa 884 to aa 1062 in hShroom3 (178 aa) 1. ASD 2: aa 1671 to aa 1955 in hShroom3 (284 aa)

Normally we'd have to download these sequences by hand. In R we can do it automatically; this might take a second.

# Human shroom 3 (H. sapiens)
hShroom3 <- entrez_fetch(db = "protein", 
                          id = "NP_065910", 
                          rettype = "fasta")

# Mouse shroom 3a (M. musculus)
mShroom3a <- entrez_fetch(db = "protein", 
                          id = "AAF13269", 
                          rettype = "fasta")

# Human shroom 2 (H. sapiens)
hShroom2 <- entrez_fetch(db = "protein", 
                          id = "CAA58534", 
                          rettype = "fasta")


# Sea-urchin shroom
sShroom <- entrez_fetch(db = "protein", 
                          id = "XP_783573", 
                          rettype = "fasta")

The rentrez package is actively under development and so there may be compatibility issues, especially if you are using R from your desktop. I you have trouble running entrez_fetch() you can get all of these data objects from the dayoff package using the data() command, eg.

data("hShroom3") #only do this if you had trouble with the previous code chunk

We can take a peek at what we just got

cat(hShroom3)

There's lots of labels and ID nubmers and stuff here, so I'm going to check about how long each of these sequences is - each should have an at least slightly different length. If any are identical, I might have repeated an accession number or re-used an object name. The function nchar() counts of the number of characters in an R object.

nchar(hShroom3)
nchar(mShroom3a)
nchar(sShroom)
nchar(hShroom2)

Prepping macromolecular sequences

"90% of data analysis is data cleaning" (-Just about every data analyst and data scientist on twitter)

We have our sequences, but the current format isn't useable for us yet. We can set it up using a function I wrote called fasta_cleaner().

If we run the name of the command with out any quotation marks we can see the code:

fasta_cleaner

Now use the function to clean our sequences

hShroom3  <- fasta_cleaner(hShroom3,  parse = F)
mShroom3a <- fasta_cleaner(mShroom3a, parse = F)
hShroom2  <- fasta_cleaner(hShroom2,  parse = F)
sShroom   <- fasta_cleaner(sShroom,   parse = F)

Now let's take a peek at what our sequences look like:

hShroom3

Aligning sequences

We can do a global alignment of two sequences using the pairwiseAlignment() function from the Biostrings.

align.h3.vs.m3a <- pairwiseAlignment(
                  hShroom3,
                  mShroom3a)

I've given this kind of an awkward name "h3.vs.m3a" means "human shroom 3 aligned versus mouse shroom 3a".

We can peek at part of the alignment by just calling up the R object:

align.h3.vs.m3a

The score tells us how closely they are aligned; higher scores mean the sequences are more similar. Its hard to interpret the number on its own so we can get the percent sequence identity using the pid() function.

pid(align.h3.vs.m3a)

So, shroom3 from humans and shroom3 from mice are about 71% similar (at least using this particular method of alignment, and there are many ways to do this!)

What about human shroom 3 and human shroom 2?

align.h3.vs.h2 <- pairwiseAlignment(
                  hShroom3,
                  hShroom2)

First check out the raw score

score(align.h3.vs.h2)

Now the percent sequence alignment.

pid(align.h3.vs.h2)

How does it work out evolutionarily that Human shroom 3 and Mouse shroom 3 are more similar to each other than human shroom 3 and human shroom 2? What are the evolutionary relationships among these genes within the shroom gene family?

Downloading multiple sequences

The shroom genes were discovered in frogs (Xenopus), humans and mice in the 1990s. In the early 2000s Hagens et al (2006) compiled information on genes known to resemble shroom and proposed a way to refer to all of them to keep track of them. A table from this paper is in teh dayoff package. Take a look:

shroom_table

Instead of getting one sequence at a time we can download several by accessing the "accession" column from the table. We can get just this column using a dollar sign "$"

shroom_table$accession

We can give this whole set of accessions to entrez_fetch(). Downloading all of these might take a second.

shrooms <- entrez_fetch(db = "protein", 
                          id = shroom_table$accession, 
                          rettype = "fasta")

We can look at what we got here

cat(shrooms)

Building a phylogenetic tree

We can align all of the sequences we downloaded and use that alignment to build a phylogenetic tree. Again, we first need to set things up properly. In this case, we need to make a text file (.txt) that contain our data. This code is kinda dense, such close your eyes and run it.

sink("shrooms.txt")
cat(shrooms)
sink()

You can view this file using file.show(), which will cause a text editor to open.

file.show("shrooms.txt")

Close this when you are done.

Now some more cleaning. We need to read in the .txt file we just made and make a particular type of R object for our alginment program. Luckily this is easy using a function from the Biostrings package, readAAStringSet.

shrooms_stringset <- readAAStringSet("shrooms.txt")

unlink(here::here("shrooms.txt"))

Let's take a look at the names of these proteins:

names(shrooms_stringset)

At the bottom of the list are two entries that say either "PREDICTED: hypothetical protein" or "PREDICTED: protein". What organisms are these from? Google the species names. What does it mean the protein is predicted.

These names we're looking at are really long because they are a combination of the accession number, annotation, the species names and some other stuff. Let's overwrite them with the short versions of the names from the table, which are:

shroom_table$name.new

Now overwrite them:

names(shrooms_stringset) <- shroom_table$name.new

Multiple sequence alignment

Now, we'll use the software msa, which implements the ClustalW multiple sequence alignment algorithm. ClustalW is a classic alignment program, but there are many other options such as MUSCLE and T-COFFEE. We'll use ClustalW because it can easily be called from R, but it appears that the ClustalW algoritm is not as powerful as other software.

Normally we'd have to download the ClustalW program and either point-and-click our way through it or use the command line*, but these folks wrote up the algorithm in R so we can do this with a line of R code.

This will take a second or two (or minute if everyone runs this at the same time).

shrooms_align <- msa(shrooms_stringset,
                     method = "ClustalW")

You should see a message pop up that says "use default substitution matrix." There's some important biology and computation going on here, but we'll ignore this for now.

Now we can take a look at the alignment in PDF format. I this case I'm going to just show about 100 amino acids near the end of the alignment, where there is the most overlap accross all of the sequences:

msaPrettyPrint(shrooms_align, 
               #output="asis", 
               y=c(2000, 2100),
               showNames="left", 
               #showLogo="top",
               consensusColor="ColdHot", 
               shadingMode = "functional",
               shadingModeArg = "structure",
               showLegend=TRUE,
               showConsensus = "bottom",
               askForOverwrite=FALSE)

If you are running from your desktop this may run into problems (no easy solution yet...)

Gene tree

First, we need to convert to a different data structure. Sigh...

shrooms_align_seqinr <- msaConvert(shrooms_align, type="seqinr::alignment")

Next we need to get an estimate of how similar each sequences is. THis will be the basis for our phylogenetic tree.

shrooms_dist <- dist.alignment(shrooms_align_seqinr, "identity")

We've made a matrix using dist.alginment(); let's round it off so its easier to look at:

shrooms_dist_rounded <- round(shrooms_dist,1)

Now let's look at it

shrooms_dist_rounded

Now let's use these data to build a tree. First, we let R figure out the structure of the tree. We'll use the neighbor joining algorith via the functin nj():

tree <- nj(shrooms_dist)

Now we'll make a quick plot of our tree.

plot(tree, main="Phylogenetic Tree")
mtext(text = "Shroom family gene tree")

Let's make a fancier plot:

# plot(tree, main="Phylogenetic Tree")
# mtext(text = "Shroom family gene tree")
# 
# x <- 0.551
# x2 <- 0.6
# 
# segments(x0 = x, y0 = 1, 
#          x1 = x, y1 = 4,
#          lwd=2)
# text(x = x*1.01, y = 2.5, "Shrm 3",adj = 0)
# 
# segments(x0 = x, y0 = 5, 
#          x1 = x, y1 = 6,
#          lwd=2)
# text(x = x*1.01, y = 5.5, "Shrm 2",adj = 0)
# 
# segments(x0 = x, y0 = 7, 
#          x1 = x, y1 = 9,
#          lwd=2)
# text(x = x*1.01, y = 8, "Shrm 1",adj = 0)
# 
# segments(x0 = x, y0 = 10, 
#          x1 = x, y1 = 13,
#          lwd=2)
# text(x = x*1.01, y = 12, "Shrm ?",adj = 0)
# 
# 
# segments(x0 = x, y0 = 14, 
#          x1 = x, y1 = 15,
#          lwd=2)
# text(x = x*1.01, y = 14.5, "Shrm 4",adj = 0)
# 
# 
# segments(x0 = x2, y0 = 1, 
#          x1 = x2, y1 = 6,
#          lwd=2)
# 
# segments(x0 = x2, y0 = 7, 
#          x1 = x2, y1 = 9,
#          lwd=2)
# 
# segments(x0 = x2, y0 = 10, 
#          x1 = x2, y1 = 15,
#          lwd=2)

Functions used

library() source() head() nchar()

entrez_fretch()

msa rentrez seqinr ape

Entrez: ncbi.nlm.nih.gov/search/

brouwern/dayoff documentation built on Nov. 4, 2019, 8:15 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

brouwern/dayoff
Bioinformatics in R

In brouwern/dayoff: Bioinformatics in R

"Worked example: Building a phylogeny in R"

Introduction

Software Preliminaires

Download necessary packages

Load packages with library()

Downloading macromolecular sequences

Prepping macromolecular sequences

Aligning sequences

Downloading multiple sequences

Building a phylogenetic tree

Multiple sequence alignment

Gene tree

Functions used

R Package Documentation

Browse R Packages

We want your feedback!

brouwern/dayoff Bioinformatics in R

In brouwern/dayoff: Bioinformatics in R

"Worked example: Building a phylogeny in R"

Introduction

Software Preliminaires

Download necessary packages

Load packages with library()

Downloading macromolecular sequences

Prepping macromolecular sequences

Aligning sequences

Downloading multiple sequences

Building a phylogenetic tree

Multiple sequence alignment

Gene tree

Functions used

R Package Documentation

Browse R Packages

We want your feedback!

brouwern/dayoff
Bioinformatics in R