suppressPackageStartupMessages({ library(distSTRING) library(Biostrings) library(ape) library(dplyr) library(tidyr) library(ggplot2) })
Calculating pairwise distances of either DNA or AA sequences is a common task for evolutionary biologist. The distance calculations are either based on specific nucleotide, codon or amino acid models or on a scoring matrix.
Note: Sequences need to be pre-aligned into so called multiple sequence
alignments (MSA), which can be done with a multitude of existing software.
Just to mention for example mafft,
muscle or the R package
msa
.
The R package ape
for example offers the ape::dist.dna()
function, which has implemented a
collection of different evolutionary models.
distSTRING
extends the possibility to directly calculate pairwise nucloetide
distances of an Biostrings::DNAStringSet
object or pairwise amino acid
distances of an Biostrings::AAStringSet
object. The scoring matrix based
calculations are implemented in c++
with RcppThread
to parallelise pairwise
combinations.
It is a non-trivial part to resolve haploid (1n) sequences
from a diploid (2n) individual (aka phasing) to further use the
haploid sequences for distance calculations. To cope with this situation,
distSTRING
uses a literal distance [@chang2017whole] which can be directly
applied on IUPAC
nucleotide ambiguity encoded sequences with the
dnastring2dist()
function. IUPAC
sequences can be for example obtained
directly from mapped BAM
files and the angsd -doFasta 4
option [@korneliussen2014angsd].
The Grantham's score [@grantham1974amino] attempts to predict the distance
between two amino acids, in an evolutionary sense considering the amino acid
composition, polarity and molecular volume. distSTRING
offers with the
aastring2dist()
function the possibility to obtain pairwise distances of all
sequences in an Biostrings::AAStringSet
(needs to be pre-aligned). The
resulting distance matrix can be used to calculate neighbor-joining trees via
the R package ape
.
Calculating synonymous (Ks) and nonsynonymous (Ka) substitutions from coding
sequences and its ratio Ka/Ks can be used as an indicator of selective
pressure acting on a protein. The dnastring2kaks()
function can be applied on
pre-aligned Biostrings::DNAStringSet()
objects to calculate these values
either according to [@li1993unbiased] via the R package
seqinr
or
according to the model of [@nei1986simple].
Further, all codons can be evaluated among the coding sequence alignment and
be plotted to for example protein domains with substitutions or indels with the
codonmat2xy()
function.
To install this package, start R (version "4.1") and enter:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("distSTRING")
# load distSTRING library(distSTRING) # load example data data(hiv, package="distSTRING") data(AAMatrix, package="distSTRING") data(woodmouse, package="ape")
To be able to use distance calculation functions from other R packages, like
ape
or
seqinr
, it is
necessary to have dedicated sequence format conversion functions. Here, some
examples are shown, how to convert from and to a Biostrings::DNAStringSet
object.
?Biostrings::DNAStringSet()
>>> ?seqinr::as.alignment()
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## define names names(cds1.cds2.aln) <- c("seq1", "seq2") ## convert into alignment cds1.cds2.aln |> dnastring2aln()
?seqinr::as.alignment()
>>> ?Biostrings::DNAStringSet()
## convert back into DNAStringSet cds1.cds2.aln |> dnastring2aln() |> aln2dnastring()
?Biostrings::DNAStringSet()
>>> ?ape::DNAbin()
## convert into alignment cds1.cds2.aln |> dnastring2dnabin()
?ape::DNAbin()
>>> ?Biostrings::DNAStringSet()
## convert back into DNAStringSet cds1.cds2.aln |> dnastring2dnabin() |> dnabin2dnastring() ## use woodmouse data woodmouse |> dnabin2dnastring()
?Biostrings::AAStringSet()
>>> ?seqinr::as.alignment()
## translate cds into aa aa1.aa2.aln <- cds1.cds2.aln |> cds2aa() ## convert into alignment aa1.aa2.aln |> aastring2aln()
?seqinr::as.alignment()
>>> ?Biostrings::AAStringSet()
## convert back into AAStringSet aa1.aa2.aln |> aastring2aln() |> aln2aastring()
?Biostrings::AAStringSet()
>>> ?ape::as.AAbin()
## convert into AAbin aa1.aa2.aln |> aastring2aabin()
?ape::as.AAbin()
>>> ?Biostrings::AAStringSet()
## convert back into AAStringSet aa1.aa2.aln |> aastring2aabin() |> aabin2aastring()
Biostrings::DNAStringSet
translation (cds2aa()
)To be able to translate a coding sequence into amino acids, sometimes the
sequences do not start at the first frame. The cds2aa
function can take an
alternative codon start site into account (frame = 1
or frame = 2
or
frame = 3
). However, sometimes it is also
necessary that the resulting coding sequence length is a multiple of three.
This can be forced by using the shorten = TRUE
option.
Simple translation:
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## define names names(cds1.cds2.aln) <- c("seq1", "seq2") ## translate cds into aa cds1.cds2.aln |> cds2aa() aa1.aa2.aln <- cds1.cds2.aln |> cds2aa()
Translation keeping multiple of three sequence length:
## translate cds into aa using frame = 2 ## result is empty due to not multiple of three cds1.cds2.aln |> cds2aa(frame=2) ## translate cds into aa using frame = 2 and shorten = TRUE cds1.cds2.aln |> cds2aa(frame=2, shorten=TRUE) ## translate cds into aa using frame = 3 and shorten = TRUE cds1.cds2.aln |> cds2aa(frame=3, shorten=TRUE) ## use woodmouse data woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE)
Translation using alternative genetic code:
As you can see from the above example, the initial amino acids I
will change
into M
due to the mitochondrial translation code and also some *
stop
codons will change into a W
amino acid.
## alternative genetic code ## use woodmouse data woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE, genetic.code=Biostrings::getGeneticCode("2"))
aastring2dist()
)## calculate pairwise AA distances based on Grantham's distance aa.dist <- hiv |> cds2aa() |> aastring2dist(score=granthamMatrix()) ## obtain distances head(aa.dist$distSTRING) ## obtain pairwise sites used head(aa.dist$sitesUsed)
## create and plot bionj tree aa.dist.bionj <- ape::bionj(as.dist(aa.dist$distSTRING)) plot(aa.dist.bionj)
To use a different score matrix, here as an example the AAMatrix
from the
R package alakazam
is used:
## use AAMatrix data head(AAMatrix) aa.dist.AAMatrix <- hiv |> cds2aa() |> aastring2dist(score=AAMatrix) head(aa.dist.AAMatrix$distSTRING)
dnastring2dist()
)ape::dist.dna
models## use hiv data dna.dist <- hiv |> dnastring2dist(model="K80") ## obtain distances head(dna.dist$distSTRING) ## obtain pairwise sites used head(dna.dist$sitesUsed)
## create and plot bionj tree dna.dist.bionj <- ape::bionj(as.dist(dna.dist$distSTRING))
It is also possible to compare the amino acid and nucleotide based trees:
## creation of the association matrix: association <- cbind(aa.dist.bionj$tip.label, aa.dist.bionj$tip.label) ## cophyloplot ape::cophyloplot(aa.dist.bionj, dna.dist.bionj, assoc=association, length.line=4, space=28, gap=3, rotate=FALSE)
IUPAC
distance## use hiv data hiv.dist.iupac <- head(hiv |> dnastring2dist(model="IUPAC")) head(hiv.dist.iupac$distSTRING) ## run multi-threaded system.time(hiv |> dnastring2dist(model="IUPAC", threads=1)) system.time(hiv |> dnastring2dist(model="IUPAC", threads=2))
Woodmouse data example:
## use woodmouse data woodmouse.dist <- woodmouse |> dnabin2dnastring() |> dnastring2dist() head(woodmouse.dist$distSTRING)
dnastring2kaks()
)## use hiv data ## model Li head(hiv |> dnastring2kaks(model="Li")) ## model NG86 head(hiv |> dnastring2kaks(model="NG86", threads=1))
As an example for the codon comparison data from the Human Immunodeficiency Virus Type 1 is used [@ganeshan1997human], [@yang2000codon].
The window plots are constructed with the R package
ggplot2
.
dnastring2codonmat()
)## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into alignment cds1.cds2.aln |> dnastring2codonmat()
Like the cds2aa()
function, also the dnastring2codonmat()
function is
implemented to handle different frames.
## use frame 2 and shorten to circumvent multiple of three error cds1 <- Biostrings::DNAString("-ATGCAACATTGC-") cds2 <- Biostrings::DNAString("-ATG---CATTGC-") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) cds1.cds2.aln |> dnastring2codonmat(frame=2, shorten=TRUE)
codonmat2xy()
)## use hiv data ## calculate average behavior hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()
print(hiv.xy |> dplyr::select(Codon,SynMean,NonSynMean,IndelMean) |> tidyr::gather(variable, values, -Codon) |> ggplot2::ggplot(aes(x=Codon, y=values)) + ggplot2::geom_line(aes(colour=factor(variable))) + ggplot2::geom_point(aes(colour=factor(variable))) + ggplot2::ggtitle("HIV-1 sample 136 patient 1 from Sweden envelope glycoprotein (env) gene"))
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.