suppressPackageStartupMessages({
    library(distSTRING)
    library(Biostrings)
    library(ape)
    library(dplyr)
    library(tidyr)
    library(ggplot2)
    })

Introduction

Calculating pairwise distances of either DNA or AA sequences is a common task for evolutionary biologist. The distance calculations are either based on specific nucleotide, codon or amino acid models or on a scoring matrix.

Note: Sequences need to be pre-aligned into so called multiple sequence alignments (MSA), which can be done with a multitude of existing software. Just to mention for example mafft, muscle or the R package msa.

The R package ape for example offers the ape::dist.dna() function, which has implemented a collection of different evolutionary models. distSTRING extends the possibility to directly calculate pairwise nucloetide distances of an Biostrings::DNAStringSet object or pairwise amino acid distances of an Biostrings::AAStringSet object. The scoring matrix based calculations are implemented in c++ with RcppThread to parallelise pairwise combinations.

It is a non-trivial part to resolve haploid (1n) sequences from a diploid (2n) individual (aka phasing) to further use the haploid sequences for distance calculations. To cope with this situation, distSTRING uses a literal distance [@chang2017whole] which can be directly applied on IUPAC nucleotide ambiguity encoded sequences with the dnastring2dist() function. IUPAC sequences can be for example obtained directly from mapped BAM files and the angsd -doFasta 4 option [@korneliussen2014angsd].

The Grantham's score [@grantham1974amino] attempts to predict the distance between two amino acids, in an evolutionary sense considering the amino acid composition, polarity and molecular volume. distSTRING offers with the aastring2dist() function the possibility to obtain pairwise distances of all sequences in an Biostrings::AAStringSet (needs to be pre-aligned). The resulting distance matrix can be used to calculate neighbor-joining trees via the R package ape.

Calculating synonymous (Ks) and nonsynonymous (Ka) substitutions from coding sequences and its ratio Ka/Ks can be used as an indicator of selective pressure acting on a protein. The dnastring2kaks() function can be applied on pre-aligned Biostrings::DNAStringSet() objects to calculate these values either according to [@li1993unbiased] via the R package seqinr or according to the model of [@nei1986simple].

Further, all codons can be evaluated among the coding sequence alignment and be plotted to for example protein domains with substitutions or indels with the codonmat2xy() function.

Installation

To install this package, start R (version "4.1") and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("distSTRING")

Load distSTRING

# load distSTRING
library(distSTRING)
# load example data
data(hiv, package="distSTRING")
data(AAMatrix, package="distSTRING")
data(woodmouse, package="ape")

Sequence Format conversion

To be able to use distance calculation functions from other R packages, like ape or seqinr, it is necessary to have dedicated sequence format conversion functions. Here, some examples are shown, how to convert from and to a Biostrings::DNAStringSet object.

?Biostrings::DNAStringSet() >>> ?seqinr::as.alignment()

## define two cds sequences
cds1 <- Biostrings::DNAString("ATGCAACATTGC")
cds2 <- Biostrings::DNAString("ATG---CATTGC")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
## define names
names(cds1.cds2.aln) <- c("seq1", "seq2")
## convert into alignment
cds1.cds2.aln |> dnastring2aln()

?seqinr::as.alignment() >>> ?Biostrings::DNAStringSet()

## convert back into DNAStringSet
cds1.cds2.aln |> dnastring2aln() |> aln2dnastring()

?Biostrings::DNAStringSet() >>> ?ape::DNAbin()

## convert into alignment
cds1.cds2.aln |> dnastring2dnabin()

?ape::DNAbin() >>> ?Biostrings::DNAStringSet()

## convert back into DNAStringSet
cds1.cds2.aln |> dnastring2dnabin() |> dnabin2dnastring()
## use woodmouse data
woodmouse |> dnabin2dnastring()

?Biostrings::AAStringSet() >>> ?seqinr::as.alignment()

## translate cds into aa
aa1.aa2.aln <- cds1.cds2.aln |> cds2aa()
## convert into alignment
aa1.aa2.aln |> aastring2aln()

?seqinr::as.alignment() >>> ?Biostrings::AAStringSet()

## convert back into AAStringSet
aa1.aa2.aln |> aastring2aln() |> aln2aastring()

?Biostrings::AAStringSet() >>> ?ape::as.AAbin()

## convert into AAbin
aa1.aa2.aln |> aastring2aabin()

?ape::as.AAbin() >>> ?Biostrings::AAStringSet()

## convert back into AAStringSet
aa1.aa2.aln |> aastring2aabin() |> aabin2aastring()

Frame aware Biostrings::DNAStringSet translation (cds2aa())

To be able to translate a coding sequence into amino acids, sometimes the sequences do not start at the first frame. The cds2aa function can take an alternative codon start site into account (frame = 1 or frame = 2 or frame = 3). However, sometimes it is also necessary that the resulting coding sequence length is a multiple of three. This can be forced by using the shorten = TRUE option.

Simple translation:

## define two cds sequences
cds1 <- Biostrings::DNAString("ATGCAACATTGC")
cds2 <- Biostrings::DNAString("ATG---CATTGC")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
## define names
names(cds1.cds2.aln) <- c("seq1", "seq2")
## translate cds into aa
cds1.cds2.aln |> cds2aa()
aa1.aa2.aln <- cds1.cds2.aln |> cds2aa()

Translation keeping multiple of three sequence length:

## translate cds into aa using frame = 2
## result is empty due to not multiple of three
cds1.cds2.aln |> cds2aa(frame=2)
## translate cds into aa using frame = 2 and shorten = TRUE
cds1.cds2.aln |> cds2aa(frame=2, shorten=TRUE)
## translate cds into aa using frame = 3 and shorten = TRUE
cds1.cds2.aln |> cds2aa(frame=3, shorten=TRUE)
## use woodmouse data
woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE)

Translation using alternative genetic code:

As you can see from the above example, the initial amino acids I will change into M due to the mitochondrial translation code and also some * stop codons will change into a W amino acid.

## alternative genetic code
## use woodmouse data
woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE,
    genetic.code=Biostrings::getGeneticCode("2"))

Pairwise sequence comparison

Calculate pairwise AA distances (aastring2dist())

Grantham's distance

## calculate pairwise AA distances based on Grantham's distance
aa.dist <- hiv |> cds2aa() |> aastring2dist(score=granthamMatrix())
## obtain distances
head(aa.dist$distSTRING)
## obtain pairwise sites used
head(aa.dist$sitesUsed)
## create and plot bionj tree
aa.dist.bionj <- ape::bionj(as.dist(aa.dist$distSTRING))
plot(aa.dist.bionj)

To use a different score matrix, here as an example the AAMatrix from the R package alakazam is used:

## use AAMatrix data
head(AAMatrix)
aa.dist.AAMatrix <- hiv |> cds2aa() |> aastring2dist(score=AAMatrix)
head(aa.dist.AAMatrix$distSTRING)

Calculate pairwise DNA distances (dnastring2dist())

ape::dist.dna models

## use hiv data
dna.dist <- hiv |> dnastring2dist(model="K80")
## obtain distances
head(dna.dist$distSTRING)
## obtain pairwise sites used
head(dna.dist$sitesUsed)
## create and plot bionj tree
dna.dist.bionj <- ape::bionj(as.dist(dna.dist$distSTRING))

It is also possible to compare the amino acid and nucleotide based trees:

## creation of the association matrix:
association <- cbind(aa.dist.bionj$tip.label, aa.dist.bionj$tip.label)
## cophyloplot
ape::cophyloplot(aa.dist.bionj,
    dna.dist.bionj,
    assoc=association,
    length.line=4,
    space=28,
    gap=3,
    rotate=FALSE)

IUPAC distance

## use hiv data
hiv.dist.iupac <- head(hiv |> dnastring2dist(model="IUPAC"))
head(hiv.dist.iupac$distSTRING)
## run multi-threaded
system.time(hiv |> dnastring2dist(model="IUPAC", threads=1))
system.time(hiv |> dnastring2dist(model="IUPAC", threads=2))

Woodmouse data example:

## use woodmouse data
woodmouse.dist <- woodmouse |> dnabin2dnastring() |> dnastring2dist()
head(woodmouse.dist$distSTRING)

Coding sequences

Calculating synonymous and nonsynonymous substitutions (dnastring2kaks())

## use hiv data
## model Li
head(hiv |> dnastring2kaks(model="Li"))
## model NG86
head(hiv |> dnastring2kaks(model="NG86", threads=1))

Codon comparison

As an example for the codon comparison data from the Human Immunodeficiency Virus Type 1 is used [@ganeshan1997human], [@yang2000codon].

The window plots are constructed with the R package ggplot2.

Create codon matrix (dnastring2codonmat())

## define two cds sequences
cds1 <- Biostrings::DNAString("ATGCAACATTGC")
cds2 <- Biostrings::DNAString("ATG---CATTGC")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
## convert into alignment
cds1.cds2.aln |> dnastring2codonmat()

Like the cds2aa() function, also the dnastring2codonmat() function is implemented to handle different frames.

## use frame 2 and shorten to circumvent multiple of three error
cds1 <- Biostrings::DNAString("-ATGCAACATTGC-")
cds2 <- Biostrings::DNAString("-ATG---CATTGC-")
cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1),
    Biostrings::DNAStringSet(cds2))
cds1.cds2.aln |> dnastring2codonmat(frame=2, shorten=TRUE)

Calculate average behavior of each codon (codonmat2xy())

## use hiv data
## calculate average behavior
hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()

Plot average behavior of each codon

print(hiv.xy |> dplyr::select(Codon,SynMean,NonSynMean,IndelMean) |>
    tidyr::gather(variable, values, -Codon) |>
    ggplot2::ggplot(aes(x=Codon, y=values)) + 
    ggplot2::geom_line(aes(colour=factor(variable))) + 
    ggplot2::geom_point(aes(colour=factor(variable))) + 
    ggplot2::ggtitle("HIV-1 sample 136 patient 1 from
        Sweden envelope glycoprotein (env) gene"))

References


nocite: '@*'

Session Info

sessionInfo()


kullrich/distSTRING documentation built on Dec. 21, 2021, 8:42 a.m.