molEvolSim | R Documentation |
Functions to simulate the evolution of DNA sequences following a Markov substitution process.
Calculates the shift rate matrix associated with one of eight DNA evolution models.
Instantiate a DNA (molecular) evolution simulator from a shift intensity matrix over a given evolutionary time.
Generates a random sequence of gaps or DNA bases.
Generates a set of random nucleotide evolution rate from a gamma distribution.
Generates a set of DNA sequences along the edge of a phylogenetic network by evolving filial sequences from parental ones following a random Markov process.
DNArate(
model = c("JC69", "K80", "K81", "F81", "HKY85", "T92", "TN93", "GTR"),
piGap = 0,
deletionRate = 0,
insertionRate = 0,
pi,
piGC,
par
)
molEvolSim(Q, step, rho)
drawDNASequence(NN, piGap = 0, pi = rep(0.25, 4L))
drawEvolRate(NN, gamma.shape = 5, gamma.scale = 5e-04)
simulateSequence(
x,
Q,
sqn,
rate,
contrib = function(x, a = 0, ...) (x^-a)/sum(x^-a),
...
)
model |
A character string identifying one of eight DNA evolution
models, namely |
piGap |
Numeric; the equilibrium frequency of the gaps (a value between 0 and 1, default: 0). |
deletionRate |
Numeric; the rate of nucleotide deletion (default: 0). |
insertionRate |
Numeric; the rate of nucleotide insertion (default: 0). |
pi |
Optional. A length 4 numeric between 0 and 1, summing to 1 and
giving the equilibrium base frequencies or the probabilities for drawing one
of the four DNA bases (A, G, C, or T, in that order). When missing, the
function assumes equal equilibrium base frequencies
( |
piGC |
Optional. A numeric between 0 and 1; the GC relative frequency
with respect to AT. When missing, the function assumes equal equilibrium base
frequencies ( |
par |
Optional. A set with 1 to 6 evolution rate parameters (numeric; see the details below). |
Q |
A 5 x 5 shift intensity matrix (transition and transversion) such as
the one obtained from function |
step |
The simulation time step (arbitrary units of time). |
rho |
An evolution rate factor to be used on top of the shift intensity matrix. |
NN |
An integer; the number of locations to be generated. A locations is either one of the four nucleotides or a gap. |
gamma.shape |
A numeric; shape parameter of the beta distribution used to draw the nucleotide (and gaps) evolution rates. |
gamma.scale |
A numeric; scale parameter of the beta distribution used to draw the nucleotide (and gaps) evolution rates. |
x |
A |
sqn |
A raw vector or matrix containing one or more initial DNA sequence. |
rate |
A numeric vector of mean evolution rates. |
contrib |
A function that determine the contribution of the parent
vertices as a function of the evolutionary distances (passed as its first
argument. It may have any number of argument, each with a default value, and
must accept arbitrary arguments (...). Default:
|
... |
Any named arguments to be internally passed to other functions or methods. |
The molecular evolution model is based on a set of locations that can take one of five states, namely gap ('-'), adenine ('A'), guanine ('G') cytosine ('C'), or Thymine ('T'). A seed sequence is evolved one location at a time. Changes from one nucleotides to another appear as nucleotide transitions or transversions (shifts), whereas changes from a gap to one of the nucleotides appear as an insertion and changes from one the nucleotides to a gap as a deletion.
The changes are simulated as a simple Markov process, using a shift
probability matrix, which is calculated using a shift intensity matrix
Q
. The off-diagonal elements of this matrix can be calculated from a
DNA evolution model by DNArate()
, The diagonal elements are defined by
construct. Multiple shift intensity matrices can be employed for various
simulated sequences, it necessary.
Gap-only locations are discarded at the outset of the process, yielding sets
of sequences that are more or less shorter than the prescribed number of
nucleotides depending on the gap frequency used in the initial sequence. The
initial (root) sequence can be drawn from a uniform distribution with
user-defined frequencies (i.e., using function
link{drawDNASequence}
), whereas each location's evolution rate can be
drawn from a gamma distribution (i.e., using function
link{drawEvolRate}
).
The molecular evolution simulator can be instantiated multiple times, by
calling molEvolSim
multiple times and storing the results into a list.
For instance, if every single location is assigned its own evolution rate, a
simulator is implemented for every single location. Then, member function
$evolve(N)
is called to evolve a location 'N', with the returned value
being the new location value. When the time step of evolution rate change,
member function $recalculate(step, rho)
is called to update the
shift probability matrix with time step 'step' and evolution rate 'rho'.
The shift matrix is obtained using member function $getMt()
.
The gap opening and closure rates (arguments gepOpen
and
gspClose
, respectively) correspond to the total of the four changes
(from A, G, C, or T to a gap in the case of the gap opening rate and from a
gap to A, G, C, or T, in the case of a gap closure).
Models "JC69"
, "K80"
, and "K81"
do not require
equilibrium base frequencies (argument pi
), since they assume equal
equilibrium base frequencies, whereas model "T92"
uses parameter
piGC
to calculate these frequencies. Therefore, these models will
disregard any values provided to argument pi
. For the other model,
unequal equilibrium base frequencies may be provided through argument
pi
. Otherwise, equal equilibrium base frequencies are assumed in all
cases. Argument par
is disregarded by models "JC69"
and
"F81"
. The number of values provided to that argument varies depending
on the model. "K80"
, "HKY85"
, and "T92"
require a single
value, "K81"
and "TN93"
, require two; and "GTR"
requires
six. In all cases, omitting par
will yield a model equivalent to
"F81"
(or "JC69"
with equal equilibrium base frequencies).
Permissible values for argument par
are positive values and are
subjected to further constraints that depend on the model.
The "JC69"
and "F81"
models take no par
value. The
value of the single parameter (par[1]
) of models "K80"
,
"HKY85"
, and "T92"
, as well as the two parameters of model
"TN93"
, have to be between 0 and 1. For the "K81"
model, the
two parameters also must have a sum smaller than, or equal to 1. Finally, the
six parameters of the "GTR"
model must sum to 6, but can be
individually larger than 1.
Function simulateSequence()
is assigned a random sequence at the
origin vertex (or vertices) of the evolutionary graph through its argument
sqn
, and also takes a set of evolution rate (one for each location)
through its argument rate
. The DNA sequence evolution is simulated as
a Markov process described by the transition intensity matrix given as
argument Q
.
DNArate
A 5 x 5 shift intensity matrix (transition and transversion)
molEvolSim
...
drawDNASequence
A raw vector of length NN
containing
the ASCII values for characters '-' (0x2d), 'A' (0x41), 'C' (0x43), 'G'
(0x47), and 'T' (0x54) representing random nucleotides to be used as the
seed sequence for a DNA fragment.
drawEvolRate
A numeric vectors of length NN
containing the evolution rate for each of the nucleotides in the sequence.
simulateSequence
A raw matrix with as many rows as the number of vertices in the graph and as many columns as the number of nucleotides in the simulated sequence. The elements of the matrix are ASCII values for characters '-' (0x2d), 'A' (0x41), 'C' (0x43), 'G' (0x47), and 'T' (0x54) representing the simulated nucleotides.
DNArate()
:
molEvolSim()
:
drawDNASequence()
:
drawEvolRate()
:
simulateSequence()
:
Guillaume Guénard [aut, cre] (<https://orcid.org/0000-0003-0761-3072>), Pierre Legendre [ctb] (<https://orcid.org/0000-0002-3838-3305>)
Jukes, T.H. & Cantor, C.R. (1969). Evolution of Protein Molecules. New York: Academic Press pp. 21-132. doi:10.3389/fgene.2015.00319
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16(2): 111-120. doi:10.1007/BF01731581
Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proceedings of the National Academy of Sciences of the United States of America 78(1): 454-458. doi:10.1073/pnas.78.1.454
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17(6): 368-376. doi:10.1007/BF01734359
Hasegawa, M.; Kishino, H. & Yano, T. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22(2): 160-174. doi:10.1007/BF02101694
Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Molecular Biology and Evolution. 9(4): 678-687. doi:10.1093/oxfordjournals.molbev.a040752
Tamura, K. & Nei, M. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10(3): 512-526. doi:10.1093/oxfordjournals.molbev.a040023
Tavaré S. 1986. Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Lectures on Mathematics in the Life Sciences. 17: 57-86.
## Examples of various molecular evolution models
## Jukes and Cantor (1969):
DNArate("JC69", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
## Kimura 1980:
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=0)
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=2/3)
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=1)
## Kimura 1981:
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
par=c(0,1/3))
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
par=c(1/3,0))
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
par=c(1/3,0))
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
par=c(0.1,0.4))
## Felsenstein 1981:
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.5,0.15,0.10,0.25))
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.15,0.5,0.12,0.23))
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.15,0.5,0.05,0.3))
## Hasegawa, Kishino, and Yano (1985)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.5,0.15,0.10,0.25))
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.5,0.15,0.10,0.25), par=0.1)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.5,0.15,0.10,0.25), par=0.5)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.5,0.15,0.10,0.25), par=0.9)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=0.9)
## Hasegawa, Kishino, and Yano (1985)
DNArate("HKY85")
DNArate("HKY85", pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.1, pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.5, pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.9, pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.5)
## Tamura (1992)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=0.1)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02, piGC=0.25)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02, piGC=1/3,
par=0.2)
## Tamura & Nei (1993)
DNArate("TN93", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("TN93", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
par=c(0.1,0.2))
DNArate("TN93", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
par = c(1/3,1/2))
## Generalized time-reversible (GTR; Tavaré 1986)
DNArate("GTR", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("GTR", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
par=c(1.5,1,0.5,0,2,1))
DNArate("GTR", , piGap=0.30, insertionRate=0.02, deletionRate=0.02,
pi=c(0.5,0.25,0.15,0.10), par=c(1.5,1,0.5,0,2,1))
## The transition intensity matrix from a Kimura (1980) model:
Q <- DNArate(model="K80", piGap = 0.3, deletionRate=0.1, insertionRate=0.1)
Q
## Implement the molecular evolution simulator for a single nucleotide:
molEvolSim(Q = Q, step = 1, rho = 5) -> em1
## Get the transition probability matrix as follows:
em1$getMt()
## A vector of raw as examples of initial traits:
tr <- charToRaw("-AGCT")
## Set the RNG seed.
set.seed(28746549L)
## Simulate molecular evolution from:
rawToChar(em1$evolve(tr[1L])) ## a gap.
rawToChar(em1$evolve(tr[2L])) ## an adenine base.
rawToChar(em1$evolve(tr[3L])) ## a guanine base.
rawToChar(em1$evolve(tr[4L])) ## a cytosine base.
rawToChar(em1$evolve(tr[5L])) ## a thymine base.
## Recalculate the probabilities for a lower mean evolution rate (one tenth
## the previous one):
em1$recalculate(1, 0.1)
em1$getMt() ## The recalculated transition probability matrix.
## Simulate molecular evolution from:
rawToChar(em1$evolve(tr[1L])) ## a gap.
rawToChar(em1$evolve(tr[2L])) ## an adenine base.
rawToChar(em1$evolve(tr[3L])) ## a guanine base.
rawToChar(em1$evolve(tr[4L])) ## a cytosine base.
rawToChar(em1$evolve(tr[5L])) ## a thymine base.
## Base changes are now less probable.
## Simulate the evolution of a sequence with 100 base pairs for 250 generations.
Ngeneration <- 250
Nnucleotide <- 100
## This is the matrix holding the sequences:
seq <- matrix(raw(), Ngeneration, Nnucleotide)
## Drawing the initial sequence:
seq[1,] <- drawDNASequence(Nnucleotide, piGap = 0.25, pi = c(0.4,0.4,0.1,0.1))
## Each site has its own mean evolution rate, which are drawn as follows:
erate <- drawEvolRate(Nnucleotide, gamma.shape = 5, gamma.scale = 5e-03)
## Using the Hasegawa, Kishino, and Yano (1985) model:
DNArate(
model = "HKY85",
piGap = 0.25,
deletionRate = 0.1,
insertionRate = 0.1,
pi = c(0.4, 0.4, 0.1, 0.1),
par = 0.25
) -> Q
## Instantiating a molecular evolution models for each site using the single
## change rate matrix, a constant time step of 1 (Ma), and individual mean
## evolution rates drawn previously:
em <- list()
for(j in 1:Nnucleotide)
em[[j]] <- molEvolSim(Q = Q, step = 1, rho = erate[j])
## These for loops call the $evolve() function to evolve each site for each
## generation as follows:
for(i in 2:Ngeneration)
for(j in 1:Nnucleotide)
seq[i, j] <- em[[j]]$evolve(seq[i - 1, j])
## Sequences with the gaps (perfect alignment):
concatenate(seq) %>%
show.sequence
## Sequences with the gaps removed (prior to multiple sequence alignment):
concatenate(seq, discard="-") %>%
show.sequence
### Examples for molEvolSim() here...
## Clean up:
rm(Q, em1, tr, Ngeneration, Nnucleotide, seq, erate, em, i, j)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.