molEvolSim: Molecular Evolution Simulator
In guenardg/MPSEM: Modelling Phylogenetic Signals using Eigenvector Maps

View source: R/molEvolSim.R

molEvolSim

R Documentation

Molecular Evolution Simulator

Description

Functions to simulate the evolution of DNA sequences following a Markov substitution process.

Calculates the shift rate matrix associated with one of eight DNA evolution models.

Instantiate a DNA (molecular) evolution simulator from a shift intensity matrix over a given evolutionary time.

Generates a random sequence of gaps or DNA bases.

Generates a set of random nucleotide evolution rate from a gamma distribution.

Generates a set of DNA sequences along the edge of a phylogenetic network by evolving filial sequences from parental ones following a random Markov process.

Usage

DNArate(
  model = c("JC69", "K80", "K81", "F81", "HKY85", "T92", "TN93", "GTR"),
  piGap = 0,
  deletionRate = 0,
  insertionRate = 0,
  pi,
  piGC,
  par
)

molEvolSim(Q, step, rho)

drawDNASequence(NN, piGap = 0, pi = rep(0.25, 4L))

drawEvolRate(NN, gamma.shape = 5, gamma.scale = 5e-04)

simulateSequence(
  x,
  Q,
  sqn,
  rate,
  contrib = function(x, a = 0, ...) (x^-a)/sum(x^-a),
  ...
)

Arguments

`model`	A character string identifying one of eight DNA evolution models, namely `"JC69"`: Jukes and Cantor (1969); `"K80"`: Kimura (1980; aka K2P); `"K81"`: Kimura (1981, aka K3P); `"F81"`: Felsenstein (1981); `"HKY85"`: Hasegawa, Kishino and Yano (1985); `"T92"`: Tamura (1992); `"TN93"`: Tamura and Nei (1993); `"GTR"`: Tavaré (1986); or any unambiguous abbreviation thereof (default: `"JC69"`).
`piGap`	Numeric; the equilibrium frequency of the gaps (a value between 0 and 1, default: 0).
`deletionRate`	Numeric; the rate of nucleotide deletion (default: 0).
`insertionRate`	Numeric; the rate of nucleotide insertion (default: 0).
`pi`	Optional. A length 4 numeric between 0 and 1, summing to 1 and giving the equilibrium base frequencies or the probabilities for drawing one of the four DNA bases (A, G, C, or T, in that order). When missing, the function assumes equal equilibrium base frequencies (`pi = rep(0.25, 4)`).
`piGC`	Optional. A numeric between 0 and 1; the GC relative frequency with respect to AT. When missing, the function assumes equal equilibrium base frequencies (`piGC = 0.5`).
`par`	Optional. A set with 1 to 6 evolution rate parameters (numeric; see the details below).
`Q`	A 5 x 5 shift intensity matrix (transition and transversion) such as the one obtained from function `DNAevol` (see details).
`step`	The simulation time step (arbitrary units of time).
`rho`	An evolution rate factor to be used on top of the shift intensity matrix.
`NN`	An integer; the number of locations to be generated. A locations is either one of the four nucleotides or a gap.
`gamma.shape`	A numeric; shape parameter of the beta distribution used to draw the nucleotide (and gaps) evolution rates.
`gamma.scale`	A numeric; scale parameter of the beta distribution used to draw the nucleotide (and gaps) evolution rates.
`x`	A `graph-class` object.
`sqn`	A raw vector or matrix containing one or more initial DNA sequence.
`rate`	A numeric vector of mean evolution rates.
`contrib`	A function that determine the contribution of the parent vertices as a function of the evolutionary distances (passed as its first argument. It may have any number of argument, each with a default value, and must accept arbitrary arguments (...). Default: `function(x, a = 0, ...) (x^-a)/sum(x^-a)`.
`...`	Any named arguments to be internally passed to other functions or methods.

Details

The molecular evolution model is based on a set of locations that can take one of five states, namely gap ('-'), adenine ('A'), guanine ('G') cytosine ('C'), or Thymine ('T'). A seed sequence is evolved one location at a time. Changes from one nucleotides to another appear as nucleotide transitions or transversions (shifts), whereas changes from a gap to one of the nucleotides appear as an insertion and changes from one the nucleotides to a gap as a deletion.

The changes are simulated as a simple Markov process, using a shift probability matrix, which is calculated using a shift intensity matrix Q. The off-diagonal elements of this matrix can be calculated from a DNA evolution model by DNArate(), The diagonal elements are defined by construct. Multiple shift intensity matrices can be employed for various simulated sequences, it necessary.

Gap-only locations are discarded at the outset of the process, yielding sets of sequences that are more or less shorter than the prescribed number of nucleotides depending on the gap frequency used in the initial sequence. The initial (root) sequence can be drawn from a uniform distribution with user-defined frequencies (i.e., using function link{drawDNASequence}), whereas each location's evolution rate can be drawn from a gamma distribution (i.e., using function link{drawEvolRate}).

The molecular evolution simulator can be instantiated multiple times, by calling molEvolSim multiple times and storing the results into a list. For instance, if every single location is assigned its own evolution rate, a simulator is implemented for every single location. Then, member function $evolve(N) is called to evolve a location 'N', with the returned value being the new location value. When the time step of evolution rate change, member function $recalculate(step, rho) is called to update the shift probability matrix with time step 'step' and evolution rate 'rho'. The shift matrix is obtained using member function $getMt().

The gap opening and closure rates (arguments gepOpen and gspClose, respectively) correspond to the total of the four changes (from A, G, C, or T to a gap in the case of the gap opening rate and from a gap to A, G, C, or T, in the case of a gap closure).

Models "JC69", "K80", and "K81" do not require equilibrium base frequencies (argument pi), since they assume equal equilibrium base frequencies, whereas model "T92" uses parameter piGC to calculate these frequencies. Therefore, these models will disregard any values provided to argument pi. For the other model, unequal equilibrium base frequencies may be provided through argument pi. Otherwise, equal equilibrium base frequencies are assumed in all cases. Argument par is disregarded by models "JC69" and "F81". The number of values provided to that argument varies depending on the model. "K80", "HKY85", and "T92" require a single value, "K81" and "TN93", require two; and "GTR" requires six. In all cases, omitting par will yield a model equivalent to "F81" (or "JC69" with equal equilibrium base frequencies). Permissible values for argument par are positive values and are subjected to further constraints that depend on the model.

The "JC69" and "F81" models take no par value. The value of the single parameter (par[1]) of models "K80", "HKY85", and "T92", as well as the two parameters of model "TN93", have to be between 0 and 1. For the "K81" model, the two parameters also must have a sum smaller than, or equal to 1. Finally, the six parameters of the "GTR" model must sum to 6, but can be individually larger than 1.

Function simulateSequence() is assigned a random sequence at the origin vertex (or vertices) of the evolutionary graph through its argument sqn, and also takes a set of evolution rate (one for each location) through its argument rate. The DNA sequence evolution is simulated as a Markov process described by the transition intensity matrix given as argument Q.

Value

DNArate: A 5 x 5 shift intensity matrix (transition and transversion)
molEvolSim: ...
drawDNASequence: A raw vector of length NN containing the ASCII values for characters '-' (0x2d), 'A' (0x41), 'C' (0x43), 'G' (0x47), and 'T' (0x54) representing random nucleotides to be used as the seed sequence for a DNA fragment.
drawEvolRate: A numeric vectors of length NN containing the evolution rate for each of the nucleotides in the sequence.
simulateSequence: A raw matrix with as many rows as the number of vertices in the graph and as many columns as the number of nucleotides in the simulated sequence. The elements of the matrix are ASCII values for characters '-' (0x2d), 'A' (0x41), 'C' (0x43), 'G' (0x47), and 'T' (0x54) representing the simulated nucleotides.

Functions

DNArate():
molEvolSim():
drawDNASequence():
drawEvolRate():
simulateSequence():

Author(s)

Guillaume Guénard [aut, cre] (<https://orcid.org/0000-0003-0761-3072>), Pierre Legendre [ctb] (<https://orcid.org/0000-0002-3838-3305>)

References

Jukes, T.H. & Cantor, C.R. (1969). Evolution of Protein Molecules. New York: Academic Press pp. 21-132. doi:10.3389/fgene.2015.00319

Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16(2): 111-120. doi:10.1007/BF01731581

Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proceedings of the National Academy of Sciences of the United States of America 78(1): 454-458. doi:10.1073/pnas.78.1.454

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17(6): 368-376. doi:10.1007/BF01734359

Hasegawa, M.; Kishino, H. & Yano, T. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22(2): 160-174. doi:10.1007/BF02101694

Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Molecular Biology and Evolution. 9(4): 678-687. doi:10.1093/oxfordjournals.molbev.a040752

Tamura, K. & Nei, M. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10(3): 512-526. doi:10.1093/oxfordjournals.molbev.a040023

Tavaré S. 1986. Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Lectures on Mathematics in the Life Sciences. 17: 57-86.

Examples

## Examples of various molecular evolution models

## Jukes and Cantor (1969):
DNArate("JC69", piGap=0.30, insertionRate=0.02, deletionRate=0.02)

## Kimura 1980:
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=0)
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=2/3)
DNArate("K80", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=1)

## Kimura 1981:
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        par=c(0,1/3))
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        par=c(1/3,0))
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        par=c(1/3,0))
DNArate("K81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        par=c(0.1,0.4))

## Felsenstein 1981:
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.5,0.15,0.10,0.25))
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.15,0.5,0.12,0.23))
DNArate("F81", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.15,0.5,0.05,0.3))

## Hasegawa, Kishino, and Yano (1985)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.5,0.15,0.10,0.25))
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.5,0.15,0.10,0.25), par=0.1)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.5,0.15,0.10,0.25), par=0.5)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.5,0.15,0.10,0.25), par=0.9)
DNArate("HKY85", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=0.9)

## Hasegawa, Kishino, and Yano (1985)
DNArate("HKY85")
DNArate("HKY85", pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.1, pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.5, pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.9, pi = c(0.5, 0.15, 0.10, 0.25))
DNArate("HKY85", par = 0.5)

## Tamura (1992)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02, par=0.1)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02, piGC=0.25)
DNArate("T92", piGap=0.30, insertionRate=0.02, deletionRate=0.02, piGC=1/3,
        par=0.2)

## Tamura & Nei (1993)
DNArate("TN93", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("TN93", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        par=c(0.1,0.2))
DNArate("TN93", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        par = c(1/3,1/2))

## Generalized time-reversible (GTR; Tavaré 1986)
DNArate("GTR", piGap=0.30, insertionRate=0.02, deletionRate=0.02)
DNArate("GTR", piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        par=c(1.5,1,0.5,0,2,1))
DNArate("GTR", , piGap=0.30, insertionRate=0.02, deletionRate=0.02,
        pi=c(0.5,0.25,0.15,0.10), par=c(1.5,1,0.5,0,2,1))

## The transition intensity matrix from a Kimura (1980) model:
Q <- DNArate(model="K80", piGap = 0.3, deletionRate=0.1, insertionRate=0.1)
Q

## Implement the molecular evolution simulator for a single nucleotide:
molEvolSim(Q = Q, step = 1, rho = 5) -> em1

## Get the transition probability matrix as follows:
em1$getMt()

## A vector of raw as examples of initial traits:
tr <- charToRaw("-AGCT")

## Set the RNG seed.
set.seed(28746549L)

## Simulate molecular evolution from:
rawToChar(em1$evolve(tr[1L]))    ## a gap.
rawToChar(em1$evolve(tr[2L]))    ## an adenine base.
rawToChar(em1$evolve(tr[3L]))    ## a guanine base.
rawToChar(em1$evolve(tr[4L]))    ## a cytosine base.
rawToChar(em1$evolve(tr[5L]))    ## a thymine base.

## Recalculate the probabilities for a lower mean evolution rate (one tenth
## the previous one):
em1$recalculate(1, 0.1)

em1$getMt()        ## The recalculated transition probability matrix.

## Simulate molecular evolution from:
rawToChar(em1$evolve(tr[1L]))    ## a gap.
rawToChar(em1$evolve(tr[2L]))    ## an adenine base.
rawToChar(em1$evolve(tr[3L]))    ## a guanine base.
rawToChar(em1$evolve(tr[4L]))    ## a cytosine base.
rawToChar(em1$evolve(tr[5L]))    ## a thymine base.

## Base changes are now less probable.

## Simulate the evolution of a sequence with 100 base pairs for 250 generations.
Ngeneration <- 250
Nnucleotide <- 100

## This is the matrix holding the sequences:
seq <- matrix(raw(), Ngeneration, Nnucleotide)

## Drawing the initial sequence:
seq[1,] <- drawDNASequence(Nnucleotide, piGap = 0.25, pi = c(0.4,0.4,0.1,0.1))

## Each site has its own mean evolution rate, which are drawn as follows:
erate <- drawEvolRate(Nnucleotide, gamma.shape = 5, gamma.scale = 5e-03)

## Using the Hasegawa, Kishino, and Yano (1985) model:
DNArate(
  model = "HKY85",
  piGap = 0.25,
  deletionRate = 0.1,
  insertionRate = 0.1,
  pi = c(0.4, 0.4, 0.1, 0.1),
  par = 0.25
) -> Q

## Instantiating a molecular evolution models for each site using the single
## change rate matrix, a constant time step of 1 (Ma), and individual mean
## evolution rates drawn previously:
em <- list()
for(j in 1:Nnucleotide)
  em[[j]] <- molEvolSim(Q = Q, step = 1, rho = erate[j])

## These for loops call the $evolve() function to evolve each site for each
## generation as follows:
for(i in 2:Ngeneration)
  for(j in 1:Nnucleotide)
    seq[i, j] <- em[[j]]$evolve(seq[i - 1, j])

## Sequences with the gaps (perfect alignment):
concatenate(seq) %>%
  show.sequence

## Sequences with the gaps removed (prior to multiple sequence alignment):
concatenate(seq, discard="-") %>%
  show.sequence

### Examples for molEvolSim() here...

## Clean up:
rm(Q, em1, tr, Ngeneration, Nnucleotide, seq, erate, em, i, j)

guenardg/MPSEM documentation built on April 14, 2025, 3:53 p.m.