Description Usage Arguments Details Value Author(s) References Examples
Simulate the evolution of a pangenome using a random ultrametric tree as guide to resemble a coalescent process, and according to both the neutral model (mutations) and the Infinitely Many Genes (IMG) model (gene presence/absence).
1 2 3 4 |
ref |
A multi-fasta file from where to sample genes. Each fasta header is expected to identify a single gene. Each sampled gene is taken as the most recent common ancestor of each gene family sampled at the end of the simulation (final time); in other words, each gene family will coalesce to each of the original genes sampled from this file. |
norg |
The number of organisms sampled at final time. |
br |
|
ne |
Effective population size. Baumdicker et al. (2012) estimates the effective population sizes of Prochlorococcus and Synechococcus to be around 10e11, which was taken as default. This is also the number of generations from the MRCA to the sampled (final) organisms. |
C |
The coregenome size (default is 100). |
u |
Probability of gene gain per generation. Default is 1e-8, to be in accordance to values presented in Baumdicker et al. (2012). (theta = 2Ne*u, and taking Ne = 10e11, and estimates of theta ~2000 for Prochlorococcus). |
v |
Probability of gene loss rate per generation. Default is 1e-11, to be in accordance to values presented in Baumdicker et al. (2012). (rho = 2Ne*v, and taking Ne = 10e11, and estimates of rho ~2 for Prochlorococcus). |
mu |
The per site per generation substitution rate. According to Duchene et al. (2016), a typical substitution rate is around 1e-5 and 1e-9 per site, per annum. Taking 1e-9, and assuming 200 generation per annum (Shierup and Wuif, 2010), this gives mu = 5e-12, which was taken as default. |
write_by |
One of |
dir_out |
The non-existent directory where to put the simulated orthologous groups. |
replace |
|
smat |
A codon substitution matrix with probability weights for
transition. It must be a |
force |
|
verbose |
|
The algorithm first simulates a random ultrametric tree (rcoal) describing the evolutionary history of organisms sampled at the final time. Effective population size (ne) is also interpreted as the number of generations from the MRCA to the sampling time.
A IMG process is simulated using this tree, and parameters C (coregenome size), u (probability of gene gain, per generation), and v (probability of gene loss, per generation). At the end of this process a final panmatrix is obtained, describing the presence or absence of each gene for each organism, at the sampling time. The final gene presence/absence pattern is in accordance with the previously simulated coalescent tree, as the model describes.
As many genes as in the above panmatrix are sampled from the reference
file (ref). For each one the MRCA node in the tree is identified, and from
this node to the final time the gene is subjected to mutations at rate mu
(probability of substitution, per generation). Genes are first replicated
as many times it appears at the final time, and mutations are computed in
accordance to the simulated tree (each gene will follow an independent path).
This implementation mutates codons, rather than single nucleotides, because
we wanted to be able to avoid stop codons of being created at the middle of
a gene. The probability of substitution from one codon to another is decribed
by the smat
parameter, which by defaults assigns equal probability of
changing one codon to another with a single nucleotide difference between
them, 0 probability of changing from one codon to another with more than one
difference, and 0 probability of changing from any codon to a stop codon.
This behavior can be changed by modifying smat parameter, which by default
takes the substitution matrix '.codon.subst.mat'
in the namespace of
this package. To see a reference of how it was created, the file subs_mat.R
in the package source shows details, and can be taken as guide to create a
custom one.
Finally, sequences are written to the output directory (dir_out) by gene (group of orthologous sequences), or by genome, in fasta format. Each header contains information about which gene in the reference file was used to generate it. One can set the algorithm to be able to sample genes from the reference file with replace (replace = TRUE), but it is not recommended since the IMG model states that each gene only is gained once in the evolutionary history of the pan organism (so replace = FALSE, by default, and gives error if number of columns in the final panmatrix is greater than available genes in ref), although if it is set to TRUE, each copy will follow an independent evolutionary path.
An object of class pangenomeSimulation
, which consist in a
list of 5 elements: (1) The simulated coalescent, as an object of class
phylo (ape package); (2) A list with gene families;
(3) The panmatrix; (4) a list with substitution codon position for each
gene, per branch, since gene birth to sampling time, and (5) a list of
genetic distances, for each gene family. Also a series of attributes are
returned, with calling parameters. A summary method is also provided to
retrieve statistics from the simulation.
Ignacio Ferres
Baumdicker, F., Hess, W. R., & Pfaffelhuber, P. (2010). "The Diversity of a Distributed Genome in Bacterial Populations". The Annals of Applied Probability.
Baumdicker, F., Hess, W. R., & Pfaffelhuber, P. (2012). "The Infinitely Many Genes Model for the Distributed Genome of Bacteria". Genome Biology and Evolution.
Duchene, S. (2016). "Genome-scale rates of evolutionary change in bacteria". Microbial Genomics.
Ashley Robinson, D., Falush, D. Fiel, E. J. (2010). "Bacterial Population Genetics in Infectious Disease." Wiley-Blackwell, Chapter 1: "The Coalescent of Bacterial Populations"; Section 1.5; Schierup, M. H. & Wiuf, C. "From Coalescent Time to Real Time".
Jukes, T. H., Cantor, C. R. (1969). "Evolution of Protein Molecules". New York: Academic Press.
Kimura, M. (1983). "The neutral theory of molecular evolution". Cambridge.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## Not run:
library(simurg)
ref_file <- system.file('extdata', 'ref_tutorial.tar.gz', package = 'simurg')
untar(tarfile = ref_file)
ref <- 'ref_tutorial.fasta'
pg <- simpg(ref = ref, norg = 10, ne = 1e10, C = 100, u = 1e-8, v = 1e-11,
mu = 5e-12, write_by = 'genome', dir_out = 'sim_pg',
replace = TRUE)
summary(pg)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.