simpg: Simulate a bacterial pangenome
In iferres/simba: simurg: Simulate Bacterial Pangenomes

Description Usage Arguments Details Value Author(s) References Examples

View source: R/simpg.R

Simulate the evolution of a pangenome using a random ultrametric tree as guide to resemble a coalescent process, and according to both the neutral model (mutations) and the Infinitely Many Genes (IMG) model (gene presence/absence).

simpg(ref = "pan_genome_reference.fa", norg = 10, br, ne = 1e+11,
  C = 100, u = 1e-08, v = 1e-11, mu = 5e-12, write_by = "gene",
  dir_out = "sim_pg", replace = FALSE, smat, force = FALSE,
  verbose = TRUE)

`ref`	A multi-fasta file from where to sample genes. Each fasta header is expected to identify a single gene. Each sampled gene is taken as the most recent common ancestor of each gene family sampled at the end of the simulation (final time); in other words, each gene family will coalesce to each of the original genes sampled from this file.
`norg`	The number of organisms sampled at final time.
`br`	`numeric` vector of length `norg - 1` for setting the coalescent branch lengths. By default is calculated as: `rexp( norg - 1, choose(seq(norg, 2, -1), 2) )` as expected by the coalescent theory (see Wakeley, 2008 "Coalescent Theory: An Introduction". Roberts & Company Publishers).
`ne`	Effective population size. Baumdicker et al. (2012) estimates the effective population sizes of Prochlorococcus and Synechococcus to be around 10e11, which was taken as default. This is also the number of generations from the MRCA to the sampled (final) organisms.
`C`	The coregenome size (default is 100).
`u`	Probability of gene gain per generation. Default is 1e-8, to be in accordance to values presented in Baumdicker et al. (2012). (theta = 2Neu, and taking Ne = 10e11, and estimates of theta ~2000 for Prochlorococcus*).
`v`	Probability of gene loss rate per generation. Default is 1e-11, to be in accordance to values presented in Baumdicker et al. (2012). (rho = 2Nev, and taking Ne = 10e11, and estimates of rho ~2 for Prochlorococcus*).
`mu`	The per site per generation substitution rate. According to Duchene et al. (2016), a typical substitution rate is around 1e-5 and 1e-9 per site, per annum. Taking 1e-9, and assuming 200 generation per annum (Shierup and Wuif, 2010), this gives mu = 5e-12, which was taken as default.
`write_by`	One of `'gene'` of `'genome'`. Whether to write sequences in files grouped by orthology or by genome, respectively.
`dir_out`	The non-existent directory where to put the simulated orthologous groups.
`replace`	`logical`. Whether to replace (see sample) when sampling genes from reference. Default and recommended is FALSE, since IMG model states that each new gene is a different one. Setting `replace = TRUE` will probably cause duplicated genes, although each one will follow a different evolutionary history in the simulation.
`smat`	A codon substitution matrix with probability weights for transition. It must be a `data.frame` with `dim(smat) == c(64, 64)`, where column and row names are the possible combination of codons (i.e. "aaa", "taa", "caa", etc). The values in the `data.frame` correspond to probability weights for obtaining the elements of the vector being sampled. See sample for more information. The default is a transition matrix with 0 probability of transition from a coding codon to any stop codon and to itself, and equal probability weights of transition to any other codon which differ in, at most, one base (point mutations; one site per codon at most is mutated each time). If you want to use a different codon matrix, I recommend to use the default as reference: `simba:::.codon.subst.mat` . See package source (R/subs_mat.R) for a guide on how to generate a similar matrix.
`force`	`logical`. If `dir_out` exists, whether to force directory creation by overwriting the existing one. It will eliminate any existing content.
`verbose`	`logical`. Show (or not) progress messages.

The algorithm first simulates a random ultrametric tree (rcoal) describing the evolutionary history of organisms sampled at the final time. Effective population size (ne) is also interpreted as the number of generations from the MRCA to the sampling time.

A IMG process is simulated using this tree, and parameters C (coregenome size), u (probability of gene gain, per generation), and v (probability of gene loss, per generation). At the end of this process a final panmatrix is obtained, describing the presence or absence of each gene for each organism, at the sampling time. The final gene presence/absence pattern is in accordance with the previously simulated coalescent tree, as the model describes.

As many genes as in the above panmatrix are sampled from the reference file (ref). For each one the MRCA node in the tree is identified, and from this node to the final time the gene is subjected to mutations at rate mu (probability of substitution, per generation). Genes are first replicated as many times it appears at the final time, and mutations are computed in accordance to the simulated tree (each gene will follow an independent path). This implementation mutates codons, rather than single nucleotides, because we wanted to be able to avoid stop codons of being created at the middle of a gene. The probability of substitution from one codon to another is decribed by the smat parameter, which by defaults assigns equal probability of changing one codon to another with a single nucleotide difference between them, 0 probability of changing from one codon to another with more than one difference, and 0 probability of changing from any codon to a stop codon. This behavior can be changed by modifying smat parameter, which by default takes the substitution matrix '.codon.subst.mat' in the namespace of this package. To see a reference of how it was created, the file subs_mat.R in the package source shows details, and can be taken as guide to create a custom one.

Finally, sequences are written to the output directory (dir_out) by gene (group of orthologous sequences), or by genome, in fasta format. Each header contains information about which gene in the reference file was used to generate it. One can set the algorithm to be able to sample genes from the reference file with replace (replace = TRUE), but it is not recommended since the IMG model states that each gene only is gained once in the evolutionary history of the pan organism (so replace = FALSE, by default, and gives error if number of columns in the final panmatrix is greater than available genes in ref), although if it is set to TRUE, each copy will follow an independent evolutionary path.

An object of class pangenomeSimulation, which consist in a list of 5 elements: (1) The simulated coalescent, as an object of class phylo (ape package); (2) A list with gene families; (3) The panmatrix; (4) a list with substitution codon position for each gene, per branch, since gene birth to sampling time, and (5) a list of genetic distances, for each gene family. Also a series of attributes are returned, with calling parameters. A summary method is also provided to retrieve statistics from the simulation.

Ignacio Ferres

Baumdicker, F., Hess, W. R., & Pfaffelhuber, P. (2010). "The Diversity of a Distributed Genome in Bacterial Populations". The Annals of Applied Probability.

Baumdicker, F., Hess, W. R., & Pfaffelhuber, P. (2012). "The Infinitely Many Genes Model for the Distributed Genome of Bacteria". Genome Biology and Evolution.

Duchene, S. (2016). "Genome-scale rates of evolutionary change in bacteria". Microbial Genomics.

Ashley Robinson, D., Falush, D. Fiel, E. J. (2010). "Bacterial Population Genetics in Infectious Disease." Wiley-Blackwell, Chapter 1: "The Coalescent of Bacterial Populations"; Section 1.5; Schierup, M. H. & Wiuf, C. "From Coalescent Time to Real Time".

Jukes, T. H., Cantor, C. R. (1969). "Evolution of Protein Molecules". New York: Academic Press.

Kimura, M. (1983). "The neutral theory of molecular evolution". Cambridge.

## Not run: 
library(simurg)

ref_file <- system.file('extdata', 'ref_tutorial.tar.gz', package = 'simurg')
untar(tarfile = ref_file)
ref <- 'ref_tutorial.fasta'

pg <- simpg(ref = ref, norg = 10, ne = 1e10, C = 100, u = 1e-8, v = 1e-11,
            mu = 5e-12, write_by = 'genome', dir_out = 'sim_pg',
            replace = TRUE)

summary(pg)

## End(Not run)