knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 7, fig.retina = 1, dpi = 72 )
This package provides a function to simulate a bacterial pangenome in accordance to the neutral model of mutation, as well as the infinitely many genes model for accessory gene gain and loss proceses. It first generates a random ultrametric tree with as many tips as user defines. Then a process of gene gain and loss is simulated mimicking the Infinitely Many Genes (IMG) model, and obtaining at the end of this step a panmatrix describing the gene presence/absence frequencies at the final time (sampling time). As many genes as columns in the final panmatrix are sampled from a reference multi-fasta file, and each one becomes the most recent common ancestor of each gene family at final time. These gene families will have the presence/absence frequency described in the panmatrix, and each individual gene will follow an evolutionary (mutation) path from the time of gene birth, to the final time, at a given constant rate of substitution and a transition matrix. More details of the algorithm are described in the documentation.
First load the package to the R session:
# Load package.. library(simurg)
simurg
provides a single function , simpg()
, and some methods associated to the output class. To access the documentation, run:
?simpg # or help('simpg')
This function needs a reference multi-fasta file from where to sample the genes that would be subjected to mutation, and this package provides one for tutorial purposes only since, depending on running parameters, the user would probably need a bigger one. A well suited reference file would be roary's pan_genome_referebce.fa
output file. For this tutorial we will use the attached reference file, which needs to be first decompressed:
wd <- tempdir() # Set the directory where to decompress to /tmp/ tgz <- system.file('extdata', 'ref_tutorial.tar.gz', package = 'simurg') untar(tarfile = tgz, exdir = wd) ref <- list.files(path = wd, pattern = 'ref_tutorial[.]fasta$', full.names = TRUE) ref
Now we are ready to run a simulation: ````r set.seed(123) pg <- simpg(ref = ref, # Reference multi-fasta file norg = 20, # Number of organisms to sample ne = 1e11, # Number of generations from MRCA C = 100, # Core genome size u = 1e-8, # Probability of gene gain, per generation v = 1e-11, # Probability of gene loss, per generation mu = 5e-12, # Probability of substitution, per site, per generation write_by = 'genome', # Write simulated sequences to files by genome rather by gene family dir_out = paste(wd, 'dir_out', sep = '/'), # Directory where to write sequences replace = TRUE, # Permit gene sampling with replacement force = TRUE, # Force re writing of output directory if exists verbose = TRUE)
Now lets inspect the output: ```r pg class(pg) mode(pg) length(pg) sapply(pg, class)
As you see, the class pangenomeSimulation
consists in a list
of 5 elements.
The first one is the simulated ultrametric tree, of class phylo
(ape package). You can, for example, plot it:
plot(pg$coalescent)
The second one is a gene list, showing each gene family members.
head(pg$gene_list, n = 3)
The third one is the panmatrix. The first columns are core genes, accesory genes are next to them.
# First five core genes: pg$panmatrix[, 1:5] # Some accesory genes, for this example: pg$panmatrix[, 100:105]
Next field shows each substitution (codon position) for each gene, per branch. This list is an anidated list that shows, for each gene, all the substitutions that took place on each branch. The names of the branches indicates the segments formed by 2 subsequet nodes in the tree (pg$coalescent
); i.e. "11-14" is the segment formed between nodes 11 and 14.
str(pg$substitutions, list.len = 3)
To see the nodes in the plotted coalescent tree:
plot(pg$coalescent) nodelabels()
The last field is a list with genetic distances between all pairs of sequences on each gene family.
length(pg$distances) pg$distances[['gene1']]
Finally, sequences are witten at dir_out
:
list.files(path = paste(wd, 'dir_out', sep = '/'), full.names = TRUE)
The write_by
parameter allows to write sequences by gene family rather by genome.
A summary
method for class pangenoneSimulation
is also provided:
spg <- summary(pg) # Coalescence times (log transform) brln <- spg$Coal_times plot(log(brln), xlab = 'Expected coalescence times (log)', ylab = 'Simulated coalescence times (log)') #Expect some dispersion at the begining, but correlation towards the end.. abline(0, 1) # Gene family frequency freqs <- spg$Gene_family_frequency barplot(freqs) # Evolutionary distance (expected vs observed) dis <- spg$Evo_dist head(dis) plot(x = dis$norm_cophenetic, y = dis$mean_gene_dist, xlim = c(0, 1), xlab = 'Normalized cophenetic distances', ylim = c(0, 1), ylab = 'Mean genetic distances') #Expect saturation towards the end.. abline(0, 1)
unlink(paste(wd, 'dir_out', sep = '/'), recursive = TRUE)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.