knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = FALSE )
yasss is designed to simulate genealogies which are stored in data.frames
. A
high level summary of the core simulation functions are provided below:
sim_pop
sim_next_gen
mutator
: mutate parent seq into offspring seqassign_fitness
: for most recent generation onlyget_fit_offspring
It quickly became clear that most simulations will involve many datasets
generated with distinct sets of arguments. Functionality to track this extra
structure was added to yasss in a function called sim_proc_many_pop
that
calls sim_pop
in a for loop and that packages up and processes all results
from the calls to sim_pop
.
sim_pop
sim_next_gen
assign_fitness
get_fit_offspring
NOTE: Issue to resolve: Since a sequence recombines with a partner, each sequence has two chances of being a recombinant: Once when the recombinants are chosen and 'once' when the partners are chosen. How should we modify the parameters because of this? Should we modify the parameters because of this? CORRECTION: It is a double chance to be involved in a recombination event - not double chance to become a recombinant. However, we do not allow recombinants to recombine with other recombinants that were formed in the same generation, so there is a little weirdness going on with the selection of the recombination partner.
NOTE: The recomb_x
columns in the genealogy data structure does not
inherit the values of the parent sequences. This makes sense because you can
easily recompute it later using the parent_id
s. However, it should be
communicated clearly to the user.
Recombination is added into sim_pop
before the fitness computation step. It
must be performed before the fitness computation step since recombination
alters the sequence and fitness depends on the sequence. Also, the sequences
generated before the recombination step should be seen as placeholder
sequences that will only be finalized after they have been exposed to
recombination.
The problem with recombination is always the same. There are too many parameters and under certain thresholds, the recombination is just inconsequential. Getting the various parameters and their interactions just right is the key.
The paramters of interest:
Why do you need all these parameters? Because you need to describe the levels of recombination accurately. Consider a statement like: 25% of sequences are recombinants. If all the sequences are identical, then this means nothing and you can seriously question whether or not it is possible to know that the sequences actually are recombinants. If 50% of the sequences are duplicates, then do the 25% rate also apply to recombination between duplicates? How about recombination between sequences that differ by only one position? That means that the recombinant will be identical to one of the input sequences so the effect of the recombination is just to remove a unique sequence and to include a duplicate.
Note that there is another super key detail that is not even captured by any of the above definitions:
Is a sequence a recombinant if it is the result of recombination between identical sequences?
Is a sequence a recombinant if the recombinant is identical to one of the input sequences?
Is a sequence a recombinant if the recombination event occurred 5 cycles ago? What about 10 cycles ago?
If one sequence recombined in the previous cycle and another sequence recombined 10 cycles ago, do we differentiate between them?
How is this relevant to yasss?
When simulating recombination, the effect of the recombination should be reported in a sensible way. Attached to a simulated dataset should be some number indicating the amount of recombination that is present in the dataset. Given what was discussed above, it is really hard to figure out what metric should be used.
FOR NOW I am only comfortable reporting the per nucleotide per cycle recombination rate and reporting the number of recombination events per cycle. (cycles coincide with generations, so that cycle 0 (which doesn't happen) is linked with the ancestral sequences and then cycle 1 produces generation 1).
Note that there is also some implications for how to implement recombination in the simulation process. For computational purposes we will treat recombination as if it replaces one of the two sequences that recombined and that it does not affect the other sequence.
After producing the next generation and before assigning fitness scores:
What metrics should be produced during this process?
Only fill out the columns provided by the genealogy data structure.
recomb_pos
: The position of the breakpoint.recomb_replaced
: Which side of the sequence was replaced? To the left
of
the breakpoint, or to the right
of the breakpoint?recomb_partner
: The id of the sequence that the recombination was with.recomb_muts
: The number of mutations introduced into the sequence as a result of
the recombination. Specifically, this is how many mutations would have been
required to modify the sequence that was removed to make it identical the
recombinant. It is appropriate for recomb_muts
to track the number of mutations relative
to the sequence that was removed since the same concept can be computed for the
recombination partner at any time. The usefulness of this column lies in
reporting the size of the effect of recombination.
Two functions are needed to implement this:
recombine_gen
: Given a generation and some parameters, add recombination
to some sequences based on random draws.recombine_seqs
: Given two sequences, recombine themThe recombine_gen
function is responsible for:
recombine_seqs
The recombine_gen
function is implemented using vectors where frequent
updates are performed. This yields massive performance gains over updating a
data.frame directly.
The recombine_seqs
function is responsible for:
recomb_muts
valueAll the recombination functionality lives in the recombination.R
script.
A popular way to interpret recombination rates is as linkage disequilibrium. Linkage disequilibrium is a way to measure how much the co-occurance of SNPs deviate from random. This changes with the distance between the SNPs.
To compute the linkage disequilibrium:
The metric used to measure the linkage disequilibrium between these two positions is:
$$D = \frac{|p_{12} - p_{1}p_{2}|}{D_{max}}$$
where
$$D_{max} = min(p_{1}p_{2}, (1-p_{1})(1-p_{2})) \quad if \quad p_{12} < p_{1}p_{2}$$
and
$$D_{max} = min((1-p_{1})p_{2}, (1-p_{1})p_{2}) \quad if \quad p_{12} > p_{1}p_{2}$$
Computation of linkage disequilibrium
Computing the joint occurance probability $(p_{12})$ is non-trivial. To ensure that minor mutations do not skew the result, only consider positions where the major variant is prevalent in between 20% nd 80% of the sequences. Each sequence must be considered individually for each pair of positions.
The co-occurance data is summarized in a 2-d array:
The element is the number of sequences where both sequences each had the majority variant at each of the two positions.
The structure storing the co-occurance is called jot
, short for joint
occurance tracker.
Algorithm to compute linkage disequilibrium
cm <- consensusMatrix(all_sequences)
max_freq <- apply(cm, 1, max)
max_nuc <- rep('X', length(max_freq))
for i in 1:ncol(cm)
max_indx <- min(which(cm[,i] == max_freq))
max_nuc[i] <- c('A', 'C', 'G', 'T')[max_indx]
for L1 in 1:(seq_length-1)
n1 <- most frequent nucleotide @ L1
for L2 in (L1+1):seq_length
check that lower triangle is all zero
for L1 in 1:nrow(jot)
for L2 in (L1+1):ncol(jot)
link_dist <- L2 - L1
D = jot[L1, L2] - max_freq[L1]*max_freq[L2]
linkages[[link_dist]] <- c(linkages[[link_dist]], D)
Tests for linkage diseq. calculation
Create 10 sequences of length 10, with a base of all A:
These should result in D scores of:
Then swap them around so that position 9 becomes position 2 and check that that same results are obtained.
This section must be updated
To simulate the quasispecies, a function is needed that will take a parent sequence as input and produce a child sequence by possibly producing some mutations in the sequence.
The constructs yasss uses to produce single children from a parent are called mutators, introduced as an S3 class into yasss.
It simply is a list with three elements: fun, args and arg_checker.
The fun element is a function with the argument the_seq and an arbitrary number of other arguments.
The args element is a list with the actual arguments that fun will use.
The arg_checker performs basic sanity checks on the arguments and is seperated
from the actual function itself so that error checking and debigging can be
simplified by checking the construction of the mutator at the start of the
program without generating a very deep trace. (Since the calls will happen
through mechanisms like do.call(mutator$fun, args)
the trace will be hard to
read also.
A constructor lives in mutator_constructor.
Each mutator has its own script and its own test file.
Presently no mutator will be allowed to produce indels. Hopefully this restriction can be lifted in the future.
At the moment six posible mutators are envisioned: 1) uniform_mutator 2) pp_uniform_mutator 3) nucleotide_based_mutator 4) pp_nucleotide_based_mutator 5) codon_based_mutator 6) pp_codon_based_mutator
pp stands for per position and allows the user to specify variable mutation rates for different positions.
Obviously each mutator is just a special case of the mutator that comes after it. As such, the ultimate goal should be to write a single general function that produces each of the special cases. However, that requires a lot more engineering that just getting a simple one up and running right now. So I will start with the uniform mutator and implement most of the package with only that mutator and then come back to this to produce the special mutators later. Also, there might be performance issue with the most general versions that will not be an issue at all for the simpler ones. In theory the performance mismatches can be completely nuetralized by moving this operation to C++ since the main performance overhead will be the lookups.
The output of a mutator is a list with the elements: 1) parent 2) child 3) mutation_statistics 4) mutation_rates
The parent is the input sequence.
The child is the produced offspring.
The mutation_statistics tracks metrics about the mutations that occurred. Initially it will be a list with a single element: n_mut
The mutation_rates element is the matrix of mutation rates is the possibly updates mutation rate array that is used to compute the mutation probabilities. This is included to future proof the code so that the addition of indel functionality will not require mojor changes.
Overall process for developing the mutators: - develop the uniform mutator together with its test file. - There will be a whole slew of tests that are not specific to the uniform mutator. Split them out into a general mutator tests script. Use a for loop looping over all the different types of mutators to apply these general tests to each mutator. - After the uniform mutator is completed, write the mutator_constructor.
The genealogy data.frames
rapidly become too large to store in memory. Thus
there are two different approaches to managing the data from multiple
simulations. The first is to store the full genealogies, which is useful for
smaller simulations and for troubleshooting. The second approach reduces memory
consumption by summarizing each genealogy as it is simulated and only saving the
summary metrics.
The data structure that saves only the summary metrics is called a dsum
and
the summary metrics include:
density()
Another structure that exists in the data is a pair (or set) generated by
applying different fitness cutoffs (criteria) to a genealogy. While the
computation of fitness scores is part of the sim_pop
function, these scores
can be updated at a later stage using the assign_fitness
function with the
last_generation_only
argument set to FALSE. However, the current attempt at
managing the data will not handle this case where different
fitness_evaluators
are used on a single genealogy. Instead, we will assume
that the fitness scores are only computed once during the call to sim_pop
.
Future versions may lift this restriction on the data management component,
which will require the creation of a complex mechanism for managing the
argument sets, since the arguments that pertain specifically to sim_pop
will
have to be separated from the arguments that are used only in the
fitness_evaluator
.
A genealogy is typically used in one of three states:
fitness_scores
are ignored,get_fit_offspring
is used to obtain a genealogy based only on fit
individuals whose ancestors were also fit and finally, or get_fit_offspring
.In general, we will further group these three states. Either we will
completely ignore the fitness_scores
and just use the full genealogy or we
will construct a fit_unfit_pair
in which get_fit_offspring
is used to
generated a fitness restricted genealogy and then a random sample is generated
ignoring fitness scores to generated a matched (based on number of
individuals) pair.
This complexity is tracked in the following data structures:
fit_unfit_pair
: Two genealogies that were simulated with a single call to
sim_pop
and in one the fitness scores were used to restrict the
individuals while in the other one a number of individuals were randomly
sampled to ensure that the matched pair has the same number of individuals
in the final generation.arg_set
: A list that contains the arguments for sim_pop
AND a
required fitness AND a label. Thus an arg_set
contains enough
information to produce a fit/unfit pair. The label will be used later to
associate the simulated data with the arg_set
that was used to produce it.
Note: In the future multiple fitness evaluators may be applied to the same
genealogy which will require a more complex arg_set
concept.arg_collection
: A list of arg_set
s.gcol
: A collection of genealogies. This is the main memory consuming
culprit. It is just a string of genealogy data.frames rbinded together with
extra columns added that tracks the genealogy_id
, the arg_set
that was
used to produce it and the fitness requirement that had to be satisfied.dsum
: Summary of the distance matrix (percentiles, avg, density). This is a
list with the three elements that summarize a dmat and optionally with the
identifiers used to track which simulation (and fitness selection) the dmat
belonged to.dcollection
: A collection of dsum
s. Each element is a single dsum
.arg_set
and arg_collection
An example of the current version of an arg_set
. Note that this will
probably be redesigned in a future version to allow multiple
fitness_evaluators
to be applied to a single genealogy. Multiple arg_set
s
are collected into a long list called the arg_collection
. Each element in an
arg_collection
is a single arg_set
. The elements in an arg_collection
should not be named as the label that is part of the arg_set
must be used as
the identifier.
arg_set <- list( label = 'A', ancestors = paste(rep("A", 500), collapse = ''), r0 = 2, n_gen = n_gen, n_pop = Inf, mutator = list(fun = "mutator_uniform_fun", args = list(mu = 1/250)), fitness_evaluator = list(fun = "fitness_evaluator_homology_fun", args = list(comparators = paste(rep('XXXXC', 100), collapse = ''), h2fs = "h2fs_univariate_linear_fun")), required_fitness = 0.02 )
gcol
This is just a data.frame
build by rbind
ing many different genealogies
together and adding some columns to track the arguments that where used to
call sim_pop
and that were used to remove individuals from the population.
The extra columns that must be added:
sim_id
: The id of the simulation. If many simulations are run in a
loop using the same arg_set
this column is the primary mechanism that
distinguishes the simulations from each other.label
: The label is used to match the simulation to the arguments that
were used call sim_pop
or sample from the genealogy.sampling
: The sampling appoach applied to the original genealogy that
was produced by sim_pop
. Valid values include:fit_threshold
size_matched_sampling
none
dsum
and dcollection
Due to the memory requiremented of storing full genealogies (or distance
matrices), it is often preferable to immediately compute and summarize a
distance matrix on the last generation of a genealogy and to delete both the
genealogy and distance matrix from memory. A dcollection
is a list in which
each element summarized a single distance matrix and contains additional
elements to help identify it. An element of a dcollection
is referred to as
a dsum
. Outside of a dcollection
, the dsum
does not necessarily have
the identification elements. These additional identificaiton elements are the
same as the extra columns added to the gcol
data structure:
sim_id
: The id of the simulation. If many simulations are run in a
loop using the same arg_set
this column is the primary mechanism that
distinguishes the simulations from each other.label
: The label is used to match the simulation to the arguments that
were used call sim_pop
or sample from the genealogy.sampling
: The sampling appoach applied to the original genealogy that
was produced by sim_pop
. Valid values include:fit_threshold
size_matched_sampling
none
Each element of a dsum
must have three core elements that contains the
summary metrics:
perc
: the percentiles of the distribution,avg_hd
: the average of all the pairwise distances,dens
: an estimate of the distribution as produced by density()
clara2
: The key summary metrics of the call to clara:avg_within_cluster
: The average of all the pairwise distances between all
the elements in the same cluster.avg_between_cluster
: The average of all the pairwise distances between
all the elements not in the same cluster.cluste_sizes
: The number of sequences in each cluster.The root level elements of the dcollection
list should not be named since the
identifier elements that match the extra columns of a gcol
should be used to
identify the elements.
To make analysis easier, specifically plotting with ggplot
, a function that
converts a dcollection
into a data.frame
is also provided. The function
dcollection_to_df
produces a dcollection_df
that is a list with three
elements, each a data.frame containing all the data of a specific kind from all
the datasets summarized in the dcollection
. The three data.frames are:
dmat_metrics
: High level summary metrics of the distribution of the
pairwise distances. Uses metric
, value
columns to record many different
metric types in the same data.frame.dmat_distribution_df
: The x
and y
output of the density
function so
that a density function can be plotted with geom_smooth
. Uses columns with names x and y.dmat_clara2_df
: The summary statistics of the clustering of the distanct
matrix into 2 sub clusters using the clara
function. Uses metric
, value
columns to record many different metric types in the same data.frame
.Each of these data.frame use the identifying columns:
sim_id
label
sampling
group_label
: Concatenate the label
and sampling
columns.uniq_id
: Concatenate the group_label
and sim_id
columns.Since this package is not implemented in an object oriented framework, the data structures are not explicitly defined in code. The main focus of the implementation is functions, so the data structures described need to guide the development of specific functions that operate on and produce the described datasets.
check_arg_set
This function validates an arg_set
(a single element of an
arg_collection
). Its function is similar to that of the validator of a
constructor. Given an arg_set
it runs through a series of checks and
produces a list in which the name describe the check that was performed and
the entry is a bool indication pass (TRUE) or failure (FALSE).
check_dsum
Checks that a dsum conforms to the required formats. The function has a flag that allows the inclusion of the identifiers to be toggled.
If identifiers are set to off, then only check avg_hd
, perc
and density
.
If identifiers are set to on, also check presence, format and length of
sim_id
, label
, and sampling
Optionally also give the arg_collection
to check the label.
check_dcollection
Checks that a dcollection
is a list, that each element in the dcollection
is a dsum
with identifiers (optionally check for label existence in an arg_collection
) and
that the elements are unnamed.
summarize_dmat
Summarizes a distance matrix into an average pairwise distance (avg_hd
), a
set of percentile (perc
) and a density estimate (dens
). No identifiers are
added, that is not part of this function.
Complications related to the bandwidth
The density
function estimates the height of the density function of the
distribution by considering a number of points in a certain region. (More
complex, but this idea is good enough to explain what is wrong). Since
mutations are such rare events, if we simulate short reads, then number of
mutations that occur can be low. Thus, the set of possible pairwise distances
than can occur can be limited to say, 0, 1, 2, ..., 10. This means that the
'density function' has only 11 possible values that it can assume so it really
is a probability mass function and not a density function.
The estimation performed by the density
function in R is designed to capture
complex function which might possibly have high levels of squiggliness. This
means that such a probability mass function as described above is modelled as
a multimodel distribution with high density at the mentioned points and much
lower density elsewhere. This is not what we want from an estimate of the
density function of the pairwise distances between the sequences.
To fix this, one of two things can be done:
1) Detect when the distribution should be considered discete, and then
estimate it as a probability mass functions
2) Auto-tune the bandwith argument of the density
function so that it always
covers at least one real point with each kernel.
Clustering the distance matrix
It is sometimes of interest to see if there are clusters in the resulting simulations. Specifically, to see if the consideration of the fitness scores split the population into sub-clusters.
There are multiple ways to approach this question. The simplest is to just
divide the population into two clusters and to compare the within and between
distances to each other. Within yasss
this is referred to as the clara2
approach.
clara2
Use the clara
function to split the population into two different clusters
based on the distance matrix. Using the clustering vector produced by clara
,
then compute all the within cluster distances and average that to obtain a
single average within cluster distance metric. Likewise, compute the average
of the between cluster distances.
These metrics are stored inside the dsum
result in a list called clara2
.
The clara2
sub-list of the dsum
list contains the elements:
avg_within_cluster
: A single numeric value that is the average of all the
within cluster distances.avg_between_cluster
: A single numeric value that is the average of all the
between cluster distances.cluster_sizes
: An integer vector of length two that lists the sizes of the
clusters.sim_proc_many_pops
A large wrapper for sim_pop
, get_fit_offspring
and summarize_dmat
. It
has many arguments that control the amount of intermediary results are stored.
Primary arguments:
arg_collection
: All the arg_set
s that will be used to simulate
genealogies with sim_pop
and get_fit_offspring
.n_pop
: The number of times sim_pop
will be called each arg_set
. Note
that the number of datasets generated will depend on the way the fitness
scores are used to sample from the genealogies.fitness_processing
: The way that the fitness scores should be used to
sample from the genealogy. Valid options include:none
: This will produce one dataset per genealogy.fit_unfit_pair
: This will use the threshold approach to remove all unfit
individuals and their offspring. The remaining members of the last
generation will be considered to be the fit member of the pair. An equal
number of individuals will be sampled at random from the original
genealogy to produce the unfit pair such that it is the same size as the
fit pair.fit_unfit_unmatched_pair
: Same as fit_unfit_pair
except that the
unfit member will not be down sampled to size match the fit member.output_genealogy
: Should the genealogy data sets be deleted to reduce
memory usage? Valid options:last_gen_only
none
full
output_dmat
: Should the distance matrix be included in the output?Algorithm:
loop over arg_collection
loop over n_pop
call sim_pop
if fitness_processing == none
:
new_genealogies
fitness_processing == fit_unfit_pair
:get_fit_offspring
with fitness_threshold
new_genealogies
(full genealogy)new_genealogies
(full genealogy)fitness_processing == fit_unfit_unmatched_pair
:get_fit_offspring
with fitness_threshold
new_genealogies
(full genealogy)just take the fill genealogy
new_genealogies
(full genealogy)if output_genealogy == last_gen_only
new_genealogies
last_gen
last_gen
to output_genealogies
output_genealogy == full
append new_genealogies
to output_genealogies
loop over new_genealogies
last_gen
summarize_dmat
add new dsum into dcollection
if output_dmat
all_dmats
Return value from sim_proc_many_pops
is a list with the following elements:
dcollection
all_dmats
: [OPTIONAL] list of lists containing the dmats and the indentifiers.all_genealogies
: [OPTIONAL] data.frame
that contains all the
genealogies, or in the case that last_gen_only
was specified, only the
last generations of the genealogies and all the identifier columns.arg_collection
: The input arg_collection
.All this data management functionality is build to make it easier to simulate batches of data to illustrate key concepts. There are three primary types of comparison that are of interest (right now):
arg_collection
with just one arg_set
with fitness_processing
set to none
.arg_collection
with just one arg_set
with fitness_processing
specified to either fit_unfit_pair
or
fit_unfit_unmatched_pair
.arg_collection
with just multiple
arg_set
s where the based simulation parameters are not modified, with
different fitness_evaluator
s and/or different fitness_requirements
specified.In almost all cases, only the summarized distance matrix will be produced. As far as I can tell, PFitter only requires the average pairwise hamming distance as input, we do not need to hack in PFitter?
Simulate the growth of the quasispecies - Inputs -- Ancestors -- Number of copies of each ancestor -- (Vector) of generation sizes -- Required population size or the number of generations -- (Vector over generations) of recombination profiles -- (Vector) of per codon mutation matrices (specified on the nucleotide level 64x64) -- (Vector) of per codon indel generation
Simulate the sampling achieved by reverse transcription - Inputs -- primers -- primer binding affinity function -- Number of primers to generate
Simulate PCR - Inputs -- Number of cycles -- Per cycle efficiency -- Max number of sequences to generate -- (vector) per nucleotide mutation matrix (4x4) -- PCR recombination profile -- PCR recombination boost parameter (inclease recombination rate each cycle)
Simulate the sequencing - Inputs -- Number of reads sequenced -- Length of reads -- (vector) per nucleotide error matrix (4x4)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.