knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = FALSE
)

Overall Design

yasss is designed to simulate genealogies which are stored in data.frames. A high level summary of the core simulation functions are provided below:

It quickly became clear that most simulations will involve many datasets generated with distinct sets of arguments. Functionality to track this extra structure was added to yasss in a function called sim_proc_many_pop that calls sim_pop in a for loop and that packages up and processes all results from the calls to sim_pop.

sim_pop

sim_next_gen

assign_fitness

get_fit_offspring

?recombination

Overview of recombination

NOTE: Issue to resolve: Since a sequence recombines with a partner, each sequence has two chances of being a recombinant: Once when the recombinants are chosen and 'once' when the partners are chosen. How should we modify the parameters because of this? Should we modify the parameters because of this? CORRECTION: It is a double chance to be involved in a recombination event - not double chance to become a recombinant. However, we do not allow recombinants to recombine with other recombinants that were formed in the same generation, so there is a little weirdness going on with the selection of the recombination partner.

NOTE: The recomb_x columns in the genealogy data structure does not inherit the values of the parent sequences. This makes sense because you can easily recompute it later using the parent_ids. However, it should be communicated clearly to the user.

Recombination is added into sim_pop before the fitness computation step. It must be performed before the fitness computation step since recombination alters the sequence and fitness depends on the sequence. Also, the sequences generated before the recombination step should be seen as placeholder sequences that will only be finalized after they have been exposed to recombination.

The problem with recombination is always the same. There are too many parameters and under certain thresholds, the recombination is just inconsequential. Getting the various parameters and their interactions just right is the key.

The paramters of interest:

Why do you need all these parameters? Because you need to describe the levels of recombination accurately. Consider a statement like: 25% of sequences are recombinants. If all the sequences are identical, then this means nothing and you can seriously question whether or not it is possible to know that the sequences actually are recombinants. If 50% of the sequences are duplicates, then do the 25% rate also apply to recombination between duplicates? How about recombination between sequences that differ by only one position? That means that the recombinant will be identical to one of the input sequences so the effect of the recombination is just to remove a unique sequence and to include a duplicate.

Note that there is another super key detail that is not even captured by any of the above definitions:

Is a sequence a recombinant if it is the result of recombination between identical sequences?

Is a sequence a recombinant if the recombinant is identical to one of the input sequences?

Is a sequence a recombinant if the recombination event occurred 5 cycles ago? What about 10 cycles ago?

If one sequence recombined in the previous cycle and another sequence recombined 10 cycles ago, do we differentiate between them?

How is this relevant to yasss?

When simulating recombination, the effect of the recombination should be reported in a sensible way. Attached to a simulated dataset should be some number indicating the amount of recombination that is present in the dataset. Given what was discussed above, it is really hard to figure out what metric should be used.

FOR NOW I am only comfortable reporting the per nucleotide per cycle recombination rate and reporting the number of recombination events per cycle. (cycles coincide with generations, so that cycle 0 (which doesn't happen) is linked with the ancestral sequences and then cycle 1 produces generation 1).

Note that there is also some implications for how to implement recombination in the simulation process. For computational purposes we will treat recombination as if it replaces one of the two sequences that recombined and that it does not affect the other sequence.

Implementation details.

After producing the next generation and before assigning fitness scores:

What metrics should be produced during this process?

Only fill out the columns provided by the genealogy data structure.

It is appropriate for recomb_muts to track the number of mutations relative to the sequence that was removed since the same concept can be computed for the recombination partner at any time. The usefulness of this column lies in reporting the size of the effect of recombination.

Two functions are needed to implement this:

The recombine_gen function is responsible for:

The recombine_gen function is implemented using vectors where frequent updates are performed. This yields massive performance gains over updating a data.frame directly.

The recombine_seqs function is responsible for:

All the recombination functionality lives in the recombination.R script.

Linkage Disequilibrium

A popular way to interpret recombination rates is as linkage disequilibrium. Linkage disequilibrium is a way to measure how much the co-occurance of SNPs deviate from random. This changes with the distance between the SNPs.

To compute the linkage disequilibrium:

The metric used to measure the linkage disequilibrium between these two positions is:

$$D = \frac{|p_{12} - p_{1}p_{2}|}{D_{max}}$$

where

$$D_{max} = min(p_{1}p_{2}, (1-p_{1})(1-p_{2})) \quad if \quad p_{12} < p_{1}p_{2}$$

and

$$D_{max} = min((1-p_{1})p_{2}, (1-p_{1})p_{2}) \quad if \quad p_{12} > p_{1}p_{2}$$

Computation of linkage disequilibrium

Computing the joint occurance probability $(p_{12})$ is non-trivial. To ensure that minor mutations do not skew the result, only consider positions where the major variant is prevalent in between 20% nd 80% of the sequences. Each sequence must be considered individually for each pair of positions.

The co-occurance data is summarized in a 2-d array:

The element is the number of sequences where both sequences each had the majority variant at each of the two positions.

The structure storing the co-occurance is called jot, short for joint occurance tracker.

Algorithm to compute linkage disequilibrium

Tests for linkage diseq. calculation

Create 10 sequences of length 10, with a base of all A:

These should result in D scores of:

Then swap them around so that position 9 becomes position 2 and check that that same results are obtained.

Mutators

This section must be updated

To simulate the quasispecies, a function is needed that will take a parent sequence as input and produce a child sequence by possibly producing some mutations in the sequence.

The constructs yasss uses to produce single children from a parent are called mutators, introduced as an S3 class into yasss.

It simply is a list with three elements: fun, args and arg_checker.

The fun element is a function with the argument the_seq and an arbitrary number of other arguments.

The args element is a list with the actual arguments that fun will use.

The arg_checker performs basic sanity checks on the arguments and is seperated from the actual function itself so that error checking and debigging can be simplified by checking the construction of the mutator at the start of the program without generating a very deep trace. (Since the calls will happen through mechanisms like do.call(mutator$fun, args) the trace will be hard to read also.

A constructor lives in mutator_constructor.

Each mutator has its own script and its own test file.

Presently no mutator will be allowed to produce indels. Hopefully this restriction can be lifted in the future.

At the moment six posible mutators are envisioned: 1) uniform_mutator 2) pp_uniform_mutator 3) nucleotide_based_mutator 4) pp_nucleotide_based_mutator 5) codon_based_mutator 6) pp_codon_based_mutator

pp stands for per position and allows the user to specify variable mutation rates for different positions.

Obviously each mutator is just a special case of the mutator that comes after it. As such, the ultimate goal should be to write a single general function that produces each of the special cases. However, that requires a lot more engineering that just getting a simple one up and running right now. So I will start with the uniform mutator and implement most of the package with only that mutator and then come back to this to produce the special mutators later. Also, there might be performance issue with the most general versions that will not be an issue at all for the simpler ones. In theory the performance mismatches can be completely nuetralized by moving this operation to C++ since the main performance overhead will be the lookups.

The output of a mutator is a list with the elements: 1) parent 2) child 3) mutation_statistics 4) mutation_rates

The parent is the input sequence.

The child is the produced offspring.

The mutation_statistics tracks metrics about the mutations that occurred. Initially it will be a list with a single element: n_mut

The mutation_rates element is the matrix of mutation rates is the possibly updates mutation rate array that is used to compute the mutation probabilities. This is included to future proof the code so that the addition of indel functionality will not require mojor changes.

Overall process for developing the mutators: - develop the uniform mutator together with its test file. - There will be a whole slew of tests that are not specific to the uniform mutator. Split them out into a general mutator tests script. Use a for loop looping over all the different types of mutators to apply these general tests to each mutator. - After the uniform mutator is completed, write the mutator_constructor.

Fitness evaluators

Managing multiple simulations

The genealogy data.frames rapidly become too large to store in memory. Thus there are two different approaches to managing the data from multiple simulations. The first is to store the full genealogies, which is useful for smaller simulations and for troubleshooting. The second approach reduces memory consumption by summarizing each genealogy as it is simulated and only saving the summary metrics.

Data structures

The data structure that saves only the summary metrics is called a dsum and the summary metrics include:

Another structure that exists in the data is a pair (or set) generated by applying different fitness cutoffs (criteria) to a genealogy. While the computation of fitness scores is part of the sim_pop function, these scores can be updated at a later stage using the assign_fitness function with the last_generation_only argument set to FALSE. However, the current attempt at managing the data will not handle this case where different fitness_evaluators are used on a single genealogy. Instead, we will assume that the fitness scores are only computed once during the call to sim_pop. Future versions may lift this restriction on the data management component, which will require the creation of a complex mechanism for managing the argument sets, since the arguments that pertain specifically to sim_pop will have to be separated from the arguments that are used only in the fitness_evaluator.

A genealogy is typically used in one of three states:

In general, we will further group these three states. Either we will completely ignore the fitness_scores and just use the full genealogy or we will construct a fit_unfit_pair in which get_fit_offspring is used to generated a fitness restricted genealogy and then a random sample is generated ignoring fitness scores to generated a matched (based on number of individuals) pair.

This complexity is tracked in the following data structures:

arg_set and arg_collection

An example of the current version of an arg_set. Note that this will probably be redesigned in a future version to allow multiple fitness_evaluators to be applied to a single genealogy. Multiple arg_sets are collected into a long list called the arg_collection. Each element in an arg_collection is a single arg_set. The elements in an arg_collection should not be named as the label that is part of the arg_set must be used as the identifier.

arg_set <- list(
  label = 'A',
  ancestors = paste(rep("A", 500), collapse = ''),
  r0 = 2,
  n_gen = n_gen,
  n_pop = Inf,
  mutator = list(fun = "mutator_uniform_fun",
                 args = list(mu = 1/250)),
  fitness_evaluator = list(fun = "fitness_evaluator_homology_fun",
                           args = list(comparators = paste(rep('XXXXC', 100), collapse = ''),
                                       h2fs = "h2fs_univariate_linear_fun")),
  required_fitness = 0.02
)

gcol

This is just a data.frame build by rbinding many different genealogies together and adding some columns to track the arguments that where used to call sim_pop and that were used to remove individuals from the population.

The extra columns that must be added:

dsum and dcollection

Due to the memory requiremented of storing full genealogies (or distance matrices), it is often preferable to immediately compute and summarize a distance matrix on the last generation of a genealogy and to delete both the genealogy and distance matrix from memory. A dcollection is a list in which each element summarized a single distance matrix and contains additional elements to help identify it. An element of a dcollection is referred to as a dsum. Outside of a dcollection, the dsum does not necessarily have the identification elements. These additional identificaiton elements are the same as the extra columns added to the gcol data structure:

Each element of a dsum must have three core elements that contains the summary metrics:

The root level elements of the dcollection list should not be named since the identifier elements that match the extra columns of a gcol should be used to identify the elements.

To make analysis easier, specifically plotting with ggplot, a function that converts a dcollection into a data.frame is also provided. The function dcollection_to_df produces a dcollection_df that is a list with three elements, each a data.frame containing all the data of a specific kind from all the datasets summarized in the dcollection. The three data.frames are:

Each of these data.frame use the identifying columns:

Functions

Since this package is not implemented in an object oriented framework, the data structures are not explicitly defined in code. The main focus of the implementation is functions, so the data structures described need to guide the development of specific functions that operate on and produce the described datasets.

check_arg_set

This function validates an arg_set (a single element of an arg_collection). Its function is similar to that of the validator of a constructor. Given an arg_set it runs through a series of checks and produces a list in which the name describe the check that was performed and the entry is a bool indication pass (TRUE) or failure (FALSE).

check_dsum

Checks that a dsum conforms to the required formats. The function has a flag that allows the inclusion of the identifiers to be toggled.

If identifiers are set to off, then only check avg_hd, perc and density.

If identifiers are set to on, also check presence, format and length of sim_id, label, and sampling

Optionally also give the arg_collection to check the label.

check_dcollection

Checks that a dcollection is a list, that each element in the dcollection is a dsum with identifiers (optionally check for label existence in an arg_collection) and that the elements are unnamed.

summarize_dmat

Summarizes a distance matrix into an average pairwise distance (avg_hd), a set of percentile (perc) and a density estimate (dens). No identifiers are added, that is not part of this function.

Complications related to the bandwidth

The density function estimates the height of the density function of the distribution by considering a number of points in a certain region. (More complex, but this idea is good enough to explain what is wrong). Since mutations are such rare events, if we simulate short reads, then number of mutations that occur can be low. Thus, the set of possible pairwise distances than can occur can be limited to say, 0, 1, 2, ..., 10. This means that the 'density function' has only 11 possible values that it can assume so it really is a probability mass function and not a density function.

The estimation performed by the density function in R is designed to capture complex function which might possibly have high levels of squiggliness. This means that such a probability mass function as described above is modelled as a multimodel distribution with high density at the mentioned points and much lower density elsewhere. This is not what we want from an estimate of the density function of the pairwise distances between the sequences.

To fix this, one of two things can be done:

1) Detect when the distribution should be considered discete, and then estimate it as a probability mass functions 2) Auto-tune the bandwith argument of the density function so that it always covers at least one real point with each kernel.

Clustering the distance matrix

It is sometimes of interest to see if there are clusters in the resulting simulations. Specifically, to see if the consideration of the fitness scores split the population into sub-clusters.

There are multiple ways to approach this question. The simplest is to just divide the population into two clusters and to compare the within and between distances to each other. Within yasss this is referred to as the clara2 approach.

clara2

Use the clara function to split the population into two different clusters based on the distance matrix. Using the clustering vector produced by clara, then compute all the within cluster distances and average that to obtain a single average within cluster distance metric. Likewise, compute the average of the between cluster distances.

These metrics are stored inside the dsum result in a list called clara2. The clara2 sub-list of the dsum list contains the elements:

sim_proc_many_pops

A large wrapper for sim_pop, get_fit_offspring and summarize_dmat. It has many arguments that control the amount of intermediary results are stored.

Primary arguments:

Algorithm:

Return value from sim_proc_many_pops is a list with the following elements:

Intended use

All this data management functionality is build to make it easier to simulate batches of data to illustrate key concepts. There are three primary types of comparison that are of interest (right now):

In almost all cases, only the summarized distance matrix will be produced. As far as I can tell, PFitter only requires the average pairwise hamming distance as input, we do not need to hack in PFitter?

Very optomistic pie in the sky dreams

Simulate the growth of the quasispecies - Inputs -- Ancestors -- Number of copies of each ancestor -- (Vector) of generation sizes -- Required population size or the number of generations -- (Vector over generations) of recombination profiles -- (Vector) of per codon mutation matrices (specified on the nucleotide level 64x64) -- (Vector) of per codon indel generation

Simulate the sampling achieved by reverse transcription - Inputs -- primers -- primer binding affinity function -- Number of primers to generate

Simulate PCR - Inputs -- Number of cycles -- Per cycle efficiency -- Max number of sequences to generate -- (vector) per nucleotide mutation matrix (4x4) -- PCR recombination profile -- PCR recombination boost parameter (inclease recombination rate each cycle)

Simulate the sequencing - Inputs -- Number of reads sequenced -- Length of reads -- (vector) per nucleotide error matrix (4x4)



philliplab/yasss documentation built on Sept. 7, 2020, 3:28 p.m.