In philliplab/yasss: Yet Another Short Sequence Simulator

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = FALSE
)

Overall Design

yasss is designed to simulate genealogies which are stored in data.frames. A high level summary of the core simulation functions are provided below:

sim_pop
loop over number of generations
- sim_next_gen
- loop over all parents
  - mutator: mutate parent seq into offspring seq
- assign_fitness: for most recent generation only
get_fit_offspring

It quickly became clear that most simulations will involve many datasets generated with distinct sets of arguments. Functionality to track this extra structure was added to yasss in a function called sim_proc_many_pop that calls sim_pop in a for loop and that packages up and processes all results from the calls to sim_pop.

`sim_pop`

`sim_next_gen`

`assign_fitness`

`get_fit_offspring`

?recombination

Overview of recombination

NOTE: Issue to resolve: Since a sequence recombines with a partner, each sequence has two chances of being a recombinant: Once when the recombinants are chosen and 'once' when the partners are chosen. How should we modify the parameters because of this? Should we modify the parameters because of this? CORRECTION: It is a double chance to be involved in a recombination event - not double chance to become a recombinant. However, we do not allow recombinants to recombine with other recombinants that were formed in the same generation, so there is a little weirdness going on with the selection of the recombination partner.

NOTE: The recomb_x columns in the genealogy data structure does not inherit the values of the parent sequences. This makes sense because you can easily recompute it later using the parent_ids. However, it should be communicated clearly to the user.

Recombination is added into sim_pop before the fitness computation step. It must be performed before the fitness computation step since recombination alters the sequence and fitness depends on the sequence. Also, the sequences generated before the recombination step should be seen as placeholder sequences that will only be finalized after they have been exposed to recombination.

The problem with recombination is always the same. There are too many parameters and under certain thresholds, the recombination is just inconsequential. Getting the various parameters and their interactions just right is the key.

The paramters of interest:

Breakpoint: The location where the break occurred
LHS: The sequence on the left hand side of the break point
RHS: The sequence on the right hand side of the break point
Recombinant Prevalence: The chance that a sequence has at least one breakpoint in it.
Breakpoint Prevalence: A per nucleotide chance of that nucleotide being a break point. (Note that this definition is not useful when thinking about something like PCR where the rate must be quoted on a per nucleotide per cycle basis.)
Recombination Rate: The chance per nucleotide per cycle of generative process of a recombination event occurring.
Global distance: The distance between the two sequences that recombined
LHS distance: The distance from the recombinant to the sequence matching on the LHS of the breakpoint
RHS distance: The distance from the recombinant to the sequence matching on the RHS of the breakpoint
Recombination effect: The minimum of the LHS and RHS distances

Why do you need all these parameters? Because you need to describe the levels of recombination accurately. Consider a statement like: 25% of sequences are recombinants. If all the sequences are identical, then this means nothing and you can seriously question whether or not it is possible to know that the sequences actually are recombinants. If 50% of the sequences are duplicates, then do the 25% rate also apply to recombination between duplicates? How about recombination between sequences that differ by only one position? That means that the recombinant will be identical to one of the input sequences so the effect of the recombination is just to remove a unique sequence and to include a duplicate.

Note that there is another super key detail that is not even captured by any of the above definitions:

Suppose there are two very similar clusters of sequences, cluster A and B.
Two sequences, one from each cluster, recombines to form a recombinant.
Then AFTER this event, one cluster gets wiped out, but just one sequence with a key new mutation (mutation X) survives and repopulates the cluster.
Then the recombinant sequence is a recombinant between these two clusters, one of the clusters being defined by mutation X, but it was never even exposed to mutation X.
This just highlights how hard it is to design good metrics for reporting recombination, meaning that, the metric communicates the right value to the reader so that the reader's understanding and broader intuition aligns well with what is really going on.

Is a sequence a recombinant if it is the result of recombination between identical sequences?

Is a sequence a recombinant if the recombinant is identical to one of the input sequences?

Is a sequence a recombinant if the recombination event occurred 5 cycles ago? What about 10 cycles ago?

If one sequence recombined in the previous cycle and another sequence recombined 10 cycles ago, do we differentiate between them?

How is this relevant to yasss?

When simulating recombination, the effect of the recombination should be reported in a sensible way. Attached to a simulated dataset should be some number indicating the amount of recombination that is present in the dataset. Given what was discussed above, it is really hard to figure out what metric should be used.

FOR NOW I am only comfortable reporting the per nucleotide per cycle recombination rate and reporting the number of recombination events per cycle. (cycles coincide with generations, so that cycle 0 (which doesn't happen) is linked with the ancestral sequences and then cycle 1 produces generation 1).

Note that there is also some implications for how to implement recombination in the simulation process. For computational purposes we will treat recombination as if it replaces one of the two sequences that recombined and that it does not affect the other sequence.

Implementation details.

After producing the next generation and before assigning fitness scores:

Loop over all sequences
draw from uniform
if draw > binom(1 out of length seq, per nuc. per cycle rate)
- randomly select another sequence
- randomly select a position for the break point
- randomly select which sequence is LHS and which is RHS
- form recombinant
- replace sequence corresponding to index in outer loop with recombinant

What metrics should be produced during this process?

Only fill out the columns provided by the genealogy data structure.

recomb_pos: The position of the breakpoint.
recomb_replaced: Which side of the sequence was replaced? To the left of the breakpoint, or to the right of the breakpoint?
recomb_partner: The id of the sequence that the recombination was with.
recomb_muts: The number of mutations introduced into the sequence as a result of the recombination. Specifically, this is how many mutations would have been required to modify the sequence that was removed to make it identical the recombinant.

It is appropriate for recomb_muts to track the number of mutations relative to the sequence that was removed since the same concept can be computed for the recombination partner at any time. The usefulness of this column lies in reporting the size of the effect of recombination.

Two functions are needed to implement this:

recombine_gen: Given a generation and some parameters, add recombination to some sequences based on random draws.
recombine_seqs: Given two sequences, recombine them

The recombine_gen function is responsible for:

looping over the whole generation
draw to determine if a sequence has a recombination breakpoint in it
determine the recombination partner
first select out all sequences in the current generation that are not recombinants.
if there are no such sequences available, then the current sequence will not be turned into a recombinant. (Can only happen on the last sequence)
draw one of these randomly
call recombine_seqs
update the genealogy data structure

The recombine_gen function is implemented using vectors where frequent updates are performed. This yields massive performance gains over updating a data.frame directly.

The recombine_seqs function is responsible for:

drawing the breakpoint location
extracting the substrings
concatenating them to form the recombinant
computing the recomb_muts value
packaging all the recombination goodies into a list

All the recombination functionality lives in the recombination.R script.

Linkage Disequilibrium

A popular way to interpret recombination rates is as linkage disequilibrium. Linkage disequilibrium is a way to measure how much the co-occurance of SNPs deviate from random. This changes with the distance between the SNPs.

To compute the linkage disequilibrium:

Consider two positions, $x_{1}$ and $x_{2}$,
the most frequent nucleotides at each position, $n_{1}$ and $n_{2}$,
the prevalence of these two individual nucleotides, $p_{1}$ and $p_{2}$, and
their joint prevalence, i.e. the percentage of sequences in which they bot occur, $p_{12}$.

The metric used to measure the linkage disequilibrium between these two positions is:

$$D = \frac{|p_{12} - p_{1}p_{2}|}{D_{max}}$$

where

$$D_{max} = min(p_{1}p_{2}, (1-p_{1})(1-p_{2})) \quad if \quad p_{12} < p_{1}p_{2}$$

and

$$D_{max} = min((1-p_{1})p_{2}, (1-p_{1})p_{2}) \quad if \quad p_{12} > p_{1}p_{2}$$

Computation of linkage disequilibrium

Computing the joint occurance probability $(p_{12})$ is non-trivial. To ensure that minor mutations do not skew the result, only consider positions where the major variant is prevalent in between 20% nd 80% of the sequences. Each sequence must be considered individually for each pair of positions.

The co-occurance data is summarized in a 2-d array:

Dimension 1: The first position
Dimension 2: The second position

The element is the number of sequences where both sequences each had the majority variant at each of the two positions.

The structure storing the co-occurance is called jot, short for joint occurance tracker.

Algorithm to compute linkage disequilibrium

cm <- consensusMatrix(all_sequences)
max_freq <- apply(cm, 1, max)
max_nuc <- rep('X', length(max_freq))
for i in 1:ncol(cm)
max_indx <- min(which(cm[,i] == max_freq))
max_nuc[i] <- c('A', 'C', 'G', 'T')[max_indx]
for L1 in 1:(seq_length-1)
n1 <- most frequent nucleotide @ L1
- if n1 not in (.2, .8) CONTINUE
- for L2 in (L1+1):seq_length
- n2 <- most frequent nucleotide @ L2
- if n2 not in (.2, .8) CONTINUE
- jot[L1, L2] += 1
check that lower triangle is all zero
for L1 in 1:nrow(jot)
for L2 in (L1+1):ncol(jot)
- link_dist <- L2 - L1
- D = jot[L1, L2] - max_freq[L1]*max_freq[L2]
- linkages[[link_dist]] <- c(linkages[[link_dist]], D)

Tests for linkage diseq. calculation

Create 10 sequences of length 10, with a base of all A:

V1: Replace first 5 As at pos 2 with C; Replace first 5 As at pos 9 with C
V2: Replace first 5 As at pos 2 with C; Replace last 5 As at pos 9 with C
V3: Replace first 5 As at pos 2 with C; Replace middle 6 As at pos 9 with C
V4: Replace first 5 As at pos 2 with C; Replace middle 7(3:9) As at pos 9 with C

These should result in D scores of:

V1: 1
V2: 1
V3: 0
V4: 1/3

Then swap them around so that position 9 becomes position 2 and check that that same results are obtained.

Mutators

This section must be updated

To simulate the quasispecies, a function is needed that will take a parent sequence as input and produce a child sequence by possibly producing some mutations in the sequence.

The constructs yasss uses to produce single children from a parent are called mutators, introduced as an S3 class into yasss.

It simply is a list with three elements: fun, args and arg_checker.

The fun element is a function with the argument the_seq and an arbitrary number of other arguments.

The args element is a list with the actual arguments that fun will use.

The arg_checker performs basic sanity checks on the arguments and is seperated from the actual function itself so that error checking and debigging can be simplified by checking the construction of the mutator at the start of the program without generating a very deep trace. (Since the calls will happen through mechanisms like do.call(mutator$fun, args) the trace will be hard to read also.

A constructor lives in mutator_constructor.

Each mutator has its own script and its own test file.

Presently no mutator will be allowed to produce indels. Hopefully this restriction can be lifted in the future.

At the moment six posible mutators are envisioned: 1) uniform_mutator 2) pp_uniform_mutator 3) nucleotide_based_mutator 4) pp_nucleotide_based_mutator 5) codon_based_mutator 6) pp_codon_based_mutator

pp stands for per position and allows the user to specify variable mutation rates for different positions.

Obviously each mutator is just a special case of the mutator that comes after it. As such, the ultimate goal should be to write a single general function that produces each of the special cases. However, that requires a lot more engineering that just getting a simple one up and running right now. So I will start with the uniform mutator and implement most of the package with only that mutator and then come back to this to produce the special mutators later. Also, there might be performance issue with the most general versions that will not be an issue at all for the simpler ones. In theory the performance mismatches can be completely nuetralized by moving this operation to C++ since the main performance overhead will be the lookups.

The output of a mutator is a list with the elements: 1) parent 2) child 3) mutation_statistics 4) mutation_rates

The parent is the input sequence.

The child is the produced offspring.

The mutation_statistics tracks metrics about the mutations that occurred. Initially it will be a list with a single element: n_mut

The mutation_rates element is the matrix of mutation rates is the possibly updates mutation rate array that is used to compute the mutation probabilities. This is included to future proof the code so that the addition of indel functionality will not require mojor changes.

Overall process for developing the mutators: - develop the uniform mutator together with its test file. - There will be a whole slew of tests that are not specific to the uniform mutator. Split them out into a general mutator tests script. Use a for loop looping over all the different types of mutators to apply these general tests to each mutator. - After the uniform mutator is completed, write the mutator_constructor.

Fitness evaluators

Managing multiple simulations

The genealogy data.frames rapidly become too large to store in memory. Thus there are two different approaches to managing the data from multiple simulations. The first is to store the full genealogies, which is useful for smaller simulations and for troubleshooting. The second approach reduces memory consumption by summarizing each genealogy as it is simulated and only saving the summary metrics.

Data structures

The data structure that saves only the summary metrics is called a dsum and the summary metrics include:

the percentiles of the distribution,
the average of all the pairwise distances,
and, an estimate of the distribution as produced by density()

Another structure that exists in the data is a pair (or set) generated by applying different fitness cutoffs (criteria) to a genealogy. While the computation of fitness scores is part of the sim_pop function, these scores can be updated at a later stage using the assign_fitness function with the last_generation_only argument set to FALSE. However, the current attempt at managing the data will not handle this case where different fitness_evaluators are used on a single genealogy. Instead, we will assume that the fitness scores are only computed once during the call to sim_pop. Future versions may lift this restriction on the data management component, which will require the creation of a complex mechanism for managing the argument sets, since the arguments that pertain specifically to sim_pop will have to be separated from the arguments that are used only in the fitness_evaluator.

A genealogy is typically used in one of three states:

the fitness_scores are ignored,
get_fit_offspring is used to obtain a genealogy based only on fit individuals whose ancestors were also fit and finally, or
a random sampling from the last generation of the genealogy to match the size to a fitness restricted version that was generated with get_fit_offspring.

In general, we will further group these three states. Either we will completely ignore the fitness_scores and just use the full genealogy or we will construct a fit_unfit_pair in which get_fit_offspring is used to generated a fitness restricted genealogy and then a random sample is generated ignoring fitness scores to generated a matched (based on number of individuals) pair.

This complexity is tracked in the following data structures:

fit_unfit_pair: Two genealogies that were simulated with a single call to sim_pop and in one the fitness scores were used to restrict the individuals while in the other one a number of individuals were randomly sampled to ensure that the matched pair has the same number of individuals in the final generation.
arg_set: A list that contains the arguments for sim_pop AND a required fitness AND a label. Thus an arg_set contains enough information to produce a fit/unfit pair. The label will be used later to associate the simulated data with the arg_set that was used to produce it. Note: In the future multiple fitness evaluators may be applied to the same genealogy which will require a more complex arg_set concept.
arg_collection : A list of arg_sets.
gcol: A collection of genealogies. This is the main memory consuming culprit. It is just a string of genealogy data.frames rbinded together with extra columns added that tracks the genealogy_id, the arg_set that was used to produce it and the fitness requirement that had to be satisfied.
dsum: Summary of the distance matrix (percentiles, avg, density). This is a list with the three elements that summarize a dmat and optionally with the identifiers used to track which simulation (and fitness selection) the dmat belonged to.
dcollection: A collection of dsums. Each element is a single dsum.

`arg_set` and `arg_collection`

An example of the current version of an arg_set. Note that this will probably be redesigned in a future version to allow multiple fitness_evaluators to be applied to a single genealogy. Multiple arg_sets are collected into a long list called the arg_collection. Each element in an arg_collection is a single arg_set. The elements in an arg_collection should not be named as the label that is part of the arg_set must be used as the identifier.

arg_set <- list(
  label = 'A',
  ancestors = paste(rep("A", 500), collapse = ''),
  r0 = 2,
  n_gen = n_gen,
  n_pop = Inf,
  mutator = list(fun = "mutator_uniform_fun",
                 args = list(mu = 1/250)),
  fitness_evaluator = list(fun = "fitness_evaluator_homology_fun",
                           args = list(comparators = paste(rep('XXXXC', 100), collapse = ''),
                                       h2fs = "h2fs_univariate_linear_fun")),
  required_fitness = 0.02
)

`gcol`

This is just a data.frame build by rbinding many different genealogies together and adding some columns to track the arguments that where used to call sim_pop and that were used to remove individuals from the population.

The extra columns that must be added:

sim_id: The id of the simulation. If many simulations are run in a loop using the same arg_set this column is the primary mechanism that distinguishes the simulations from each other.
label: The label is used to match the simulation to the arguments that were used call sim_pop or sample from the genealogy.
sampling: The sampling appoach applied to the original genealogy that was produced by sim_pop. Valid values include:
fit_threshold
size_matched_sampling
none

`dsum` and `dcollection`

Due to the memory requiremented of storing full genealogies (or distance matrices), it is often preferable to immediately compute and summarize a distance matrix on the last generation of a genealogy and to delete both the genealogy and distance matrix from memory. A dcollection is a list in which each element summarized a single distance matrix and contains additional elements to help identify it. An element of a dcollection is referred to as a dsum. Outside of a dcollection, the dsum does not necessarily have the identification elements. These additional identificaiton elements are the same as the extra columns added to the gcol data structure:

sim_id: The id of the simulation. If many simulations are run in a loop using the same arg_set this column is the primary mechanism that distinguishes the simulations from each other.
label: The label is used to match the simulation to the arguments that were used call sim_pop or sample from the genealogy.
sampling: The sampling appoach applied to the original genealogy that was produced by sim_pop. Valid values include:
fit_threshold
size_matched_sampling
none

Each element of a dsum must have three core elements that contains the summary metrics:

perc: the percentiles of the distribution,
avg_hd: the average of all the pairwise distances,
dens : an estimate of the distribution as produced by density()
clara2 : The key summary metrics of the call to clara:
avg_within_cluster: The average of all the pairwise distances between all the elements in the same cluster.
avg_between_cluster: The average of all the pairwise distances between all the elements not in the same cluster.
cluste_sizes: The number of sequences in each cluster.

The root level elements of the dcollection list should not be named since the identifier elements that match the extra columns of a gcol should be used to identify the elements.

To make analysis easier, specifically plotting with ggplot, a function that converts a dcollection into a data.frame is also provided. The function dcollection_to_df produces a dcollection_df that is a list with three elements, each a data.frame containing all the data of a specific kind from all the datasets summarized in the dcollection. The three data.frames are:

dmat_metrics: High level summary metrics of the distribution of the pairwise distances. Uses metric, value columns to record many different metric types in the same data.frame.
dmat_distribution_df: The x and y output of the density function so that a density function can be plotted with geom_smooth. Uses columns with names x and y.
dmat_clara2_df: The summary statistics of the clustering of the distanct matrix into 2 sub clusters using the clara function. Uses metric, value columns to record many different metric types in the same data.frame.

Each of these data.frame use the identifying columns:

sim_id
label
sampling
group_label: Concatenate the label and sampling columns.
uniq_id: Concatenate the group_label and sim_id columns.

Functions

Since this package is not implemented in an object oriented framework, the data structures are not explicitly defined in code. The main focus of the implementation is functions, so the data structures described need to guide the development of specific functions that operate on and produce the described datasets.

`check_arg_set`

This function validates an arg_set (a single element of an arg_collection). Its function is similar to that of the validator of a constructor. Given an arg_set it runs through a series of checks and produces a list in which the name describe the check that was performed and the entry is a bool indication pass (TRUE) or failure (FALSE).

`check_dsum`

Checks that a dsum conforms to the required formats. The function has a flag that allows the inclusion of the identifiers to be toggled.

If identifiers are set to off, then only check avg_hd, perc and density.

If identifiers are set to on, also check presence, format and length of sim_id, label, and sampling

Optionally also give the arg_collection to check the label.

`check_dcollection`

Checks that a dcollection is a list, that each element in the dcollection is a dsum with identifiers (optionally check for label existence in an arg_collection) and that the elements are unnamed.

`summarize_dmat`

Summarizes a distance matrix into an average pairwise distance (avg_hd), a set of percentile (perc) and a density estimate (dens). No identifiers are added, that is not part of this function.

Complications related to the bandwidth

The density function estimates the height of the density function of the distribution by considering a number of points in a certain region. (More complex, but this idea is good enough to explain what is wrong). Since mutations are such rare events, if we simulate short reads, then number of mutations that occur can be low. Thus, the set of possible pairwise distances than can occur can be limited to say, 0, 1, 2, ..., 10. This means that the 'density function' has only 11 possible values that it can assume so it really is a probability mass function and not a density function.

The estimation performed by the density function in R is designed to capture complex function which might possibly have high levels of squiggliness. This means that such a probability mass function as described above is modelled as a multimodel distribution with high density at the mentioned points and much lower density elsewhere. This is not what we want from an estimate of the density function of the pairwise distances between the sequences.

To fix this, one of two things can be done:

1) Detect when the distribution should be considered discete, and then estimate it as a probability mass functions 2) Auto-tune the bandwith argument of the density function so that it always covers at least one real point with each kernel.

Clustering the distance matrix

It is sometimes of interest to see if there are clusters in the resulting simulations. Specifically, to see if the consideration of the fitness scores split the population into sub-clusters.

There are multiple ways to approach this question. The simplest is to just divide the population into two clusters and to compare the within and between distances to each other. Within yasss this is referred to as the clara2 approach.

clara2

Use the clara function to split the population into two different clusters based on the distance matrix. Using the clustering vector produced by clara, then compute all the within cluster distances and average that to obtain a single average within cluster distance metric. Likewise, compute the average of the between cluster distances.

These metrics are stored inside the dsum result in a list called clara2. The clara2 sub-list of the dsum list contains the elements:

avg_within_cluster: A single numeric value that is the average of all the within cluster distances.
avg_between_cluster: A single numeric value that is the average of all the between cluster distances.
cluster_sizes: An integer vector of length two that lists the sizes of the clusters.

`sim_proc_many_pops`

A large wrapper for sim_pop, get_fit_offspring and summarize_dmat. It has many arguments that control the amount of intermediary results are stored.

Primary arguments:

arg_collection: All the arg_sets that will be used to simulate genealogies with sim_pop and get_fit_offspring.
n_pop: The number of times sim_pop will be called each arg_set. Note that the number of datasets generated will depend on the way the fitness scores are used to sample from the genealogies.
fitness_processing: The way that the fitness scores should be used to sample from the genealogy. Valid options include:
none: This will produce one dataset per genealogy.
fit_unfit_pair: This will use the threshold approach to remove all unfit individuals and their offspring. The remaining members of the last generation will be considered to be the fit member of the pair. An equal number of individuals will be sampled at random from the original genealogy to produce the unfit pair such that it is the same size as the fit pair.
fit_unfit_unmatched_pair: Same as fit_unfit_pair except that the unfit member will not be down sampled to size match the fit member.
output_genealogy: Should the genealogy data sets be deleted to reduce memory usage? Valid options:
last_gen_only
none
full
output_dmat: Should the distance matrix be included in the output?

Algorithm:

loop over arg_collection
loop over n_pop
call sim_pop
if fitness_processing == none:
add in identifying columns
append into new_genealogies
if fitness_processing == fit_unfit_pair:
get_fit_offspring with fitness_threshold
- add in identifying columns
- append into new_genealogies (full genealogy)
size matched sampling
- add in identifying columns
- append into new_genealogies (full genealogy)
if fitness_processing == fit_unfit_unmatched_pair:
get_fit_offspring with fitness_threshold
- add in identifying columns
- append into new_genealogies (full genealogy)
just take the fill genealogy
- add in identifying columns
- append into new_genealogies (full genealogy)
if output_genealogy == last_gen_only
loop over new_genealogies
- filter out last_gen
- append last_gen to output_genealogies
if output_genealogy == full
append new_genealogies to output_genealogies
loop over new_genealogies
filter out last_gen
call summarize_dmat
add in identifiers
add new dsum into dcollection
if output_dmat
add the dmat into all_dmats

Return value from sim_proc_many_pops is a list with the following elements:

dcollection
all_dmats: [OPTIONAL] list of lists containing the dmats and the indentifiers.
all_genealogies: [OPTIONAL] data.frame that contains all the genealogies, or in the case that last_gen_only was specified, only the last generations of the genealogies and all the identifier columns.
arg_collection: The input arg_collection.

Intended use

All this data management functionality is build to make it easier to simulate batches of data to illustrate key concepts. There are three primary types of comparison that are of interest (right now):

Summary of primary metrics based on multiple similar simulations.
Produce data by specifying an arg_collection with just one arg_set with fitness_processing set to none.
Comparison of metrics between non-fitness restricted and fitness-restricted evolutionary processes.
Produce data by specifying an arg_collection with just one arg_set with fitness_processing specified to either fit_unfit_pair or fit_unfit_unmatched_pair.
Comparison of the effects of different evolutionary processes.
Produce data by specifying an arg_collection with just multiple arg_sets where the based simulation parameters are not modified, with different fitness_evaluators and/or different fitness_requirements specified.
LATER: effect of recombination?
LATER: effect of different mutators?

In almost all cases, only the summarized distance matrix will be produced. As far as I can tell, PFitter only requires the average pairwise hamming distance as input, we do not need to hack in PFitter?

Very optomistic pie in the sky dreams

Simulate the growth of the quasispecies - Inputs -- Ancestors -- Number of copies of each ancestor -- (Vector) of generation sizes -- Required population size or the number of generations -- (Vector over generations) of recombination profiles -- (Vector) of per codon mutation matrices (specified on the nucleotide level 64x64) -- (Vector) of per codon indel generation

Simulate the sampling achieved by reverse transcription - Inputs -- primers -- primer binding affinity function -- Number of primers to generate

Simulate PCR - Inputs -- Number of cycles -- Per cycle efficiency -- Max number of sequences to generate -- (vector) per nucleotide mutation matrix (4x4) -- PCR recombination profile -- PCR recombination boost parameter (inclease recombination rate each cycle)

Simulate the sequencing - Inputs -- Number of reads sequenced -- Length of reads -- (vector) per nucleotide error matrix (4x4)

philliplab/yasss documentation built on Sept. 7, 2020, 3:28 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

philliplab/yasss
Yet Another Short Sequence Simulator

In philliplab/yasss: Yet Another Short Sequence Simulator

Overall Design

`sim_pop`

`sim_next_gen`

`assign_fitness`

`get_fit_offspring`

?recombination

Overview of recombination

Implementation details.

Linkage Disequilibrium

Mutators

Fitness evaluators

Managing multiple simulations

Data structures

`arg_set` and `arg_collection`

`gcol`

`dsum` and `dcollection`

Functions

`check_arg_set`

`check_dsum`

`check_dcollection`

`summarize_dmat`

`sim_proc_many_pops`

Intended use

Very optomistic pie in the sky dreams

R Package Documentation

Browse R Packages

We want your feedback!

philliplab/yasss Yet Another Short Sequence Simulator

In philliplab/yasss: Yet Another Short Sequence Simulator

Overall Design

sim_pop

sim_next_gen

assign_fitness

get_fit_offspring

?recombination

Overview of recombination

Implementation details.

Linkage Disequilibrium

Mutators

Fitness evaluators

Managing multiple simulations

Data structures

arg_set and arg_collection

gcol

dsum and dcollection

Functions

check_arg_set

check_dsum

check_dcollection

summarize_dmat

sim_proc_many_pops

Intended use

Very optomistic pie in the sky dreams

R Package Documentation

Browse R Packages

We want your feedback!

philliplab/yasss
Yet Another Short Sequence Simulator

`sim_pop`

`sim_next_gen`

`assign_fitness`

`get_fit_offspring`

`arg_set` and `arg_collection`

`gcol`

`dsum` and `dcollection`

`check_arg_set`

`check_dsum`

`check_dcollection`

`summarize_dmat`

`sim_proc_many_pops`