sim_biallelic | R Documentation |
Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level minor allele frequencies (PLMAF), and some parameters dictating skew and error distributions. Outputs include the phased haplotypes and the unphased read count and coverage data.
sim_biallelic(
coi,
plmaf = runif(10, 0, 0.5),
coverage = 200,
alpha = 1,
overdispersion = 0,
relatedness = 0,
epsilon = 0
)
coi |
Complexity of infection. |
plmaf |
Vector of population-level minor allele frequencies at each locus. |
coverage |
Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci. |
alpha |
Shape parameter of the symmetric Dirichlet prior on strain proportions. |
overdispersion |
The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \mjseqn\fracpoverdispersion and \mjseqn\frac1-poverdispersion. |
relatedness |
The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (\Sexpr[results=rd]{tools:::Rd_expr_doi("https://doi.org/10.1371/journal.pcbi.1005348")}): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))." |
epsilon |
The probability of a single read being miscalled as the other allele. This error is applied in both directions. |
Simulated data are drawn from a simple statistical model:
Strain proportions are drawn from a symmetric Dirichlet
distribution with shape parameter alpha
.
Phased haplotypes are drawn at every locus, one for each
coi
. The allele at each locus is drawn from a Bernoulli
distribution with probability given by the plmaf
.
The "true" within-sample allele frequency at every locus is
obtained by multiplying haplotypes by their strain proportions, and
summing over haplotypes. Errors are introduced through the equation
\mjsdeqnwsmaf_error = wsmaf(1-e) + (1-wsmaf)e
where \mjseqnwsmaf is the WSMAF without error and \mjseqne is
the error parameter epsilon
.
Final read counts are drawn from a beta-binomial distribution with
expectation \mjseqnw_error. The raw number of draws is given by the
coverage
, and the skew of the distribution is given by the
overdispersion
parameter. If the overdispersion
is equal to
zero, then the distribution is binomial, rather than beta-binomial.
An object of class sim
. Contains a list of
tibbles:
parameters
contains each parameter and the value used to simulate data.
strain_proportions
contains the proportion of each strain.
phased_haplotypes
contains the phased haplotype for each strain at each
locus.
data
contains the following columns:
plmaf
: The population-level minor allele frequency.
coverage
: The coverage at each locus.
counts
: The count at each locus.
wsaf
: The within-sample minor allele frequency.
Other simulated data functions:
plot-simulation
,
process_sim()
sim_biallelic(coi = 5)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.