sim_biallelic: Simulate biallelic data

View source: R/simulate.R

sim_biallelicR Documentation

Simulate biallelic data

Description

Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level minor allele frequencies (PLMAF), and some parameters dictating skew and error distributions. Outputs include the phased haplotypes and the unphased read count and coverage data.

Usage

sim_biallelic(
  coi,
  plmaf = runif(10, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0
)

Arguments

coi

Complexity of infection.

plmaf

Vector of population-level minor allele frequencies at each locus.

coverage

Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci.

alpha

Shape parameter of the symmetric Dirichlet prior on strain proportions.

overdispersion

The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \mjseqn\fracpoverdispersion and \mjseqn\frac1-poverdispersion.

relatedness

The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (\Sexpr[results=rd]{tools:::Rd_expr_doi("https://doi.org/10.1371/journal.pcbi.1005348")}): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))."

epsilon

The probability of a single read being miscalled as the other allele. This error is applied in both directions.

Details

\loadmathjax

Simulated data are drawn from a simple statistical model:

  1. Strain proportions are drawn from a symmetric Dirichlet distribution with shape parameter alpha.

  2. Phased haplotypes are drawn at every locus, one for each coi. The allele at each locus is drawn from a Bernoulli distribution with probability given by the plmaf.

  3. The "true" within-sample allele frequency at every locus is obtained by multiplying haplotypes by their strain proportions, and summing over haplotypes. Errors are introduced through the equation \mjsdeqnwsmaf_error = wsmaf(1-e) + (1-wsmaf)e where \mjseqnwsmaf is the WSMAF without error and \mjseqne is the error parameter epsilon.

  4. Final read counts are drawn from a beta-binomial distribution with expectation \mjseqnw_error. The raw number of draws is given by the coverage, and the skew of the distribution is given by the overdispersion parameter. If the overdispersion is equal to zero, then the distribution is binomial, rather than beta-binomial.

Value

An object of class sim. Contains a list of tibbles:

  • parameters contains each parameter and the value used to simulate data.

  • strain_proportions contains the proportion of each strain.

  • phased_haplotypes contains the phased haplotype for each strain at each locus.

  • data contains the following columns:

    • plmaf: The population-level minor allele frequency.

    • coverage: The coverage at each locus.

    • counts: The count at each locus.

    • wsaf: The within-sample minor allele frequency.

See Also

Other simulated data functions: plot-simulation, process_sim()

Examples

sim_biallelic(coi = 5)

bailey-lab/coiaf documentation built on April 26, 2023, 6:32 p.m.