coalescent.sim: Simulate a tree, phenotype, and genetic data.
In caitiecollins/treeWAS: Phylogenetic tree-based microbial GWAS

coalescent.sim

R Documentation

Simulate a tree, phenotype, and genetic data.

Description

This funtion allows the user to simulate a phylogenetic tree, as well as phenotypic and genetic data, including associated and unassociated loci.

Usage

coalescent.sim(
  n.ind = 100,
  n.snps = 10000,
  n.subs = 1,
  n.snps.assoc = 0,
  assoc.prob = 100,
  n.phen.subs = 15,
  phen = NULL,
  plot = TRUE,
  heatmap = FALSE,
  reconstruct = FALSE,
  dist.dna.model = "JC69",
  grp.min = 0.25,
  row.names = TRUE,
  set = 1,
  tree = NULL,
  coaltree = TRUE,
  s = 20,
  af = 10,
  filename.plot = NULL,
  seed = NULL
)

Arguments

`n.ind`	An integer specifying the number of individual genomes to simulate (ie. the number of terminal nodes in the tree).
`n.snps`	An integer specifying the number of genetic loci to simulate.
`n.subs`	Either an integer or a vector (containing a distribution) that is used to determine the number of substitutions to occur on the phylogenetic tree for each genetic locus (see details).
`n.snps.assoc`	An optional integer specifying the number of genetic loci
`assoc.prob`	An optional integer (> 0, <= 100) specifying the strength of the association between the n.snps.assoc loci and the phenotype (see details).
`n.phen.subs`	An integer specifying the expected number of phenotypic substitutions to occur on the phylogenetic tree (through the same process as the n.subs parameter when n.subs is an integer (see details)).
`phen`	An optional vector containing a phenotype for each of the n.ind individuals if no phenotypic simulation is desired.
`plot`	A logical indicating whether to generate a plot of the phylogenetic tree (`TRUE`) or not (`FALSE`, the default).
`heatmap`	A logical indicating whether to produce a heatmap of the genetic distance between the simulated genomes of the n.ind individuals.
`reconstruct`	Either a logical indicating whether to attempt to reconstruct a phylogenetic tree using the simulated genetic data, or one of c("UPGMA", "nj", "ml") to specify that tree reconstruction is desired by one of these three methods (Unweighted Pair Group Method with Arithmetic Mean, Neighbour-Joining, Maximum-Likelihood).
`dist.dna.model`	A character string specifying the type of model to use in reconstructing the phylogenetic tree for calculating the genetic distance between individual genomes, only used if `tree` is a character string (see ?dist.dna).
`grp.min`	An optional number between 0.1 and 0.9 to control the proportional size of the smaller phenotypic group.
`row.names`	An optional vector containing row names for the individuals to be simulated.
`set`	An integer (1, 2, or 3) required to select the method of generating associated loci if `n.snps.assoc` is not zero.
`coaltree`	A logical indicating whether to generate a coalescent tree (`TRUE`, the default), or an rtree-type tree (`FALSE`, see ?rtree).
`s`	If `set` is 3, the `s` parameter controls a baseline number of substitutions to be experienced by the phenotype and associated loci: by default, 20.
`af`	If `set` is 3, the `af` parameter provides an association factor, controlling the preference for association over non-association at associated loci: by default, 10 (for a 10x preference).
`filename.plot`	An optional character string denoting the file location for saving any plots produced; else `NULL`.
`seed`	An optional integer to control the pseudo-randomisation process and allow for identical repeat runs of the function; else `NULL`.

Details

Homoplasy Distribution

The homoplasy distribution contains the number of substitutions per site.

If the value of the n.subs parameter is set to an integer, this integer is used as the parameter of a Poisson distribution from which the number of substitutions to occur on the phylogenetic tree is drawn for each of the n.snps simulated genetic loci.

The n.subs argument can also be used to provide a distribution to define the number of substitutions per site.

It must be in the form of a named vector (or table), or a vector in which the i'th element contains the number of loci that have been estimated to undergo i substitutions on the tree. The vector must be of length max n.subs, and "empty" indices must contain zeros. For example: the vector n.subs = c(1833, 642, 17, 6, 1, 0, 0, 1), could be used to define the homoplasy distribution for a dataset with 2500 loci, where the maximum number of substitutions to be undergone on the tree by any locus is 8, and no loci undergo either 6 or 7 substitutions.

Association Probability

The assoc.prob parameter is only functional when set is set to 1. If so, assoc.prob controls the strength of association through a process analagous to dilution. All n.snps.assoc loci are initially simulated to undergo a substitution every time the phenotype undergoes a substitution (ie. perfect association). The assoc.prob parameter then acts like a dilution factor, removing (100 - assoc.prob)% of the substitutions that occurred during simulation under perfect association.

Author(s)

Caitlin Collins caitiecollins@gmail.com

Examples

## Not run: 
## load example homoplasy distribution
data(dist_0)
str(dist_0)

## simulate a matrix with 10 associated loci:
dat <- coalescent.sim(n.ind = 100,
                        n.snps = 1000,
                        n.subs = dist_0,
                        n.snps.assoc = 10,
                        assoc.prob = 90,
                        n.phen.subs = 15,
                        phen = NULL,
                        plot = TRUE,
                        heatmap = FALSE,
                        reconstruct = FALSE,
                        dist.dna.model = "JC69",
                        grp.min = 0.25,
                        row.names = NULL,
                        coaltree = TRUE,
                        s = NULL,
                        af = NULL,
                        filename = NULL,
                        set = 1,
                        seed = 1)

## examine output:
str(dat)

## isolate elements of output:
snps <- dat$snps
phen <- dat$phen
snps.assoc <- dat$snps.assoc
tree <- dat$tree

## End(Not run)

caitiecollins/treeWAS documentation built on Feb. 11, 2025, 11:25 a.m.