simulate_data: Generate Simulated Data
In cbirdlab/impostar: ImPoStAR: Implement Population Structure Analyses in R

Description Usage Arguments Value Author(s) References See Also Examples

This function is used to generate a simulated SNP dataset for any number of populations with specified numbers of individuals per population, reads per SNP per population, reference allele frequencies per population, and SNPs to be simulated. Intended for use with the runLogRegTest or runAMOVA functions.

1	simulate_data(nIndiv, nReads, RefProb, nSNPs, file_name=F)

`nIndiv`	a vector of integers specifying the number of individuals in each pool. The vector length indicates the number of pools and MUST be of equal length to nReads and RefProb. Required.
`nReads`	a vector of integers specifying the number of reads per SNP in each pool. The vector length indicates the number of pools and MUST be of equal length to nIndiv and RefProb. Required.
`RefProb`	a vector of numerics specifying the true reference allele frequency for all SNPs in each pool. The vector length indicates the number of pools and MUST be of equal length to nIndiv and nReads. Required.
`nSNPs`	an integer indicating the number of SNPs to simulate. Required.
`file_name`	a logical or string indicating if the resulting simulated dataset should be written to a 'csv' file in the working directory. If TRUE, a file will be written and named based on the number of individuals, reads, and allele frequencies. If a string, the string will be appended with '.csv' and written to the working directory. If FALSE (default), no file will be written.

The simulate_data function returns a dataframe and optional csv file. Each row is a SNP and columns represent the total number of alleles (2n, TotAlleles), read depth (DP), reference alleles (RefAl), alternate alleles (AltAl), reference reads (RD), and alternate reads (AD) per population (as indicated by the number in the column names).

Rebecca M. Hamner, Jason D. Selwyn, Evan Krell, Scott A. King, Christopher E. Bird

Hamner, R.M., J.D. Selwyn, E. Krell, S.A. King, and C.E. Bird. In review. Modeling next-generation sequencer sampling error in pooled population samples dramatically reduces false positives in genetic structure tests.

runLogRegTest, runAMOVA

# if arguments are read into variables
simdata <- simulate_data(nIndiv, nReads, RefProb, nSNPs, file_name=F)
# if three pools with equal individuals, reads and reference allele frequencies and with csv written
simdata <- simulate_data(rep(50, 3), rep(100, 3), rep(0.5, 3), 1000, file_name=T)
# if arguments supplied directly to function for four pools with specified csv file name.
simdata <- simulate_data(c(20,45,50,55), c(100,115,200,150), c(0.1,0.15,0.5,0.9), 1000, file_name="mySimulatedData")