ExpectedHindHe: Simulate Data to Get Expected Distribution of Hind/He

View source: R/simulation.R

ExpectedHindHeR Documentation

Simulate Data to Get Expected Distribution of Hind/He

Description

These functions were created to help users determine an appropriate cutoff for filtering loci based on H_{ind}/H_E after running HindHe and InbreedingFromHindHe. ExpectedHindHe takes allele frequencies, sample size, and read depths from a RADdata object, simulates genotypes and allelic read depths from these assuming Mendelian inheritance, and then estimates H_{ind}/H_E for each simulated locus. ExpectedHindHeMapping performs similar simulation and estimation, but in mapping populations based on parental genotypes and expected distribution of progeny genotypes. SimGenotypes, SimGenotypesMapping, and SimAlleleDepth are internal functions used by ExpectedHindHe and ExpectedHindHeMapping but are provided at the user level since they may be more broadly useful.

Usage

ExpectedHindHe(object, ploidy = object$possiblePloidies[[1]], inbreeding = 0,
               overdispersion = 20, contamRate = 0, errorRate = 0.001,
               reps = ceiling(5000/nLoci(object)),
               quiet = FALSE, plot = TRUE)

ExpectedHindHeMapping(object, ploidy = object$possiblePloidies[[1]],
                      n.gen.backcrossing = 0, n.gen.selfing = 0,
                      overdispersion = 20, contamRate = 0, errorRate = 0.001,
                      freqAllowedDeviation = 0.05, 
                      minLikelihoodRatio = 10, reps = ceiling(5000/nLoci(object)),
                      quiet = FALSE, plot = TRUE)

SimGenotypes(alleleFreq, alleles2loc, nsam, inbreeding, ploidy)

SimGenotypesMapping(donorGen, recurGen, alleles2loc, nsam,
                    ploidy.don, ploidy.rec,
                    n.gen.backcrossing, n.gen.selfing)

SimAlleleDepth(locDepth, genotypes, alleles2loc, overdispersion = 20,
               contamRate = 0, errorRate = 0.001)

Arguments

object

A RADdata object.

ploidy

A single integer indicating the ploidy to use for genotype simulation. For ExpectedHindHe and ExpectedHindHeMapping, this number will be multiplied by the values in GetTaxaPloidy(object) then divided by two to determine the ploidy of each individual for simulation.

inbreeding

A number ranging from 0 to 1 indicating the amount of inbreeding (F). This represents inbreeding from all sources (population structure, self-fertilization, etc.) and can be estimated with InbreedingFromHindHe.

overdispersion

Overdispersion parameter as described in AddGenotypeLikelihood. Lower values will cause allelic read depth distributions to deviate further from expectations based on allele copy number.

contamRate

Sample cross-contamination rate to simulate. Although 0 is the default, 0.001 is also reasonable.

errorRate

Sequencing error rate to simulate. For Illumina reads, 0.001 is a reasonable value. An error is assumed to have an equal chance of converting an allele to any other allele at the locus, although this is somewhat of an oversimplification.

reps

The number of times to simulate the data and estimate H_{ind}/H_E. This can generally be left at the default, but set it higher than 1 if you want to see within-locus variance in the estimate.

quiet

Boolean indicating whether to suppress messages and results printed to console.

plot

Boolean indicating whether to plot a histogram of H_{ind}/H_E values.

n.gen.backcrossing

An integer indicating the number of generations of backcrossing.

n.gen.selfing

An integer indicating the number of generations of self-fertilization.

freqAllowedDeviation

The amount by which allele frequencies are allowed to deviate from expected allele frequencies. See AddAlleleFreqMapping.

minLikelihoodRatio

Minimum likelihood ratio for determining the most likely parental genotypes. See GetLikelyGen.

alleleFreq

A vector of allele frequencies, as can be found in the $alleleFreq slot of a RADdata object after running AddAlleleFreqHWE.

alleles2loc

An integer vector assigning alleles to loci, as can be found in the $alleles2loc slot of a RADdata object.

nsam

An integer indicating the number of samples (number of taxa) to simulate.

donorGen

A vector indicating genotypes of the donor parent (which can be either parent if backcrossing was not performed), with one value for each allele in the dataset, and numbers indicating the copy number of each allele.

recurGen

A vector indicating genotypes of the recurrent parent, as with donorGen.

ploidy.don

A single integer indicating the ploidy of the donor parent.

ploidy.rec

A single integer indicating the ploidy of the recurrent parent.

locDepth

An integer matrix indicating read depth at each taxon and locus. Formatted as the $locDepth slot of a RADdata object, notably with columns named by locus number rather than locus name.

genotypes

A numeric matrix, formatted as the output of GetProbableGenotypes or SimGenotypes, indicating genotypes as allele copy number.

Details

To prevent highly inflated values in the output, ExpectedHindHe filters loci with minor allele frequencies below five times the sequencing error rate.

Value

ExpectedHindHe and ExpectedHindHeMapping invisibly return a matrix, with loci in rows and reps in columns, containing H_{ind}/H_E from the simulated loci.

SimGenotypes and SimGenotypesMapping return a numeric matrix of allele copy number, with samples in rows and alleles in columns, similar to that produced by GetProbableGenotypes.

SimAlleleDepth returns an integer matrix of allelic read depth, with samples in rows and alleles in columns, similar to the $alleleDepth slot of a RADdata object.

Author(s)

Lindsay V. Clark

References

Clark, L. V., Mays, W., Lipka, A. E. and Sacks, E. J. (2022) A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. BMC Bioinformatics 23, 101, doi:10.1186/s12859-022-04635-9.

Examples

# Load dataset for the example
data(exampleRAD)
exampleRAD <- AddAlleleFreqHWE(exampleRAD)

# Simulate genotypes
simgeno <- SimGenotypes(exampleRAD$alleleFreq, exampleRAD$alleles2loc, 10, 0.2, 2)

# Simulate reads
simreads <- SimAlleleDepth(exampleRAD$locDepth[1:10,], simgeno, exampleRAD$alleles2loc)

# Get expected Hind/He distribution if all loci in exampleRAD were well-behaved
ExpectedHindHe(exampleRAD, reps = 10)

# Mapping population example
data(exampleRAD_mapping)
exampleRAD_mapping <- SetDonorParent(exampleRAD_mapping, "parent1")
exampleRAD_mapping <- SetRecurrentParent(exampleRAD_mapping, "parent2")
exampleRAD_mapping <- AddAlleleFreqMapping(exampleRAD_mapping,
                                           expectedFreqs = c(0.25, 0.75),
                                           allowedDeviation = 0.08)
exampleRAD_mapping <- AddGenotypeLikelihood(exampleRAD_mapping)
exampleRAD_mapping <- EstimateParentalGenotypes(exampleRAD_mapping,
                                                n.gen.backcrossing = 1)

simgenomap <- SimGenotypesMapping(exampleRAD_mapping$likelyGeno_donor[1,],
                                  exampleRAD_mapping$likelyGeno_recurrent[1,],
                                  exampleRAD_mapping$alleles2loc,
                                  nsam = 10, ploidy.don = 2, ploidy.rec = 2,
                                  n.gen.backcrossing = 1,
                                  n.gen.selfing = 0)
                                  
ExpectedHindHeMapping(exampleRAD_mapping, n.gen.backcrossing = 1, reps = 10)

lvclark/polyRAD documentation built on Jan. 15, 2024, 4:19 a.m.