sim.parent.assign.fun: sim.parent.assign.fun

Description Usage Arguments Value References Examples

View source: R/sim.parent.assign.fun.R

Description

This function adopts a stochastic simulation approach to determine the proportion of correct assignments and, for maximum likelihood approaches, the critical delta LOD values. For each repetition, 'snp.dat.indiv', 'snp.dat.pools', 'snp.param.indiv', 'snp.param.pools', 'fam.set.combns' and 'fam.set.combns.by.pool' data frames for one pooled DNA sample are generated from user-defined 'ped', 'map', 'true.snp.param.indiv' and 'sim.fam.sets' data frames. Parentage is assigned for each simulated pool using the parent.assign.fun.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
sim.parent.assign.fun(
  n_repetitions,
  ped,
  map,
  missing.parents = NULL,
  true.snp.param.indiv,
  sim.fam.sets = NULL,
  method,
  beta.min.ss = FALSE,
  discrete.method = "geno.probs",
  threshold.indiv = NULL,
  threshold.pools = NULL,
  n.in.pools,
  min.intensity = 0,
  snp.error.assumed = NULL,
  snp.error.underlying = NULL,
  min.sd = 0,
  fams,
  skip.checks = FALSE
)

Arguments

n_repetitions

is a integer variable defining the number of repetitions in the simulation

ped

is a data frame and a conventional pedigree file with one additional column. It must include all SAMPLE_IDs used to construct snp.param.indiv and all possible parents of pooled samples. It contains the following headings (class in parentheses):

  • 'SAMPLE_ID' is the individual (i.e. not a pooled sample) sample identifier. Individuals with no true SAMPLE_ID should be assigned a dummy SAMPLE_ID.(integer).

  • 'SIRE_ID' is the SAMPLE_ID of the sire (0 if unknown) (integer).

  • 'DAM_ID' is the SAMPLE_ID of the dam (0 if unknown) (integer).

  • 'SAMPLED' if TRUE individual used to generate snp.param.indiv (logical).

map

is a data frame and genetic map identifying the position of SNP.

  • 'CHROMOSOME' is the chromosome number. To assume that SNP are not linked provided a unique CHROMOSOME number for each SNP_ID (integer).

  • 'SNP_ID' is the SNP identifier (ordered by physical position within chromosome) (character).

  • 'GENETIC_POSITION' is the SNP genetic position in Morgans (numeric). To assume that SNP are not linked make all GENETIC_POSITION = 0 (numeric).

  • 'B_ALLELE_FREQ' is the frequency of the B allele in the population.

  • 'ERROR_RATE' is the SNP error rate (i.e. the proportion of individuals/pools with signal intensity data from a random genotype rather than the true genotype for the SNP_ID). Refer to Hamilton 2020 (numeric).

  • 'PROP_MISS' is the proportion missing data for the SNP_ID (numeric).

missing.parents

is a vector idenifying parents with no SNP data (i.e. known missing parents). Samples/individuals in missing.parents must be present as a SIRE_ID or a DAM_ID in ped

true.snp.param.indiv

is a data frame detailing the assumed SNP parameters of the population with the following headings (class in parentheses):

  • 'SNP_ID' is the SNP identifier (character).

  • 'MEAN_P_AA' is the mean of allelic proportion for homozygous A genotypes (numeric).

  • 'SD_P_AA' is the standard deviation of allelic proportion for homozygous A genotypes (numeric).

  • 'MEAN_P_AB' is the mean of allelic proportion for heterozygous (AB) genotypes (numeric).

  • 'SD_P_AB' is the standard deviation of allelic proportion for heterozygous (AB) genotypes (numeric).

  • 'MEAN_P_BB' is the mean of allelic proportion for homozygous B genotypes (numeric).

  • 'SD_P_BB' is the standard deviation of allelic proportion for homozygous B genotypes (numeric).

  • 'A_ALLELE' is the base represented by allele A (i.e. 'A', 'C', 'G' or 'T') (character).

  • 'B_ALLELE' is the base represented by allele B (i.e. 'A', 'C', 'G' or 'T') (character).

sim.fam.sets

is a data frame with the following headings (class in parentheses). Note: if sim.fam.sets = NULL (see example below with n.in.pools = 8), FAMILY_ID is taken from the 'fams' and duplicated n.in.pools times, FAM_SET_ID = 1 for the first duplication of FAMILY_IDs, 2 for the second etc and PROBABILITY = NA (default = NULL):

  • 'FAM_SET_ID' is the family set identifier (integer). A 'family set' is a group of families of which one is known to be the true family of one of the individuals in a pooled sample. Within each 'family set combination' there must be a 'family set' for each individual in a pooled sample (i.e. if n.in.pools = 2 there must be two family sets in each family set combination)

  • 'FAMILY_ID' is the family identifier (integer).

  • 'PROBABILITY' is probability that an individual from this family is represented in the pooled sample. If all are NA it is assumed that the probability is equal for each family within the family set.

method

is a vector of methods to be implemented (e.g. c("Quantitative", "Discrete", "Exclusion", "Least_squares"))

beta.min.ss

is a logical variable appicable to least_squares method only (default = FALSE). If TRUE, the sum of squares of all parental combinations are computed and the combination with the minimum value is identified. Refer to Hamilton 2020.

discrete.method

is a character variable applicable to the "Discrete" or "Exclusion" methods only (default = "geno.probs"). It must equal either:

  • "geno.probs" in which case discrete genotypes for parents and pools are derived from genotype probabilities.

  • "assigned.genos" in which case discrete genotypes for parents and pools are obtained directly from the snp.dat.indiv and snp.dat.pools inputs.

threshold.indiv

is a numeric variable between 0 and 1 inclusive applicable to the "Discrete" or "Exclusion" methods only when discrete.method = "geno.probs" (default = NULL). A discrete genotype is assigned to the the most likely genotype in the quantitative ordered genotype probability matrix Gij if it is greater than threshold.indiv (or threshold.indiv / 2 for the two heterozygous genotypes). Otherwise the genotype is deemed missing (refer to the left hand side of page 5 of Henshall et al. 2014)

threshold.pools

is a numeric variable between 0 and 1 inclusive applicable to the "Discrete" or "Exclusion" methods only when discrete.method = "geno.probs" (default = NULL). Equivalent to threshold.indiv for pooled DNA samples.

n.in.pools

is an integer variable representing the number of individual that contributed DNA to each sample in snp.dat.pools

min.intensity

is a numeric variable (default = 0). If the square root of the sum of INTENSITY_A squared and INTENSITY_B squared in snp.dat.indiv or snp.dat.pools is less than min.intensity then this record is excluded. That is, observations that fall into an arc with a radius equal to min.intensity in the lower left of signal intensity scatter plots are excluded.

snp.error.assumed

Must be one of (default = NULL):

  • NULL. Note that if snp.error.assumed is NULL then snp.error.underlying must not be NULL.

  • a numeric variable between 0 and 1, in which case the 'assumed error rate' (see Henshall et al 2014) is the same across all SNP.

  • a data frame with columns SNP_ID and SNP_ERROR_TILDE (see Henshall et al 2014).

fams

is a data frame with the following headings (class in parentheses):

  • 'FAMILY_ID' is the family identifier (integer).

  • 'SIRE_ID' is the sire identifier (integer).

  • 'DAM_ID' is the dam identifier (integer).

skip.checks

is a logical variable. If FALSE parent.assign.fun data checks are not undertaken.

snp.error.underlying.

Not used if snp.error.assumed is not NULL (default = NULL). Must be either:

  • NULL.

  • a numeric variable between 0 and 1 inclusive. Used to comptute SNP_ERROR_TILDE from SNP_ERROR_HAT according to the approach outlined on the left of page 5 of Henshall et al. 2014 using individual (i.e. not pooled) data only. If snp.error.underlying = 0 then SNP_ERROR_TILDE = SNP_ERROR_HAT.

min.sd:

a numeric variable defining a lower bound to be applied to estimates of the standard deviation of allelic proportion for genotypes in snp.param.indiv and snp.param.pools (default = 0)

Value

'summary' is a data frame containing a summary of simulated pedigree assignments:

ggplot.log.quant:

ggplot.log.discrete:

'quant.sim.out' is a detailed summary for the 'Quantitative' method (refer to Hamilton 2020):

'discrete.sim.out' is a detailed summary for the 'Discrete' method (refer to Hamilton 2020):

'exclusion.sim.out' is a detailed summary for the 'Exclusion' method (refer to Hamilton 2020):

'ls.sim.beta.constrain.out' is a detailed summary for the 'least squares' method where beta hat is constrained to equal 1/n.in.pools within each FAM_SET_ID (refer to Hamilton 2020):

'ls.sim.min.ss.out' is a detailed summary for the 'least squares' method where the family combination with the lowest sum of squares is identified (refer to Hamilton 2020):

References

Henshall JM, Dierens, L Sellars MJ (2014) Quantitative analysis of low-density SNP data for parentage assignment and estimation of family contributions to pooled samples. Genetics Selection Evolution 46, 51. https://doi 10.1186/s12711-014-0051-y

Hamilton MG (2020) Maximum likelihood parentage assignment using quantitative genotypes

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#Retrieve data for 'pooling by phenotype' example from Hamilton 2020
data(shrimp.ped)
data(shrimp.map)
data(shrimp.true.snp.param.indiv)
data(shrimp.sim.fam.sets)
data(shrimp.fams)

#Run simulation for all methods with n.in.pools = 2.  Note that 3 is not enough repetitions (1000 may be).
sim.parent.assign.fun(n_repetitions = 3, 
                      ped = shrimp.ped,
                      map = shrimp.map,
                      true.snp.param.indiv = shrimp.true.snp.param.indiv,
                      sim.fam.sets = shrimp.sim.fam.sets, # equivalent to sim.fam.sets = NULL in this case
                      method = c("Quantitative", "Discrete", "Exclusion", "Least_squares"),     
                      beta.min.ss = TRUE, 
                      discrete.method = "geno.probs",   
                      threshold.indiv = 0.98,              
                      threshold.pools = 0.98, 
                      n.in.pools = 2,                
                      snp.error.assumed = 0.01,        
                      fams = shrimp.fams
)

#Run simulation using "Least_squares" method (beta.min.ss = FALSE) with n.in.pools = 8.  
#Do not attempt large pool sizes using any other method nor with beta.min.ss = TRUE, as your
#computer is likely to say no.  Note that 3 is not enough repetitions but is okay as an example.
sim.parent.assign.fun(n_repetitions = 3, 
                      ped = shrimp.ped,
                      map = shrimp.map,
                      true.snp.param.indiv = shrimp.true.snp.param.indiv,
                      sim.fam.sets = NULL, #shrimp.sim.fam.sets only appropriate for n.in.pools = 2
                      method = "Least_squares",     
                      beta.min.ss = FALSE,  
                      n.in.pools = 8,                
                      snp.error.assumed = 0.01,        
                      fams = shrimp.fams
)

mghamilton/SNPpools documentation built on Feb. 13, 2021, 12:52 a.m.