makeUR: Make an unrelated (UR) population

View source: R/makeUR.R

makeURR Documentation

Make an unrelated (UR) population

Description

Create an UR object from an RA object and perform standard filtering and compute statistics specific to unrelated populations.

Usage

makeUR(
  RAobj,
  ploid = 2,
  indsubset = NULL,
  filter = list(MAF = 0.01, MISS = 0.5, BIN = 100, HW = c(-0.05, Inf), MAXDEPTH = 500),
  mafEst = TRUE,
  nThreads = 2
)

Arguments

RAobj

Object of class RA created via the readRA function.

ploid

An integer number specifying the ploidy level of the population. Currently, only a ploidy level of two (diploid) is implemented.

indsubset

Integer vector specifying which samples of the RA dataset to retain in the UR population.

filter

Named list of thresholds for various criteria used to fiter SNPs. See below for details.

mafEst

Logical value indicating whether the allele frequences and sequencing error parameters are to estimated for each SNP (see details).

nThreads

Integer vector specifying the number of clusters to use in the foreach loop. Only used in the estimation of allele frequencies when mafEst=TRUE.

Details

If mafEst=TRUE, then the major allele frequency and sequencing error rate for each SNP is estimated based on optimizing the likelihood

P(Y=a) = \sum_{G} P(Y=a|G)P(G)

where P(G) are genotype probabilities under Hardy Weinberg Equilibrium (HWE) and P(Y=a|G) are the probilities given in Equation (5) of \insertCitebilton2018genetics2;textualGUSbase. Otherwise, the allele frequencies are taken as the mean of the allele ratio (defined as the number of reference reads divided by the total number of reads) and the sequencing error rate is assumed to be zero.

The filtering criteria currently implemented are

  • Minor allele frequency (MAF): SNPs are discarded if their MAF is less than the threshold (default is 0.01)

  • Proportion of missing data (MISS): SNPs are discarded if the proportion of individuals with no reads (e.g. missing genotype) is greater than the threshold value (default is 0.5)

  • Bin size for SNP selection (BIN):SNPs are binned together if the distance (in base pairs) between them is less than the threshold value (default is 100). One SNP is then randomly selected from each bin and retained for final analysis. This filtering is to ensure that there is only one SNP on each sequence read.

  • Hardy Weinberg Distance (HW): SNPs are discarded if their Hardy Weinberg distance is less than the first threshold value (default=-0.05) or if their Hardy Weinberg distance is greater than the second threshold value (default=Inf). This filtering criteria has been taken from the KGD software (https://github.com/AgResearch/KGD).

  • Maximum average SNP read depth (MAXDEPTH): SNPs are discarded if the average read depth for the SNP is larger than the threshold (default is 500)

If filter = NULL, then no filtering is performed.

Estimation of the allele frequencies when mafEst=TRUE is parallelized using openMP in compiled C code, where the number of threads used in the parallelization is specified by the argument nThreads.

Value

An R6 object of class UR.

Author(s)

Timothy P. Bilton and Ken G. Dodds

References

\insertRef

bilton2018genetics2GUSbase

Examples

file <- simDS()
RAfile <- VCFtoRA(file$vcf)
simdata <- readRA(RAfile)

## make unrelated population
urpop <- makeUR(simdata)

tpbilton/GUSbase documentation built on March 8, 2024, 1:35 p.m.