BIGRED: Bayes Inferred Genotype Replicate Error Detector (BIGRED)

Description Usage Arguments Value Author(s)

Description

Detects outlier(s) among supposed replicate sequence runs of a given genotype

Usage

1
2
BIGRED(L, chrom, mafrange, thinby, eUSER, proband, aliases.fn, expid,
  headersuffix, ncores, outprefix, returnwhat, threshold)

Arguments

L

(numeric integer) specifies the number of sites to sample and use for analysis. BTRED samples a site if each putative replicate has at least one read at that site.

chrom

(numeric vector) specifies the chromosome(s) from which to sample the L sites.

mafrange

(numeric vector; length 2) a site is sampled if its minor allele has a frequency within this range. The default is set to c(0.0,0.5), i.e. BTRED will sample sites regardless of MAF status. We found that BTRED is most accurate when analyzing sites with MAFs in the range (0.4,0.5] in simulation experiments (paper in review). We do not recommend sampling sites with rare minor alleles.

thinby

(numeric interger) specifies the minimum distance (in bp) between any two sampled sites

eUSER

(numeric float) specifies the fixed sequencing error rate used to calculate genotype likelihoods. This value may range from 0 to 1. For an error rate of 1 percent, enter 0.01.

proband

(character) specifies the name of the genotype with k putative replicates, where k >1.

aliases.fn

(character) specifies the name of the alias text file listing the names of the proband's k putative sequence runs. Refer to README.md for a description of an alias text file and formatting requirements.

expid

(character) only specify an argument for this parameter if you wish to run the function on a given proband multiple times, simultaneouly. One potential reason for applying BIGRED on one given proband more than once would be if the user wishes to average the results of multiple runs, rather than relying on the results of one run. If the user executes these runs simultaneously, a different value of this parameter should be used for each run to avoid overwriting output files.

headersuffix

(character) BTRED() requires a header file for each chromosome (refer to README.md for a description of this file type and formatting requirements). Each header file must follow the naming format chr[chromosome]_[headersuffix]. Enter [headersuffix] as the argument for this parameter. Example: The headersuffix associated with header file chr001_gbsheader is 'gbsheader'.

ncores

(numeric interger) specifies the number of cores to be used while running the function (only portions of the function are parallelized).

outprefix

(character) specifies the output filename prefix. Results are saved as Rds files. If no filename prefix is supplied, the function will generate a prefix following the format $proband_chrom$chromosome_L$L or number of sites available for sampling_maf$mafrange_thinby$thinby_BIGRED. Files will be saved in the current working directory. Output filenames end with the suffix '.rds'. Example filename: I011206_chrom1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18_L1000_maf0.4,0.5_thinby20000_BIGRED.rds

returnwhat

(character) specifies what is returned by the function. The user may select one of three options: "pS", "truereplicates", or "all". Selecting "pS" returns the posterior probability of each identity vector. Selecting "truereplicates" returns the ID of replicates determined by the algorithm to originate from the proband. The user must also supply an arguement for the parameter $threshold if selecting this option. Refer to (14) for a description of $threshold. **NOTE** The algorithm selects the source that has a clear majority. As an example, when Pr( S=(1,1,2) | X ) > threshold, the algorithm returns the IDs of putative replicates d=1 and d=2 (source 1). For the case where there is no clear majority, the algorithm randomly selects a source to return. As an example, when Pr( S=(1,1,2,2) | X ) > threshold, the algorithm returns the IDs of either replicates d=1 and d=2 (source 1) or d=3 and d=4 (source 2). As another example, when Pr( S=(1,2,3) | X) > threshold, the algorithm returns the ID of either d=1 (source 1), d=2 (source 2), or d=3 (source 3). Selecting "all" returns a list of four elements: (1) replicatenames: (named list) the sample IDs of the k putative replicates associated with the proband (2) pS: (numeric vector) the posterior probability of each identity vector (3) statistics: (numeric vector) the mean read depth across the thinned set of sites for each sample (4) sitenames: (character vector) the sites sampled by the algorithm listed using the notation $chromosome_$physical position

threshold

(numeric float) only specify an argument for this parameter if returnwhat="truereplicates" (see parameter (13)); value must fall in the range (0.5,1]

Information regarding warning messages: If L exceeds the number of available sites, the function will return a warning message informing the user of how many sites were available given the $thinby and $mafrange criteria. The function will continue to estimate the posterior probability of each identity vector regardless of this warning message. The prefix of the output filename will specify how many sites were available for sampling (see outprefix parameter description above).

Value

The function _outputs_ an Rds file storing a list with two elements: 'results' (class: list) and 'runtime' (class: proc_time). results (list; length 4): (1) replicatenames: (list) the sample IDs of the k putative replicates associated with the proband (2) pS: (numeric vector) the posterior probability of each identity vector (3) statistics: (numeric vector) the mean read depth across the thinned set of sites for each sample (4) sitenames: (character vector) the sites sampled by the algorithm listed using the notation $chromosome_$physical position

runtime: specifies how much real and CPU time (in seconds) required to run the function

The function _returns_ one of three possible objects depending on the $returnwhat parameter (refer to the parameter description for $returnwhat). (check this) The function outputs a log file indicating when the number of sites satisfying [mafrange] and missingness criteria is less than L.

Author(s)

Ariel W Chan, ac2278@cornell.edu


ac2278/BIGRED documentation built on May 28, 2019, 3:22 p.m.