updog_vanilla: Using Parental Data for Offspring Genotyping.

Description Usage Arguments Details Value Author(s) See Also

View source: R/update_all.R

Description

This function fits a hierarchical model to sequence counts from a collection of siblings — or a population of individuals in Hardy-Weinberg equilibrium — and returns genotyped information. The hierarchy comes from either the fact that they share the same parents or they come from a population in Hardy-Weinberg equilibrium. If you also have parental sequencing data, then you can include this to improve estimates.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
updog_vanilla(ocounts, osize, ploidy, p1counts = NULL, p1size = NULL,
  p2counts = NULL, p2size = NULL, commit_num = 4, min_disp = 0,
  print_val = FALSE, update_outmean = FALSE, update_outdisp = FALSE,
  update_outprop = TRUE, update_bias_val = TRUE, update_seq_error = TRUE,
  update_od_param = TRUE, update_pgeno = TRUE, p1geno = 3, p2geno = 3,
  seq_error = 0.01, od_param = 0.01, bias_val = 1, out_prop = 0.001,
  out_mean = 0.5, out_disp = 1/3, non_mono_max = 2, bound_bias = FALSE,
  seq_error_mean = -4.7, seq_error_sd = 1, bias_val_mean = 0,
  bias_val_sd = 0.7, allele_freq = 0.5, model = c("f1", "s1", "hw",
  "uniform"), verbose = FALSE)

Arguments

ocounts

A vector of non-negative integers. The ith element is the number of reads of the reference allele in the ith child.

osize

A vector of positive integers. The ith element is the total number of reads for the ith child.

ploidy

A positive integer. The number of copies of the genome in the species. This is the assumed to be the same for all individuals.

p1counts

A vector of non-negative integers. The ith element is the number of reads of the reference allele in the ith sample of parent 1. If NULL then the prior probabilities on parent 1's genotype will default to uniform.

p1size

A vector of positive integers. The ith element is the total number of reads in the ith sample of parent 1. If NULL then the prior probabilities on parent 1's genotype will default to uniform.

p2counts

A vector of non-negative integers. The ith element is the number of reads of the reference allele in the ith sample of parent 2. If NULL then the prior probabilities on parent 2's genotype will default to uniform.

p2size

A vector of positive integers. The ith element is the total number of reads in the ith sample of parent 2. If NULL then the prior probabilities on parent 2's genotype will default to uniform.

commit_num

The number of consecutive iterations where the parental genotypes do not change before we commit to those parental genotypes.

min_disp

The minimum value for the over-dispersion parameter of the outlier distribution. If this gets too small, then we can be overly confident in points being outliers.

print_val

A logical. Should we print the updates (TRUE) or not (FALSE)?

update_outmean

A logical. Should we update out_mean (TRUE) or not (FALSE)?

update_outdisp

A logical. Should we update out_mean (TRUE) or not (FALSE)?

update_outprop

A logical. Should we update out_prop (TRUE) or not (FALSE)?

update_bias_val

A logical. Should we update bias_val (TRUE) or not (FALSE)?

update_seq_error

A logical. Should we update seq_error (TRUE) or not (FALSE)?

update_od_param

A logical. Should we update od_param (TRUE) or not (FALSE)?

update_pgeno

A logical. Should we update p1geno and p1geno (TRUE) or not (FALSE)?

p1geno

The initial value of the first parental genotype.

p2geno

The initial value of the second parental genotype.

seq_error

A non-negative numeric. This is the known sequencing error rate. The default is to estimate this using data that are all approximately the reference allele.

od_param

The initial value of the overdispersion parameter.

bias_val

The initial value of the bias parameter.

out_prop

The initial value of the proportion of points that are outliers.

out_mean

The initial value of the mean of the outlier distribution.

out_disp

The initial value of the over-dispersion parameter of the outlier distribution.

non_mono_max

The maximum number of iterations to allow non-monotonicity of likelihood.

bound_bias

A logical. Should we bound bias_val by a somewhat arbitrary value (TRUE) or not (FALSE)?

seq_error_mean

The mean of the logit-normal prior on the sequencing error rate, which corresponds to parvec[2]. Set seq_error_sd = Inf to have no penalty on the sequencing error rate. The default is -4.7, which roughly corresponds to a mean sequencing error value of 0.009. If you want to constain seq_error to be zero, you need to set update_seq_error = FALSE, seq_error = 0, andseq_error_mean = -Inf.

seq_error_sd

The standard deviation of the logit-normal prior on the sequencing error rate, which corresponds to parvec[2]. The default is 1, which at three standard deviations is about a sequencing error rate of 0.15. Set seq_error_sd = Inf to have no penalty on the sequencing error rate.

bias_val_mean

The prior mean on the log of bias_val (corresponding to parvec[1]). Set bias_val_sd = Inf to have no penalty on the bias parameter.

bias_val_sd

The prior standard deviation on the log of bias_val (corresponding to parvec[1]). Set bias_val_sd = Inf to have no penalty on the bias parameter.

allele_freq

The allele frequency if model = "hw".

model

The model for the genotype distribution. Do we assume an F1 population ("f1"), an S1 population ("s1"), Hardy-Weinberg equilibrium ("hw"), or a uniform distribution ("uniform").

verbose

A logical. Should we output a lot more (TRUE) or not (FALSE)?

Details

The key improvements in updog are its abilities to account for common features in GBS data: sequencing error rate, read-mapping bias, overdispersion, and outlying points.

Value

A list of class updog with some or all of the following elements:

ogeno

A vector. Each element of which is the maximum a posteriori estimate of each individual's genotype.

maxpostprob

A vector. The maximum posterior probability of a genotype for each individual.

postmean

A vector. The posterior mean genotype for each individual.

bias_val

The estimated bias parameter. This is a value greater than 0 which is the ratio of the probability of correctly mapping a read containing the alternative allele to the probability of correctly mapping a read containing the reference allele. A value of 1 indicates no bias. A value less than one indicates bias towards the reference alelle. A value greater than 1 indiciates bias towards the alternative allele.

seq_error

The estimated sequencing error rate. This is between 0 and 1.

od_param

The estimated overdispersion parameter. Also known as the "intra-class correlation", this is the overdispersion parameter in the underlying beta of the beta-binomial distribution of the counts. Between 0 and 1, a value closer to 0 indicates less overdispersion and a value greater than 1 indicates greater overdispersion. In real data, we we typically see estimates between 0 and 0.01.

p1geno

The estimated genotype of one parent. The number of copies of the reference allele one of the parents has.

p1geno

The estimated genotype of the other parent. The number of copies of the reference allele the other parent has.

allele_freq

The estimated allele-frequency of the reference allele. This is the binomial proportion. Between 0 and 1, a value closer to 1 indicates a larger amount of reference alleles in the population.

out_prop

The estimated proportion of points that are outliers.

out_mean

The estimated mean of the outlier distribution. The outlier distribution is beta-binomial.

out_disp

The estimated overdispersion parameter of the outlier distribution. This is the "intra-class correlation" parameter of the beta-binomial outlier distribution.

prob_out

A vector. Each element of which is the posterior probability that a point is an outlier.

prob_ok

The posterior probability that a point is a non-outlier.

p1_prob_out

The posterior probability that parent 1 is an outlier.

p2_prob_out

The posterior probability that parent 2 is an outlier.

num_iter

The number of iterations the optimization program was run.

convergence

1 if we reached maxiter and 0 otherwise.

llike

The final log-likelihood of the estimates.

hessian

The negative-Fisher information under the parameterization (s, ell, r), where s = log(bias_val) = log(d), ell = logit(seq_error) = logit(eps), and r = - logit(od_param) = - logit(tau). If you want standard errors for these parameters (in the described parameterization), simply take the negative inverse of the hessian.

input

A list with the input counts (ocounts), the input sizes (osize), input parental counts (p1counts and p2counts), input parental sizes (p2size and p1size), the ploidy (ploidy) and the model (model).

log_bias

The log of bias_val

logit_seq_error

The logit of seq_error. I.e. log(seq_error / (1 - seq_error))

neg_logit_od_param

The negative logit of od_param. I.e. log((1 - od_param) / od_param)

covmat

The observed Fisher information matrix of c(log_bias, logit_seq_error, neg_logit_od_param). This can be used as the covariance matrix of these estimates.

prop_mis

The (posterior) proportion of individuals mis-genotyped. Equal to 1 - maxpostprob.

Author(s)

David Gerard

See Also

plot.updog For plotting the output of updog_vanilla.


dcgerard/updogAlpha documentation built on May 14, 2019, 3:10 a.m.