fb_haploid: Inferring hidden ancestry states from haploids

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Uses the forward-backward algorithm to estimate ancestry along a given chromosome for a given genotyped tetrad or simulated data.

Usage

1
fb_haploid(snp_locations, p0, p1, p_assign, scale)

Arguments

snp_locations

a numeric vector specifying the locations of each snp (in bps). This vector is assumed to be ordered (sorted from smallest to largest snp).

p0

a vector specifying the number of reads that mapped to parent 0. p0 is assumed to be in the smae order as snp_locations.

p1

a vector specifying the number of reads that mapped to parent 1. p1 is assumed to be in the smae order as snp_locations.

p_assign

a value specifying the assignment probabilty (see details).

scale

a numeric specifying the the genome-wide recombination rate (Morgans / bp). scale is assumed to be between 0 and 1 but in practice it is usually quite small.

Details

The function fb_haploid attempts to estimate parental genotypic 'states' along a chromosome given empirical or simulated F2 cross data. Next-generation data inherits both sequencing error and missing data – especially when sequencing coverage is low. Also, parental populations can be polymorphic with respect to gene frequencies before admixture. In these cases the parental state is ambiguous or unknown. fb_haploid takes these uncertainties into account and estimates the most likely sequence of ancestry.

The three steps taken in fb_haploid are:

  1. Calculate forward probabilities for each state (5' to 3')

  2. Calculate backward probabilities for each state (3' to 5')

  3. From these two probabilities, calculate the posterior probability that a snp location is of a given parental state.

The two element vector of forward probabilities, f_{i}, for each parental state at each position i is calculated as:

f_{i} = \mathbf{e_{i}}~\mathbf{T_{i}}~\mathbf{f_{i-1}}

where e_{i} is the emission probabilities for each state (see below), T_{i} is a 2-by-2 matrix describing the transition (recombination) probabilities between two states, and f_{i-1} is the forward probability at the previous position along the chromosome (5' of position i). It is assumed that each state is equally likely to occur at the first position.

The emmission probabilities are calculated independently for each snp and depends on the sequence reads assigned to parent "0" and parent "1" and p_assign. The emission probabilities are calculated using the binomial equation. For example, the emission probability for parental state "0" at snp position i is:

e_{i}^{(0)} = {n \choose{k_{0}}} p^{k_{0}} (1-p)^{n-k_{0}}

where n is the sum of the number of reads from both parents at snp i and p is the assignment probability, p_assign.

Like the emission probabilities, the transition probabilities are also calculated for each snp position. The transition probabilities in matrix \mathbf{T_{i}} are calcuated from the genome-wide recombination rate, scale, and the physical distance (in bps) between the two snps. Specifically, Haldane's function is used to estimate the probabilty of getting an odd number of crossovers between snp i and snp i-1. For more details, see Hether et al. (in prep).

To avoid underflow, we rescale the forward (and backward) probabilities each iteration to that they sum to unity.

The backward probabilities are calculated similarly but in the 3' to 5' direction. It is assumed that the backward probability is equally likely in each state. Following Durbin et al. (1998) the backward probability at snp position i is:

b_{i} = \mathbf{T_{i}}~\mathbf{e_{i+1}}~\mathbf{b_{i+1}}

Again, these probabilities are rescaled to avoid underflow.

The posterior probability that the state at position i is j given the observed sequence read counts, x_{i} is calculated by:

P(π_{i} = j~|~x_{i}) = \frac{\mathbf{f}_{i}^{(j)} \mathbf{b}_{i}^{(j)}}{\mathbf{f}_{i}~\mathbf{b}_{i}}

Ties in posterior probabilities are rare with sufficient coverage and/or signal. However, in the event of a tie state 1 is picked.

Finally, we can determine the log likelihood of the whole sequence of observations by summing up the log of all the scale factors in the forward probability calculation.

Value

a dataframe with the following columns:

Author(s)

Tyler D. Hether

References

Hether, T.D., C. G. Wiench1, and P.A. Hohenlohe (in review). 2015. Novel molecular and analytical tools for efficient estimation of rates of meiotic crossover, non-crossover and gene conversion

Drubin, R. S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of proteins and nucleic acids. Cambridge University Press, Cambridge CB2 8RU, UK.

See Also

fb_diploid

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
set.seed(1234567)        # For reproducibility
n_spores <- 1            # number of spores
l <- 75                  # number of snps to simulate
c <- 3.5e-05             # recombination rate between snps (Morgan/bp)
snps <- c(1:l)*1.3e4     # snps are evenly spaced 20kbp apart
p_a <- 0.95              # assignment probability
coverage <- 2.1          # mean coverage
# Now simulate
sim1 <- sim_en_masse(n.spores=n_spores, scale=c, snps=snps,
 p.assign=p_a, mu.rate=0, f.cross=0.8, f.convert=0.3,
 length.conversion=2e3, coverage=coverage)
fb_haploid(snp_locations=sim1$Snp, p0=sim1$p0, p1=sim1$p1, p_assign=p_a, scale=c)

tylerhether/fbgenotyper documentation built on May 3, 2019, 1:53 p.m.