fb_haploid: Inferring hidden ancestry states from haploids
In tylerhether/fbgenotyper: R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data

Description Usage Arguments Details Value Author(s) References See Also Examples

Uses the forward-backward algorithm to estimate ancestry along a given chromosome for a given genotyped tetrad or simulated data.

1	fb_haploid(snp_locations, p0, p1, p_assign, scale)

`snp_locations`	a numeric vector specifying the locations of each snp (in bps). This vector is assumed to be ordered (sorted from smallest to largest snp).
`p0`	a vector specifying the number of reads that mapped to parent 0. p0 is assumed to be in the smae order as snp_locations.
`p1`	a vector specifying the number of reads that mapped to parent 1. p1 is assumed to be in the smae order as snp_locations.
`p_assign`	a value specifying the assignment probabilty (see details).
`scale`	a numeric specifying the the genome-wide recombination rate (Morgans / bp). `scale` is assumed to be between 0 and 1 but in practice it is usually quite small.

The function fb_haploid attempts to estimate parental genotypic 'states' along a chromosome given empirical or simulated F2 cross data. Next-generation data inherits both sequencing error and missing data – especially when sequencing coverage is low. Also, parental populations can be polymorphic with respect to gene frequencies before admixture. In these cases the parental state is ambiguous or unknown. fb_haploid takes these uncertainties into account and estimates the most likely sequence of ancestry.

The three steps taken in fb_haploid are:

Calculate forward probabilities for each state (5' to 3')
Calculate backward probabilities for each state (3' to 5')
From these two probabilities, calculate the posterior probability that a snp location is of a given parental state.

The two element vector of forward probabilities, f_{i}, for each parental state at each position i is calculated as:

f_{i} = \mathbf{e_{i}}~\mathbf{T_{i}}~\mathbf{f_{i-1}}

where e_{i} is the emission probabilities for each state (see below), T_{i} is a 2-by-2 matrix describing the transition (recombination) probabilities between two states, and f_{i-1} is the forward probability at the previous position along the chromosome (5' of position i). It is assumed that each state is equally likely to occur at the first position.

The emmission probabilities are calculated independently for each snp and depends on the sequence reads assigned to parent "0" and parent "1" and p_assign. The emission probabilities are calculated using the binomial equation. For example, the emission probability for parental state "0" at snp position i is:

e_{i}^{(0)} = {n \choose{k_{0}}} p^{k_{0}} (1-p)^{n-k_{0}}

where n is the sum of the number of reads from both parents at snp i and p is the assignment probability, p_assign.

Like the emission probabilities, the transition probabilities are also calculated for each snp position. The transition probabilities in matrix \mathbf{T_{i}} are calcuated from the genome-wide recombination rate, scale, and the physical distance (in bps) between the two snps. Specifically, Haldane's function is used to estimate the probabilty of getting an odd number of crossovers between snp i and snp i-1. For more details, see Hether et al. (in prep).

To avoid underflow, we rescale the forward (and backward) probabilities each iteration to that they sum to unity.

The backward probabilities are calculated similarly but in the 3' to 5' direction. It is assumed that the backward probability is equally likely in each state. Following Durbin et al. (1998) the backward probability at snp position i is:

b_{i} = \mathbf{T_{i}}~\mathbf{e_{i+1}}~\mathbf{b_{i+1}}

Again, these probabilities are rescaled to avoid underflow.

The posterior probability that the state at position i is j given the observed sequence read counts, x_{i} is calculated by:

P(π_{i} = j~|~x_{i}) = \frac{\mathbf{f}_{i}^{(j)} \mathbf{b}_{i}^{(j)}}{\mathbf{f}_{i}~\mathbf{b}_{i}}

Ties in posterior probabilities are rare with sufficient coverage and/or signal. However, in the event of a tie state 1 is picked.

Finally, we can determine the log likelihood of the whole sequence of observations by summing up the log of all the scale factors in the forward probability calculation.

a dataframe with the following columns:

snp_locations
p0
p1
The emission matrix (2 columns) specifying the emmision probability of belong to the parent 0 (emiss.1) and parent 2 (emiss.2)
the forward probability matrix (2 columns) giving the scaled forward probabilities of belong to the parent 0 (forward.1) and parent 2 (forward.2)
the backward probability matrix (2 columns) giving the scaled backward probabilities of belong to the parent 0 (backward.1) and parent 2 (backward.2)
Fscale The forward scaling factor for each snp
Bscale The backward scaling factor for each snp
posterior probability matrix (2 columns) posterior probabilities for belong to each parental state
states_inferred a vector of length snp_locations giving the state with the highest posterior.
lnL a numeric giving the total likelihood of the data (see details).

Tyler D. Hether

Hether, T.D., C. G. Wiench1, and P.A. Hohenlohe (in review). 2015. Novel molecular and analytical tools for efficient estimation of rates of meiotic crossover, non-crossover and gene conversion

Drubin, R. S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of proteins and nucleic acids. Cambridge University Press, Cambridge CB2 8RU, UK.

fb_diploid

set.seed(1234567)        # For reproducibility
n_spores <- 1            # number of spores
l <- 75                  # number of snps to simulate
c <- 3.5e-05             # recombination rate between snps (Morgan/bp)
snps <- c(1:l)*1.3e4     # snps are evenly spaced 20kbp apart
p_a <- 0.95              # assignment probability
coverage <- 2.1          # mean coverage
# Now simulate
sim1 <- sim_en_masse(n.spores=n_spores, scale=c, snps=snps,
 p.assign=p_a, mu.rate=0, f.cross=0.8, f.convert=0.3,
 length.conversion=2e3, coverage=coverage)
fb_haploid(snp_locations=sim1$Snp, p0=sim1$p0, p1=sim1$p1, p_assign=p_a, scale=c)