Description Usage Arguments Details Value Author(s) References See Also Examples
Uses the forward-backward algorithm to estimate ancestry along a given chromosome for a given genotyped tetrad or simulated data.
1 | fb_haploid(snp_locations, p0, p1, p_assign, scale)
|
snp_locations |
a numeric vector specifying the locations of each snp (in bps). This vector is assumed to be ordered (sorted from smallest to largest snp). |
p0 |
a vector specifying the number of reads that mapped to parent 0. p0 is assumed to be in the smae order as snp_locations. |
p1 |
a vector specifying the number of reads that mapped to parent 1. p1 is assumed to be in the smae order as snp_locations. |
p_assign |
a value specifying the assignment probabilty (see details). |
scale |
a numeric specifying the the genome-wide recombination rate (Morgans / bp). |
The function fb_haploid
attempts to estimate
parental genotypic 'states' along a chromosome given empirical or
simulated F2 cross data.
Next-generation data inherits both sequencing error and missing data
– especially when sequencing coverage is low. Also, parental populations
can be polymorphic with respect to gene frequencies before admixture. In these cases
the parental state is ambiguous or unknown. fb_haploid
takes
these uncertainties into account and estimates the most likely sequence of
ancestry.
The three steps taken in fb_haploid
are:
Calculate forward probabilities for each state (5' to 3')
Calculate backward probabilities for each state (3' to 5')
From these two probabilities, calculate the posterior probability that a snp location is of a given parental state.
The two element vector of forward probabilities, f_{i}, for each parental state at each position i is calculated as:
f_{i} = \mathbf{e_{i}}~\mathbf{T_{i}}~\mathbf{f_{i-1}}
where e_{i} is the emission probabilities for each state (see below), T_{i} is a 2-by-2 matrix describing the transition (recombination) probabilities between two states, and f_{i-1} is the forward probability at the previous position along the chromosome (5' of position i). It is assumed that each state is equally likely to occur at the first position.
The emmission probabilities are calculated independently for
each snp and depends on the sequence reads assigned to parent "0"
and parent "1" and p_assign
. The emission probabilities are
calculated using the binomial equation. For example, the emission
probability for parental state "0" at snp position i is:
e_{i}^{(0)} = {n \choose{k_{0}}} p^{k_{0}} (1-p)^{n-k_{0}}
where n is the sum of the number of reads from both parents at snp i and p
is the assignment probability, p_assign
.
Like the emission probabilities, the transition probabilities are also
calculated for each snp position. The transition probabilities in matrix \mathbf{T_{i}}
are calcuated from the genome-wide recombination rate, scale
, and the physical
distance (in bps) between the two snps. Specifically, Haldane's function is used
to estimate the probabilty of getting an odd number of crossovers between snp i
and snp i-1. For more details, see Hether et al. (in prep).
To avoid underflow, we rescale the forward (and backward) probabilities each iteration to that they sum to unity.
The backward probabilities are calculated similarly but in the 3' to 5' direction. It is assumed that the backward probability is equally likely in each state. Following Durbin et al. (1998) the backward probability at snp position i is:
b_{i} = \mathbf{T_{i}}~\mathbf{e_{i+1}}~\mathbf{b_{i+1}}
Again, these probabilities are rescaled to avoid underflow.
The posterior probability that the state at position i is j given the observed sequence read counts, x_{i} is calculated by:
P(π_{i} = j~|~x_{i}) = \frac{\mathbf{f}_{i}^{(j)} \mathbf{b}_{i}^{(j)}}{\mathbf{f}_{i}~\mathbf{b}_{i}}
Ties in posterior probabilities are rare with sufficient coverage and/or signal. However, in the event of a tie state 1 is picked.
Finally, we can determine the log likelihood of the whole sequence of observations by summing up the log of all the scale factors in the forward probability calculation.
a dataframe with the following columns:
snp_locations
p0
p1
The emission matrix (2 columns) specifying the emmision probability of belong to the parent 0 (emiss.1) and parent 2 (emiss.2)
the forward probability matrix (2 columns) giving the scaled forward probabilities of belong to the parent 0 (forward.1) and parent 2 (forward.2)
the backward probability matrix (2 columns) giving the scaled backward probabilities of belong to the parent 0 (backward.1) and parent 2 (backward.2)
Fscale
The forward scaling factor for each snp
Bscale
The backward scaling factor for each snp
posterior probability matrix (2 columns) posterior probabilities for belong to each parental state
states_inferred a vector of length snp_locations giving the state with the highest posterior.
lnL a numeric giving the total likelihood of the data (see details).
Tyler D. Hether
Hether, T.D., C. G. Wiench1, and P.A. Hohenlohe (in review). 2015. Novel molecular and analytical tools for efficient estimation of rates of meiotic crossover, non-crossover and gene conversion
Drubin, R. S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of proteins and nucleic acids. Cambridge University Press, Cambridge CB2 8RU, UK.
1 2 3 4 5 6 7 8 9 10 11 12 | set.seed(1234567) # For reproducibility
n_spores <- 1 # number of spores
l <- 75 # number of snps to simulate
c <- 3.5e-05 # recombination rate between snps (Morgan/bp)
snps <- c(1:l)*1.3e4 # snps are evenly spaced 20kbp apart
p_a <- 0.95 # assignment probability
coverage <- 2.1 # mean coverage
# Now simulate
sim1 <- sim_en_masse(n.spores=n_spores, scale=c, snps=snps,
p.assign=p_a, mu.rate=0, f.cross=0.8, f.convert=0.3,
length.conversion=2e3, coverage=coverage)
fb_haploid(snp_locations=sim1$Snp, p0=sim1$p0, p1=sim1$p1, p_assign=p_a, scale=c)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.