HMMancestry: R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data

Documented in fb_diploid fb_haploid

# This file was generated by Rcpp::compileAttributes
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

#' @title Inferring hidden ancestry states from haploids
#'
#' @description Uses the forward-backward algorithm to estimate ancestry 
#' along a given chromosome for a given genotyped tetrad or simulated data.
#'
#' @param snp_locations a numeric vector specifying the locations of each snp (in bps). This
#' vector is assumed to be ordered (sorted from smallest to largest snp).
#' 
#' @param p0 a vector specifying the number of reads that mapped to parent 0. 
#' p0 is assumed to be in the smae order as snp_locations. 
#'
#' @param p1 a vector specifying the number of reads that mapped to parent 1. 
#' p1 is assumed to be in the smae order as snp_locations.
#'
#' @param p_assign a value specifying the assignment probabilty (see details).
#'
#' @param scale a numeric specifying the the genome-wide recombination rate (Morgans / bp). \code{scale} is assumed to be 
#' between 0 and 1 but in practice it is usually quite small. 
#'
#' @details The function \code{fb_haploid} attempts to estimate 
#' parental genotypic 'states' along a chromosome given empirical or 
#' simulated F2 cross data. 
#' Next-generation data inherits both sequencing error and missing data 
#' -- especially when sequencing coverage is low. Also, parental populations
#' can be polymorphic with respect to gene frequencies before admixture. In these cases 
#' the parental state is ambiguous or unknown. \code{fb_haploid} takes 
#' these uncertainties into account and estimates the most likely sequence of
#' ancestry. 
#' 
#' The three steps taken in \code{fb_haploid} are: 
#' \enumerate{
#'  \item Calculate forward probabilities for each state (5' to 3')
#'  \item Calculate backward probabilities for each state (3' to 5')
#'  \item From these two probabilities, calculate the posterior probability
#'  that a snp location is of a given parental state.
#' } 
#' 
#' The two element vector of forward probabilities, \eqn{f_{i}}, for each parental state 
#' at each position \eqn{ i } is calculated as:
#' \deqn{ f_{i} = \mathbf{e_{i}}~\mathbf{T_{i}}~\mathbf{f_{i-1}} }
#' 
#' where \eqn{e_{i}} is the emission probabilities for each state (see below), \eqn{T_{i}}
#' is a 2-by-2 matrix describing the transition (recombination) probabilities
#' between two states, and \eqn{f_{i-1}} is the forward probability at the 
#' previous position along the chromosome (5' of position \eqn{i}). It is assumed 
#' that each state is equally likely to occur at the first position. 
#' 
#' 
#' The emmission probabilities are calculated independently for 
#' each snp and depends on the sequence reads assigned to parent "0" 
#' and parent "1" and \code{p_assign}. The emission probabilities are
#' calculated using the binomial equation. For example, the emission
#' probability for parental state "0" at snp position i is: 
#' 
#' \deqn{e_{i}^{(0)} = {n \choose{k_{0}}} p^{k_{0}} (1-p)^{n-k_{0}}}
#' 
#' where \eqn{n} is the sum of the number of reads from both parents at snp \eqn{i} and \eqn{p} 
#' is the assignment probability, \code{p_assign}. 
#' 
#' Like the emission probabilities, the transition probabilities are also
#' calculated for each snp position. The transition probabilities in matrix \eqn{\mathbf{T_{i}}}
#' are calcuated from the genome-wide recombination rate, \code{scale}, and the physical
#' distance (in bps) between the two snps. Specifically, Haldane's function is used
#' to estimate the probabilty of getting an odd number of crossovers between snp \eqn{i}
#' and snp \eqn{i-1}. For more details, see Hether et al. (in prep).
#' 
#' To avoid underflow, we rescale the forward (and backward) probabilities
#' each iteration to that they sum to unity. 
#' 
#' The backward probabilities are calculated similarly but in the 3' to 5' 
#' direction. It is assumed that the backward probability is equally likely 
#' in each state. Following Durbin et al. (1998) the backward probability at
#' snp position \eqn{i} is: 
#' 
#' \deqn{b_{i} = \mathbf{T_{i}}~\mathbf{e_{i+1}}~\mathbf{b_{i+1}}}
#' 
#' Again, these probabilities are rescaled to avoid underflow. 
#' 
#' The posterior probability that the state at position \eqn{i} is \eqn{j}
#' given the observed sequence read counts, \eqn{x_{i}}
#' is calculated by:  
#' 
#' \deqn{P(\pi_{i} = j~|~x_{i}) = \frac{\mathbf{f}_{i}^{(j)} \mathbf{b}_{i}^{(j)}}{\mathbf{f}_{i}~\mathbf{b}_{i}}}
#' 
#' Ties in posterior probabilities are rare with sufficient coverage and/or signal. However, in the event of a tie state 1 is picked.
#'
#' Finally, we can determine the log likelihood of the whole sequence 
#' of observations by summing up the log of all the scale factors in the 
#' forward probability calculation.
#'
#' @return a dataframe with the following columns: 
#' \itemize{
#'     \item \code{snp_locations}
#'     \item \code{p0}
#'     \item \code{p1}
#'     \item The emission matrix (2 columns) specifying the emmision probability of belong
#' to the parent 0 (emiss.1) and parent 2 (emiss.2) 
#'     \item the forward probability matrix (2 columns) giving the scaled
#'     forward probabilities of belong
#' to the parent 0 (forward.1) and parent 2 (forward.2)
#'     \item the backward probability matrix (2 columns) giving the scaled
#'     backward probabilities of belong
#' to the parent 0 (backward.1) and parent 2 (backward.2)
#'     \item \code{Fscale} The forward scaling factor for each snp
#'     \item \code{Bscale} The backward scaling factor for each snp
#'     \item posterior probability matrix (2 columns) posterior
#'     probabilities for belong to each parental state
#'     \item states_inferred a vector of length snp_locations giving the state with the
#'     highest posterior. 
#'     \item lnL a numeric giving the total likelihood of the data (see details).
#' }
#'
#' @references Hether, T.D., C. G. Wiench1, and P.A. Hohenlohe (in review). 2015. Novel molecular and analytical tools 
#' for efficient estimation of rates of meiotic crossover, non-crossover and gene conversion
#'
#' Drubin, R. S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models
#' of proteins and nucleic acids. Cambridge University Press, Cambridge CB2 8RU, UK.
#'
#' @seealso \code{\link{fb_diploid}}
#'
#' @author Tyler D. Hether
#'
#' @examples
#' set.seed(1234567)        # For reproducibility
#' n_spores <- 1            # number of spores
#' l <- 75                  # number of snps to simulate
#' c <- 3.5e-05             # recombination rate between snps (Morgan/bp)
#' snps <- c(1:l)*1.3e4     # snps are evenly spaced 20kbp apart
#' p_a <- 0.95              # assignment probability
#' coverage <- 2.1          # mean coverage
#' # Now simulate
#' sim1 <- sim_en_masse(n.spores=n_spores, scale=c, snps=snps, 
#'  p.assign=p_a, mu.rate=0, f.cross=0.8, f.convert=0.3, 
#'  length.conversion=2e3, coverage=coverage)
#' fb_haploid(snp_locations=sim1$Snp, p0=sim1$p0, p1=sim1$p1, p_assign=p_a, scale=c)
#'
#' @export
fb_haploid <- function(snp_locations, p0, p1, p_assign, scale) {
    .Call('HMMancestry_fb_haploid', PACKAGE = 'HMMancestry', snp_locations, p0, p1, p_assign, scale)
}

#' @title Inferring hidden ancestry states from diploids
#'
#' @description Uses the forward-backward algorithm to estimate ancestry 
#' along a given chromosome for a given genotyped diploid.
#'
#' @param snp_locations a numeric vector specifying the locations of each snp (in bps). This
#' vector is assumed to be ordered (sorted from smallest to largest snp).
#' 
#' @param p0 a vector specifying the number of reads that mapped to parent 0. 
#' p0 is assumed to be in the smae order as snp_locations. 
#'
#' @param p1 a vector specifying the number of reads that mapped to parent 1. 
#' p1 is assumed to be in the smae order as snp_locations.
#'
#' @param p_assign a value specifying the assignment probabilty (see details).
#'
#' @param scale a numeric specifying the the genome-wide recombination rate (Morgans / bp). \code{scale} is assumed to be 
#' between 0 and 1 but in practice it is usually quite small. 
#'
#' @details This is an extension of \code{\link{fb_haploid}} with three hidden states:
#' two homozygous states and a heterozygous state. 
#'
#' @return a dataframe with the following columns: 
#' \itemize{
#'     \item \code{snp_locations}
#'     \item \code{p0}
#'     \item \code{p1}
#'     \item The emission matrix (3 columns) specifying the emmision probability of belong
#' to the only parent 0 (emiss.1), both parents (emiss.1), and only parent 2 (emiss.3) 
#'     \item the forward probability matrix (3 columns) giving the scaled
#'     forward probabilities of belong
#' to the only parent 0 (forward.1), both parents (forward.2), and only parent 2 (forward.3)
#'     \item the backward probability matrix (3 columns) giving the scaled
#'     backward probabilities of belong
#' to the only parent 0 (backward.1), both parents (backward.2), and only parent 2 (backward.3)
#'     \item \code{Fscale} The forward scaling factor for each snp
#'     \item \code{Bscale} The backward scaling factor for each snp
#'     \item posterior probability matrix (3 columns) posterior
#'     probabilities for belong to each parental state
#'     \item states_inferred a vector of length snp_locations giving the state with the
#'     highest posterior. 
#'     \item lnL a numeric giving the total likelihood of the data (see \code{\link{fb_haploid}} ).
#' }
#'
#' @references Hether, T.D., C. G. Wiench1, and P.A. Hohenlohe (in review). 2015. Novel molecular and analytical tools 
#' for efficient estimation of rates of meiotic crossover, non-crossover and gene conversion
#'
#' Drubin, R. S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models
#' of proteins and nucleic acids. Cambridge University Press, Cambridge CB2 8RU, UK.
#'
#' @seealso \code{\link{fb_haploid}}
#'
#' @author Tyler D. Hether
#'
#' @examples
#' set.seed(1234567)        # For reproducibility
#' n_spores <- 1            # number of spores
#' l <- 75                  # number of snps to simulate
#' c <- 3.5e-05             # recombination rate between snps (Morgan/bp)
#' snps <- c(1:l)*1.3e4     # snps are evenly spaced 20kbp apart
#' p_a <- 0.95              # assignment probability
#' coverage <- 2.1          # mean coverage
#' # Now simulate two haploids
#' sim1 <- sim_en_masse(n.spores=n_spores, scale=c, snps=snps, 
#'  p.assign=p_a, mu.rate=0, f.cross=0.8, f.convert=0.3, 
#'  length.conversion=2e3, coverage=coverage)
#' sim2 <- sim_en_masse(n.spores=n_spores, scale=c, snps=snps, 
#'  p.assign=p_a, mu.rate=0, f.cross=0.8, f.convert=0.3, 
#'  length.conversion=2e3, coverage=coverage)
#' # Now merge the two haploids to make a diploid
#' p0 <- sim1$p0+sim2$p0
#' p1 <- sim1$p1+sim2$p1
#' res <- fb_diploid(snp_locations=sim1$Snp, p0=p0, p1=p1, p_assign=p_a, scale=c)
#' res
#'@export
fb_diploid <- function(snp_locations, p0, p1, p_assign, scale) {
    .Call('HMMancestry_fb_diploid', PACKAGE = 'HMMancestry', snp_locations, p0, p1, p_assign, scale)
}

tylerhether/HMMancestry documentation built on May 3, 2019, 1:53 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tylerhether/HMMancestry
R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data

R/RcppExports.R
In tylerhether/HMMancestry: R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data

Defines functions fb_haploid fb_diploid

Documented in fb_diploid fb_haploid

R Package Documentation

Browse R Packages

We want your feedback!

tylerhether/HMMancestry R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data

R/RcppExports.R In tylerhether/HMMancestry: R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data

Defines functions fb_haploid fb_diploid

Documented in fb_diploid fb_haploid

R Package Documentation

Browse R Packages

We want your feedback!

tylerhether/HMMancestry
R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data

R/RcppExports.R
In tylerhether/HMMancestry: R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data