Nothing
#' get posterior estimates for m from different read depth categories
#'
#' This function calls internal C++ routines that perform Markov
#' chain Monte Carlo (MCMC) to sample from the posterior distribution of the
#' heterozygote miscall rate for each read depth category.
#'
#' The read depth bins are
#' determined by passing to the function \code{minBin}---the minimum number of observations desired for each read
#' depth bin. The function then breaks the observations into bins so that each read depth bin
#' has at least minBin observations.
#'
#' Note that if you want to estimate the heterozygote miscall rate \bold{overall} (i.e., not conditioning each
#' estimate on a read depth bin), then simply give a very large number (larger than the number of
#' markers times the number of individuals) for minBin. For example, you could use a number like 1e15
#' for minBin. As a consequence, all the genotypes will be put into a single read depth bin.
#' @param v a `vcfR' object holding the information from a variant call format (VCF) file with the genotype and depth data.
#' If you are going to be breaking the estimates down by read depth categories, the VCF file must have a DP field for
#' every genotype.
#' @param indivs a character vector holding
#' the names of the individuals from the VCF file to include in the analysis. By default
#' this is NULL, in which case everyone from the file is included.
#' @param minBin minimum number of observations for each read depth bin. If you have 10K markers
#' and 50 individuals then you have about 500,000 genotypes to play with. Requiring bins with at
#' least 5,000 genotypes in them will give you less than 100 bins. You can play around with this
#' number to get the right number of bins. The algorithm breaks the read depths up into bins that
#' have at least minBin genotypes in them.
#' @param init_m the initial value of the heterozygote miscall rate assumed for each read depth bin.
#' By default this is 0.1.
#' @param num_sweeps the number of sweeps of the MCMC algorithm to do. Default is 500, which is a
#' little on the short side. Run multiple times and make sure the values obtained are similar across
#' runs to assess convergence.
#' @param burn_in how many sweeps from the beginning of the chain to discard when computing the
#' posterior means and quantiles. Default is 100. Note that full traces of the visited m values
#' for every read depth bin are returned as well, so that the behavior of the chain in those early
#' steps can be investigated.
#' @return A list with six components:
#' \describe{
#' \item{m_posteriors}{A tibble with 6 columns: bin = the index of the read depth bin;
#' mean = the posterior mean estimate of the the heterozygote miscall rate in that bin;
#' lo95 = the low endpoint of the 95% credible interval; hi95 = the high endpoint of the
#' 95% credible interval; total_n = the total number of genotypes in the read depth bin;
#' and mean_dp = the mean read depth in the bin. }
#' \item{m_traces}{A tibble with all the values visited for m for every read depth bin. This tibble
#' has three columns: bin = the index of the read depth bin; sweep = the sweep number, value = the value
#' of m for that read depth bin in that particular sweep. }
#' \item{dp_summary}{A tibble summarizing how many genotypes of different read depths
#' appear in each bin.}
#' \item{bin_stats}{A tibble with a different summary of the read-depth bins.}
#' \item{num_sweeps}{Number of MCMC sweeps used.}
#' \item{burn_in}{Number of sweeps discarded as burn in.}
#' }
#' @export
#' @examples
#' # Shorter run than recommended (for quick example...)
#' im <- infer_m(lobster_buz_2000, minBin = 1000, num_sweeps = 100, burn_in = 20)
infer_m <- function(v, minBin, indivs = NULL, init_m = 0.1, num_sweeps = 500, burn_in = 100) {
message("Preparing data structures for MCMC")
# deal with the indivs
if(is.null(indivs)) {
indivs <- colnames(v@gt)[-1]
}
# check to make sure that all of indivs are included in the VCF
wrongos <- setdiff(indivs, colnames(v@gt)[-1])
if(length(wrongos) > 0) {
stop("Requesting indivs that are not in v: ", paste(wrongos, collapse = ", "))
}
# extract just those indivs. This assumes that vcfR has named the first column FORMAT
v@gt <- v@gt[,c("FORMAT", indivs)]
# prep it
p <- prep_vcf_for_est_m_rd(v, "DP", minBin)
message("Running MCMC")
# estimate it
b <- estimate_m_rd(Y = p$mat012,
R = p$dp_bins_list$dp_bins,
init_m = init_m,
num_cats = p$dp_bins_list$num_cats,
p_prior = c(0.5, 0.5),
m_prior = c(0.5, 0.5),
num_reps = num_sweeps)
message("Tidying output")
# tidy the output
res <- tidy_m_ests(E = b,
S = p$dp_bins_list$bin_stats,
TB = p$dp_bins_list$tidy_bins,
burn = burn_in)
res
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.