knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This is a small, lightweight package that lets users investigate the distribution of genotypes in genotype-by-sequencing (GBS) data where they expect (by and large) Hardy-Weinberg equilibrium, in order to assess rates of genotyping errors and the dependence of those rates on read depth.
The name comes from the bolded letters in this sentence:
Where's my Heterozygotes at? Observations on genotyping Accuracy.
It also fits well with my reaction when I started investigating heterozygote miscall rates (rates at which true heterozygotes are incorrectly called as homozygotes) in some restriction associated digest (RAD) sequencing data sets---My eyes bugged out and I said, "Whoa!"
The package comes with a small bit of data from lobster to play with. The rest of this document shows a quick run through a few of the functions to do an analysis of a data set.
# load up the package: library(whoa)
Read about the lobster data here. Execute this if you want:
help("lobster_buz_2000")
The main thing to know is that it is a 'vcfR' object. You can
make such an object yourself by reading in a variant call format (VCF) file
using vcfR::read.vcfR()
.
# first get compute expected and observed genotype frequencies gfreqs <- exp_and_obs_geno_freqs(lobster_buz_2000) # then plot those. Set max_plot_loci so that all 2000 # loci will be plotted geno_freqs_scatter(gfreqs, max_plot_loci = 2000)
If we want to estimate the het miscall rate (over all read depth bins)
we just set the minimum bin size to a very large value so it make just one bin, and
use the infer_m
function which implements a Markov chain Monte Carlo (MCMC) algorithm
to estimate the heteryzogote miscall rate.
overall <- infer_m(lobster_buz_2000, minBin = 1e15)
Now look at that:
overall$m_posteriors
Wow! (Or should we say "WHOA!") A het miscall rate of around 25%.
See the total_n above is about 65,000. That means 65,000 genotypes.
(2000 loci typed at 36 individuals, with some missing data).
We will bin those up so that there are at least 2000 genotypes in each
bin and then estimate the het miscall rate for each read depth bin.
binned <- infer_m(lobster_buz_2000, minBin = 2000)
And then we can plot the posterior mean and CIs for each read depth bin.
posteriors_plot(binned$m_posteriors)
Again, WHOA! The het miscall rate at low read depths is super high!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.