In dcgerard/updogAlpha: Using Parental Data for Offspring Genotyping

knitr::opts_chunk$set(fig.width=4.5, fig.height=3.5)

What's Updog?

Updog is an empirical Bayes approach to genotype individuals (particularly polyploids) from next generation sequencing (NGS) data. We had in mind NGS data that results from a reduced representation library, such as "genotyping-by-sequencing" (GBS) [@elshire2011robust] or "restriction site-associated DNA sequencing" (RAD-seq) [@baird2008rapid].

Updog wields the power of hierarchical modeling to account for some key features of NGS data overlooked in most other analyses: allelic bias, overdispersion, and outlying observations. Updog will also automatically account for sequencing errors.

To efficiently account for these features, updog needs to know the distribution of the individual genotypes in the population. Updog can accurately estimate this distribution if the population consists of full siblings (option model = "s1" or model = "f1") or if the population is in Hardy-Weinberg equilibrium (model = "hw").

You can read more about the updog method in @gerard2018harnessing.

Example from an S1 Population

Fit updog

Load updog and the snpdat dataset. The data frame snpdat contains three example SNPs (single nucleotide polymorphisms) from the study of @shirasawa2017high. The individuals in this dataset resulted from a single generation of selfing (an S1 population). You can read more about it by typing ?snpdat.

set.seed(1)
library(updogAlpha)
data("snpdat")

We'll use the tidyverse to extract the First SNP.

library(tidyverse)
snpdat %>% 
  filter(snp == "SNP1") %>%
  select(counts, size, id) ->
  smalldat
head(smalldat)

We will separate the counts between the children and the parent (the first individual). Note that you do not need the parental counts to fit updog, but they can help improve estimates of the parameters in the updog model.

pcounts <- smalldat$counts[1]
psize   <- smalldat$size[1]
ocounts <- smalldat$counts[-1]
osize   <- smalldat$size[-1]
ploidy  <- 6 # sweet potatoes are hexaploid

We can first use plot_geno to visualize the raw data.

plot_geno(ocounts = ocounts, osize = osize, ploidy = ploidy)

Now we fit updog. We use model = "s1" because the individuals resulted from one generation of selfing of the same parent.

uout <- updog(ocounts = ocounts, osize = osize, ploidy = ploidy, 
              p1counts = pcounts, p1size = psize, model = "s1")

Analyze Output

We use plot.updog to visualize the fit. Points are color coded according to the genotype with the highest posterior probability. For example, a genotype of "4" represents four copies of the reference allele and two copies of the alternative allele (AAAAaa). The level of transparency is proportional to the posterior probability of a point being an outlier. The lines represent the mean counts at a given genotype. The "+" symbol with a black dot is the location of the parent.

plot(uout)

I have incorporated a colorblind safe palate.

plot(uout, use_colorblind = TRUE)

The fit seems reasonable. We can test how reasonable our genotype distribution is using summary.updog.

summary(uout)

The goodness of fit $p$-value is quite large, indicating no evidence of lack of fit.

We can sample from the fitted model using rupdog.

rout <- rupdog(uout)

If we plot the sampled output, it looks similar to the real data.

plot(rout, show_outlier = FALSE)

Filtering SNPs

For downstream analyses, you might want to filter out poorly behaved SNPs. These SNPs might be poorly behaved for a variety of reasons (they might not be real SNPs, it might be much more very difficult to map one allele to the correct location relative to the other allele, etc). Updog gives you some measures to filter out these SNPs.

The most intuitive measure would be the (posterior) proportion of individuals mis-genotyped:

uout$prop_mis

For this SNP, we expect about r round(uout$prop_mis, digits = 4) * 100 percent of the individuals to be mis-genotyped. The specific cutoff you use is context and data dependent. But as a starting point, you could try a loose cutoff by keeping SNPs only if they have a prop_mis of less than 0.2.

From our simulation studies, we also generally get rid of SNPs with overdispersion parameters greater than 0.05 or SNPs with bias parameters either less than 0.5 or greater than 2. However, if you have higher or lower read depths than what we looked at in our simulations, you should adjust these levels accordingly.

You should probably also get rid of SNPs if they have a high sequencing error rate, but we didn't explore this in simulations so I don't have any heuristic recommendations.

Special Considerations for Population Studies

For population studies, I would always use the options model = "hw", update_outprop = FALSE, and out_prop = 0. This is for two (extremely heuristic) reasons:

Updog (particularly the bias estimates) seems to be very sensitive to the form of the prior distribution and it seems that a binomial distribution is more common (or at least a better fit) in most datasets than a uniform distribution. Thus, I would never use model = "uniform".
I don't trust outlier estimates unless I have a very strong prior. If your population consists of full siblings, then we have this strong prior. But in population settings, our prior is much less strong. Thus, I would force updog to not allow for outliers with update_outprop = FALSE, and out_prop = 0.