Genotyping Many SNPs with multidog()"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=4.5,
  fig.height=3.5
)

Abstract

multidog() provides support for genotyping many SNPs by iterating flexdog() over the SNPs. Support is provided for parallel computing through the future package. The genotyping method is described in Gerard et al. (2018) and Gerard and Ferrão (2020).

Fit multidog()

Let's load updog, future, and the data from Uitdewilligen et al. (2013).

library(future)
library(updog)
data("uitdewilligen")

uitdewilligen$refmat is a matrix of reference counts while uitdewilligen$sizemat is a matrix of total read counts. In these data, the rows index the individuals and the columns index the loci. But for insertion into multidog() we need it the other way around (individuals in the columns and loci in the rows). So we will transpose these matrices.

refmat  <- t(uitdewilligen$refmat)
sizemat <- t(uitdewilligen$sizemat)
ploidy  <- uitdewilligen$ploidy

sizemat and refmat should have the same row and column names. These names identify the loci and the individuals.

setdiff(colnames(sizemat), colnames(refmat))
setdiff(rownames(sizemat), rownames(refmat))

If we want to do parallel computing, we should check that we have the proper number of cores:

future::availableCores()

Now let's run multidog():

mout <- multidog(refmat = refmat, 
                 sizemat = sizemat, 
                 ploidy = ploidy, 
                 model = "norm",
                 nc = 2)

By default, parallelization is run using

future::plan(future::multisession, workers = nc)

if nc is greater than 1. You can choose your own evaluation strategy by running future::plan() prior to running multidog(), and then setting nc = NA. This should be particularly useful in higher performance computing environments that use schedulers, where you can control the evaluation strategy through the future.batchtools package. For example, the following will run multidog() using forked R processes:

future::plan(future::multicore, workers = 2)
mout <- multidog(refmat = refmat, 
                 sizemat = sizemat, 
                 ploidy = ploidy, 
                 model = "norm",
                 nc = NA)

## Shut down parallel workers
future::plan(future::sequential)

multidog() Output

There is a plot method for the output of multidog().

plot(mout, indices = c(1, 5, 100))

The output of multidog contains two data frame. The first contains properties of the SNPs, such as estimated allele bias and estimated sequencing error rate.

str(mout$snpdf)

The second data frame contains properties of each individual at each SNP, such as the estimated genotypes (geno) and the posterior probability of being genotyping correctly (maxpostprob).

str(mout$inddf)

You can obtain the columns in inddf in matrix form with format_multidog().

genomat <- format_multidog(mout, varname = "geno")
head(genomat)

To filter SNPs based on quality metrics (bias, sequencing error rate, overdispersion, etc), you can use filter_snp(), which uses the same non-standard evaluation you are used to from dplyr::filter(). That is, you can define predicates in terms of the variable names in the snpdf data frame from the output of mupdog(). It then keeps rows in both snpdf and inddf where the predicate for a SNP evaluates to TRUE.

dim(mout$snpdf)
dim(mout$inddf)
mout_cleaned <- filter_snp(mout, prop_mis < 0.05 & bias > exp(-1) & bias < exp(1))
dim(mout_cleaned$snpdf)
dim(mout_cleaned$inddf)

An experimental function is provided to export the multidog() output to a VCF file. You need to make sure you have the the following Bioconductor packages installed to use it: VariantAnnotation, GenomicRanges, S4Vectors, and IRanges.

# install.packages("BiocManager")
# BiocManager::install(c("VariantAnnotation", "GenomicRanges", "S4Vectors", "IRanges"))
export_vcf(obj = mout, filename = "./multidog_fit.vcf")

References

Gerard, David, and Luís Felipe Ventorim Ferrão. "Priors for genotyping polyploids." Bioinformatics 36, no. 6 (2020): 1795-1800. https://doi.org/10.1093/bioinformatics/btz852.

Gerard, David, Luís Felipe Ventorim Ferrão, Antonio Augusto Franco Garcia, and Matthew Stephens. 2018. "Genotyping Polyploids from Messy Sequencing Data." Genetics 210 (3). Genetics: 789–807. https://doi.org/10.1534/genetics.118.301468.

Uitdewilligen, Anne-Marie A. AND D’hoop, Jan G. A. M. L. AND Wolters. 2013. "A Next-Generation Sequencing Method for Genotyping-by-Sequencing of Highly Heterozygous Autotetraploid Potato." PLOS ONE 8 (5). Public Library of Science: 1–14. https://doi.org/10.1371/journal.pone.0062355.



Try the updog package in your browser

Any scripts or data that you put into this service are public.

updog documentation built on Oct. 18, 2022, 9:07 a.m.