README.md

polyRAD: Genotype Calling with Uncertainty from Sequencing Data in Polyploids and Diploids

This R package is ready for use, although new features are still being developed. See the list of future features to see where it is headed.

polyRAD is part of an upcoming suite of interoperable software called the ploidyverse.

I'm always interested in new collaboration! If you find polyRAD to be helpful in your research, let me know if you'd be interested in sharing your data and results for coauthorship in the publication describing the software. I would also like to hear your feature requests. Contact: Lindsay Clark, University of Illinois, Urbana-Champaign.

Purpose

Genotypes derived from genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq) have inherent uncertainty associated with them due to sampling error, i.e. some alleles might not get sequenced at all, or might not be sequenced in exact proportion to their copy number in the genome. This package imports read depth in a variety of formats output by various bioinformatics pipelines and estimates the probability of each possible genotype for each taxon and locus. Unlike similar pipelines, polyRAD can account for population structure and variable inheritance modes (autopolyploid, allopolyploid, intermediate). Genotypes and/or probability distributions can then be exported for downstream analysis such as genome-wide association, genomic selection, QTL mapping, or population structure analysis.

Why polyRAD?

If you're like me, you don't want to waste a lot of money sequencing your DNA samples at a higher depth than is necessary. You would rather spend that money adding more samples to the project, or using a different restriction enzyme to get more markers! You may have also noticed that some loci get sequenced at a much higher depth than others, which means that if you sequence the same library a second time, you aren't likely to get a lot of reads for the loci that need it most. So how can we get the maximum amount of information out of seqeuncing data where many loci are low depth? And, for example, if we only have five reads, how can we estimate allele dosage in a heterozygous octoploid?

The answer that polyRAD provides is a Bayesian genotype caller with many options for specifying genotype prior probabilties. When read depth is low, accurate priors make a big difference in the accuracy of genotype calls. And because some genotype calls are going to be uncertain no matter how sophisticated our algorithm is, polyRAD can export genotypes as continuous numeric variables reflecting the probabilities of all possible allele copy numbers. This includes genotypes with zero reads, where the priors themselves are used for imputation.

Genotype priors in diversity panels and natural populations:

Genotype priors in biparental mapping populations:

In particular, by using population structure and linkage to inform genotype priors on a per-individual basis, high depth markers are used by polyRAD to improve the accuracy of genotyping at low depth markers. All pipelines allow autopolyploidy, allopolyploidy, or some mixture of the two. And because non-model organisms need some love, reference genomes are optional.

Formats supported

To hopefully answer the question, "Can I use polyRAD?":

polyRAD requires as input the sequence read depth at each allele for each sample. Alleles must also be grouped into loci. The bioinformatics pipeline that you used for SNP discovery did not have to assume polyploidy, as long as it faithfully reported allelic read depth. Genomic alignment information is optional. Right now there are data import functions for the following formats:

Currently there are export functions for the following software. Genotypes are exported as continuous variables for these three formats. There are also functions to generate matrices of continuous or discrete genotypes, which can be used in custom export functions.

There is an export function for discrete genotypes for the following software:

Installation

polyRAD depends on some Bioconductor packages. Before attempting to install polyRAD, run

source("https://bioconductor.org/biocLite.R")
biocLite("pcaMethods")

If you plan to import from VCF, also run

biocLite("VariantAnnotation")

polyRAD can then be installed from CRAN with

install.packages("polyRAD")

Alternatively, if there are new features not yet on the CRAN version that you want to use, you can install the development version here on GitHub at your own risk. There are R packages such as devtools and githubinstall that facilitate installing directly from GitHub.

Tutorial

The tutorial document for the package is available on Github.

Citation

polyRAD is described in a preprint manuscript:

Clark LV, Lipka AE, and Sacks EJ (2018) polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids. bioRxiv, doi:10.1101/380899

Citable Zenodo DOI for the software: DOI

Version 0.1 was also presented in a poster:

Clark LV, Lipka AE, and Sacks EJ (2018) polyRAD: Genotype Calling with Uncertainty from Sequencing Data in Polyploids and Diploids. Plant and Animal Genome Conference XXVI, January 13-17, San Diego, California, USA. doi:10.13140/RG.2.2.27134.08001

Funding

This material is based upon work supported by the National Science Foundation under Grant No. 1661490. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.



lvclark/polyRAD documentation built on Oct. 3, 2018, 1:01 p.m.