VCF2RADdata: Create a RADdata Object from a VCF File
In lvclark/polyRAD: Genotype Calling with Uncertainty from Sequencing Data in Polyploids and Diploids

VCF2RADdata

R Documentation

Create a RADdata Object from a VCF File

Description

This function reads a Variant Call Format (VCF) file containing allelic read depth and SNP alignment positions, such as can be produced by TASSEL or GATK, and generates a RADdata dataset to be used for genotype calling in polyRAD.

Usage

VCF2RADdata(file, phaseSNPs = TRUE, tagsize = 80, refgenome = NULL, 
            tol = 0.01, al.depth.field = "AD", min.ind.with.reads = 200, 
            min.ind.with.minor.allele = 10, possiblePloidies = list(2), 
            taxaPloidy = 2L, contamRate = 0.001, 
            samples = VariantAnnotation::samples(VariantAnnotation::scanVcfHeader(file)),
            svparam = VariantAnnotation::ScanVcfParam(fixed = "ALT", info = NA,
                                                      geno = al.depth.field,
                                                      samples = samples), 
            yieldSize = 5000, expectedAlleles = 5e+05, expectedLoci = 1e+05,
            maxLoci = NA)

Arguments

`file`	The path to a VCF file to be read. This can be uncompressed, bgzipped using Samtools or Bioconductor, or a `TabixFile` object from Bioconductor.
`phaseSNPs`	If `TRUE`, markers that appear to have come from the same set of reads will be phased and grouped into haplotypes. Otherwise, each row of the file will be kept as a distinct marker.
`tagsize`	The read length, minus any barcode sequence, that was used for genotyping. In TASSEL, this is the same as the kmerLength option. This argument is used for grouping SNPs into haplotypes and is ignored if `phaseSNPs = FALSE`.
`refgenome`	Optional. The name of a FASTA file, or an `FaFile` object, containing the reference genome. When grouping SNPs into haplotypes, if provided this reference genome is used to insert non-variable nucleotides between the variable nucleotides in the `alleleNucleotides` slot of the `RADdata` output. Ignored if `phaseSNPs = FALSE`. Useful if exact SNP positions need to be retained for downstream analysis after genotype calling in polyRAD. In particular this argument is necessary if you plan to export genotype calls back to VCF.
`tol`	The proportion by which two SNPs can differ in read depth and still be merged into one group for phasing. Ignored if `phaseSNPs = FALSE`.
`al.depth.field`	The name of the genotype field in the VCF file that contains read depth at each allele. This should be "AD" unless your format is very unusual.
`min.ind.with.reads`	Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with reads.
`min.ind.with.minor.allele`	Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with the minor allele. When there are more than two alleles, at least two alleles must have at least this many samples with reads for the SNP to be retained.
`possiblePloidies`	A list indicating inheritance modes that might be encountered in the dataset. See `RADdata`.
`taxaPloidy`	A single integer, or an integer vector with one value per taxon, indicating ploidy. See `RADdata`.
`contamRate`	A number indicating the expected sample cross-contamination rate. See `RADdata`.
`samples`	A character vector containing the names of samples from the file to export to the `RADdata` object. The default is all samples. If a subset is provided, filtering with `min.ind.with.reads` and `min.ind.with.minor.allele` is performed within that subset. Ignored if a different `samples` argument is provided within `svparam`.
`svparam`	A `ScanVcfParam` object to be used with `readVcf`. The primary reasons to change this from the default would be 1) if you want additional FIXED or INFO fields from the file to be exported to the `locTable` slot of the `RADdata` object, and/or 2) if you only want to import particular regions of the genome, as specified with the `which` argument of `ScanVcfParam`.
`yieldSize`	An integer indicating the number of lines of the file to read at once. Increasing this number will make the function faster but consume more RAM.
`expectedAlleles`	An integer indicating the approximate number of alleles that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function.
`expectedLoci`	An integer indicating the approximate number of loci that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function.
`maxLoci`	An integer indicating the approximate maximum number of loci to return. If provided, the function will stop reading the file once it has found at least this many loci that pass filtering and phasing. This argument is intended to be used for generating small `RADdata` objects for testing purposes, and should be left `NA` under normal circumstances.

Details

This function requires the BioConductor package VariantAnnotation. See https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html for installation instructions.

If you anticipate running VCF2RADdata on the same file more than once, it is recommended to run bgzip and indexTabix from the package Rsamtools once before running VCF2RADdata. See examples. If the reference genome is large enough to require a .csi index rather than a .tbi index, after bgzipping the file you can generate the index from the bash terminal using tabix --csi from Samtools.

min.ind.with.minor.allele is used for filtering SNPs as the VCF file is read. Additionally, because phasing SNPs into haplotypes can cause some haplotypes to fail to pass this threshold, VCF2RADdata internally runs MergeRareHaplotypes with min.ind.with.haplotype = min.ind.with.minor.allele, then RemoveMonomorphicLoci, before returning the final RADdata object.

Value

A RADdata object.

Note

In the python directory of the polyRAD installation, there is a script called tassel_vcf_tags.py that can identify the full tag sequence(s) for every allele imported by VCF2RADdata.

Author(s)

Lindsay V. Clark

References

Variant Call Format specification: http://samtools.github.io/hts-specs/

TASSEL GBSv2 pipeline: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline

GATK: https://gatk.broadinstitute.org/hc/en-us

Tassel4-Poly: https://github.com/guilherme-pereira/tassel4-poly

Examples

# get the example VCF installed with polyRAD
exampleVCF <- system.file("extdata", "Msi01genes.vcf", package = "polyRAD")


# loading VariantAnnotation namespace takes >10s,
# so is excluded from CRAN checks

require(VariantAnnotation)

# Compress and index the VCF before reading, if not already done
if(!file.exists(paste(exampleVCF, "bgz", sep = "."))){
  vcfBG <- bgzip(exampleVCF)
  indexTabix(vcfBG, "vcf")
}

# Read into RADdata object
myRAD <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500)

# Example of subsetting by genomic region (first 200 kb on Chr01)
mysv <- ScanVcfParam(fixed = "ALT", info = NA, geno = "AD",
                     samples = samples(scanVcfHeader(exampleVCF)),
                     which = GRanges("01", IRanges(1, 200000)))
myRAD2 <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500,
                      svparam = mysv, yieldSize = NA_integer_)

lvclark/polyRAD documentation built on June 14, 2025, 10:08 p.m.