VCF2RADdata: Create a RADdata Object from a VCF File

View source: R/data_import.R

VCF2RADdataR Documentation

Create a RADdata Object from a VCF File

Description

This function reads a Variant Call Format (VCF) file containing allelic read depth and SNP alignment positions, such as can be produced by TASSEL or GATK, and generates a RADdata dataset to be used for genotype calling in polyRAD.

Usage

VCF2RADdata(file, phaseSNPs = TRUE, tagsize = 80, refgenome = NULL, 
            tol = 0.01, al.depth.field = "AD", min.ind.with.reads = 200, 
            min.ind.with.minor.allele = 10, possiblePloidies = list(2), 
            taxaPloidy = 2L, contamRate = 0.001, 
            samples = VariantAnnotation::samples(VariantAnnotation::scanVcfHeader(file)),
            svparam = VariantAnnotation::ScanVcfParam(fixed = "ALT", info = NA,
                                                      geno = al.depth.field,
                                                      samples = samples), 
            yieldSize = 5000, expectedAlleles = 5e+05, expectedLoci = 1e+05,
            maxLoci = NA)

Arguments

file

The path to a VCF file to be read. This can be uncompressed, bgzipped using Samtools or Bioconductor, or a TabixFile object from Bioconductor.

phaseSNPs

If TRUE, markers that appear to have come from the same set of reads will be phased and grouped into haplotypes. Otherwise, each row of the file will be kept as a distinct marker.

tagsize

The read length, minus any barcode sequence, that was used for genotyping. In TASSEL, this is the same as the kmerLength option. This argument is used for grouping SNPs into haplotypes and is ignored if phaseSNPs = FALSE.

refgenome

Optional. The name of a FASTA file, or an FaFile object, containing the reference genome. When grouping SNPs into haplotypes, if provided this reference genome is used to insert non-variable nucleotides between the variable nucleotides in the alleleNucleotides slot of the RADdata output. Ignored if phaseSNPs = FALSE. Useful if exact SNP positions need to be retained for downstream analysis after genotype calling in polyRAD. In particular this argument is necessary if you plan to export genotype calls back to VCF.

tol

The proportion by which two SNPs can differ in read depth and still be merged into one group for phasing. Ignored if phaseSNPs = FALSE.

al.depth.field

The name of the genotype field in the VCF file that contains read depth at each allele. This should be "AD" unless your format is very unusual.

min.ind.with.reads

Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with reads.

min.ind.with.minor.allele

Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with the minor allele. When there are more than two alleles, at least two alleles must have at least this many samples with reads for the SNP to be retained.

possiblePloidies

A list indicating inheritance modes that might be encountered in the dataset. See RADdata.

taxaPloidy

A single integer, or an integer vector with one value per taxon, indicating ploidy. See RADdata.

contamRate

A number indicating the expected sample cross-contamination rate. See RADdata.

samples

A character vector containing the names of samples from the file to export to the RADdata object. The default is all samples. If a subset is provided, filtering with min.ind.with.reads and min.ind.with.minor.allele is performed within that subset. Ignored if a different samples argument is provided within svparam.

svparam

A ScanVcfParam object to be used with readVcf. The primary reasons to change this from the default would be 1) if you want additional FIXED or INFO fields from the file to be exported to the locTable slot of the RADdata object, and/or 2) if you only want to import particular regions of the genome, as specified with the which argument of ScanVcfParam.

yieldSize

An integer indicating the number of lines of the file to read at once. Increasing this number will make the function faster but consume more RAM.

expectedAlleles

An integer indicating the approximate number of alleles that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function.

expectedLoci

An integer indicating the approximate number of loci that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function.

maxLoci

An integer indicating the approximate maximum number of loci to return. If provided, the function will stop reading the file once it has found at least this many loci that pass filtering and phasing. This argument is intended to be used for generating small RADdata objects for testing purposes, and should be left NA under normal circumstances.

Details

This function requires the BioConductor package VariantAnnotation. See https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html for installation instructions.

If you anticipate running VCF2RADdata on the same file more than once, it is recommended to run bgzip and indexTabix from the package Rsamtools once before running VCF2RADdata. See examples. If the reference genome is large enough to require a .csi index rather than a .tbi index, after bgzipping the file you can generate the index from the bash terminal using tabix --csi from Samtools.

min.ind.with.minor.allele is used for filtering SNPs as the VCF file is read. Additionally, because phasing SNPs into haplotypes can cause some haplotypes to fail to pass this threshold, VCF2RADdata internally runs MergeRareHaplotypes with min.ind.with.haplotype = min.ind.with.minor.allele, then RemoveMonomorphicLoci, before returning the final RADdata object.

Value

A RADdata object.

Note

In the python directory of the polyRAD installation, there is a script called tassel_vcf_tags.py that can identify the full tag sequence(s) for every allele imported by VCF2RADdata.

Author(s)

Lindsay V. Clark

References

Variant Call Format specification: http://samtools.github.io/hts-specs/

TASSEL GBSv2 pipeline: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline

GATK: https://gatk.broadinstitute.org/hc/en-us

Tassel4-Poly: https://github.com/guilherme-pereira/tassel4-poly

See Also

MakeTasselVcfFilter for filtering to a smaller VCF file before reading with VCF2RADdata.

To export to VCF: RADdata2VCF

Other data import functions: readStacks, readHMC, readTagDigger, readTASSELGBSv2, readProcessIsoloci, readDArTag

Examples

# get the example VCF installed with polyRAD
exampleVCF <- system.file("extdata", "Msi01genes.vcf", package = "polyRAD")


# loading VariantAnnotation namespace takes >10s,
# so is excluded from CRAN checks

require(VariantAnnotation)

# Compress and index the VCF before reading, if not already done
if(!file.exists(paste(exampleVCF, "bgz", sep = "."))){
  vcfBG <- bgzip(exampleVCF)
  indexTabix(vcfBG, "vcf")
}

# Read into RADdata object
myRAD <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500)

# Example of subsetting by genomic region (first 200 kb on Chr01)
mysv <- ScanVcfParam(fixed = "ALT", info = NA, geno = "AD",
                     samples = samples(scanVcfHeader(exampleVCF)),
                     which = GRanges("01", IRanges(1, 200000)))
myRAD2 <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500,
                      svparam = mysv, yieldSize = NA_integer_)


lvclark/polyRAD documentation built on Jan. 15, 2024, 4:19 a.m.