The Variant Call Format, or VCF, is the standard in bioinformatics for storing information about DNA sequence variation. One VCF file can contain genotypes across many loci and individuals. Because VCF is highly flexible, widely used, and self-documenting, we adopt it here as the primary format for transferring data among ploidyverse packages, and between the ploidyverse and other software.
VCF is stored as tab-delimited text, with a corresponding binary format called BCF. Here we describe guidelines for VCF files imported into or exported from ploidyverse packages, and generally set standards for storing genotype data from polyploid organisms. These guidelines are meant to make it easier to transfer data among software and researchers, rather than impose restrictions. If your needs deviate in some way from what is described here and you would like to suggest some changes, you may file an issue or pull request on GitHub, or start a discussion on the ploidyverse Google Group.
Additionally, the
VariantAnnotation
package, from the Bioconductor collection of R packages, implements flexible
S4 classes for storing any or all data from a VCF file within an R object, and
has functions for reading and writing VCF files. The ploidyverseVcf
R
package extends VariantAnnotation
and is designed to assist package
developers and users with meeting the ploidyverse VCF guidelines.
VCF data within the ploidyverse should:
For VCFs containing genotype calls, it is important to include information on what software and model was used for generating those calls. ploidyverse VCFs should include a header formatted as follows:
##ploidyverse=<ID=GenotypeCalls,Software=EBG,Version=1,Model=hwe,Description="Genotype calls generated from EBG">
Additional lines can be added for other pieces of software used in the
pipeline for generating the VCF. Lines with the following ID
values are
suggested:
SNPdiscovery
: How variant loci were identified.ReadDepth
: How allelic read depth was calculated for each sample and locus.GenotypeCalls
: How genotypes and genotype posterior probabilities were
inferred from allelic read depth.Phasing
: How alleles were arranged into haplotypes.Imputation
: How missing genotypes were imputed.In ploidyverseVcf
, the software
function can be used to add these lines.
If you aren't studying a model organism, it is likely that you are using a reference genome that is likely to be updated in the next few years, and/or a reference genome for a related species rather than the actual species that you are studying. Therefore it is important to indicate what reference was used.
Lines beginning with ##contig
provide information about each contig (usually,
each chromosome) that was used for alignment. Below are examples of contig
lines that should be found in a ploidyverse VCF.
##contig=<ID=Chr01,length=150798994,assembly="Miscanthus sinensis v7.1 on Phytozome">
##contig=<ID=Chr02,length=146084461,assembly="Miscanthus sinensis v7.1 on Phytozome">
Optionally, you may wish to indicate the file name for the reference genome with a line formatted as follows:
##reference=Msinensis_DH1_v7_0.fasta
If you are using a non-reference pipeline, include the following line:
##contig=<ID=NonRef>
And use "NonRef" in the CHROM column for all variants in the genotype table.
In VariantAnnotation
, data from contig lines can be retrieved or assigned with
the seqinfo
function.
seqnames(seqinfo(vcf)) # The 'ID' field
seqlengths(seqinfo(vcf)) # The 'length' field
genome(seqinfo(vcf)) # The 'assembly' field
A URL and MD5 checksum can also be specified in the contig lines, although I am
not sure that these are currently supported by VariantAnnotation
.
The INFO
column of the genotypes table can be used to store
information about each variant in addition to what is already found in the
CHROM
, POS
, REF
, ALT
, QUAL
, and FILTER
columns. Header lines
beginning with ##INFO
describe all additional fields in the INFO
column.
When there is no reference genome, ploidyverse VCFs require an INFO
field
called TAG
that provides the sequence tag flanking the site. When there
is a reference genome, we still recommend this field but do not require it.
For reduced representation technologies such as genotyping-by-sequencing (GBS)
or restriction site-associated DNA sequencing (RADseq), the entire sequence
tag beginning with the restriction cut site should be provided. Either the
reference allele or most common allele should be represented here. The
purpose is to facilitate alignment to future reference genomes and
cross-referencing to other projects using the same reduced representation
technology.
The following lines define these INFO
fields:
##INFO=<ID=TAG,Number=1,Type=String,Description="DNA sequence of reference RAD tag">
##INFO=<ID=TAGREVCOMPL,Number=0,Type=Flag,Description="The tag aligns to the bottom strand of the reference genome">
##INFO=<ID=TAGPOS,Number=1,Type=Integer,Description="Position of variant with respect to the beginning of the RAD tag">
Need to add details about how to infer the variant tags from this information
In VCF, sample names are used for column headers in the genotype table.
It is also possible to store more data about samples in header lines that
begin with ##SAMPLE
. (See "Sample field format" in the VCF specification.)
ploidyverse VCFs require the following information for each sample:
The following header lines define the above two fields and are mandatory:
##META=<ID=Species,Type=String,Number=.,Description="Species name">
##META=<ID=Ploidy,Type=String,Number=.,Description="Ploidy with respect to reference genome">
Optional fields:
SAMN00000000
.doi:xxxxx
or https://doi.org/xxxxx
format.##META
lines are included that
contain descriptive definitions of those fields. If you have pertinent
information about your samples stored in a spreadsheet, it should go here.Below are suggested header lines for these optional fields:
##META=<ID=CollectionLocation,Type=String,Number=.,Description="Collection location of wild samples">
##META=<ID=MaterialProvider,Type=String,Number=.,Description="Person or organization curating biological materials">
##META=<ID=PopulationDesign,Type=String,Number=.,Values=[Mother, Father, Progeny, Control],Description="Identity within artificial population">
##META=<ID=BioSample,Type=String,Number=.,Description="Accession number in NCBI BioSample database">
##META=<ID=DOI,Type=String,Number=.,Description="Digital Object Identifier">
And below are examples of what the sample lines might look like:
##SAMPLE=<ID=PMS-014,Species=Miscanthus sinensis,Ploidy=2x,BioSample=SAMN02213066,CollectionLocation=29.657 N 109.1195 E,Description="Miscanthus sinensis 'PMS-014'">
##SAMPLE=<ID=JM2014-S-4,Species=Miscanthus x giganteus,Ploidy=3x,BioSample=SAMN05752168,CollectionLocation=34.3014 N 134.1304 E,Description="Miscanthus x giganteus 'JM2014-S-4'">
In the above example, "PMS-014" and "JM2014-S-4" would also be found as column headers for the genotype table.
Within the VariantAnnotation
package, the table of sample information can be
accessed by:
meta(header(vcf))$SAMPLE
or, using the ploidyverseVcf
package:
sampleinfo(vcf)
where vcf
is the name of a vcf object.
Within the tab-delimited genotype table, for every sample by locus, multiple
pieces of information can be stored, separated by a colon (:
). The FORMAT
column indicates which fields are stored in the table and what order they are
in. See the official VCF specification for more information.
In VariantAnnotation
, data from the genotype table can be retrieved using the
geno
function, which returns a named list of matrices. Names of the list
correspond to the genotype fields found in the file. Each matrix stores data
from the table for that field. For fields such as AD
or GP
that can store
multiple comma-separated values, the matrix is a two-dimensional list of
vectors.
In keeping with the VCF specification and with existing software for calling
genotypes in polyploids, ploidyverse VCFs code unphased genotypes in the format
0/0/0/1
and phased genotypes in the format 0|0|0|1
. The example given is a
tetraploid with three copies of the reference allele and one copy of the first
alternative allele. The genotype with the highest posterior probability should
be given in the GT field.
This is the only genotype format field that is mandatory for VCFs that are to be used as input to a ploidyverse genotype calling pipeline. It is also mandatory for VCFs output by the genotype calling pipelines, in order to enable independent evaluation of genotyping quality.
This field contains integers separated by commas. The first integer corresponds
to the reference allele, the second integer to the first alternative allele,
the third integer to the second alternative allele, and so on. Every integer
indicates the read depth for the corresponding allele. So, if the reference
allele is A and there is only one alternative allele G, a value of 12,4
would indicate 12 reads of A and four reads of G. In a tetraploid that value
in the AD field might correspond to the genotype 0/0/0/1
.
This field contains genotype posterior probabilities, ranging from zero to one and summing to one, for all possible genotypes. These are comma-separated. In order to conserve disk space, we recommend rounding to the nearest 0.001.
See the VCF specification for the ordering of genotypes. For example, in a tetraploid with one alternative allele, there are five possible genotypes, and a posterior probability must be provided for each one.
If genotypes were called with a likelihood approach rather than a Bayesian approach, uniform priors can be used to convert likelihoods to posterior probabilities.
This field is needed only if SNPs are phased. If the data were produced by RAD-seq or similar reduced-representation DNA sequencing technology, we highly recommend phasing SNPs that were called within the span of one RAD tag, i.e. alleles that can be unequivocally phased because they were on the same sequencing read.
As suggested in the VCF specification, the phase set ID should be the alignment position of the first SNP in the phase set. For non-reference pipelines, phase set IDs can simply be numbered sequentially from one.
This is a custom field not described in the VCF specification, containing a
numeric genotype ranging from 0 to 1, indicating the estimated intra-individual
allele frequency for the alternative allele. For example, if the genotype
0/0/0/1
was called with high confidence, the value would be 0.25.
Generally this value should be the posterior mean genotype, i.e. the copy number of the alternative allele multiplied by the posterior probability of that genotype, summed across all possible genotypes, divided by the ploidy. If the value for this field is calculated differently, it should be indicated in the file header.
This field should have one value for each alternative allele, separated by commas when there are multiple alternative alleles.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.