scanVcf-methods: Import VCF files

scanVcfR Documentation

Import VCF files

Description

Import Variant Call Format (VCF) files in text or binary format

Usage

scanVcfHeader(file, ...)
## S4 method for signature 'character'
scanVcfHeader(file, ...)

scanVcf(file, ..., param)
## S4 method for signature 'character,ScanVcfParam'
scanVcf(file, ..., param)
## S4 method for signature 'character,missing'
scanVcf(file, ..., param)
## S4 method for signature 'connection,missing'
scanVcf(file, ..., param)

## S4 method for signature 'TabixFile'
scanVcfHeader(file, ...)
## S4 method for signature 'TabixFile,missing'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,ScanVcfParam'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,GRanges'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,IntegerRangesList'
scanVcf(file, ..., param)

Arguments

file

For scanVcf and scanVcfHeader, the character() file name, TabixFile, or class connection (file() or bgzip()) of the ‘VCF’ file to be processed.

param

A instance of ScanVcfParam influencing which records are parsed and the ‘INFO’ and ‘GENO’ information returned.

...

Additional arguments for methods

Details

The argument param allows portions of the file to be input, but requires that the file be bgzip'd and indexed as a TabixFile.

scanVcf with param="missing" and file="character" or file="connection" scan the entire file. With file="connection", an argument n indicates the number of lines of the VCF file to input; a connection open at the beginning of the call is open and incremented by n lines at the end of the call, providing a convenient way to stream through large VCF files.

The INFO field of the scanned VCF file is returned as a single ‘packed’ vector, as in the VCF file. The GENO field is a list of matrices, each matrix corresponds to a field as defined in the FORMAT field of the VCF header. Each matrix has as many rows as scanned in the VCF file, and as many columns as there are samples. As with the INFO field, the elements of the matrix are ‘packed’. The reason that INFO and GENO are returned packed is to facilitate manipulation, e.g., selecting particular rows or samples in a consistent manner across elements.

Value

scanVcfHeader returns a VCFHeader object with header information parsed into five categories, samples, meta, fixed, info and geno. Each can be accessed with a ‘getter’ of the same name (e.g., info(<VCFHeader>)). If the file header has multiple rows with the same name (e.g., 'source') the row names of the DataFrame are made unique in the usual way, 'source', 'source.1' etc.

scanVcf returns a list, with one element per range. Each list has 7 elements, obtained from the columns of the VCF specification:

rowRanges

GRanges instance derived from CHROM, POS, ID, and the width of REF

REF

reference allele

ALT

alternate allele

QUAL

phred-scaled quality score for the assertion made in ALT

FILTER

indicator of whether or not the position passed all filters applied

INFO

additional information

GENO

genotype information immediately following the FORMAT field in the VCF

The GENO element is itself a list, with elements corresponding to those defined in the VCF file header. For scanVcf, elements of GENO are returned as a matrix of records x samples; if the description of the element in the file header indicated multiplicity other than 1 (e.g., variable number for “A”, “G”, or “.”), then each entry in the matrix is a character string with sub-entries comma-delimited.

Author(s)

Martin Morgan and Valerie Obenchain>

References

http://vcftools.sourceforge.net/specs.html outlines the VCF specification.

http://samtools.sourceforge.net/mpileup.shtml contains information on the portion of the specification implemented by bcftools.

http://samtools.sourceforge.net/ provides information on samtools.

See Also

readVcf BcfFile TabixFile

Examples

  fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
  scanVcfHeader(fl)
  vcf <- scanVcf(fl)
  ## value: list-of-lists
  str(vcf)
  names(vcf[[1]][["GENO"]])
  vcf[[1]][["GENO"]][["GT"]]

Bioconductor/VariantAnnotation documentation built on Nov. 2, 2024, 7:22 a.m.