scanVcf | R Documentation |
Import Variant Call Format (VCF) files in text or binary format
scanVcfHeader(file, ...)
## S4 method for signature 'character'
scanVcfHeader(file, ...)
scanVcf(file, ..., param)
## S4 method for signature 'character,ScanVcfParam'
scanVcf(file, ..., param)
## S4 method for signature 'character,missing'
scanVcf(file, ..., param)
## S4 method for signature 'connection,missing'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile'
scanVcfHeader(file, ...)
## S4 method for signature 'TabixFile,missing'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,ScanVcfParam'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,GRanges'
scanVcf(file, ..., param)
## S4 method for signature 'TabixFile,IntegerRangesList'
scanVcf(file, ..., param)
file |
For |
param |
A instance of |
... |
Additional arguments for methods |
The argument param
allows portions of the file to be input, but
requires that the file be bgzip'd and indexed as a
TabixFile
.
scanVcf
with param="missing"
and file="character"
or file="connection"
scan the entire file. With
file="connection"
, an argument n
indicates the number of
lines of the VCF file to input; a connection open at the beginning of
the call is open and incremented by n
lines at the end of the
call, providing a convenient way to stream through large VCF files.
The INFO field of the scanned VCF file is returned as a single ‘packed’ vector, as in the VCF file. The GENO field is a list of matrices, each matrix corresponds to a field as defined in the FORMAT field of the VCF header. Each matrix has as many rows as scanned in the VCF file, and as many columns as there are samples. As with the INFO field, the elements of the matrix are ‘packed’. The reason that INFO and GENO are returned packed is to facilitate manipulation, e.g., selecting particular rows or samples in a consistent manner across elements.
scanVcfHeader
returns a VCFHeader
object with
header information parsed into five categories, samples
,
meta
, fixed
, info
and geno
. Each
can be accessed with a ‘getter’ of the same name
(e.g., info(<VCFHeader>)). If the file header has multiple rows
with the same name (e.g., 'source') the row names of the DataFrame
are made unique in the usual way, 'source', 'source.1' etc.
scanVcf
returns a list, with one element per range. Each list
has 7 elements, obtained from the columns of the VCF specification:
GRanges
instance derived from CHROM
, POS
,
ID
, and the width of REF
reference allele
alternate allele
phred-scaled quality score for the assertion made in ALT
indicator of whether or not the position passed all filters applied
additional information
genotype information immediately following the FORMAT field in the VCF
The GENO
element is itself a list, with elements corresponding
to those defined in the VCF file header. For scanVcf
, elements
of GENO are returned as a matrix of records x samples; if the
description of the element in the file header indicated multiplicity
other than 1 (e.g., variable number for “A”, “G”, or
“.”), then each entry in the matrix is a character string with
sub-entries comma-delimited.
Martin Morgan and Valerie Obenchain>
http://vcftools.sourceforge.net/specs.html outlines the VCF specification.
http://samtools.sourceforge.net/mpileup.shtml contains
information on the portion of the specification implemented by
bcftools
.
http://samtools.sourceforge.net/ provides information on
samtools
.
readVcf
BcfFile
TabixFile
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
scanVcfHeader(fl)
vcf <- scanVcf(fl)
## value: list-of-lists
str(vcf)
names(vcf[[1]][["GENO"]])
vcf[[1]][["GENO"]][["GT"]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.