Description Usage Arguments Details Value Author(s) References See Also Examples
Read Variant Call Format (VCF) files
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ## S4 method for signature 'TabixFile,ScanVcfParam'
readVcf(file, genome, param,
..., row.names=TRUE)
## S4 method for signature 'TabixFile,IntegerRangesList'
readVcf(file, genome, param,
..., row.names=TRUE)
## S4 method for signature 'TabixFile,GRanges'
readVcf(file, genome, param,
..., row.names=TRUE)
## S4 method for signature 'TabixFile,GRangesList'
readVcf(file, genome, param,
..., row.names=TRUE)
## S4 method for signature 'TabixFile,missing'
readVcf(file, genome, param,
..., row.names=TRUE)
## S4 method for signature 'character,ANY'
readVcf(file, genome, param,
..., row.names=TRUE)
## S4 method for signature 'character,missing'
readVcf(file, genome, param,
..., row.names=TRUE)
## S4 method for signature 'character,missing'
readVcf(file, genome, param,
..., row.names=TRUE)
## Lightweight functions to read a single variable
readInfo(file, x, param=ScanVcfParam(), ..., row.names=TRUE)
readGeno(file, x, param=ScanVcfParam(), ..., row.names=TRUE)
readGT(file, nucleotides=FALSE, param=ScanVcfParam(), ..., row.names=TRUE)
## Import wrapper
## S4 method for signature 'VcfFile,ANY,ANY'
import(con, format, text, ...)
|
file |
A Use of the |
genome |
A
|
param |
An instance of |
x |
|
row.names |
A |
nucleotides |
A |
con |
The |
format, text |
Ignored. |
... |
Additional arguments, passed to methods. For
|
readVcf
imports records from bzip compressed or uncompressed
VCF files. Data are parsed into a VCF
object
using the file header information if available. To import a subset
of ranges the VCF must have an index file. An index file can be
created with bzip
and indexVcf
functions.
The readInfo
, readGeno
and readGT
functions
are lightweight versions of readVcf
and import a single
variable. The return object is a vector, matrix or CompressedList
instead of a VCF class.
readVcf
calls scanVcf
, the details of which can be
found with ?scanVcf
.
readVcf() reads and parses fields according to the multiplicity and
data type specified in the header lines. Fields without header lines are
skipped (not read or parsed). To see what fields are present in the
header use scanVcfHeader()
. See ?VCFHeader
for more details.
Passing verbose = TRUE
to readVcf()
prints the fields
with header lines that will be parsed by readVcf
.
CHROM, POS, ID and REF fields are used to create the GRanges
stored in the VCF
object and accessible with the rowRanges
accessor.
REF, ALT, QUAL and FILTER are parsed into the DataFrame
in the
fixed
slot. Because ALT can have more than one value per variant
it is represented as a DNAStringSetList
. REF is a DNAStringSet
,
QUAL is numeric
and FILTER is a character
. Accessors include
fixed
, ref
, alt
, qual
, and filt
.
Data from the INFO field can be accessed with the info
accessor.
Genotype data (i.e., data immediately following the FORMAT field in the
VCF) can be accessed with the geno
accessor. INFO and genotype data
types are determined according to the ‘Number’ and ‘Type’
information in the file header as follows:
‘Number’ should only be 0 when ‘Type’ is 'flag'. These fields are parsed as logical vectors.
If ‘Number’ is 1, ‘info’ data are parsed into a
vector
and ‘geno’ into a matrix
.
If ‘Number’ is >1, ‘info’ data are parsed into a
DataFrame
with the same number of columns. ‘geno’ are
parsed into an array
with the same dimensions as ‘Number’.
Columns of the ‘geno’ matrices are the samples.
If ‘Number’ is ‘.’, ‘A’ or ‘G’,
both ‘info’ and ‘geno’ data are parsed into a matrix
.
When the header does not contain any ‘INFO’ lines, the data are returned as a single, unparsed column.
Missing data in VCF files on disk are represented by a dot (".").
readVcf
retains the dot as a character string for data type
character and converts it to NA
for data types numeric or double.
Because the data are stored in rectangular data structures there is a
value for each info
and geno
field element in the VCF
class. If the element was missing or was not collected for a particular
variant the value will be NA
.
In the case of the ALT field we have the following treatment of special characters / missing values:
'.' true missings become empty characters
'*' are treated as missing and become empty characters
'I' are treated as undefined and become '.'
Subsets of data (i.e., specific variables, positions or samples) can
be read from a VCF file by providing a ScanVcfParam
object in
the call to readVcf
. Other lightweight options are the
readGT
, readInfo
and readGeno
functions which
return data as a matrix instead of the VCF
class.
Another option for handling large files is to iterate through the
data in chunks by setting the yieldSize
parameter in a
VcfFile
object. Iteration can be through all data fields or
a subset defined by a ScanVcfParam
. See example below,
'Iterating through VCF with yieldSize'.
readVcf
returns a VCF
object. See ?VCF
for
complete details of the class structure. readGT
, readInfo
and
readGeno
return a matrix
.
The CHROM, POS, ID and REF fields are used to create a GRanges
object. Ranges are created using POS as the start value and width of
the reference allele (REF). By default, the IDs become the rownames
('row.names = FALSE' to turn this off). If IDs are
missing (i.e., ‘.’) a string of CHROM:POS_REF/ALT is used instead.
The genome
argument is stored in the seqinfo of the GRanges
and can be accessed with genome(<VCF>)
.
One metadata column, paramRangeID
, is included with the
rowRanges
. This ID is meaningful when multiple ranges are
specified in the ScanVcfParam
and distinguishes which records
match each range.
REF, ALT, QUAL and FILTER fields of the VCF are parsed into a
DataFrame
.
REF is returned as a DNAStringSet.
ALT is a CharacterList when it contains structural variants and a DNAStringSetList otherwise. See also the 'Details' section for 'Missing data'.
Data from the INFO field of the VCF is parsed into a DataFrame
.
If present, the genotype data are parsed into a list of matrices
or arrays
. Each list element represents a field in the FORMAT
column of the VCF file. Rows are the variants, columns are the samples.
This slot contains a DataFrame
describing the samples. If present,
the sample names following FORMAT in the VCF file become the row names.
Header information present in the file is put into a list
in metadata
.
See references for complete details of the VCF file format.
Valerie Obenchain>
http://vcftools.sourceforge.net/specs.html outlines the VCF specification.
http://samtools.sourceforge.net/mpileup.shtml contains
information on the portion of the specification implemented by
bcftools
.
http://samtools.sourceforge.net/ provides information on
samtools
.
indexVcf
,
VcfFile
,
indexTabix
,
TabixFile
,
scanTabix
,
scanBcf
,
expand,CollapsedVCF-method
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## vcf <- readVcf(fl, c("20"="hg19")) ## 'genome' as named vector
## ---------------------------------------------------------------------
## Header and genome information
## ---------------------------------------------------------------------
vcf
## all header information
hdr <- header(vcf)
## header information for 'info' and 'fixed' fields
info(hdr)
fixed(hdr)
## ---------------------------------------------------------------------
## Accessors
## ---------------------------------------------------------------------
## fixed fields together
head(fixed(vcf), 5)
## fixed fields separately
filt(vcf)
ref(vcf)
## info data
info(hdr)
info(vcf)
info(vcf)$DP
## geno data
geno(hdr)
geno(vcf)
head(geno(vcf)$GT)
## genome
unique(genome(rowRanges(vcf)))
## ---------------------------------------------------------------------
## Data subsets with lightweight read* functions
## ---------------------------------------------------------------------
## Import a single 'info' or 'geno' variable
DP <- readInfo(fl, "DP")
HQ <- readGeno(fl, "HQ")
## Import GT as numeric representation
GT <- readGT(fl)
## Import GT as nucleotides
GT <- readGT(fl, nucleotides=TRUE)
## ---------------------------------------------------------------------
## Data subsets with ScanVcfParam
## ---------------------------------------------------------------------
## Subset on genome coordinates:
## 'file' must have an index
rngs <- GRanges("20", IRanges(c(14370, 1110000), c(17330, 1234600)))
names(rngs) <- c("geneA", "geneB")
param <- ScanVcfParam(which=rngs)
compressVcf <- bgzip(fl, tempfile())
tab <- indexVcf(compressVcf)
vcf <- readVcf(tab, "hg19", param)
## When data are subset by range ('which' argument in ScanVcfParam),
## the 'paramRangeID' column provides a map back to the original
## range in 'param'.
rowRanges(vcf)[,"paramRangeID"]
vcfWhich(param)
## Subset on samples:
## Consult the header for the sample names.
samples(hdr)
## Specify one or more names in 'samples' in a ScanVcfParam.
param <- ScanVcfParam(samples="NA00002")
vcf <- readVcf(tab, "hg19", param)
geno(vcf)$GT
## Subset on 'fixed', 'info' or 'geno' fields:
param <- ScanVcfParam(fixed="ALT", geno=c("GT", "HQ"), info=c("NS", "AF"))
vcf_tab <- readVcf(tab, "hg19", param)
info(vcf_tab)
geno(vcf_tab)
## No ranges are specified in the 'param' so tabix file is not
## required. Instead, the uncompressed VCF can be used as 'file'.
vcf_fname <- readVcf(fl, "hg19", param)
## The header will always contain information for all variables
## in the original file reguardless of how the data were subset.
## For example, all 'geno' fields are listed in the header
geno(header(vcf_fname))
## but only 'GT' and 'HQ' are present in the VCF object.
geno(vcf_fname)
## Subset on both genome coordinates and 'info', 'geno' fields:
param <- ScanVcfParam(geno="HQ", info="AF", which=rngs)
vcf <- readVcf(tab, "hg19", param)
## When any of 'fixed', 'info' or 'geno' are omitted (i.e., no
## elements specified) all records are retrieved. Use NA to indicate
## that no records should be retrieved. This param specifies
## all 'fixed fields, the "GT" 'geno' field and none of 'info'.
ScanVcfParam(geno="GT", info=NA)
## ---------------------------------------------------------------------
## Iterate through VCF with 'yieldSize'
## ---------------------------------------------------------------------
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
tab <- VcfFile(fl, yieldSize=4000)
open(tab)
while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
cat("vcf dim:", dim(vcf_yield), "\n")
close(tab)
## ---------------------------------------------------------------------
## Debugging with 'verbose'
## ---------------------------------------------------------------------
## readVcf() uses information in the header lines to parse the data to
## the correct number and type. Fields without header lines are skipped.
## If a call to readVcf() results in no info(VCF) or geno(VCF) data the
## file may be missing header lines. Set 'verbose = TRUE' to get
## a listing of fields found in the header.
## readVcf(myfile, "mygenome", verbose=TRUE)
## Header fields can also be discovered with scanVcfHeader().
hdr <- scanVcfHeader(fl)
geno(hdr)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.