options(showHeadLines=3) options(showTailLines=3)
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
VCFArray is a Bioconductor package that represents VCF files as
objects derived from the DelayedArray package and DelayedArray
class. It converts data entries from VCF file into a
DelayedArray
-derived data structure. The backend VCF file could
either be saved on-disk locally or remote as online resources. Data
entries that could be extracted include the fixed data fields (REF,
ALT, QUAL, FILTER), information field (e.g., AA, AF...), and the
individual format field (e.g., GT, DP...). The array data generated
from fixed/information fields are one-dimensional VCFArray
, with the
dimension being the length of the variants. The array data generated
from individual FORMAT
field are always returned with the first
dimension being variants
and the second dimension being
samples
. This feature is consistent with the assay data saved in
SummarizedExperiment
, and makes the VCFArray
package interoperable
with other established Bioconductor data infrastructure.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("VCFArray")
The development version is also available to download from Github.
BiocManager::install("Bioconductor/VCFArray")
library(VCFArray)
To construct a VCFArray
object, 4 arguments are needed: file
,
vindex
and name
, and pfix
. The file
argument could take either
a character string (VCF file name), or VcfFile
object, or a
RangedVcfStack
object. name
argument must be specified to indicate
which data entry we want to extract from the input file. It's
case-sensitive, and must be consistent with the names from VCF header
file. vindex
argument will only be used to indicate the file path
of the index file if it does not exist. pfix
is used to spefify the
category that the name
field belongs to. NOTE that the pfix
needs to be provided specifically when there are same name
in
multiple categories, otherwise, error will return.
The vcfFields()
method takes the VCF file path, VcfFile
object or
RangedVcfStack
object as input, and returns a CharacterList with all
available VCF fields within specific categories. Users should consult
the fixed
, info
and geno
category for available data entries
that could be converted into VCFArray
instances. The data entry
names can be used as input for the name
argument in VCFArray
constructor.
args(VCFArray) fl <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation") library(VariantAnnotation) vcfFields(fl)
Since the index file for our vcf file already exists, the vindex
argument would not be needed (which is the most common case for
on-disk VCF files). So we can construct the VCFArray
object for the
GT
data entry in the provided VCF file with arguments of file
and
name
only.
VCFArray(file = fl, name = "GT")
We can also construct a VCFArray
object with the file
argument
being a VcfFile
object.
vcf <- VariantAnnotation::VcfFile(fl) VCFArray(file = vcf, name = "DS")
The file
argument could also take RangedVcfStack
object. Note that
an ordinary VcfStack
object without Range
information could not
be used to construct a VCFArray
.
extdata <- system.file(package = "GenomicFiles", "extdata") files <- dir(extdata, pattern="^CEUtrio.*bgz$", full=TRUE)[1:2] names(files) <- sub(".*_([0-9XY]+).*", "\\1", basename(files)) seqinfo <- as(readRDS(file.path(extdata, "seqinfo.rds")), "Seqinfo") stack <- GenomicFiles::VcfStack(files, seqinfo) gr <- as(GenomicFiles::seqinfo(stack)[rownames(stack)], "GRanges") ## RangedVcfStack rgstack <- GenomicFiles::RangedVcfStack(stack, rowRanges = gr) rgstack
Here we choose the name = SB
, which returns a 3-dimensional
VCFArray
object, with the first 2 dimensions correspond to variants
and samples respectively.
vcfFields(rgstack)$geno VCFArray(rgstack, name = "SB")
As the vignette title suggest, the backend VCF file could also be
remote files. Here we included an example of representing VCF file of
chromosome 22 from the 1000 Genomes Project (Phase 3). NOTE that for
a remote VCF file, the vindex
argument must be specified. Since
this VCF files is relatively big, and it takes longer time, we only
show the code here without evaluation.
chr22url <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" chr22url.tbi <- paste0(chr22url, ".tbi") va <- VCFArray(chr22url, vindex =chr22url.tbi, name = "GT")
VCFArray
represents VCF files as DelayedArray
instances. It has
methods like dim
, dimnames
defined, and it inherits array-like
operations and methods from DelayedArray
, e.g., the subsetting
method of [
.
NOTE that for 1-dimensional VCFArray
objects that are generated
from the fixed / information data field of VCF file, drop = FALSE
should always be used with [
subsetting to ensure VCFArray
object
as returned value.
seed
returns the VCFArraySeed
of the VCFArray
object, which
includes information about the backend VCF file, e.g., the vcf file
path, index file path, name of the data entry (with a prefix of
category), dimension and etc.
va <- VCFArray(fl, name = "GT") seed(va)
vcffile
returns the VcfFile
object corresponding to the backend
VCF file.
vcffile(va)
dim()
and dimnames()
The dimnames(VCFArray)
returns an unnamed list, with the length of
each element being the same as return from dim(VCFArray)
.
va <- VCFArray(fl, name = "GT") dim(va) class(dimnames(va)) lengths(dimnames(va))
[
subsettingVCFArray
instances can be subsetted, following the usual R
conventions, with numeric or logical vectors; logical vectors are
recycled to the appropriate length.
va[1:3, 1:3] va[c(TRUE, FALSE), ]
Numeric calculations could be evaluated on VCFArray
objects.
ds <- VCFArray(fl, name = "DS") log(ds+5)
The VCFArraySeed
class represents the 'seed' for the VCFArray
object. It is not exported from the VCFArray package. Seed objects
should contain the VCF file path, and are expected to satisfy the
“seed contract” of DelayedArray, i.e. to support dim()
and
dimnames()
.
seed <- VCFArray:::VCFArraySeed(fl, name = "GT", pfix = NULL) seed path(vcffile(seed))
The seed can be used to construct a VCFArray
instance.
(va <- VCFArray(seed))
The DelayedArray()
constructor with VCFArraySeed
object as inputs
will return the same content as the VCFArray()
constructor over the
same VCFArraySeed
.
da <- DelayedArray(seed) class(da) all.equal(da, va)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.