VCFArray: DelayedArray objects with on-disk/remote VCF backend

options(showHeadLines=3)
options(showTailLines=3)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

VCFArray is a Bioconductor package that represents VCF files as objects derived from the DelayedArray package and DelayedArray class. It converts data entries from VCF file into a DelayedArray-derived data structure. The backend VCF file could either be saved on-disk locally or remote as online resources. Data entries that could be extracted include the fixed data fields (REF, ALT, QUAL, FILTER), information field (e.g., AA, AF...), and the individual format field (e.g., GT, DP...). The array data generated from fixed/information fields are one-dimensional VCFArray, with the dimension being the length of the variants. The array data generated from individual FORMAT field are always returned with the first dimension being variants and the second dimension being samples. This feature is consistent with the assay data saved in SummarizedExperiment, and makes the VCFArray package interoperable with other established Bioconductor data infrastructure.

Installation

  1. Download the package.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("VCFArray")

The development version is also available to download from Github.

BiocManager::install("Bioconductor/VCFArray")
  1. Load the package into R session.
library(VCFArray)

VCFArray

VCFArray constructor

To construct a VCFArray object, 4 arguments are needed: file, vindex and name, and pfix. The file argument could take either a character string (VCF file name), or VcfFile object, or a RangedVcfStack object. name argument must be specified to indicate which data entry we want to extract from the input file. It's case-sensitive, and must be consistent with the names from VCF header file. vindex argument will only be used to indicate the file path of the index file if it does not exist. pfix is used to spefify the category that the name field belongs to. NOTE that the pfix needs to be provided specifically when there are same name in multiple categories, otherwise, error will return.

The vcfFields() method takes the VCF file path, VcfFile object or RangedVcfStack object as input, and returns a CharacterList with all available VCF fields within specific categories. Users should consult the fixed, info and geno category for available data entries that could be converted into VCFArray instances. The data entry names can be used as input for the name argument in VCFArray constructor.

args(VCFArray)
fl <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation")
library(VariantAnnotation)
vcfFields(fl)

Since the index file for our vcf file already exists, the vindex argument would not be needed (which is the most common case for on-disk VCF files). So we can construct the VCFArray object for the GT data entry in the provided VCF file with arguments of file and name only.

VCFArray(file = fl, name = "GT")

We can also construct a VCFArray object with the file argument being a VcfFile object.

vcf <- VariantAnnotation::VcfFile(fl)
VCFArray(file = vcf, name = "DS")

The file argument could also take RangedVcfStack object. Note that an ordinary VcfStack object without Range information could not be used to construct a VCFArray.

extdata <- system.file(package = "GenomicFiles", "extdata")
files <- dir(extdata, pattern="^CEUtrio.*bgz$", full=TRUE)[1:2]
names(files) <- sub(".*_([0-9XY]+).*", "\\1", basename(files))
seqinfo <- as(readRDS(file.path(extdata, "seqinfo.rds")), "Seqinfo")
stack <- GenomicFiles::VcfStack(files, seqinfo)
gr <- as(GenomicFiles::seqinfo(stack)[rownames(stack)], "GRanges")
## RangedVcfStack
rgstack <- GenomicFiles::RangedVcfStack(stack, rowRanges = gr)  
rgstack

Here we choose the name = SB, which returns a 3-dimensional VCFArray object, with the first 2 dimensions correspond to variants and samples respectively.

vcfFields(rgstack)$geno
VCFArray(rgstack, name = "SB")

As the vignette title suggest, the backend VCF file could also be remote files. Here we included an example of representing VCF file of chromosome 22 from the 1000 Genomes Project (Phase 3). NOTE that for a remote VCF file, the vindex argument must be specified. Since this VCF files is relatively big, and it takes longer time, we only show the code here without evaluation.

chr22url <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
chr22url.tbi <- paste0(chr22url, ".tbi")
va <- VCFArray(chr22url, vindex =chr22url.tbi, name = "GT")

VCFArray methods

VCFArray represents VCF files as DelayedArray instances. It has methods like dim, dimnames defined, and it inherits array-like operations and methods from DelayedArray, e.g., the subsetting method of [.
NOTE that for 1-dimensional VCFArray objects that are generated from the fixed / information data field of VCF file, drop = FALSE should always be used with [ subsetting to ensure VCFArray object as returned value.

slot accessors

seed returns the VCFArraySeed of the VCFArray object, which includes information about the backend VCF file, e.g., the vcf file path, index file path, name of the data entry (with a prefix of category), dimension and etc.

va <- VCFArray(fl, name = "GT")
seed(va)

vcffile returns the VcfFile object corresponding to the backend VCF file.

vcffile(va)

dim() and dimnames()

The dimnames(VCFArray) returns an unnamed list, with the length of each element being the same as return from dim(VCFArray).

va <- VCFArray(fl, name = "GT")
dim(va)
class(dimnames(va))
lengths(dimnames(va))

[ subsetting

VCFArray instances can be subsetted, following the usual R conventions, with numeric or logical vectors; logical vectors are recycled to the appropriate length.

va[1:3, 1:3]
va[c(TRUE, FALSE), ]

Some numeric calculation

Numeric calculations could be evaluated on VCFArray objects.

ds <- VCFArray(fl, name = "DS")
log(ds+5)

Internals: VCFArraySeed

The VCFArraySeed class represents the 'seed' for the VCFArray object. It is not exported from the VCFArray package. Seed objects should contain the VCF file path, and are expected to satisfy the “seed contract” of DelayedArray, i.e. to support dim() and dimnames().

seed <- VCFArray:::VCFArraySeed(fl, name = "GT", pfix = NULL)
seed
path(vcffile(seed))

The seed can be used to construct a VCFArray instance.

(va <- VCFArray(seed))

The DelayedArray() constructor with VCFArraySeed object as inputs will return the same content as the VCFArray() constructor over the same VCFArraySeed.

da <- DelayedArray(seed)
class(da)
all.equal(da, va)

sessionInfo()

sessionInfo()


Try the VCFArray package in your browser

Any scripts or data that you put into this service are public.

VCFArray documentation built on Nov. 8, 2020, 5:30 p.m.