options(showHeadLines=3) options(showTailLines=3)
With the rapid development of the biotechnologies, the sequencing (e.g., DNA, bulk/single-cell RNA, etc.) and other types of biological data are getting increasingly larger-profile. The memory space in R has been an obstable for fast and efficient data processing, because most available R or Bioconductor packages are developed based on in-memory data manipulation. SingleCellExperiment has achieved efficient on-disk saving/reading of the large-scale count data as HDF5Array objects. However, there was still no such light-weight containers available for high-throughput variant data (e.g., DNA-seq, genotyping, etc.).
We have developed VariantExperiment, a Bioconductor package to
contain variant data into RangedSummarizedExperiment
object. The
package converts and represent VCF/GDS files using standard
SummarizedExperiment
metaphor. It is a container for high-through
variant data with GDS back-end.
In VariantExperiment
, The high-throughput variant data is saved in
DelayedArray objects with GDS back-end. In addition to the
light-weight Assay
data, it also supports the on-disk saving of
annotation data for both features and samples (corresponding to
rowData/colData
respectively) by implementing the
DelayedDataFrame data structure. The on-disk representation of
both assay data and annotation data realizes on-disk reading and
processing and saves R memory space significantly. The interface of
RangedSummarizedExperiment
data format enables easy and common
manipulations for high-throughput variant data with common
SummarizedExperiment metaphor in R and Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("VariantExperiment")
Or install the development version of the package from Github.
BiocManager::install("Bioconductor/VariantExperiment")
library(VariantExperiment)
GDSArray is a Bioconductor package that represents GDS
files as
objects derived from the DelayedArray package and DelayedArray
class. It converts GDS
nodes into a DelayedArray
-derived data
structure. The rich common methods and data operations defined on
GDSArray
makes it more R-user-friendly than working with the GDS
file directly.
The GDSArray()
constructor takes 2 arguments: the file path and the
GDS node name (which can be retrieved with the gdsnodes()
function)
inside the GDS file.
library(GDSArray) file <- GDSArray::gdsExampleFileName("seqgds") gdsnodes(file) GDSArray(file, "genotype/data") GDSArray(file, "sample.id")
More details about GDS
or GDSArray
format can be found in the
vignettes of the gdsfmt, SNPRelate, SeqArray, GDSArray
and DelayedArray packages.
DelayedDataFrame is a Bioconductor package that implements
delayed operations on DataFrame
objects using standard DataFrame
metaphor. Each column of data inside DelayedDataFrame
is represented
as 1-dimensional GDSArray
with on-disk GDS file. Methods like
show
,validity check
, [
, [[
subsetting, rbind
, cbind
are
implemented for DelayedDataFrame
. The DelayedDataFrame
stays lazy
until an explicit realization call like DataFrame()
constructor or
as.list()
triggered. More details about DelayedDataFrame data
structure could be found in the vignette of DelayedDataFrame
package.
VariantExperiment
classVariantExperiment
classVariantExperiment
class is defined to extend
RangedSummarizedExperiment
. The difference would be that the assay
data are saved as DelayedArray
, and the annotation data are saved by
default as DelayedDataFrame
(with option to save as ordinary
DataFrame
), both of which are representing the data on-disk with
GDS
back-end.
Conversion methods into VariantExperiment
object are
defined directly for VCF
and GDS
files. Here we show one simple
example to convert a DNA-sequencing data in GDS format into
VariantExperiment
and some class-related operations.
ve <- makeVariantExperimentFromGDS(file) ve
In this example, the sequencing file in GDS format was converted into
a VariantExperiment
object, with all available assay data saved into
the assay
slot, all available feature annotation nodes into
rowRanges/rowData
slot, and all available sample annotation nodes
into colData
slot. The available values for each arguments in
makeVariantExperimentFromGDS()
function can be retrieved using the
showAvailable()
function.
args(makeVariantExperimentFromGDS) showAvailable(file)
Assay data are in GDSArray
format, and could be retrieve by the
assays()/assay()
function. NOTE that when converted into a
VariantExperiment
object, the assay data will be checked and
permuted, so that the first 2 dimensions always match to features
(variants/snps) and samples respectively, no matter how are the
dimensions are with the original GDSArray that can be constructed.
assays(ve) assay(ve, 1) GDSArray(file, "genotype/data") ## original GDSArray from GDS file before permutation
In this example, the original GDSArray
object from genotype data was
2 x 90 x 1348
. But it was permuted to 1348 x 90 x 2
when
constructed into the VariantExperiment
object.
The rowData()
of the VariantExperiment
is by default saved in
DelayedDataFrame
format. We can use rowRanges()
/ rowData()
to
retrieve the feature-related annotation file, with/without a
GenomicRange format.
rowRanges(ve) rowData(ve)
sample-related annotation is by default in DelayedDataFrame
format,
and could be retrieved by colData()
.
colData(ve)
The gdsfile()
will retrieve the gds file path associated with the
VariantExperiment
object.
gdsfile(ve)
Some other getter function like metadata()
will return any metadata
that we have saved inside the VariantExperiment
object.
metadata(ve)
To take advantage of the functions and methods that are defined on
SummarizedExperiment
, from which the VariantExperiment
extends, we
have defined coercion methods from VCF
and GDS
to
VariantExperiment
.
VCF
to VariantExperiment
The coercion function of makeVariantExperimentFromVCF
could
convert the VCF
file directly into VariantExperiment
object. To
achieve the best storage efficiency, the assay data are saved in
DelayedArray
format, and the annotation data are saved in
DelayedDataFrame
format (with no option of ordinary DataFrame
),
which could be retrieved by rowData()
for feature related
annotations and colData()
for sample related annotations (Only when
sample.info
argument is specified).
vcf <- SeqArray::seqExampleFileName("vcf") ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile()) ve
Internally, the VCF
file was converted into a on-disk GDS
file,
which could be retrieved by:
gdsfile(ve)
assay data is in DelayedArray
format:
assay(ve, 1)
feature-related annotation is in DelayedDataFrame
format:
rowData(ve)
User could also have the opportunity to save the sample related
annotation info directly into the VariantExperiment
object, by
providing the file path to the sample.info
argument, and then
retrieve by colData()
.
sampleInfo <- system.file("extdata", "Example_sampleInfo.txt", package="VariantExperiment") vevcf <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo) colData(vevcf)
Arguments could be specified to take only certain info columns or format columns from the vcf file.
vevcf1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP")) rowData(vevcf1)
In the above example, only 2 info entries ("OR" and "GP") are read
into the VariantExperiment
object.
The start
and count
arguments could be used to specify the start
position and number of variants to read into Variantexperiment
object.
vevcf2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000) vevcf2
For the above example, only 1000 variants are read into the
VariantExperiment
object, starting from the position of 101.
GDS
to VariantExperiment
The coercion function of makeVariantExperimentFromGDS
coerces GDS
files into VariantExperiment
objects directly, with the assay data
saved as DelayedArray
, and the rowData()/colData()
in
DelayedDataFrame
by default (with the option of ordinary DataFrame
object).
gds <- SeqArray::seqExampleFileName("gds") ve <- makeVariantExperimentFromGDS(gds) ve
Arguments could be specified to take only certain annotation columns
for features and samples. All available data entries for
makeVariantExperimentFromGDS
arguments could be retrieved by the
showAvailable()
function with the gds file name as input.
showAvailable(gds)
Note that the infoColumns
from gds file will be saved as columns
inside the rowData()
, with the prefix of
"info.". rowDataOnDisk/colDataOnDisk
could be set as FALSE
to
save all annotation data in ordinary DataFrame
format.
ve3 <- makeVariantExperimentFromGDS(gds, rowDataColumns = c("allele", "annotation/id"), infoColumns = c("AC", "AN", "DP"), rowDataOnDisk = TRUE, colDataOnDisk = FALSE) rowData(ve3) ## DelayedDataFrame object colData(ve3) ## DataFrame object
For GDS formats of SEQ_ARRAY
(defined in SeqArray as
SeqVarGDSClass
class) and SNP_ARRAY
(defined in SNPRelate as
SNPGDSFileClass
class), we have made some customized transfer of
certain nodes when reading into VariantExperiment
object for users'
convenience.
The allele
node in SEQ_ARRAY
gds file is converted into 2 columns
in rowData()
asn REF
and ALT
.
veseq <- makeVariantExperimentFromGDS(file, rowDataColumns = c("allele"), infoColumns = character(0)) rowData(veseq)
The snp.allele
node in SNP_ARRAY
gds file was converted into 2
columns in rowData()
as snp.allele1
and snp.allele2
.
snpfile <- SNPRelate::snpgdsExampleFileName() vesnp <- makeVariantExperimentFromGDS(snpfile, rowDataColumns = c("snp.allele")) rowData(vesnp)
VariantExperiment
supports basic subsetting operations using [
,
[[
, $
, and ranged-based subsetting operations using
subsetByOverlap
.
ve[1:10, 1:5]
$
subsettingThe $
subsetting can be operated directly on colData()
columns,
for easy sample extraction. NOTE that the colData/rowData
are
(by default) in the DelayedDataFrame
format, with each column saved
as GDSArray
. So when doing subsetting, we need to use as.logical()
to convert the 1-dimensional GDSArray
into ordinary vector.
colData(ve) ve[, as.logical(ve$family == "1328")]
subsetting by rowData()
columns.
rowData(ve) ve[as.logical(rowData(ve)$REF == "T"),]
VariantExperiment
objects support all of the findOverlaps()
methods and associated functions. This includes subsetByOverlaps()
,
which makes it easy to subset a VariantExperiment
object by an
interval.
ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933")) ve1
In this example, only 23 out of 1348 variants were retained with the
GRanges
subsetting.
VariantExperiment
objectNote that after the subsetting by [
, $
or Ranged-based
operations
, and you feel satisfied with the data for downstream
analysis, you need to save that VariantExperiment
object to
synchronize the gds file (on-disk) associated with the subset of data
(in-memory representation) before any statistical analysis. Otherwise,
an error will be returned.
0
VariantExperiment
objectUse the function saveVariantExperiment
to synchronize the on-disk
and in-memory representation. This function writes the processed data
as ve.gds
, and save the R object (which lazily represent the
backend data set) as ve.rds
under the specified directory. It
finally returns a new VariantExperiment
object into current R
session generated from the newly saved data.
a <- tempfile() ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE, chunk_size = 30)
VariantExperiment
objectYou can alternatively use loadVariantExperiment
to load the
synchronized data into R session, by providing only the file
directory. It reads the VariantExperiment
object saved as ve.rds
, as lazy
representation of the backend ve.gds
file under the specific
directory.
ve3 <- loadVariantExperiment(dir=a) gdsfile(ve3) all.equal(ve2, ve3)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.