GDS Data with GDS Backend

options(showHeadLines=3)
options(showTailLines=3)

Introduction

With the rapid development of the biotechnologies, the sequencing (e.g., DNA, bulk/single-cell RNA, etc.) and other types of biological data are getting increasingly larger-profile. The memory space in R has been an obstable for fast and efficient data processing, because most available R or Bioconductor packages are developed based on in-memory data manipulation. SingleCellExperiment has achieved efficient on-disk saving/reading of the large-scale count data as HDF5Array objects. However, there was still no such light-weight containers available for high-throughput variant data (e.g., DNA-seq, genotyping, etc.).

We have developed VariantExperiment, a Bioconductor package to contain variant data into RangedSummarizedExperiment object. The package converts and represent VCF/GDS files using standard SummarizedExperiment metaphor. It is a container for high-through variant data with GDS back-end.

In VariantExperiment, The high-throughput variant data is saved in DelayedArray objects with GDS back-end. In addition to the light-weight Assay data, it also supports the on-disk saving of annotation data for both features and samples (corresponding to rowData/colData respectively) by implementing the DelayedDataFrame data structure. The on-disk representation of both assay data and annotation data realizes on-disk reading and processing and saves R memory space significantly. The interface of RangedSummarizedExperiment data format enables easy and common manipulations for high-throughput variant data with common SummarizedExperiment metaphor in R and Bioconductor.

Installation

Download the package from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("VariantExperiment")

Or install the development version of the package from Github.

BiocManager::install("Bioconductor/VariantExperiment")

Load the package into R session.

library(VariantExperiment)

Background

GDSArray

GDSArray is a Bioconductor package that represents GDS files as objects derived from the DelayedArray package and DelayedArray class. It converts GDS nodes into a DelayedArray-derived data structure. The rich common methods and data operations defined on GDSArray makes it more R-user-friendly than working with the GDS file directly.

The GDSArray() constructor takes 2 arguments: the file path and the GDS node name (which can be retrieved with the gdsnodes() function) inside the GDS file.

library(GDSArray)
file <- GDSArray::gdsExampleFileName("seqgds")
gdsnodes(file)
GDSArray(file, "genotype/data")
GDSArray(file, "sample.id")

More details about GDS or GDSArray format can be found in the vignettes of the gdsfmt, SNPRelate, SeqArray, GDSArray and DelayedArray packages.

DelayedDataFrame

DelayedDataFrame is a Bioconductor package that implements delayed operations on DataFrame objects using standard DataFrame metaphor. Each column of data inside DelayedDataFrame is represented as 1-dimensional GDSArray with on-disk GDS file. Methods like show,validity check, [, [[ subsetting, rbind, cbind are implemented for DelayedDataFrame. The DelayedDataFrame stays lazy until an explicit realization call like DataFrame() constructor or as.list() triggered. More details about DelayedDataFrame data structure could be found in the vignette of DelayedDataFrame package.

`VariantExperiment` class

VariantExperiment class is defined to extend RangedSummarizedExperiment. The difference would be that the assay data are saved as DelayedArray, and the annotation data are saved by default as DelayedDataFrame (with option to save as ordinary DataFrame), both of which are representing the data on-disk with GDS back-end.

Conversion methods into VariantExperiment object are defined directly for VCF and GDS files. Here we show one simple example to convert a DNA-sequencing data in GDS format into VariantExperiment and some class-related operations.

ve <- makeVariantExperimentFromGDS(file)
ve

In this example, the sequencing file in GDS format was converted into a VariantExperiment object, with all available assay data saved into the assay slot, all available feature annotation nodes into rowRanges/rowData slot, and all available sample annotation nodes into colData slot. The available values for each arguments in makeVariantExperimentFromGDS() function can be retrieved using the showAvailable() function.

args(makeVariantExperimentFromGDS)
showAvailable(file)

slot accessors

Assay data are in GDSArray format, and could be retrieve by the assays()/assay() function. NOTE that when converted into a VariantExperiment object, the assay data will be checked and permuted, so that the first 2 dimensions always match to features (variants/snps) and samples respectively, no matter how are the dimensions are with the original GDSArray that can be constructed.

assays(ve)
assay(ve, 1)
GDSArray(file, "genotype/data")  ## original GDSArray from GDS file before permutation

In this example, the original GDSArray object from genotype data was 2 x 90 x 1348. But it was permuted to 1348 x 90 x 2 when constructed into the VariantExperiment object.

The rowData() of the VariantExperiment is by default saved in DelayedDataFrame format. We can use rowRanges() / rowData() to retrieve the feature-related annotation file, with/without a GenomicRange format.

rowRanges(ve)
rowData(ve)

sample-related annotation is by default in DelayedDataFrame format, and could be retrieved by colData().

colData(ve)

The gdsfile() will retrieve the gds file path associated with the VariantExperiment object.

gdsfile(ve)

Some other getter function like metadata() will return any metadata that we have saved inside the VariantExperiment object.

metadata(ve)

Coercion methods

To take advantage of the functions and methods that are defined on SummarizedExperiment, from which the VariantExperiment extends, we have defined coercion methods from VCF and GDS to VariantExperiment.

From `VCF` to `VariantExperiment`

The coercion function of makeVariantExperimentFromVCF could convert the VCF file directly into VariantExperiment object. To achieve the best storage efficiency, the assay data are saved in DelayedArray format, and the annotation data are saved in DelayedDataFrame format (with no option of ordinary DataFrame), which could be retrieved by rowData() for feature related annotations and colData() for sample related annotations (Only when sample.info argument is specified).

vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile())
ve

Internally, the VCF file was converted into a on-disk GDS file, which could be retrieved by:

gdsfile(ve)

assay data is in DelayedArray format:

assay(ve, 1)

feature-related annotation is in DelayedDataFrame format:

rowData(ve)

User could also have the opportunity to save the sample related annotation info directly into the VariantExperiment object, by providing the file path to the sample.info argument, and then retrieve by colData().

sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
                          package="VariantExperiment")
vevcf <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
colData(vevcf)

Arguments could be specified to take only certain info columns or format columns from the vcf file.

vevcf1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(vevcf1)

In the above example, only 2 info entries ("OR" and "GP") are read into the VariantExperiment object.

The start and count arguments could be used to specify the start position and number of variants to read into Variantexperiment object.

vevcf2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
vevcf2

For the above example, only 1000 variants are read into the VariantExperiment object, starting from the position of 101.

From `GDS` to `VariantExperiment`

The coercion function of makeVariantExperimentFromGDS coerces GDS files into VariantExperiment objects directly, with the assay data saved as DelayedArray, and the rowData()/colData() in DelayedDataFrame by default (with the option of ordinary DataFrame object).

gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve

Arguments could be specified to take only certain annotation columns for features and samples. All available data entries for makeVariantExperimentFromGDS arguments could be retrieved by the showAvailable() function with the gds file name as input.

showAvailable(gds)

Note that the infoColumns from gds file will be saved as columns inside the rowData(), with the prefix of "info.". rowDataOnDisk/colDataOnDisk could be set as FALSE to save all annotation data in ordinary DataFrame format.

ve3 <- makeVariantExperimentFromGDS(gds,
                                    rowDataColumns = c("allele", "annotation/id"),
                                    infoColumns = c("AC", "AN", "DP"),
                                    rowDataOnDisk = TRUE,
                                    colDataOnDisk = FALSE)
rowData(ve3)  ## DelayedDataFrame object 
colData(ve3)  ## DataFrame object

customization for certain gds types

For GDS formats of SEQ_ARRAY (defined in SeqArray as SeqVarGDSClass class) and SNP_ARRAY (defined in SNPRelate as SNPGDSFileClass class), we have made some customized transfer of certain nodes when reading into VariantExperiment object for users' convenience.

The allele node in SEQ_ARRAY gds file is converted into 2 columns in rowData() asn REF and ALT.

veseq <- makeVariantExperimentFromGDS(file,
                                      rowDataColumns = c("allele"),
                                      infoColumns = character(0))
rowData(veseq)

The snp.allele node in SNP_ARRAY gds file was converted into 2 columns in rowData() as snp.allele1 and snp.allele2.

snpfile <- SNPRelate::snpgdsExampleFileName()
vesnp <- makeVariantExperimentFromGDS(snpfile,
                                      rowDataColumns = c("snp.allele"))
rowData(vesnp)

Subsetting methods

VariantExperiment supports basic subsetting operations using [, [[, $, and ranged-based subsetting operations using subsetByOverlap.

two-dimensional subsetting

ve[1:10, 1:5]

`$` subsetting

The $ subsetting can be operated directly on colData() columns, for easy sample extraction. NOTE that the colData/rowData are (by default) in the DelayedDataFrame format, with each column saved as GDSArray. So when doing subsetting, we need to use as.logical() to convert the 1-dimensional GDSArray into ordinary vector.

colData(ve)
ve[, as.logical(ve$family == "1328")]

subsetting by rowData() columns.

rowData(ve)
ve[as.logical(rowData(ve)$REF == "T"),]

Range-based operations

VariantExperiment objects support all of the findOverlaps() methods and associated functions. This includes subsetByOverlaps(), which makes it easy to subset a VariantExperiment object by an interval.

ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933"))
ve1

In this example, only 23 out of 1348 variants were retained with the GRanges subsetting.

Save / load `VariantExperiment` object

Note that after the subsetting by [, $ or Ranged-based operations, and you feel satisfied with the data for downstream analysis, you need to save that VariantExperiment object to synchronize the gds file (on-disk) associated with the subset of data (in-memory representation) before any statistical analysis. Otherwise, an error will be returned.

save `VariantExperiment` object

Use the function saveVariantExperiment to synchronize the on-disk and in-memory representation. This function writes the processed data as ve.gds, and save the R object (which lazily represent the backend data set) as ve.rds under the specified directory. It finally returns a new VariantExperiment object into current R session generated from the newly saved data.

a <- tempfile()
ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE, chunk_size = 30)

load `VariantExperiment` object

You can alternatively use loadVariantExperiment to load the synchronized data into R session, by providing only the file directory. It reads the VariantExperiment object saved as ve.rds, as lazy representation of the backend ve.gds file under the specific directory.

ve3 <- loadVariantExperiment(dir=a)
gdsfile(ve3)
all.equal(ve2, ve3)

Session Info

sessionInfo()

Bioconductor/VariantExperiment documentation built on Nov. 3, 2024, 8:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Bioconductor/VariantExperiment
A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend

In Bioconductor/VariantExperiment: A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend

Introduction

Installation

Background

GDSArray

DelayedDataFrame

`VariantExperiment` class

`VariantExperiment` class

slot accessors

Coercion methods

From `VCF` to `VariantExperiment`

From `GDS` to `VariantExperiment`

customization for certain gds types

Subsetting methods

two-dimensional subsetting

`$` subsetting

Range-based operations

Save / load `VariantExperiment` object

save `VariantExperiment` object

load `VariantExperiment` object

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

Bioconductor/VariantExperiment A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend

In Bioconductor/VariantExperiment: A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend

Introduction

Installation

Background

GDSArray

DelayedDataFrame

VariantExperiment class

VariantExperiment class

slot accessors

Coercion methods

From VCF to VariantExperiment

From GDS to VariantExperiment

customization for certain gds types

Subsetting methods

two-dimensional subsetting

$ subsetting

Range-based operations

Save / load VariantExperiment object

save VariantExperiment object

load VariantExperiment object

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

Bioconductor/VariantExperiment
A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend

`VariantExperiment` class

`VariantExperiment` class

From `VCF` to `VariantExperiment`

From `GDS` to `VariantExperiment`

`$` subsetting

Save / load `VariantExperiment` object

save `VariantExperiment` object

load `VariantExperiment` object