GNU General Public License, GPLv3
Data management of whole-genome sequence variant calls with hundreds of thousands of individuals: genotypic data (e.g., SNVs, indels and structural variation calls) and annotations in SeqArray GDS files are stored in an array-oriented and compressed manner, with efficient data access using the R programming language.
The SeqArray package is built on top of Genomic Data Structure (GDS) data format, and defines required data structure for a SeqArray file. GDS is a flexible and portable data container with hierarchical structure to store multiple scalable array-oriented data sets. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. It also offers the efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. A high-level R interface to GDS files is available in the package gdsfmt.
Release Version: v1.28.1
http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html
Development Version: v1.29.2
http://www.bioconductor.org/packages/devel/bioc/html/SeqArray.html
Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("SeqArray")
library("devtools")
install_github("zhengxwen/gdsfmt")
install_github("zhengxwen/SeqArray")
The install_github()
approach requires that you build from source, i.e. make
and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.
wget --no-check-certificate https://github.com/zhengxwen/gdsfmt/tarball/master -O gdsfmt_latest.tar.gz
wget --no-check-certificate https://github.com/zhengxwen/SeqArray/tarball/master -O SeqArray_latest.tar.gz
R CMD INSTALL gdsfmt_latest.tar.gz
R CMD INSTALL SeqArray_latest.tar.gz
## Or
curl -L https://github.com/zhengxwen/gdsfmt/tarball/master/ -o gdsfmt_latest.tar.gz
curl -L https://github.com/zhengxwen/SeqArray/tarball/master/ -o SeqArray_latest.tar.gz
R CMD INSTALL gdsfmt_latest.tar.gz
R CMD INSTALL SeqArray_latest.tar.gz
library(SeqArray)
gds.fn <- seqExampleFileName("gds")
# open a GDS file
f <- seqOpen(gds.fn)
# display the contents of the GDS file
f
# close the file
seqClose(f)
## Object of class "SeqVarGDSClass"
## File: SeqArray/extdata/CEU_Exon.gds (298.6K)
## + [ ] *
## |--+ description [ ] *
## |--+ sample.id { Str8 90 LZMA_ra(35.8%), 258B } *
## |--+ variant.id { Int32 1348 LZMA_ra(16.8%), 906B } *
## |--+ position { Int32 1348 LZMA_ra(64.6%), 3.4K } *
## |--+ chromosome { Str8 1348 LZMA_ra(4.63%), 158B } *
## |--+ allele { Str8 1348 LZMA_ra(16.7%), 902B } *
## |--+ genotype [ ] *
## | |--+ data { Bit2 2x90x1348 LZMA_ra(26.3%), 15.6K } *
## | |--+ ~data { Bit2 2x1348x90 LZMA_ra(29.3%), 17.3K }
## | |--+ extra.index { Int32 3x0 LZMA_ra, 19B } *
## | \--+ extra { Int16 0 LZMA_ra, 19B }
## |--+ phase [ ]
## | |--+ data { Bit1 90x1348 LZMA_ra(0.91%), 138B } *
## | |--+ ~data { Bit1 1348x90 LZMA_ra(0.91%), 138B }
## | |--+ extra.index { Int32 3x0 LZMA_ra, 19B } *
## | \--+ extra { Bit1 0 LZMA_ra, 19B }
## |--+ annotation [ ]
## | |--+ id { Str8 1348 LZMA_ra(38.4%), 5.5K } *
## | |--+ qual { Float32 1348 LZMA_ra(2.26%), 122B } *
## | |--+ filter { Int32,factor 1348 LZMA_ra(2.26%), 122B } *
## | |--+ info [ ]
## | | |--+ AA { Str8 1348 LZMA_ra(25.6%), 690B } *
## | | |--+ AC { Int32 1348 LZMA_ra(24.2%), 1.3K } *
## | | |--+ AN { Int32 1348 LZMA_ra(19.8%), 1.0K } *
## | | |--+ DP { Int32 1348 LZMA_ra(47.9%), 2.5K } *
## | | |--+ HM2 { Bit1 1348 LZMA_ra(150.3%), 254B } *
## | | |--+ HM3 { Bit1 1348 LZMA_ra(150.3%), 254B } *
## | | |--+ OR { Str8 1348 LZMA_ra(20.1%), 342B } *
## | | |--+ GP { Str8 1348 LZMA_ra(24.4%), 3.8K } *
## | | \--+ BN { Int32 1348 LZMA_ra(20.9%), 1.1K } *
## | \--+ format [ ]
## | \--+ DP [ ] *
## | |--+ data { Int32 90x1348 LZMA_ra(25.1%), 118.8K } *
## | \--+ ~data { Int32 1348x90 LZMA_ra(24.1%), 114.2K }
## \--+ sample.annotation [ ]
## \--+ family { Str8 90 LZMA_ra(57.1%), 222B }
| Function | Description | |:--------------|:-------------------------------------------| | seqVCF2GDS | Reformat VCF files » | | seqSetFilter | Define a data subset of samples or variants » | | seqGetData | Get data from a SeqArray file with a defined filter » | | seqApply | Apply a user-defined function over array margins » | | seqBlockApply | Apply a user-defined function over array margins via blocking » | | seqParallel | Apply functions in parallel » | | ... | |
(the number of samples is ~100k)
(BioC3.8: gdsfmt_v1.18.1, SeqArray_v1.22.6; BioC3.4: gdsfmt_v1.10.1, SeqArray_v1.14.1)
seqBlockApply()
was unexpectedly slow using version ≤ v1.26.2.
See: https://github.com/zhengxwen/SeqArray/issues/59.
(update in progress ...)
gds2bgen: Format conversion from BGEN to GDS
JSeqArray.jl: Data manipulation of whole-genome sequencing variant data in Julia
PySeqArray: Data manipulation of whole-genome sequencing variant data in Python
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.