knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(DSArray)
DSArray ("desiree") provides efficient in-memory representation of 3-dimensional arrays that contain many duplicate slices via the DSArray (Duplicate Slice Array) S4 class. A basic array-like API is provided for instantiating, subsetting, and combining DSArray objects.
This vignette introduces the DSArray class and demonstrates common operations. The benchmarking vignette compares the use of DSArray objects, array objects from the base package, sparse matrix objects from the r CRANpkg("Matrix")
package, and HDF5Array objects from the r Biocpkg("HDF5Array")
package.
The DSArray package serves a niche purpose. However, since I've found it useful, I'm making it publicly available. Here is the problem in words and a picture illustrating the solution that r Biocpkg("DSArray")
offers.
Suppose you have data on a set of n
samples where each sample's data can be represented as a matrix (x1
, ..., xn
) where dim(x1) = ... = dim(xn) = c(nrow, ncol)
. We can combine these matrices along a given dimension to form a 3-dimensional array, x
. DSArray is designed for the special case where there are many duplicate slices of x
. Continuing our example, if each of the x1
, ..., xn
have duplicate rows and we combine x1
, ..., xn
to form x
such that x[, j, ]
represents xj
, then for this special case we can efficiently represent the data by storing only the unique rows of the x1
, ..., xn
and an associated index. A picture will hopefully help make this clearer:
set.seed(666) # TODO: Figure doesn't look the same in vignette as it does in RStudio n <- 3 nrow <- 20 ncol = 8 DSArray:::.drawDSArray(n = 3, nrow = 20, ncol = 8)
In this example we have n = 3
matrices, each shown as a slice of x
(x[, 1, ]
, x[, 2, ]
, x[, 3, ]
) with nrow = 20
and ncol = 8
, where the colour of the row identifies identical rows. Note that the same row may be found multiple times within a sample and may also be common to multiple samples. We can construct the DSArray representation of x
by calling DSArray(x)
. The DSArray representation has a key and a val, much like an associative array, map, or dictionary. The j-th column of the key is the key for the j-th sample (note the colour ordering of each sample). The val contains all unique rows found in the n
samples.
We can reconstruct the data for a particular sample by expanding the val by the relevant column of the key. We can often compute the required summaries of the data while retaining this sparse representation. In this way, a DSArray is similar to using a run length encoding of a vector or a sparse matrix representation to leverage the additional structure in the object.
The DSArray()
function provides several different ways to construct a DSArray object. We demonstrate its use when working with both a single sample and multiple samples.
Here we have data from a single sample in a matrix, from which we wish to construct the DSArray representation:
m <- matrix(1:10, ncol = 2, dimnames = list(letters[1:5], LETTERS[1:2])) m m_dsa <- DSArray(m) m_dsa
Note that columns of m
becomes the slices of m_dsa
; this is because a DSArray uses columns to represent samples. Also note that by default DSArray()
constructs the dimnames
from the input:
dimnames(m) dimnames(m_dsa)
We can override these by supplying them as the dimnames
argument, in particular to set the column names (sample names) for these data:
dimnames(DSArray(m, dimnames = list(rownames(m), "sample-1", colnames(m))))
When we have data on multiple samples, these might already be represented as a 3-dimensional array:
a <- array(c(1, 3, 5, 10, 30, 50, 100, 300, 500, 2, 4, 6, 20, 40, 60, 200, 400, 600), dim = c(3, 3, 2), dimnames = list(letters[1:3], LETTERS[1:3], letters[25:26]))
Here, each sample's data are a column of a
:
# Sample A a[, "A", ] # Sample B a[, "B", ] # Sample C a[, "C", ]
By default, the DSArray()
constructor assumes the columns of an array input represent the samples:
a_dsa <- DSArray(a) a_dsa
But we can specify this explicitly by setting the MARGIN
argument:
# Default: Columns (2) as samples DSArray(a, MARGIN = 2) # Rows (1) as samples DSArray(a, MARGIN = 1) # Slices (3) as samples DSArray(a, MARGIN = 3)
Alternatively, the data may be represented as a list of matrix objects, one per sample, where the dimensions of each matrix are identical:
l <- list(A = a[, "A", ], B = a[, "B", ], C = a[, "C", ]) l l_dsa <- DSArray(l) l_dsa
The aim is to allow a DSArray to be used as a drop-in replacement for an array when the need arises. The DSArray API is therefore written to mimic the array API so that DSArray objects behave as if they were 3-dimensional array objects. However, the API coverage is not 100% complete. I am adding these missing methods as needed, so if something you require is missing then please get in touch by filing a feature request at https://github.com/PeteHaitch/DSArray/issues.
Where possible, DSArray aims to avoid "densifying" the data (i.e. converting it to an array for intermediate calculations) since doing so obviously negates the memory efficiency of using a DSArray. In the DSArray documentation, we refer to methods that avoid densifying the data as being optimally implemented and methods that densify the data as being sub-optimally implemented. Not all operations are optimally implemented, some because they are difficult (or perhaps impossible) and others because I haven't yet taken the time to optimise them.
As an example, one operation to avoid if at all possible is subset replacement with the [<-
operator; this is a very expensive operation since it first densifies the data and then re-sparsifies.
We can subset a DSArray just as we would an array by using the [
operator[^drop]:
[^drop]: The drop
argument to [
is always set to FALSE
when subsetting a DSArray.
# Extract the first feature a_dsa[1, , ] # Extract the first sample a_dsa[, 1, ] # Extract the first slice a_dsa[, , 1] # Extract the first 2 features for the first 3 samples a_dsa[1:2, 1:3, ]
Rather than provide cbind()
and rbind()
methods, DSArray provides methods for the acbind()
and arbind()
generics defined in r Biocpkg("IRAnges")
. acbind()
and arbind()
generalise cbind()
and rbind()
to array-like objects. These generics bind array-like objects with an arbitrary number of dimensions along their rows (arbind
) or columns (acbind
). All DSArray objects must have the same number of rows (resp. columns) when acbind()
-ing (resp. arbind()
-ing).
acbind(a_dsa, a_dsa[, 1, ]) acbind(a, a[ , 1, , drop = FALSE]) arbind(a_dsa, a_dsa[1, , ]) arbind(a, a[1, , , drop = FALSE])
Arith
, Compare
, Ops
, Logic
, Math
, Math2
, Summary
, and Complex
are group generic functions. Each group generic function has a number of member generic functions associated with it. DSArray provides methods for each of these generics for the DSArray class.
Arith
, Compare
, and Logic
It is trivial to implement high-performance scalar-DSArray arithmetic, comparison, and logic methods:
# Arithmetic a_dsa + 3 a_dsa - 3 a_dsa * 3 a_dsa ^ 3 a_dsa %% 3 a_dsa %/% 3 a_dsa / 3 # Comparison a_dsa == 3 a_dsa > 3 a_dsa < 3 a_dsa != 3 a_dsa <= 3 a_dsa >= 3 # Logic a_dsa & TRUE a_dsa | TRUE
Unfortunately, the same is not true of vector-DSArray, array-DSArray, or DSArray-DSArray operations, which all currently require the densification of the DSArray argument(s):
a_dsa + 1:2
We can check that the DSArray method gives an identical result to the array method using the non-exported DSArray:::dsa_identical_to_array()
function. For example:
DSArray:::dsa_identical_to_array(a_dsa * 3, a * 3)
Math
All Math
member generic functions are optimally implemented except for cummax()
, cummin()
, cumprod()
, and cumsum()
.
# Optimally implemented abs(a_dsa * -1) sign(a_dsa) sqrt(a_dsa) ceiling(a_dsa + 0.3) floor(a_dsa + 0.3) trunc(a_dsa + 0.7) log(a_dsa) log10(a_dsa) log2(a_dsa) log1p(a_dsa) acos(a_dsa) acosh(a_dsa) asin(a_dsa) asinh(a_dsa) atan(a_dsa) atanh(a_dsa) exp(a_dsa) expm1(a_dsa) cos(a_dsa) cosh(a_dsa) cospi(a_dsa) sin(a_dsa) sinh(a_dsa) tan(a_dsa) tanh(a_dsa) tanpi(a_dsa) gamma(a_dsa) lgamma(a_dsa) digamma(a_dsa) trigamma(a_dsa) # Sub-optimally implemented cummin(a_dsa) cummax(a_dsa) cumprod(a_dsa) cumsum(a_dsa)
Math2
All Math2
member generic functions are optimally implemented.
# Optimally implemented round(a_dsa + 0.37, 1) signif(a_dsa + 0.37, 2)
Summary
All Summary
member generic functions are optimally implemented.
# Optimally implemented all(a_dsa) all(a_dsa - 1L) any(a_dsa) any(a_dsa * 0L) sum(a_dsa) prod(a_dsa) min(a_dsa) max(a_dsa) range(a_dsa)
Complex
None of the Complex
member generic functions are implemented because DSArray object do not currently support complex numbers.
DSArray(array(1i))
An efficient representation of sparse 3-dimensional arrays within a SummarizedExperiment was the motivation for the development of DSArray. The SummarizedExperiment package defines an important base class in the Bioconductor project. I needed an efficient way to store DNA methylation patterns, a particular kind of genomic data, that was compatible with the SummarizedExperiment package.
Here is a simple example showing that a DSArray works within a SummarizedExperiment.
library(SummarizedExperiment) se <- SummarizedExperiment(list(counts = a_dsa)) assays(se) assay(se) sum(assay(se)) se[, 2] se[c(1, 3), ] dimnames(se) rbind(se, se) cbind(se, se)
TODO
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.