knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(DSArray)

DSArray ("desiree") provides efficient in-memory representation of 3-dimensional arrays that contain many duplicate slices via the DSArray (Duplicate Slice Array) S4 class. A basic array-like API is provided for instantiating, subsetting, and combining DSArray objects.

This vignette introduces the DSArray class and demonstrates common operations. The benchmarking vignette compares the use of DSArray objects, array objects from the base package, sparse matrix objects from the r CRANpkg("Matrix") package, and HDF5Array objects from the r Biocpkg("HDF5Array") package.

What the hell do I do with this?

The DSArray package serves a niche purpose. However, since I've found it useful, I'm making it publicly available. Here is the problem in words and a picture illustrating the solution that r Biocpkg("DSArray") offers.

Suppose you have data on a set of n samples where each sample's data can be represented as a matrix (x1, ..., xn) where dim(x1) = ... = dim(xn) = c(nrow, ncol). We can combine these matrices along a given dimension to form a 3-dimensional array, x. DSArray is designed for the special case where there are many duplicate slices of x. Continuing our example, if each of the x1, ..., xn have duplicate rows and we combine x1, ..., xn to form x such that x[, j, ] represents xj, then for this special case we can efficiently represent the data by storing only the unique rows of the x1, ..., xn and an associated index. A picture will hopefully help make this clearer:

set.seed(666)
# TODO: Figure doesn't look the same in vignette as it does in RStudio
n <- 3
nrow <- 20
ncol = 8
DSArray:::.drawDSArray(n = 3, nrow = 20, ncol = 8)

In this example we have n = 3 matrices, each shown as a slice of x (x[, 1, ], x[, 2, ], x[, 3, ]) with nrow = 20 and ncol = 8, where the colour of the row identifies identical rows. Note that the same row may be found multiple times within a sample and may also be common to multiple samples. We can construct the DSArray representation of x by calling DSArray(x). The DSArray representation has a key and a val, much like an associative array, map, or dictionary. The j-th column of the key is the key for the j-th sample (note the colour ordering of each sample). The val contains all unique rows found in the n samples.

We can reconstruct the data for a particular sample by expanding the val by the relevant column of the key. We can often compute the required summaries of the data while retaining this sparse representation. In this way, a DSArray is similar to using a run length encoding of a vector or a sparse matrix representation to leverage the additional structure in the object.

Constructing a DSArray

The DSArray() function provides several different ways to construct a DSArray object. We demonstrate its use when working with both a single sample and multiple samples.

Single sample

Here we have data from a single sample in a matrix, from which we wish to construct the DSArray representation:

m <- matrix(1:10, ncol = 2, dimnames = list(letters[1:5], LETTERS[1:2]))
m 
m_dsa <- DSArray(m)
m_dsa

Note that columns of m becomes the slices of m_dsa; this is because a DSArray uses columns to represent samples. Also note that by default DSArray() constructs the dimnames from the input:

dimnames(m)
dimnames(m_dsa)

We can override these by supplying them as the dimnames argument, in particular to set the column names (sample names) for these data:

dimnames(DSArray(m, dimnames = list(rownames(m), "sample-1", colnames(m))))

Multiple samples

When we have data on multiple samples, these might already be represented as a 3-dimensional array:

a <- array(c(1, 3, 5, 10, 30, 50, 100, 300, 500, 2, 4, 6, 20, 40, 60, 200, 
             400, 600),
           dim = c(3, 3, 2),
           dimnames = list(letters[1:3], LETTERS[1:3], letters[25:26]))

Here, each sample's data are a column of a:

# Sample A
a[, "A", ]
# Sample B
a[, "B", ]
# Sample C
a[, "C", ]

By default, the DSArray() constructor assumes the columns of an array input represent the samples:

a_dsa <- DSArray(a)
a_dsa

But we can specify this explicitly by setting the MARGIN argument:

# Default: Columns (2) as samples
DSArray(a, MARGIN = 2)
# Rows (1) as samples
DSArray(a, MARGIN = 1)
# Slices (3) as samples
DSArray(a, MARGIN = 3)

Alternatively, the data may be represented as a list of matrix objects, one per sample, where the dimensions of each matrix are identical:

l <- list(A = a[, "A", ],
          B = a[, "B", ],
          C = a[, "C", ])
l
l_dsa <- DSArray(l)
l_dsa

API and overview of methods

The aim is to allow a DSArray to be used as a drop-in replacement for an array when the need arises. The DSArray API is therefore written to mimic the array API so that DSArray objects behave as if they were 3-dimensional array objects. However, the API coverage is not 100% complete. I am adding these missing methods as needed, so if something you require is missing then please get in touch by filing a feature request at https://github.com/PeteHaitch/DSArray/issues.

Where possible, DSArray aims to avoid "densifying" the data (i.e. converting it to an array for intermediate calculations) since doing so obviously negates the memory efficiency of using a DSArray. In the DSArray documentation, we refer to methods that avoid densifying the data as being optimally implemented and methods that densify the data as being sub-optimally implemented. Not all operations are optimally implemented, some because they are difficult (or perhaps impossible) and others because I haven't yet taken the time to optimise them.

As an example, one operation to avoid if at all possible is subset replacement with the [<- operator; this is a very expensive operation since it first densifies the data and then re-sparsifies.

Subsetting

We can subset a DSArray just as we would an array by using the [ operator[^drop]:

[^drop]: The drop argument to [ is always set to FALSE when subsetting a DSArray.

# Extract the first feature
a_dsa[1, , ]
# Extract the first sample
a_dsa[, 1, ]
# Extract the first slice
a_dsa[, , 1]
# Extract the first 2 features for the first 3 samples
a_dsa[1:2, 1:3, ]

Combining

Rather than provide cbind() and rbind() methods, DSArray provides methods for the acbind() and arbind() generics defined in r Biocpkg("IRAnges"). acbind() and arbind() generalise cbind() and rbind() to array-like objects. These generics bind array-like objects with an arbitrary number of dimensions along their rows (arbind) or columns (acbind). All DSArray objects must have the same number of rows (resp. columns) when acbind()-ing (resp. arbind()-ing).

acbind(a_dsa, a_dsa[, 1, ])
acbind(a, a[ , 1, , drop = FALSE])
arbind(a_dsa, a_dsa[1, , ])
arbind(a, a[1, , , drop = FALSE])

S4 Group Generic Functions

Arith, Compare, Ops, Logic, Math, Math2, Summary, and Complex are group generic functions. Each group generic function has a number of member generic functions associated with it. DSArray provides methods for each of these generics for the DSArray class.

Arith, Compare, and Logic

It is trivial to implement high-performance scalar-DSArray arithmetic, comparison, and logic methods:

# Arithmetic
a_dsa + 3
a_dsa - 3
a_dsa * 3
a_dsa ^ 3
a_dsa %% 3
a_dsa %/% 3
a_dsa / 3

# Comparison
a_dsa == 3
a_dsa > 3
a_dsa < 3
a_dsa != 3
a_dsa <= 3
a_dsa >= 3

# Logic
a_dsa & TRUE
a_dsa | TRUE

Unfortunately, the same is not true of vector-DSArray, array-DSArray, or DSArray-DSArray operations, which all currently require the densification of the DSArray argument(s):

a_dsa + 1:2

We can check that the DSArray method gives an identical result to the array method using the non-exported DSArray:::dsa_identical_to_array() function. For example:

DSArray:::dsa_identical_to_array(a_dsa * 3, a * 3)

Math

All Math member generic functions are optimally implemented except for cummax(), cummin(), cumprod(), and cumsum().

# Optimally implemented
abs(a_dsa * -1)
sign(a_dsa)
sqrt(a_dsa)
ceiling(a_dsa + 0.3)
floor(a_dsa + 0.3)
trunc(a_dsa + 0.7)
log(a_dsa)
log10(a_dsa)
log2(a_dsa)
log1p(a_dsa)
acos(a_dsa)
acosh(a_dsa)
asin(a_dsa)
asinh(a_dsa)
atan(a_dsa)
atanh(a_dsa)
exp(a_dsa)
expm1(a_dsa)
cos(a_dsa)
cosh(a_dsa)
cospi(a_dsa)
sin(a_dsa)
sinh(a_dsa)
tan(a_dsa)
tanh(a_dsa)
tanpi(a_dsa)
gamma(a_dsa)
lgamma(a_dsa)
digamma(a_dsa)
trigamma(a_dsa)

# Sub-optimally implemented
cummin(a_dsa)
cummax(a_dsa)
cumprod(a_dsa)
cumsum(a_dsa)

Math2

All Math2 member generic functions are optimally implemented.

# Optimally implemented
round(a_dsa + 0.37, 1)
signif(a_dsa + 0.37, 2)

Summary

All Summary member generic functions are optimally implemented.

# Optimally implemented
all(a_dsa)
all(a_dsa - 1L)
any(a_dsa)
any(a_dsa * 0L)
sum(a_dsa)
prod(a_dsa)
min(a_dsa)
max(a_dsa)
range(a_dsa)

Complex

None of the Complex member generic functions are implemented because DSArray object do not currently support complex numbers.

DSArray(array(1i))

Using DSArray within a SummarizedExperiment

An efficient representation of sparse 3-dimensional arrays within a SummarizedExperiment was the motivation for the development of DSArray. The SummarizedExperiment package defines an important base class in the Bioconductor project. I needed an efficient way to store DNA methylation patterns, a particular kind of genomic data, that was compatible with the SummarizedExperiment package.

Here is a simple example showing that a DSArray works within a SummarizedExperiment.

library(SummarizedExperiment)
se <- SummarizedExperiment(list(counts = a_dsa))
assays(se)
assay(se)
sum(assay(se))
se[, 2]
se[c(1, 3), ]
dimnames(se)
rbind(se, se)
cbind(se, se)

Future work

TODO



PeteHaitch/DRMatrix documentation built on May 8, 2019, 1:30 a.m.