knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = TRUE) library(DSArray)
TODO: Consider rendering this separately from the R CMD build to save time.
This vignette discusses the performance of the DSArray class and methods by comparing these to alternative representations of the data. For a basic introduction to the DSArray package, please see the introduction (TODO: link) vignette.
Remember,
Using a DSArray object is only desirable if your data can be represented as a 3-dimensional array - features as rows, samples as columns, and measurements as slices - where many of the slices are duplicated within and between samples and these slices have length > 1. DSArray objects are then an efficient, in-memory representation of the data.
To emphasise, the DSArray package serves a niche purpose, and this vignette only analyses the performance of the DSArray class when being used for this specific purpose.
Prior to (and while) developing the DSArray package, I investigated alternative representations of these type of data. The below table describes 3 alternative data structures, which I compare DSArray against in this vignette.
| Class | Package | Sparse representation | Storage mode |
|-------------------------|--------------------------|-----------------------|--------------|
| array | base (TODO: link) | No | In memory |
| simple_sparse_array | r CRANpkg("slam")
| Yes | In memory |
| HDF5Array | r Biocpkg("HDF5Array")
| No | On disk |
The array class is the simplest representation, but is obviously going to be large in memory since it stores the array in dense form. It does, however, come with the benefit that it is familiar and simple for most R users to manipulate.
To try to reduce the size of the data in memory, I investigated using a sparse array representation for the type data. The r CRANpkg("slam")
package provides data structures and (a limited number of) algorithms and methods for sparse arrays[^Matrix]. Unfortunately, the type of data I needed to work with are not well suited a classical sparse representation because these data contain both missing data (encoded as NA
) and data that are truly zero (encoded as 0
). This does not lend itself well to most sparse array representation, which categories each data point to be zero or non-zero; NA
values are non-zero and therefore contribute to making the data less sparse[^NA_to_zero].
[^Matrix]: The r CRANpkg("Matrix")
package is a popular choice for storing sparse matrices. However, it does not support multidimensional arrays, which means it is not immediately suitable for storing the type of data I have.
[^NA_to_zero]: Of course, NA
could be encoded as 0
, but this incurs a loss of information.
Finally, an obvious way to reduce memory usage is to store the data on disk rather than in memory. The r Biocpkg("HDF5Array")
package is one such R package for storing arrays on disk, using the Hierarchical Data Format (HDF).
The toy examples in the introduction vignette demonstrate how to construct a DSArray, but these data are too small and simple to benefit from a DSArray representation. For the remainder of this vignette we will use the dsa
example dataset that is included with DSArray. We also construct an array (a
), simple_sparse_array (ssa
), and HDF5Array (hdf5a
) representation of these data.
# Load the example data in DSArray representation dsa # Create the array representation a <- as(dsa, "array") # Create the simple_sparse_array representation library(slam) ssa <- as.simple_sparse_array(a) # Create the HDF5Array reprsentation library(HDF5Array) hdf5a <- writeHDF5Dataset(a, file = paste0(tempdir(), "hdf5a"), name = "hdf5a") # NOTE: We need to wrap the result in a HDF5Array object (see ?HDF5Array for # details) # TODO: Is it okay to write a .h5 during tests? hdf5a <- HDF5Array(hdf5a)
While these data are extracted from real biological experiments, it is not necessary to understand the details to follow the vignette. Briefly, dsa
contains counts of patterns of DNA methylation at 1,000,000 randomly sampled features (rows) from 17 samples (columns). For each feature and sample, the number of occurences of each of the $2^4 = 16$ observable patterns of DNA methylation is recorded (slices). Hence, the dimensions of dsa
are $1000000 \times 17 \times 16$.
dim(dsa) dim(a) dim(ssa) dim(hdf5a)
Importantly, these data are very sparse. Although at least one sample has at least one observation for every feature, the majority of samples have no observations for the majority of features (these are recorded as a slice of NA
s). Furthermore, there are many features and samples with identical (duplicated) counts, hence these data are perfectly suited to storing as a DSArray.
# TODO: Write a summary,DSArray-method; might be interesting to know percentage # NA for each sample and summary() of non-NA values per sample/slice
We can compute the size of each object in memory using pryr::object_size()
from the r CRANpkg("pryr")
package:
pryr::object_size(dsa) pryr::object_size(a) pryr::object_size(ssa) pryr::object_size(hdf5a)
Unsurprisingly, the HDF5Array has a very small size in memory since the data are stored on disk[^hdf5_size]. The DSArray representation is very efficient, only r as.numeric(round(100 * pryr::object_size(dsa) / pryr::object_size(a), 0))
% the size of the array representation. As hinted aboce, the simple_sparse_array is not suitable for these data because it must record NA
values as non-zero elements, thereby making these data very dense indeed. In fact, the simple_sparse_array representation is r as.numeric(round(100 * pryr::object_size(ssa) / pryr::object_size(a), 0))
% the size of the array representation. For this reason we do not consider simple_sparse_array in what follows[^ssa].
[^hdf5_size]: The HDF file containing the data is r round(file.size(hdf5a@seed@file) / 1024^2, 1)
MB on disk
[^ssa]: It is worth noting, however, that if NA
and 0
counts can be considered equivalent, then the simple_sparse_array representation is very efficient (73.4 MB in memory).
# Not using this object in what follows rm(ssa)
The above example demonstrates the efficiency of a DSArray for a specific dataset. We next demonstrate the memory efficiency of a DSArray relative to array over a range of nrow
, ncol
, nslice
, and the proportion of unique slices (pus
).
UP TO HERE: Make this plot and include a reference point for the dsa
dataset
nrow <- 100 ^ seq(1, 5, 1) ncol <- c(1, 5, 10, 15, 20, 30) nslice <- 2 ^ seq(0, 5, 1) pus <- seq(0, 0.9, 0.1) grid <- expand.grid(nrow = nrow, ncol = ncol, nslice = nslice, pus = pus) grid$size_of_dsarray <- apply(grid, 1, function(x) { DSArray:::.sizeDSArray(nrow = x[1], ncol = x[2], nslice = x[3], pus = x[4], so = 4) }) grid$size_of_array <- apply(grid, 1, function(x) { DSArray:::.sizeBaseArray(nrow = x[1], ncol = x[2], nslice = x[3], so = 4) }) grid$size_ratio <- apply(grid, 1, function(x) { DSArray:::.sizeRatio(nslice = x[3], pus = x[4], so = 4) }) library(ggplot2) ggplot(data = grid, aes(x = 1 - pus, y = size_of_dsarray, col = factor(nslice))) + geom_line() + facet_grid(ncol ~ nrow, labeller = labeller(ncol = label_both, nrow = label_both)) + scale_y_log10() + annotation_logticks(sides = "l") # TODO: Normalise above plot by size of worst case ggplot(data = grid, aes(x = 1 - pus, y = size_of_array, col = factor(nslice))) + geom_line() + facet_grid(ncol ~ nrow, labeller = labeller(ncol = label_both, nrow = label_both)) + scale_y_log10() + annotation_logticks(sides = "l") ggplot(data = grid, aes(x = 1 - pus, y = size_ratio, col = factor(nslice))) + geom_line() + facet_grid(ncol ~ nrow, labeller = labeller(ncol = label_both, nrow = label_both)) + geom_hline(yintercept = 1) ggplot(data = grid, aes(x = 1 - pus, y = size_ratio, col = factor(nslice))) + geom_line() + facet_grid(ncol ~ nrow, labeller = labeller(ncol = label_both, nrow = label_both)) + scale_y_log10() + annotation_logticks(sides = "l") # NOTE: Slope is constant and independent of nrow and ncol in above plot, so # just focus on one subplot ggplot(data = subset(grid, nrow == 10000 & ncol == 30), aes(x = 1 - pus, y = size_ratio, col = factor(nslice))) + geom_line() + geom_hline(yintercept = 1) ggplot(data = subset(grid, nrow == 10000 & ncol == 30), aes(x = 1 - pus, y = size_ratio, col = factor(nslice))) + geom_line() + geom_hline(yintercept = 1) + scale_y_continuous(trans = "log2") + annotation_logticks(sides = "l") # TODO: You 'break even' when (1 - pus) = 1 / nslice; figure the general # formula for when you make a x% saving
Here are some simple speed benchmarks[^benchmarks] of commonly used operations.
TODO: Specs of my machine
[^benchmark]: We use the r CRANpkg("microbenchmark")
package to perform the timing benchmarks
Note that subsetting a HDF5Array appears to be super fast, but is not an apples-to-apples comparison with the other methods. In the HDF5Array package, subsetting is a delayed operation that is not not realised until wrapped in a call to as.array()
(to realise in memory) or TODO()
(to realise on disk as a new .h5
file); see ?HDF5Array::DelayedArray
for further detail. Therefore, the fairer comparison is to the method that wraps the result in as.array()
.
library(microbenchmark) set.seed(666) # TODO: Up this to 5 or 10 for the final rendering # TODO: Uncomment as.array() calls times <- 5 # Subsetting by rows (unsorted index) i <- sample(nrow(dsa), nrow(dsa) / 10) microbenchmark(dsa[i, , ], a[i, , , drop = FALSE], hdf5a[i, , ], # as.array(hdf5a[i, , ]), times = times) # Subsetting by rows (sorted index) sorted_i <- sort(i) microbenchmark(dsa[sorted_i, , ], a[sorted_i, , , drop = FALSE], hdf5a[sorted_i, , ], # as.array(hdf5a[sorted_i, , ]), times = times) # Subsetting by columns (unsorted index) j <- sample(ncol(dsa), 8) microbenchmark(dsa[, j, ], a[, j, , drop = FALSE], hdf5a[, j, ], # as.array(hdf5a[, j, ]), times = times) # Subsetting by columns (sorted index) sorted_j <- sort(j) microbenchmark(dsa[, sorted_j, ], a[, sorted_j, , drop = FALSE], hdf5a[, sorted_j, ], # as.array(hdf5a[, sorted_j, ]), times = times) # Subsetting by slices (unsorted index) k <- sample(nslice(dsa), 4) microbenchmark(dsa[, , k], a[, , k, drop = FALSE], hdf5a[, , k], # as.array(hdf5a[, , k]), times = times) # Subsetting by slices (sorted index) sorted_k <- sort(k) microbenchmark(dsa[, , sorted_k], a[, , sorted_k, drop = FALSE], hdf5a[, , sorted_k], # as.array(hdf5a[, , sorted_k]), times = times) # Subsetting by rows and columns (unsorted indices) microbenchmark(dsa[i, j, ], a[i, j, ], hdf5a[i, j, ], # as.array(hdf5a[i, j, ]), times = times) # Subsetting by rows and columns (sorted indices) microbenchmark(dsa[sorted_i, sorted_j, ], a[sorted_i, sorted_j, ], hdf5a[sorted_i, sorted_j, ], # as.array(hdf5a[sorted_i, sorted_j, ]), times = times)
TODO: Summarise results
# Sum microbenchmark(sum(dsa, na.rm = TRUE), sum(a, na.rm = TRUE), sum(hdf5a, na.rm = TRUE), times = 5)
TODO: More summaries
TODO: Not really necessary for initial release
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.