In PeteHaitch/fstArray: 'fst' backend for DelayedArray objects

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

fstArray

fstArray provides an on-disk backend for a DelayedArray object using the fst file format. fstArray provides a convenient way to store matrix-like data in an .fst file and operate on it using delayed operations and block processing.

The fst file format is designed for serializing data frames. Therefore, fstArray can only handle 2-dimensional arrays (matrices), which it does by (ab)using the fst file format to store columns of the matrix as columns of the serialized data frame. Consequently, fstArray is best used for tasks where data are processed column-wise (or by all columns at once) and, ideally, by processing contiguous rows.

HDF5Array is the obvious alternative to fstArray. The HDF5 file format provides arrays of arbitrary dimension and allows alternative chunking and serialization strategies to better support arbitrary data access patterns.

In initial testing, I find that fstArray is generally faster at reading data from disk, surprisingly often even when reading data using sub-optimal access patterns.

Limitations: Currently, there is no equivalent to HDF5Array::writeHDF5Array(), i.e. writefstArray(). This will be added in a future version of fstArray once the core fst library supports appending data directly on/to disk. For now, an fstArray can be created by wrapping a fst file created by fst::write_fst().

Installation

You can install fstArray from github with:

# install.packages("devtools")
devtools::install_github("PeteHaitch/fstarray/")

Example

devtools::load_all()
library(fst)
library(HDF5Array)
library(microbenchmark)

library(fstArray)
library(fst)
library(HDF5Array)
library(microbenchmark)

Here's a simple example comparing the time to read subsets of the data back into memory using fstArray and HDF5Array. fstArray and HDF5Array are compared without and with data compression[^compression]. To make the results more comparable, we chunk the HDF5Array by column when using compression, mimicing the storage scheme of the fstArray.

[^compression]: The level of compression is the default used by HDF5Array, scaled for use with fstArray.

The data are a 'long' matrix with 1 million rows and 10 columns filled with pseudorandom numbers between 0 and 1.

# Simulate a 'long' matrix with numeric data
nrow <- 1e6
ncol <- 10
x <- matrix(runif(nrow * ncol), 
            ncol = ncol, 
            dimnames = list(NULL, letters[seq_len(ncol)]))

TODO: Also demonstate on data with better compressibility (e.g. sequencing coverage-like data), and 'wide' data. So, 4 datasets: long runif, wide runif, long coverage, wide scRNA-seq
TODO: HDF5Array with default chunking

Summary of results

fstArray is slightly-to-noticeably faster than HDF5Array when reading contiguous rows of data into memory as an ordinary matrix, with the performance gap increasing when data compression is used. This is the best-case data access pattern for a fstArray.

As expected, fstArray is slower than HDF5Array when reading random rows of uncompressed data into memory as an ordinary matrix. Surprisingly, however, fstArray is faster than HDF5Array on this test when data compression is used.

Finally, the worst-case data access pattern for fstArray is reading random elements of data into memory as an ordinary vector. Surprisingly, fstArray is 2x faster than HDF5Array on this benchmark.

Loading data into memory

TODO: Revisit text after re-running benchmarking

We now compare the performance of loading the data into memory as an ordinary matrix. We first create fstArray and HDF5Array representations of the data to be used in this benchmarking:

# Construct fstArray and HDF5Array instances

# Compression levels
# NOTE: Amount of HDF5 compression (`level`) is on a scale of {0, 1, ..., 9}
hdf5_compression_level <- getHDF5DumpCompressionLevel()
hdf5_compression_level
# NOTE: Amount of fst compression (`compress`) is on a scale of {0, 1, ..., 100}
fst_compression_level <- as.integer(hdf5_compression_level / 9 * 101)

# NOTE: Each fst file can only contain one dataset whereas a HDF5 file can 
#       contain multiple datasets
fst_no_compression <- tempfile(fileext = ".fst")
fst::write_fst(x = as.data.frame(x),
               path = fst_no_compression,
               compress = 0)
fstarray_no_compression <- fstArray(fst_no_compression)
fst_with_compression <- tempfile(fileext = ".fst")
fst::write_fst(x = as.data.frame(x),
               path = fst_with_compression,
               compress = fst_compression_level)
fstarray_with_compression <- fstArray(fst_with_compression)
hdf5array_no_compression <- writeHDF5Array(x = x,
                                           chunk_dim = c(nrow(x), 1L),
                                           level = 0)
hdf5array_with_compression <- writeHDF5Array(x = x,
                                             chunk_dim = c(nrow(x), 1L),
                                             level = hdf5_compression_level)

All objects contain the same data[^dimnames]:

[^dimnames]: Currently, a HDF5Array cannot store the dimnames of the array whereas an fstArray can stored column names but not rownames.

all.equal(as.matrix(fstarray_no_compression), 
          as.matrix(fstarray_with_compression))
all.equal(as.matrix(fstarray_no_compression), 
          as.matrix(hdf5array_no_compression),
          check.attributes = FALSE)
all.equal(as.matrix(fstarray_no_compression), 
          as.matrix(hdf5array_with_compression),
          check.attributes = FALSE)

Similar file sizes are achieved for fstArray and HDF5Array instances:

file_sizes_df <- data.frame(
  name = c("fstarray_no_compression",
           "fstarray_with_compression",
           "hdf5array_no_compression",
           "hdf5array_with_compression"),
  class = c("_fstArray_", "_fstArray_", "_HDF5Array_", "_HDF5Array_"),
  compression = c(FALSE, TRUE, FALSE, TRUE),
  `size (bytes)` = c(file.size(seed(fstarray_no_compression)@file),
                     file.size(seed(fstarray_with_compression)@file),
                     file.size(seed(hdf5array_no_compression)@file),
                     file.size(seed(hdf5array_with_compression)@file)),
  check.names = FALSE)
knitr::kable(file_sizes_df)

Loading all data into memory

It is faster to load an entire fstArray than an entire HDF5Array. This is despite fstArray currently having to do a coercion from a data frame to a matrix when loading data into memory[^coercion].

# Benchmark loading all data from disk as an ordinary matrix
# TODO: Make a plot of result
microbenchmark(
  fstarray_no_compression = as.matrix(fstarray_no_compression),
  fstarray_with_compression = as.matrix(fstarray_with_compression),
  hdf5array_no_compression = as.matrix(hdf5array_no_compression),
  hdf5array_with_compression = as.matrix(hdf5array_with_compression),
  times = 10)

Loading contiguous rows of data into memory

Now, loading 10,000 contiguous rows from the middle of the matrix. Again, using an fstArray is a faster than an HDF5Array.

rows <- 500001:600000
# TODO: Make a plot of result (either ggplot2::autoplot() or what's done in fst
#       README)
microbenchmark(
  fstarray_no_compression = as.matrix(fstarray_no_compression[rows, ]),
  fstarray_with_compression = as.matrix(fstarray_with_compression[rows, ]),
  hdf5array_no_compression = as.matrix(hdf5array_no_compression[rows, ]),
  hdf5array_with_compression = as.matrix(hdf5array_with_compression[rows, ]),
  times = 10)

Loading random rows of data into memory

We load 10,000 randomly selected rows, the worst data access pattern for an fstArray. Unsurprisingly, an fstArray is slower than HDF5Array, so algorithms should be designed accordingly when targetting fstArray.

rows <- sample(nrow(x), 10000)
microbenchmark(
  fstarray_no_compression = as.matrix(fstarray_no_compression[rows, ]),
  fstarray_with_compression = as.matrix(fstarray_with_compression[rows, ]),
  hdf5array_no_compression = as.matrix(hdf5array_no_compression[rows, ]),
  hdf5array_with_compression = as.matrix(hdf5array_with_compression[rows, ]),
  times = 10)

Loading random elements of data into memory

Benchmark loading 100,000 random elements of data from disk as an ordinary vector

TODO: How does [,DelayedArray-method work?

elements <- sample(length(x), 100000)
microbenchmark(
  fstarray_no_compression = fstarray_no_compression[elements],
  fstarray_with_compression = fstarray_with_compression[elements],
  hdf5array_no_compression = hdf5array_no_compression[elements],
  hdf5array_with_compression = hdf5array_with_compression[elements],
  times = 10)