rhdf5 Practical Tips
In rhdf5: R Interface to HDF5

library(rhdf5)
library(dplyr)
library(ggplot2)

Introduction

There are scenarios where the most intuitive approach to working with rhdf5 or HDF5 will not be the most efficient. This may be due to unfamiliar bottlenecks when working with data on-disk rather than in memory, or idiosyncrasies in either the HDF5 library itself or the rhdf5 package. This vignette is intended to present a collection of hints for circumventing some common pitfalls.

Reading subsets of data

One of the cool features about the HDF5 file format is the ability to read subsets of the data without (necessarily) having to read the entire file, keeping both the memory usage and execution times of these operations to a minimum. However this is not always as performant as one might hope.

To demonstrate we'll create some example data. This takes the form of a matrix with 100 rows and 20,000 columns, where the content of each column is the index of the column i.e. column 10 contains the value 10 repeated, column 20 contains 20 repeated etc. This is just so we can easily check we've extracted the correct columns. We then write this matrix to an HDF5 file, calling the dataset 'counts'. ^[You'll probably see a warning here regarding chunking, something we'll touch on later]

m1 <- matrix(rep(1:20000, each = 100), ncol = 20000, byrow = FALSE)
ex_file <- tempfile(fileext = ".h5")
h5write(m1, file = ex_file, name = "counts", level = 6)

Using the `index` argument

Now we'll use the index argument to selectively extract the first 10,000 columns and time how long this takes.

system.time(
  res1 <- h5read(file = ex_file, name = "counts", 
                 index = list(NULL, 1:10000))
)

Next, instead of selecting 10,000 consecutive columns we'll ask for every other column. This should still return the same amount of data and since our dataset is not chunked involves reading the same volume from disk.

index <- list(NULL, seq(from = 1, to = 20000, by = 2))
system.time(
  res2 <- h5read(file = ex_file, name = "counts", 
                 index = index)
)

As we can see this is massively slower than the previous example. This is because creating unions of hyperslabs is currently very slow in HDF5 (see Union of non-consecutive hyperslabs is very slow for another report of this behaviour), with the performance penalty increasing exponentially relative to the number of unions. When we use the index argument rhdf5 creates a hyperslab for each disjoint set of values we want to extract and then merges them. In our first example this only require the creation of a single 100 $\times$ 10,000 hyperslab, where as in the second case we require 10,000 hyperslabs of dimension 100 $\times$ 1 and 9,999 merge operations.

Using hyperslab selections

If there is a regular pattern to the regions you want to access, then it is likely you could also apply use HDF5's hyperslab selection method ^[The parameters for defining hyperslab selection start, stride, block, & count are not particularly intuitive if you are used to R's index selection methods. More examples discussing how to specify them can be found at www.hdfgroup.org]. The following code defines the parameters to select every other column, the same as in our previous example.

start <- c(1,1)
stride <- c(1,2)
block <- c(100,1)
count <- c(1,10000)
system.time(
  res3 <- h5read(file = ex_file, name = "counts", start = start,
                 stride = stride, block = block, count = count)
)
identical(res2, res3)

This is clearly significantly quicker than using the index argument in the example, and the call to identical() confirms we're returning the same data.

rhdf5 is sophisticated enough to combine consecutive columns into a single call, so selecting completely disjoint alternative columns represents a worst case scenario. The impact would be far less if, for example, we wanted to extract columns 1 - 5,000 and 6,001 - 11,000. In that scenario it would probably not be noticeably beneficial to move away from using the index argument, but for less contiguous selections making use of the hyperslab selection parameters can be extremely beneficial.

## this code demonstrates the exponential increase in time as the 
## number of hyberslab unions increases

select_index <- function(n = 1) {

  ## open the dataspace for the count table
  fid <- H5Fopen(ex_file)
  did  <- H5Dopen(fid, name = "counts")
  sid <- H5Dget_space(did)

  ## column choice based on number of unions required
  columns <- c(head(1:10001, n = -n), head(seq(10001-n+2, 20000, 2), n = n-1))
  index <- list(100, columns)
  H5Sselect_index(sid, index = index)

  ## tidy up
  H5Sclose(sid)
  H5Dclose(did)
  H5Fclose(fid)
}

bm <- microbenchmark::microbenchmark(
  select_index(1), select_index(2), select_index(5), 
  select_index(10), select_index(20), select_index(50),
  select_index(100), select_index(200), select_index(500),
  select_index(1000), select_index(2000), select_index(5000),
  select_index(10000),
  times = 3
) %>% mutate(n = gsub(".*\\(([0-9]+)\\)", "\\1", expr) 
             %>% as.integer(),
             time = time / 1e9)

ggplot(bm,aes(x = n, y = time)) +
  geom_point() + 
  scale_x_log10() + 
  scale_y_log10() +
  theme_bw() +
  xlab('number of hyperslab unions') +
  ylab('time (seconds)')

Irregular selections

If there isn't a regular pattern to the columns you want to select, what are the options? Perhaps the most obvious thing we can try is to skip the use of either index or the hyperslab parameters and use 10,000 separate read operations instead. Below we choose a random selection of columns and then apply the function f1() to each in turn.

set.seed(1234)
columns <- sample(x = seq_len(20000), size = 10000, replace = FALSE) %>%
  sort()

f1 <- function(cols, name) { 
  h5read(file = ex_file, name = name, 
         index = list(NULL, cols))
  }
system.time(res4 <- vapply(X = columns, FUN = f1, 
                           FUN.VALUE = integer(length = 100), 
                           name = 'counts'))

This is clearly a terrible idea, it takes ages! For reference, using the index argument with this set of columns takes r system.time(h5read(file = ex_file, name = "counts", index = list(NULL, columns)))['elapsed'] seconds. This poor performance is driven by two things:

Our dataset had no chunking applied to it when it was created. This means for each access the entire dataset is read from disk, which we end up doing 10,000 times.
rhdf5 does a lot of validation on the objects that are passed around internally. Within a call to h5read() HDF5 identifiers are created for the file, dataset, file dataspace, and memory dataspace, each of which are checked for validity. This overhead is negligible when only one call to h5read() is made, but become significant when we make 10,000 separate calls.

There's not much more you can do if the dataset is not chunked, and using the index argument is reasonable. However storing data in this format defeats of of HDF5's key utilities, namely rapid random access. As such it's probably fairly rare to encounter datasets that aren't chunked. With this in mind we'll create a new dataset in our file, based on the same matrix but this time split into 100 $\times$ 100 chunks.

h5createDataset(file = ex_file, dataset = "counts_chunked", 
                dims = dim(m1), storage.mode = "integer", 
                chunk = c(100,100), level = 6)
h5write(obj = m1, file = ex_file, name = "counts_chunked")

If we rerun the same code, but reading from the chunked datasets we get an idea for how much time is wasted exctracted the entire dataset over and over.

system.time(res5 <- vapply(X = columns, FUN = f1, 
                           FUN.VALUE = integer(length = 100), 
                           name = 'counts_chunked'))

This is still quite slow, and the remaining time is being spent on the overheads associated with multiple calls to h5read(). To reduce these the function f2()^[This is not the greatest function ever, things like the file name are hardcoded out of sight, but it illustrates the technique.] defined below splits the list of columns we want to return into sets grouped by the parameter block_size. In the default case this means any columns between 1 & 100 will be placed together, then any between 101 & 200, etc. We then lapply our previous f1() function over these groups. The effect here is to reduce the number of calls to h5read(), while keeping the number of hyperslab unions down by not having too many columns in any one call.

f2 <- function(block_size = 100) {
  cols_grouped <- split(columns,  (columns-1) %/% block_size)
  res <-  lapply(cols_grouped, f1, name = 'counts_chunked') %>%
    do.call('cbind', .)
}
system.time(f2())

bm <- microbenchmark::microbenchmark(
  f2(10), f2(25), f2(50), f2(100), 
  f2(250), f2(500), f2(1000), 
  f2(2000), f2(5000), f2(10000),
  times = 3
) %>% mutate(block_size = gsub(".*\\(([0-9]+)\\)", "\\1", expr) 
             %>% as.integer(),
             time = time / 1e9)

We can see this has a significant effect, although it's still an order of magnitude slower than when we were dealing with regularly spaced subsets. The efficiency here will vary based on a number of factors including the size of the dataset chunks and the sparsity of the column index, and you varying the block_size argument will produce differing performances. The plot below shows the timings achived by providing a selection of values to block_size. It suggests the optimal parameter in this case is probably a block size of r bm %>% arrange(time) %>% slice(1) %>% .$block_size, which took r bm %>% arrange(time) %>% slice(1) %>% .$time %>% round(2) seconds - noticeably faster than when passing all columns to the index argument in a single call.

ggplot(bm, aes(x = block_size, y = time)) + 
  geom_point() + 
  scale_x_log10() +
  theme_bw() + 
  ylab('time (seconds)')

Summary

Efficiently extracting arbitrary subsets of a HDF5 dataset with rhdf5 is a balancing act between the number of hyperslab unions, the number of calls to h5read(), and the number of times a chunk is read. For a (mostly) contiguous subset using the index argument is sufficient, while for regularly spaced but disjoint subsets the hyperslab selection parameters offer an efficient, if slightly more complex, alternative. Otherwise finding the optimal strategy may involve some experimentation in a similar fashion to we have seen above.