In vjcitn/benchOOM: benchmarking out-of-memory strategies for R

Introduction

In many large-data situations, it is impractical to load and retain data in R's working memory space. We have had a look at HDF5, SQLite and tabix-indexed text as possible solutions to problems arising with memory constraints. We'll call these "out-of-memory" (OOM) approaches

How can we obtain data on which approach will be most effective for a given task? Comparative benchmarking is a very useful skill and we give a very rudimentary account of this here.

The harness

It is common to speak of a program that drives other programs as a "harness" (see wikipedia for related discussion). We have such a program in ph525x:

library(benchOOM)
benchOOM

This program is going to help us assess performance of various OOM approaches. We consider a very limited problem, that of managing data that could reside in an R matrix. The main parameters are

NR and NC: row and column dimensions
times: number of benchmark replications for averaging
inseed: a seed for random number generation to ensure reproducibility
methods: a list of methods

The methods parameter is most complex. Each element of the list is assumed to be a function with the matrix to be managed via OOM as the first argument, some additional parameters, and a parameter intimes that gives the number of benchmark replicates.

Our objective is to produce a table that looks like

> b1
    NR  NC times       meth        wr     ingFull      ing1K
1 5000 100     5       hdf5  10.71714   9.4100810 14.2984402
2 5000 100     5         ff  25.34365  63.0977338  4.4320688
3 5000 100     5     sqlite 174.89003 105.1254638 28.4717496
4 5000 100     5 data.table  49.35190   7.9871552 13.9007588
5 5000 100     5  bigmemory  23.39697   0.9660878  0.9950034

where each method listed in meth is asked to perform the same task a fixed number of times for averaging. The construction of the table occurs by binding together metadata about the task and method to the result of getStats. We'll leave the details of getStats to independent investigation.

An example method (OOM benchmarker for HDF5 matrix)

Let's look at the method for HDF5:

benchOOM:::.h5RoundTrip

The program has three main phases

HDF5-related setup, cleaning out any previous archives and establishing the basic target file
Benchmarking of data export via h5write
Benchmarking of ingestion via h5read with various restrictions

The results of microbenchmark are assembled in a list.

vjcitn/benchOOM documentation built on April 19, 2021, 11:20 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com