In many large-data situations, it is impractical to load and retain data in R's working memory space. We have had a look at HDF5, SQLite and tabix-indexed text as possible solutions to problems arising with memory constraints. We'll call these "out-of-memory" (OOM) approaches
How can we obtain data on which approach will be most effective for a given task? Comparative benchmarking is a very useful skill and we give a very rudimentary account of this here.
It is common to speak of a program that drives other programs as a "harness" (see wikipedia for related discussion). We have such a program in ph525x:
library(benchOOM)
benchOOM
This program is going to help us assess performance of various OOM approaches. We consider a very limited problem, that of managing data that could reside in an R matrix. The main parameters are
NR
and NC
: row and column dimensionstimes
: number of benchmark replications for averaginginseed
: a seed for random number generation to ensure reproducibilitymethods
: a list of methodsThe methods
parameter is most complex. Each element of the list
is assumed to be a function with the matrix to
be managed via OOM as the first argument, some additional
parameters, and a parameter intimes
that gives the number
of benchmark replicates.
Our objective is to produce a table that looks like
> b1 NR NC times meth wr ingFull ing1K 1 5000 100 5 hdf5 10.71714 9.4100810 14.2984402 2 5000 100 5 ff 25.34365 63.0977338 4.4320688 3 5000 100 5 sqlite 174.89003 105.1254638 28.4717496 4 5000 100 5 data.table 49.35190 7.9871552 13.9007588 5 5000 100 5 bigmemory 23.39697 0.9660878 0.9950034
where each method listed in meth
is asked to perform the same
task a fixed number of times for averaging. The construction of
the table occurs by binding together metadata about the task and
method to the result of getStats
. We'll leave the details
of getStats
to independent investigation.
Let's look at the method for HDF5:
benchOOM:::.h5RoundTrip
The program has three main phases
h5write
h5read
with various restrictionsThe results of microbenchmark
are assembled in a list.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.