In many large-data situations, it is impractical to load and retain data in R's working memory space. We have had a look at HDF5, SQLite and tabix-indexed text as possible solutions to problems arising with memory constraints. We'll call these "out-of-memory" (OOM) approaches
How can we obtain data on which approach will be most effective for a given task? Comparative benchmarking is a very useful skill and we give a very rudimentary account of this here.
It is common to speak of a program that drives other programs as a "harness" (see wikipedia for related discussion). We have such a program in ph525x:
library(benchOOM)
benchOOM
This program is going to help us assess performance of various OOM approaches. We consider a very limited problem, that of managing data that could reside in an R matrix. The main parameters are
NR and NC: row and column dimensionstimes: number of benchmark replications for averaginginseed: a seed for random number generation to ensure reproducibilitymethods: a list of methodsThe methods parameter is most complex. Each element of the list
is assumed to be a function with the matrix to
be managed via OOM as the first argument, some additional
parameters, and a parameter intimes that gives the number
of benchmark replicates.
Our objective is to produce a table that looks like
> b1
NR NC times meth wr ingFull ing1K
1 5000 100 5 hdf5 10.71714 9.4100810 14.2984402
2 5000 100 5 ff 25.34365 63.0977338 4.4320688
3 5000 100 5 sqlite 174.89003 105.1254638 28.4717496
4 5000 100 5 data.table 49.35190 7.9871552 13.9007588
5 5000 100 5 bigmemory 23.39697 0.9660878 0.9950034
where each method listed in meth is asked to perform the same
task a fixed number of times for averaging. The construction of
the table occurs by binding together metadata about the task and
method to the result of getStats. We'll leave the details
of getStats to independent investigation.
Let's look at the method for HDF5:
benchOOM:::.h5RoundTrip
The program has three main phases
h5writeh5read with various restrictionsThe results of microbenchmark are assembled in a list.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.