An introduction to simpleCache

Your first cache

simpleCache has 2 main use cases: First, it can help you pick up where you left off in an R session, and second, it can help you parallelize code by enabling you to share results across R sessions.

The workhorse of simpleCache is the eponymous simpleCache() function, which in the simplest case requires just two parameters: a cache name, and a block of code. The cache name should be considered unique and its underlying object immutable, while the block of code (or instruction) is the R code that generates the object you wish to cache.

But before we start creating caches, it's important to tell simpleCache where to store the caches. simpleCache uses a global variable (RCACHE.DIR) for caches, and provides a setter function (setCacheDir()) to change this. To get started, choose a cache directory, and generate some random data.

library(simpleCache)
cacheDir = tempdir()
setCacheDir(cacheDir)
simpleCache("normSamp", { rnorm(1e7, 0,1) })

Now, watch what happens when we run that same function call again:

simpleCache("normSamp", { rnorm(1e7, 0,1) })

Notice that the second call to simpleCache() doesn't re-run the rnorm calculation. In fact, it doesn't even re-load the cache, because it notices that it's already in memory. If the cache weren't already in memory, this call would load it from disk. This means you can put this code in multiple scripts and pull the same randomized data, without re-doing the compute work.

You can also force a cache to reload using the reload option. This could be useful, for example, if you've loaded a cache and then accidentally changed it, and want to reset. By default, a call to simpleCache() will not reload an object that already exists in your environment. But you can always force it with the reload parameter:

normSamp = NA  # Oops broke my object in memory.
# Regular call won't reload because we have an object called normSamp already:
simpleCache("normSamp", { rnorm(1e7, 0,1) })
# But we can force reload and get it back with reload=TRUE
simpleCache("normSamp", { rnorm(1e7, 0,1) }, reload=TRUE)

What if we want to start over and blow that cache, getting a new random set? Use the recreate flag if you want to ensure that the cache is produced and overwritten even if it already exists:

simpleCache("normSamp", { rnorm(1e7, 0,1) }, recreate=TRUE)

With just those parameters (cache name, instruction, recreate, and reload), you should be able to make good use of simpleCache. The essence is: if the object exists in memory already: do nothing. If it does not exist in memory, but exists on disk: load it into memory. If it exists neither in memory or on disk: create it and store it to disk and memory. Now you've got the basics.

But there's more if you want it: read on!

Comparison to base R save() and load()

Of course, R has base functions that accomplish this (save() and load()), so what does simpleCache add? Well, simpleCache is essentially a convenience wrapper around the base R functions. The first advantage is that we now require only a single function: simpleCache() handles both saving and loading. This means your script does not need to be written differently depending on whether it's generating or loading a cache, because the same function can do either, depending on whether the cache exists or not. The second advantage is that caches are keyed by cache name instead of by filename. So instead of putting a whole path to an Rdata file into load(), we just pass a unique identifier for the cache, and simpleCache handles the rest. Third, simpleCache tries to be smart: if you already have the object in memory, it won't re-load it. For big caches, this can save you time if you accidentally call simpleCache() multiple times on the same cache (or if you write functions to populate an R environment with a bunch of pre-existing data).

Beyond that, simpleCache also offers several convenient options that just make it really easy to save and re-load R objects. Let's go into a bit more detail into these features.

Cache names

By default, the object will be loaded into a variable with the same name as the cache. You can change this behavior with the assignTo parameter:

simpleCache("normSamp", { rnorm(1e7, 0,1) }, assignTo="mySamp")

After doing this command, we have both normSamp (from the previous calls, not from this one) and mySamp (loaded in this call) in the workspace, and these objects are identical:

identical(normSamp, mySamp)

This assignTo concept is useful if you want to create caches but not load them, or load caches one at a time. Which leads us to...

Creating but not loading caches

It may be that you want to create a bunch of caches that are quite memory intensive, and you don't actually need them all in this particular R workspace at the same time. If you just create each object and save it, you'll end with all those objects in memory at the same time. Instead, you can use the noload parameter, which will create the caches but not load them into memory (so the object will be cached, but will not persist in this R environment). I use this frequently in a setup script to build caches that I will need later in individual scripts that will run on each one individually. Let's make 5 caches but not load them:

for (i in 1:5) {
    cacheName = paste0("normSamp_", i)
    simpleCache(cacheName, { rnorm(1e6, 0,1) }, recreate=TRUE, noload=TRUE)
}

We've now produced 5 different sample data caches. They exist on disk, but not in memory. This could, for example, be done in an initial data-generation or setup script. We then may be interested in using these (same) caches in several downstream scripts, and we could do some iterative operation on them and use assignTo to avoid loading more than 1 at a time into memory:

overallMinimum = 1e6  # pick some high number to start
for (i in 1:5) {
    cacheName = paste0("normSamp_", i)
    simpleCache(cacheName, assignTo="temp")
    overallMinimum = min(overallMinimum, temp)
}

message(overallMinimum)

In this code block, by assigning the caches to the variable temp, we only have 1 in memory at a time, because each cache load overwrites the previous one, which is exactly what we want in this case. We keep track of the minimum value of each one independently, and we've effectively calculated an overall minimum while loading only a single cache in memory at a time.

Loading multiple caches

If you've got a bunch of caches and you want them all in memory, you could just load all the caches into memory with this convenience alias:

loadCaches(paste0("normSamp_", 1:5))

The disadvantage of doing it this way is that you've lost the advantage of using the single simpleCache() function for both saving and loading, but this may be desirable in some cases.

By the way, once a cache is created, you no longer need to provide instructions:

simpleCache("normSamp")

simpleCache will load it if it can; if not, it will give you an error saying it requires an instruction.

Timing cache creating

If you want to record how long it takes to create a new cache, you can set timer=TRUE.

simpleCache("normSamp", { rnorm(1e6, 0,1) }, recreate=TRUE, timer=TRUE)

Complicated code

So far, our examples have cached the result of a very simple instruction code block: the rnorm call to randomly generate some numbers. But really, simpleCache can be used to cache anything. The code block can be whatever you want; whatever it returns will be cached. For example, let's cache the result of a call to t.test():

simpleCache("tResult", { 
    dat2 = rnorm(1e5, 0.05,2)
    t.test(normSamp, dat2)
    }, recreate=TRUE)

tResult
tResult$p.value

The point is that the code could be quite complicated and time-consuming. You may only want to calculate it once, and then re-use the result in another script -- or in this same script next time you run it. simpleCache makes that, well, simple.

That's the end of the basics. There are a few more advanced options as well, such as using a shared cache directory, submitting compute requests to a cluster using batchtools, tweaking the loading environment with the loadEnvir parameter (if you need to call simpleCache() from within a function), and tweaking the cache building resources with the buildEnvir parameter. But these options are more advanced and probably not needed for 95% of simpleCache use cases. If you do need more information, you can find further help in the other vignettes or in the detailed R function documentation (see ?simpleCache).

deleteCaches("normSamp", force=TRUE)
deleteCaches(paste0("normSamp_", 1:5), force=TRUE)
deleteCaches("tResult", force=TRUE)


nsheff/simpleCache documentation built on April 24, 2021, 1:38 a.m.