Introduction to Cache

SuggestedPkgsNeeded <- c("terra")
hasSuggests <- all(sapply(SuggestedPkgsNeeded, require, character.only = TRUE, quietly = TRUE))
useSuggests <- !(tolower(Sys.getenv("_R_CHECK_DEPENDS_ONLY_")) == "true")

knitr::opts_chunk$set(eval = hasSuggests && useSuggests)

Reproducible workflow

As part of a reproducible workflow, caching of function calls, code chunks, and other elements of a project is a critical component. The objective of a reproducible workflow is is likely that an entire work flow from raw data to publication, decision support, report writing, presentation building etc., could be built and be reproducible anywhere, on any computer, operating system, with any starting conditions, on demand. The reproducible::Cache function is built to work with any R function.

Differences with other approaches

Cache users DBI as a backend, with key functions, dbReadTable, dbRemoveTable, dbSendQuery, dbSendStatement, dbCreateTable and dbAppendTable. These can all be accessed via Cache, showCache, clearCache, and keepCache. It is optimized for speed of transactions, using fastdigest::fastdigest on R memory objects and digest::digest on files. The main function is superficially similar to archivist::cache, which uses digest::digest in all cases to determine whether the arguments are identical in subsequent iterations. It also but does many things that make standard caching with digest::digest don't work reliably between systems. For these, the function .robustDigest is introduced to make caching transferable between systems. This is relevant for file paths, environments, parallel clusters, functions (which are contained within an environment), and many others (e.g., see ?.robustDigest for methods). Cache also adds important elements like automated tagging and the option to retrieve disk-cached values via stashed objects in memory using memoise::memoise. This means that running Cache 1, 2, and 3 times on the same function will get progressively faster. This can be extremely useful for web apps built with, say shiny.

Function-level caching

Any function can be cached by wrapping Cache around the function call, or by using base pipe |> or using: Cache(FUN = functionName, ...)

This will be a slight change to a function call, such as: terra::project(raster, crs = terra::crs(newRaster)) to Cache(terra::project(raster, crs = terra::crs(newRaster))) or Cache(terra::project, raster, crs = terra::crs(newRaster)) or with the pipe terra::project(raster, crs = terra::crs(newRaster)) |> Cache()

This is particularly useful for expensive operations.

library(reproducible)
library(data.table)

tmpDir <- file.path(tempdir(), "reproducible_examples", "Cache")
dir.create(tmpDir, recursive = TRUE)

ras <- terra::rast(terra::ext(0, 300, 0, 300), vals = 1:9e4, res = 1)
terra::crs(ras) <- "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"

newCRS <- "+init=epsg:4326" # A longlat crs

# No Cache
system.time(suppressWarnings(map1 <- terra::project(ras, newCRS))) # Warnings due to new PROJ

# Try with memoise for this example -- for many simple cases, memoising will not be faster
opts <- options("reproducible.useMemoise" = TRUE)
# With Cache -- a little slower the first time because saving to disk
system.time({
  suppressWarnings({
    map1 <- Cache(terra::project, ras, newCRS, cachePath = tmpDir, notOlderThan = Sys.time())
  })
})

# faster the second time; improvement depends on size of object and time to run function
system.time({
  map2 <- Cache(terra::project, ras, newCRS, cachePath = tmpDir)
})

options(opts)

all.equal(map1, map2, check.attributes = FALSE) # TRUE

Caching examples

Basic use

try(clearCache(tmpDir, ask = FALSE), silent = TRUE) # just to make sure it is clear

ranNumsA <- Cache(rnorm, 10, 16, cachePath = tmpDir)

# All same
ranNumsB <- Cache(rnorm, 10, 16, cachePath = tmpDir) # recovers cached copy
ranNumsD1 <- Cache(quote(rnorm(n = 10, 16)), cachePath = tmpDir) # recovers cached copy
ranNumsD2 <- Cache(rnorm(n = 10, 16), cachePath = tmpDir) # recovers cached copy
# pipe
ranNumsD3 <- rnorm(n = 10, 16) |> Cache(cachePath = tmpDir) # recovers cached copy

# Any minor change makes it different
ranNumsE <- Cache(rnorm, 10, 6, cachePath = tmpDir) # different

Example 1: Basic cache use with tags

ranNumsA <- Cache(rnorm, 4, cachePath = tmpDir, userTags = "objectName:a")
ranNumsB <- Cache(runif(4), cachePath = tmpDir, userTags = "objectName:b")

showCache(tmpDir, userTags = c("objectName"))
showCache(tmpDir, userTags = c("^a$")) # regular expression ... "a" exactly
showCache(tmpDir, userTags = c("runif")) # show only cached objects made during runif call

clearCache(tmpDir, userTags = c("runif"), ask = FALSE) # remove only cached objects made during runif call
showCache(tmpDir) # all

clearCache(tmpDir, ask = FALSE)

Example 2: using the "accessed" tag

ranNumsA <- Cache(rnorm, 4, cachePath = tmpDir, userTags = "objectName:a")
ranNumsB <- Cache(runif, 4, cachePath = tmpDir, userTags = "objectName:b")

# access it again, from Cache
Sys.sleep(1)
ranNumsA <- Cache(rnorm, 4, cachePath = tmpDir, userTags = "objectName:a")
wholeCache <- showCache(tmpDir)

# keep only items accessed "recently" (i.e., only objectName:a)
onlyRecentlyAccessed <- showCache(tmpDir, userTags = max(wholeCache[tagKey == "accessed"]$tagValue))

# inverse join with 2 data.tables ... using: a[!b]
# i.e., return all of wholeCache that was not recently accessed
#   Note: the two different ways to access -- old way with "artifact" will be deprecated
toRemove <- unique(wholeCache[!onlyRecentlyAccessed, on = "cacheId"], by = "cacheId")$cacheId
clearCache(tmpDir, toRemove, ask = FALSE) # remove ones not recently accessed
showCache(tmpDir) # still has more recently accessed

Example 3: using keepCache

keepCache does the same as previous example, but more simply.

ranNumsA <- Cache(rnorm, 4, cachePath = tmpDir, userTags = "objectName:a")
ranNumsB <- Cache(runif(4), cachePath = tmpDir, userTags = "objectName:b")

# keep only those cached items from the last 24 hours
oneDay <- 60 * 60 * 24
keepCache(tmpDir, after = Sys.time() - oneDay, ask = FALSE)

# Keep all Cache items created with an rnorm() call
keepCache(tmpDir, userTags = "rnorm", ask = FALSE)
showCache(tmpDir)

# Remove all Cache items that happened within a rnorm() call
clearCache(tmpDir, userTags = "rnorm", ask = FALSE)
showCache(tmpDir) ## empty

# Also, can set a time before caching happens and remove based on this
#  --> a useful, simple way to control Cache
ranNumsA <- Cache(rnorm, 4, cachePath = tmpDir, userTags = "objectName:a")
startTime <- Sys.time()
Sys.sleep(1)
ranNumsB <- Cache(rnorm, 5, cachePath = tmpDir, userTags = "objectName:b")
keepCache(tmpDir, after = startTime, ask = FALSE) # keep only those newer than startTime

clearCache(tmpDir, ask = FALSE)

Example 4: searching for multiple objects in the cache

# default userTags is "and" matching; for "or" matching use |
ranNumsA <- Cache(runif, 4, cachePath = tmpDir, userTags = "objectName:a")
ranNumsB <- Cache(rnorm, 4, cachePath = tmpDir, userTags = "objectName:b")

# show all objects (runif and rnorm in this case)
showCache(tmpDir)

# show objects that are both runif and rnorm
# (i.e., none in this case, because objecs are either or, not both)
showCache(tmpDir, userTags = c("runif", "rnorm")) ## empty

# show objects that are either runif or rnorm ("or" search)
showCache(tmpDir, userTags = "runif|rnorm")

# keep only objects that are either runif or rnorm ("or" search)
keepCache(tmpDir, userTags = "runif|rnorm", ask = FALSE)

clearCache(tmpDir, ask = FALSE)

Example 5: using caching to speed up rerunning expensive computations

ras <- terra::rast(terra::ext(0, 5, 0, 5),
  res = 1,
  vals = sample(1:5, replace = TRUE, size = 25),
  crs = "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +ellps=WGS84"
)

rasCRS <- terra::crs(ras)
# A slow operation, like GIS operation
notCached <- suppressWarnings(
  # project raster generates warnings when run non-interactively
  terra::project(ras, rasCRS, res = 5)
)

cached <- suppressWarnings(
  # project raster generates warnings when run non-interactively
  # using quote works also
  Cache(terra::project, ras, rasCRS, res = 5, cachePath = tmpDir)
)

# second time is much faster
reRun <- suppressWarnings(
  # project raster generates warnings when run non-interactively
  Cache(terra::project, ras, rasCRS, res = 5, cachePath = tmpDir)
)

# recovered cached version is same as non-cached version
all.equal(notCached, reRun, check.attributes = FALSE) ## TRUE

Nested Caching

Nested caching, which is when Caching of a function occurs inside an outer function, which is itself cached. This is a critical element to working within a reproducible work flow. It is not enough during development to cache flat code chunks, as there will be many levels of "slow" functions. Ideally, at all points in a development cycle, it should be possible to get to any line of code starting from the very initial steps, running through everything up to that point, in less than a few seconds. If the workflow can be kept very fast like this, then there is a guarantee that it will work at any point.

##########################
## Nested Caching
# Make 2 functions
inner <- function(mean) {
  d <- 1
  Cache(rnorm, n = 3, mean = mean)
}
outer <- function(n) {
  Cache(inner, 0.1, cachePath = tmpdir2)
}

# make 2 different cache paths
tmpdir1 <- file.path(tempdir(), "first")
tmpdir2 <- file.path(tempdir(), "second")

# Run the Cache ... notOlderThan propagates to all 3 Cache calls,
#   but cachePath is tmpdir1 in top level Cache and all nested
#   Cache calls, unless individually overridden ... here inner
#   uses tmpdir2 repository
Cache(outer, n = 2, cachePath = tmpdir1, notOlderThan = Sys.time())

showCache(tmpdir1) # 2 function calls
showCache(tmpdir2) # 1 function call

# userTags get appended
# all items have the outer tag propagate, plus inner ones only have inner ones
clearCache(tmpdir1, ask = FALSE)
outerTag <- "outerTag"
innerTag <- "innerTag"
inner <- function(mean) {
  d <- 1
  Cache(rnorm, n = 3, mean = mean, notOlderThan = Sys.time() - 1e5, userTags = innerTag)
}
outer <- function(n) {
  Cache(inner, 0.1)
}
aa <- Cache(outer, n = 2, cachePath = tmpdir1, userTags = outerTag)
showCache(tmpdir1) # rnorm function has outerTag and innerTag, inner and outer only have outerTag

cacheId

Sometimes, it is not absolutely desirable to maintain the work flow intact because changes that are irrelevant to the analysis, such as changing messages sent to a user, may be changed, without a desire to rerun functions. The cacheId argument is for this. Once a piece of code is run, then the cacheId can be manually extracted (it is reported at the end of a Cache call) and manually placed in the code, passed in as, say, cacheId = "ad184ce64541972b50afd8e7b75f821b".

### cacheId
set.seed(1)
Cache(rnorm, 1, cachePath = tmpdir1)
# manually look at output attribute which shows cacheId: 7072c305d8c69df0
Cache(rnorm, 1, cachePath = tmpdir1, cacheId = "422bae4ed2f770cc") # same value
# override even with different inputs:
Cache(rnorm, 2, cachePath = tmpdir1, cacheId = "422bae4ed2f770cc")

Working with the Cache manually

Since the cache is simply a DBI data table (of an SQLite database by default). In addition, there are several helpers in the reproducible package, including showCache, keepCache and clearCache that may be useful. Also, one can access cached items manually (rather than simply rerunning the same Cache function again).

# As of reproducible version 1.0, there is a new backend directly using DBI
mapHash <- unique(showCache(tmpDir, userTags = "project")$cacheId)
map <- loadFromCache(mapHash[1], cachePath = tmpDir)
terra::plot(map)
## cleanup
unlink(dirname(tmpDir), recursive = TRUE)

Alternative database backends

By default, caching relies on a sqlite database for it's backend. While this works in many situations, there are some important limitations of using sqlite for caching, including 1) speed; 2) concurrent transactions; 3) sharing database across machines or projects. Fortunately, Cache makes use of DBI package and thus supports several database backends, including mysql and postgresql.

See https://github.com/PredictiveEcology/SpaDES/wiki/Using-alternate-database-backends-for-Cache for further information on configuring these additional backends.

Reproducible workflow

In general, we feel that a liberal use of Cache will make a re-usable and reproducible work flow. shiny apps can be made, taking advantage of Cache. Indeed, much of the difficulty in managing data sets and saving them for future use, can be accommodated by caching.



Try the reproducible package in your browser

Any scripts or data that you put into this service are public.

reproducible documentation built on Nov. 22, 2023, 9:06 a.m.