Data caching in madrat
In madrat: May All Data be Reproducible and Transparent (MADRaT) *

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

A central feature of the madrat framework is its ability to load data from cache rather than recompute it when the calculation have run already before. Here, we explain what user should know about the caching to avoid unwanted behavior.

Basics

By default every read- or calc-function creates a cache file from its computations and stores it in the cachefolder. Where this folder is located can be checked via

library(madrat, quietly = TRUE)
getConfig("cachefolder", verbose = FALSE)

When running data processing via retrieveData it currently offers two types of cache folders: cachetype = "def" will use a shared cachefolder in which all processes write their cache files by default. RetrieveData will check in this folder for fitting cache files and read them if available. Whether they are fitting or not will depend on their fingerprint which is explained further down. With cachetype = "rev" retrieveData will create a new, revision-specific cachefolder and set setConfig(forcecache = TRUE) (default is FALSE). Via this approach calculations will start with new cache files at all but created cache files will be read if a calculation is repeated. The forcecache option will in this case make sure that any available cache file which fits the function call is read in, independent of whether the content of the cache file might be outdated or not.

Fingerprint

In order to estimate whether a calculation should be rerun or whether the data can be read from cache madrat creates fingerprints for each function. If the fingerprint of the current function call agrees with the fingerprint of the corresponding cache file the cache is assumed up-to-date and read in. If they disagree, the cache file is assumed to be potentially outdated and ignored (except for forcecache = TRUE in which case it would be read in anyways).

The fingerprint is created by looking at the dependency graph of a function which can be retrieved via getDependencies:

getDependencies("calcTauTotal", packages = "madrat")

The dependency graph lists all calls of calc, read and tool functions a function depends on (not only calls in the function itself, but also calls in the functions which have been called in order to run the function). The fingerprinting function creates hashes of all these functions and all source folders involved in this process and combines them to one single hash which is the fingerprint of that specific function:

setConfig(verbosity = 3)
fp <- madrat:::fingerprint("calcTauTotal", details = TRUE, packages = "madrat")

As a hash has the characteristic to change when its input changes, an unchanged hash means that also the respective function or source folder did not change. Hence, an identical fingerprint means that the involved functions and source data did not change. So, if the fingerprint of the cache file agrees with the fingerprint calculated for the calculation it is quite likely that the data contained in the cache file also agrees with the output of the calculation one would run it again.

The reason why it is only quite likely but not certain is that not all parts of the calculation are covered: The dependency graph only considers madrat-style functions, e.g. functions not starting with download, read, correct, convert, calc, or tool will not be considered. In many cases this should be ok, considering that external functions used in the calculation will likely keep their behavior over time, but there might be instances in which this assumption is violated (e.g. if parts of the calculation are outsourced in a function not following these conventions).

On the other hand, the dependency graph might also include dependencies which only exist on the paper, as it does only scan for calls of the corresponding functions in the code, but cannot interpret which calls are actually be computed for a given calculation, e.g. there could be if clauses in a calc-function selecting different source data types. The dependency graph will show a dependency to all sources even if only one of these sources might be used at the end.

Customize fingerprinting

To make sure that the fingerprint is appropriately reflecting the current status of a calculation there are a few possibilities to steer its behavior:

Use madrat-style functions for all calculation that should get monitored by the fingerprinting algorithm (e.g. if part of the calculation is outsourced, call this new function tool.. to have it monitored.)
Adjust the fingerprinting via control flags for all other cases.

Control flags

Control flags can be used to manually include or exclude functions in the fingerprinting. Control flags are comments in the functions which are put in quotes and start with !#. They can look like:

"!# @monitor madrat:::sysdata magclass:::ncells"
  "!# @ignore  madrat:::toolAggregate"

Each line contains a control flag starting with the flag name (here monitor or ignore) and afterwards with the arguments of this control flag. The monitor flag specifies calls which should get monitored in addition to the ones anyway monitored (in the example the sysdata object in madrat and the ncells functions of the magclass package are additionally being monitored). The ignore flag specifies which calls should not be monitored even so getDepenendencies says otherwise.

While the ignore statement has to be mentioned explicitly for each function, the monitor statement will be passed on automatically to all subsequent functions (e.g. if a read function has a monitor statement all calc-functions used that read function will also monitor the additional calls of that statement, but in the same example the ignore statement would only be used for the read function itself).

In particular the ignore statement has to be handled with care as a wrong information here might lead to outdated cache files being read in. So, only use it if really necessary and if you know exactly what you are doing.

Examples

setConfig(globalenv = TRUE)
  readData <- function() return(1)
  readData2 <- function() return(2)
  calcExample <- function() {
    a <- readSource("Data")
    return(a)
  }
  calcExample2 <- function() {
    a <- readSource("Data")
    if (FALSE) a <- readSource("Data2")
    return(a)
  }

In this example are two source data sets and two calculation functions. calcExample only depends on readData while calcExample2 depends on both data sources.

fp <- madrat:::fingerprint("calcExample", details = TRUE, packages = "madrat")
  fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat")

Looking at the fingerprints this is reflected in the hash components of each fingerprint (please NOTE that the source folders are not hashed in this example as they do not exist yet. If they exist they would show up here as well as hash components). One can see, that the hash for readData is the same in both fingerprints but as the other hashes differ also the resulting fingerprint for both calculations is different.

readData <- function() return(99)
fp <- madrat:::fingerprint("calcExample", details = TRUE, packages = "madrat")
fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat")

Changing the readData function changes the hash of this function and thereby also the fingerprints of both calc functions even so the hash of the calc functions itself did not change.

readData2 <- function() {
    "!# @monitor madrat:::toolAggregate"
    return(99)
  }
fp <- madrat:::fingerprint("calcExample", details = TRUE, packages = "madrat")
fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat")

Adding a monitor control flag in readData also add this hash component to all subsequent fingerprint calculations.

calcExample2 <- function() {
    "!# @ignore readData2"
    a <- readSource("Data")
    if (FALSE) a <- readSource("Data2")
    return(a)
  }

  calcExample3 <- function() {
    a <- calcOutput("Example2")
    return(a)
  }

fp2 <- madrat:::fingerprint("calcExample2", details = TRUE, packages = "madrat")
fp3 <- madrat:::fingerprint("calcExample3", details = TRUE, packages = "madrat")

The ignore flag in calcExample2 excludes readData2 from the fingerprint calculation. But in contrast to the monitor statement this information is not forwarded to calcExample3. Hence, the latter does not only monitor madrat:::toolAggregate but also readData2!

forcecache

Before the introduction of fingerprinting forcing the use of cache files was the default approach. However, in the new setup the argument forcecache = TRUE should only be used under very specific circumstances, as it does not guarantee that the data agrees with the code of the corresponding package. In particular production runs should always use forcecache = FALSE.

A scenario in which forcecache = TRUE might still make sense are development cases in which up-to-date inputs are not required for proper function development. In these cases development can be speed up by using potentially outdated cache files as a starting point to avoid lengthy calculations of parts irrelevant for the current development stage.

If you are unsure what to use, always go with forcecache = FALSE.