options(width = 120) cran <- function(package) paste0('https://cran.r-project.org/web/packages/', package,'/index.html') rdoc <- function(package,funct) paste0('http://www.rdocumentation.org/packages/', package,'/functions/',funct) lnk_ <- function(name,url) paste0('[',name,']','(',url,')') lnk <- function(package,funct) paste0(lnk_(package,cran(package)),'::', lnk_(funct,rdoc(package,funct))) plot <- function(x) x %>% # a trick to render the plot for GitHub cacheflow:::plot.CachedWorkflow() %>% DiagrammeRsvg::export_svg() %T>% {i <<- i + 1} %>% cat(file=paste0('test',i,'.svg')) %>% {cat(paste0('\n![](https://cdn.rawgit.com/', 'alekrutkowski/cacheflow/master/test',i,'.svg)\n'))} # If a new chart: # {cat(paste0('\n![](', # 'test',i,'.svg)\n'))} i <- 0 # Cleanup for re-runs of this .Rmd tryCatch(cacheflow::removeCache(y='y'), error=function(e) NULL)
See convenience wrappers do()
and do_()
.
Example usage:
do(mean, x=1:10) # is an equivalent of .mean <- cachedCall(mean, x=1:10) do_(mean, x=1:10) # is an equivalent of .mean <- cachedCallConcur(mean, x=1:10)
So, the demo example below could be simplified:
# Original code: { hundred <- 1:100 F1 <- cachedCall(f1, vec=hundred, val=3) # <-- explicitly named returned value F1 F2 <- cachedCall(f2, F1) # <-- explicitly named returned value F2 Res3 <- extractVal(cachedCall(f3, val1=F2, val2=50)) } # Simplified code: { hundred <- 1:100 do(f1, vec=hundred, val=3) # <-- implicit returned value .f1 do(f2, .f1) # <-- implicit returned value .f2 Res3 <- extractVal(cachedCall(f3, val1=.f2, val2=50)) }
devtools::install_github('alekrutkowski/cacheflow')
Often an R script is re-run with only some of the parts modified.
With cacheflow
, the function changes and the argument changes are all
automatically detected and only the necessary re-evaluations are done.
If there is no change in the function definition or in its arguments,
it does not make sense to load and pass the cached value -- it is
sufficient to pass downstream only the informaion that the value is the
same without extracting the value itself. This kind of automatic
lazy re-evaluation is
particularly useful if there are long, chained, and complicated workflows.
In such workflows, it is cumbersome and risky to track manually which
functions/inputs/arguments have changed and which parts of
the script should be re-evaluated. It's easier to trigger the re-run
of the whole script and let the computer do the to comparison with the
cached results to the previous runs to avoid the unnecessary and
costly re-evaluations.
Automatic caching => no need for manual selections and re-runs of code chunks => saving human time and machine time. And lower error risk.
But cacheflow
is simpler -- pure R script/code, no need for external non-R
files such as YAML,
no cognitive switching cost. With its functional syntax, R seems to be a much
better workflow description language. Your R code/script is your workflow!
Assuming that your workflow consists of many functions, most of which are
it makes sense to re-evaluate only if
cacheflow
is simple -- it caches the necessary information on disk, in the
working directory, so it:
cacheflow
caches single return values in separate .Rds
files, thus reducing the risk.r lnk_('parallel',cran('parallel'))
is used
(see the demo below) as long as the concurrent R instances can access the same
working directory (which implies running them on one computer or using a shared
network drive with the same paths mapped if running on multiple computers).Functions:
initCache
, removeCache
,removeOldCache
, keepCacheFor
cachedCall
, cachedCallConcur
"workhorses"
and extractVal
cachedCall
and cachedCallConcur
:
do
and do_
withGraph
r lnk_('parallel',cran('parallel'))
is used inside withGraph: makeGraphAware
r lnk('digest','digest')
-- efficiently hashes R objects and filesr lnk('DiagrammeR','grViz')
-- plots diagrams from
GraphViz dot coder lnk('memoise','memoise')
-- for the in-session memoisation to avoid
re-loads of .Rds files, when cached values need to be re-extracted from the
saved .Rds files at some point. Not strictly needed but further increasing
efficiency (at the cost of more RAM usage) if the .Rds files are large.r lnk('codetools','findGlobals')
-- used for passing the values to
concurrent R instances (in cachedCallConcur
).# Always remember to set your working directory # in the beginning of your workflow! setwd('//ci1homes11/homes095/rutkoal/R files/cacheflow-gh')
library(magrittr) # for the pipe operator %>% library(cacheflow) # Create the necessary subdirectories in your working directory (only once) initCache() # Let's pretend we have 3 complicated pure functions # each consuming some time when re-evaluated: f1 <- function(vec, val) { Sys.sleep(1) vec + val } f2 <- function(vec) { Sys.sleep(1) mean(vec) } f3 <- function(val1, val2) { Sys.sleep(1) val1/val2 } system.time(Res1 <- 1:100 %>% cachedCall(f1, vec=., val=3) %>% cachedCall(f2, .) %>% cachedCall(f3, val1=., val2=50) %>% extractVal) system.time(Res2 <- 1:100 %>% cachedCall(f1, vec=., val=3) %>% cachedCall(f2, .) %>% cachedCall(f3, val1=., val2=50) %>% extractVal) # The same workflow but without the pipe operator and not timed { hundred <- 1:100 F1 <- cachedCall(f1, vec=hundred, val=3) F2 <- cachedCall(f2, F1) Res3 <- extractVal(cachedCall(f3, val1=F2, val2=50)) } Res1 == Res2 Res2 == Res3 # Just that function (f3) is re-evaluated due to a change in # the value of one of the args i.e. val2 (if there were further # steps beyond f3, they would be also re-evaluated): system.time(1:100 %>% cachedCall(f1, vec=., val=3) %>% cachedCall(f2, .) %>% cachedCall(f3, val1=., val2=100) %>% extractVal) # Of course, a modification of a function also triggers re-evaluation # of the modified and the subsequent (dependent) step(s): f2 <- function(vec) { Sys.sleep(1) mean(vec)/3 } system.time(1:100 %>% cachedCall(f1, vec=., val=3) %>% cachedCall(f2, .) %>% cachedCall(f3, val1=., val2=100) %>% extractVal) # Paths to files need to be wrapped in File() # when used as arguments inside cachedCall # (so that possible changes in the contents # of the files are assessed instead of the # changes in the paths): tmpf <- tempfile() cat(letters, file=tmpf, sep='\n') f4 <- function(filepath) { Sys.sleep(1) readLines(filepath) } system.time(ResA <- cachedCall(f4, File(tmpf)) %>% extractVal) tmpf2 <- tempfile() file.copy(tmpf, tmpf2) system.time(ResB <- cachedCall(f4, File(tmpf2)) %>% extractVal) identical(ResA, ResB) # Re-evaluated when the file modified: cat(c(letters,1:10), file=tmpf, sep='\n') system.time(cachedCall(f4, File(tmpf)) %>% extractVal)
# Let's pretend we have 3 complicated pure functions # each consuming some time when re-evaluated: z1 <- function(v1, v2) { Sys.sleep(5) v1 + v2 } z2 <- function(vec) { Sys.sleep(5) mean(vec) } z3 <- function(val1, val2) { val1/val2 } # With `cachedCallConcur` we can evaluate `z1` and `z2` # concurrently: system.time({ zz1 <- cachedCallConcur(z1, 1, 2) zz2 <- cachedCallConcur(z2, 1:10) # No concurrency here, because this is the final value # so we need to wait for the results anyway: zz3 <- cachedCall(z3, zz1, zz2) }) # The waiting time is ca. 5s (plus the time needed for # saving inputs for Rscript and opening Rscript) # instead of 5s + 5s.
do(mean, x=1:10) # is an equivalent of: .mean <- cachedCall(mean, x=1:10) do_(sum, 1:10) # is an equivalent of `cachedCallConcur` below: Sys.sleep(1) # just to make sure the concurrent call is completed .sum <- cachedCallConcur(sum, 1:10) # then use .mean or .sum as input arguments in the subsequent # cached calls (simple or concurrent), e.g.: do(max, .mean, .sum) # which is an equivalent of: .max <- cachedCall(max, .mean, .sum)
# Let's touch the first function to trigger re-evaluations f1 <- sum withGraph(1:100 %>% cachedCall(f1, vec=., val=3) %>% cachedCall(f2, .) %>% cachedCall(f3, val1=., val2=50)) %>% plot
# Now the same but with named values and no pipes ResY <- withGraph({ hundred <- 1:100 F1 <- cachedCall(f1, vec=hundred, val=3) F2 <- cachedCall(f2, F1) ResX <- cachedCall(f3, val1=F2, val2=50) }) ResY summary(ResY) extractVal(ResY)
# Compare with the previous version -- no reds because # there were no re-evaluations plot(ResY)
# Using `cacheflow` together with the package `parallel` # Here's a contrived tivial example for simplicity library(parallel) cl <- detectCores() %>% makeCluster
eval(bquote(clusterEvalQ(cl, .libPaths(.(.libPaths()))))) # needed in my private setting
clusterEvalQ(cl, library(cacheflow))
pRes <- withGraph({ makeGraphAware(cl) # this is needed! pairs <- data.frame(a=1:4, b=101:104) %>% split(row.names(.)) # each R instance to be fed with a pair of a and b # e.g. the first R instance gets a=1 and b=101 # the second one gets a=2 and b=102, etc. P <- parLapply(cl, pairs, function(x) cachedCall(`+`, x$a, x$b)) B <- cachedCall(`-`, 30, 12) Z <- c(P, list(B)) do.call(cachedCall, c(sum, Z)) }) stopCluster(cl) pRes summary(pRes)
plot(pRes)
# The same parallel example but with # a trick: using `bquote` to see in the diagram # the actual values passed to `cachedCall` # inside the anonymous function library(parallel) cl <- detectCores() %>% makeCluster
eval(bquote(clusterEvalQ(cl, .libPaths(.(.libPaths()))))) # needed in my private setting
clusterEvalQ(cl, library(cacheflow))
pRes <- withGraph({ makeGraphAware(cl) # this is needed! pairs <- data.frame(a=1:4, b=101:104) %>% split(row.names(.)) P <- parLapply(cl, pairs, function(x) # see the difference here: eval(bquote(cachedCall(`+`, .(x$a), .(x$b))))) B <- cachedCall(`-`, 30, 12) Z <- c(P, list(B)) do.call(cachedCall, c(sum, Z)) }) stopCluster(cl) pRes summary(pRes)
# The L next to the numbers means # a standard (long) integer in R, see # help(NumericConstants) plot(pRes)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.