fun.club: fun.club: workflow manager

fun.clubR Documentation

fun.club: workflow manager

Description

This is a workflow manager which controls the generation of R objects, their caching in memory and storing on disk. It automatically tracks the object dependencies, so that if one object is invalidated eg. by modifying its generating function, it is deleted together with all dependencies. Later, when referenced, it is automatically regenerated always with the most recent generating functions. This is done behind the scenes, but the interface is transparent for the user, see examples.

Details

One can have many fun.clubs open at the same time if they all point to different physical directories.

The functions are considered equivalent if they are deparse()d into the same character string. This means, in particular, that the code outside the functions is not checked, eg. if the function object calls another function not in fun.club, and this function changes, the objects will not be deleted.

The package does not impose any limitation on the function object names, any R names can be used (note that all variable names are limited in R to 10000 bytes, however, and were to 256 bytes in versions of R before 2.13.0, see ?name). Any arguments can be used: named, positional and .... The equivalent argument combinations like ⁠a=1, 2, c=3⁠ and ⁠c=3, 1, 2⁠ for a function ⁠function(a=1, b=2, ..., c=3)⁠ are recognized and a new object is generated only for new arguments.

Advanced: There are two special function arguments: output.env and 'file.ext' described below. They are related to the ways how the objects are stored in memory and on disk, respectively. By default, the storage is done fully automatically and is hidden from the user. These two arguments, however, alter the default algorithms.

Advanced: 'output.env' argument can be useful for storing big objects. For example, let's consider

fun.club[typical.use] <- function(n) 1:n

Then, eg. the call typical.use[100000] generates a "big object" which is returned by the function and then copied to its final destination by the library. To avoid copying, the object can be placed directly into the final place, or, more precisely, to the final environment. The latter acts as a directory holding R objects and is referred to by output.env. For example,

fun.club[advanced.use] <- function(n, output.env) {
  output.env[[ 'advanced.use' ]] <- 1:n
}

Using output.env[[ 'advanced.use' ]] <- 1:n the user stores directly his/her "big object", so no extra copying is needed. Initially, output.env should appear as the argument of the function in ⁠function(n, output.env)⁠, but it should not

  1. have a default value nor

  2. be modified by the caller eg. like in advanced.use[100000, output.env = new.env()].

Then, behind the scenes, the library assigns to output.env its correct newly created environment value, so that in the function body the expression output.env[[ 'advanced.use' ]] <- 1:n becomes valid.

In output.env environment the object is always stored under the name of the function object, ie. advance.use in our case.

If output.env appears as a function argument, the library assumes that it is the responsibility of the user to store the object and does not try to do that itself.

Advanced: The way the files are stored on disk is determined by the extension.selector and savers arguments in make.fun.club function. Depending on the R object to be saved, the former decides which file name extension should be chosen while the latter keeps the storage function for a given extension. This works fine for saving any R objects. Sometimes, however, one might need to store files external to R. Eg. one may want to download remote files to local disk and then process them in R. This step may be performed in R, but the files themselves with the "raw" data may not correspond to any R object. Such external data can not be saved by the default method. It is still advantageous, however, to keep the download algorithms and downloaded files under control of fun.club library. In this case, the files are automatically deleted if the algorithms change and, on the other hand, only the necessary files are stored and without duplication.

Since the fun.club automatic algorithms do not know how to save such "raw" data, this task is transferred to the user who can do that using the file.ext argument. When calling, file.ext should be set to the desired file name extensions. Then, internally, before the function execution, this argument is expanded to the full absolute file names with the corresponding extensions. file.ext keeping the file names can be used in the function body (but the user should not modify them). The files will be saved in the same internal directories where fun.club stores other objects.

The syntax is explained in the following example

fun.club[ write.external.files ] <-
  function(x, file.ext = c("txt", "txt.gz"))
{
  writeLines(as.character(x), con = file.ext[1])
  system(paste("gzip -c", file.ext[1], ">", file.ext[2]))
  file.ext
}

Then, write.external.files[1:10] stores the numbers 1:10 to the files .txt and (in gzip'ed form) .gz controlled by fun.club. The exact is unique and is chosen by fun.club internal algorithms.

Since the function above returns file.ext, the return value is a vector (.txt, .gz). Calling write.external.files[1:10] with the same arguments always returns these file names without regenerating the files.

Using file.ext argument, the user informs the library that .txt, .gz depend on its generating function and should be deleted if the latter (or any function object which it might contain) changes.

In the example above file.ext was given as a default argument, but it can also be redefined by the caller, eg.

write.external.files[1:10, file.ext = c('raw', 'raw.gz')]

Since the argument combination is different here, this will generate a new object.

If several function objects are defined using one function at once, file.ext should be given as a list of character vectors, one per function object:

fun.club[ writer.1, writer.2 ] <-
  function(x, file.ext = list(c("txt", "txt.gz"), "gz"))
{
  writeLines(as.character(x), con = file.ext[[ 1 ]][ 1 ])
  system(paste("gzip -c ", file.ext[[ 1 ]][ 1 ], ">",
               file.ext[[ 1 ]][ 2 ]))
  writeLines(as.character(2*x), con = file.ext[[ 2 ]][ 1 ])
  file.ext
}

In this case file.ext is expanded to the corresponding list of file name(s) with one element per function object. If there is only one function object, as in the first example, file.ext might be alternatively given as a list with a single element eg. as list(c("txt", "txt.gz")). Then it would be expanded to the list(c(".txt", "<name.txt.gz")) instead of the character vector.

Author(s)

Vladislav BALAGURA balagura@cern.ch

Examples

## create `fun.club`: a factory to generate `fun.objects`, ie. special 
## functions equipped with the capabilities to track and to cache all
## generated objects.
##
fc <- make.fun.club(dir = 'my_fun_club_directory')
##
## create the first "function object" `f1`
##
fc[f1] = function(x) x
##
## which can generate other objects as
##
f1[100]
##
## all such generated objects are cached and their dependencies are
## automatically tracked:
##
fc[f1] = function(x) 2*x
##
## f1[100] is automatically deleted and can be regenerated on demand:
##
f1[100]
##
## More complicated function with variable number of arguments in `...`
##
fc[f2] = function(y=1, ...) f1[y] * sum(unlist(list(...)))
f2[10, 1, 2, 3]
##
## The functions without arguments are also allowed. The functions can
## return arbitrary R objects (eg. other functions):
##
fc[f3] = function() { function(n) { rnorm(n) } }
##
## The function can return saveral objects placed in a `list`: `f4` below
## will return `f1[a,b]`, `f5` - `f2[a,b]` and `f6` - `f3[]`. This is
## useful if eg. the calculation gives two `data.frames` as a result, but
## they should be stored separately. This can be desirable eg. if the
## sizes of two objects are significantly different: there will be no need
## to keep in memory or reread from a file the big object to access the
## small one.
##
fc[f4, f5, f6] = function(a, b) list(f1[a+b], f2[a,b], f3[])
f4[1,2]
##
## Calling `f4` automatically generates `f5` and `f6'.
## `f4` and `f5` can be used as separate functions:
##
fc[f7] = function(a, b) f4[a,b] + f5[a,b]                    
##
## The request to generate `f7` object triggers the generation of all other
## objects it depends on
##
f7[1,2]
##
## since this `f7[1,2]` depends on `f1` (through `f5-f2`), changing `f1`
## deletes it together with all other dependencies:
##
fc['f1'] = function(x) x^2
##
## regardless of whether the objects were generated or not, syntactically
## they are always referred to in the same way, so the user might operate
## with them as if they were always available:
##
f7[1,2] + f6[3,4]                                                  


balagura/fun.club documentation built on June 11, 2025, 11:27 p.m.