In lbartnik/defer: Deferred function evaluation

library(knitr)

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

Introduction

R as a language is perfect for remote code execution. Functions are discoverable objects, the session can be queried for dependencies and the language itself comes with a variety of tools to compute on the language. An indirect proof of that is the variety of packages that bring remote execution to R, for example: foreach, opencpu, Rserve or SparkR with its dapply().

All these packages need to solve a shared challenge, that is, they need to handle user-provided code and prepare it for execution in a remote R session. Every package has its own way of handling that and to my best knowledge there is no mechanism that could be shared regardless of the context of each particular package.

This package - defer is intended to close that gap and propose a more systematic approach of preparing a "deferred execution package". I hope it will be useful in a variety of scenarios where a user-defined function needs to be run in a separate, and possibly remote, R session.

A (Minimalistic) Example

Here is the shortes possible example of what defer aims at: a user-provided function is wrapped in a deferred function and then run by the means of that wrapper.

library(defer)

fun <- function(x)x*x
deferred <- defer(fun)
print(deferred)

deferred(10)

rm(fun, deferred)

A Longer Example

Because defer can do much more than just wrap a single function, we will now take a look at a longer example.

verify <- function (model, test_data) {
  test_data$predicted <- predict(model, test_data) > .5
  with(test_data, predicted == is_setosa)
}

model <- function (train_data) {
  lm(is_setosa ~ petal_area + sepal_area, data = train_data)
}

etl <- function (data) {
  names(data) <- tolower(names(data))
  names(data) <- gsub("\\.", "_", names(data))

  data$sepal_area    <- with(data, sepal_width * sepal_length)
  data$petal_area    <- with(data, petal_width * petal_length)
  data$is_setosa     <- data$species == "setosa"
  data$is_virginica  <- data$species == "virginica"
  data$is_versicolor <- data$species == "versicolor"
  data$species       <- NULL

  data
}

Let's say we have a simple modelling pipeline that consists of:

a data-transformation function etl()
a modelling function model()
a test function verify()
a top-level function that glues it all together, glue()

Below is the glue() function. The rest, for the sake of simplicity, is defined at the end of this vignette.

First, glue() transforms the input data set via etl(). Then the new data set is split into training and testing subsets, and model() build a new predictive model using the traning data. Finally, verify() checks the quality of that model and returns a vector of TRUE/FALSE (success/failure) responses, one for each row in the test data set. The single response of glue() is the ratio of examples identified correctly.

glue <- function (data, test_size) {
  data  <- etl(data)
  test  <- sample.int(nrow(data), test_size)
  train <- setdiff(seq(nrow(data)), test)

  m <- model(data[train, ])
  mean(verify(m, data[test, ]))
}

Let's run this simple example, first locally:

glue(iris, 50)

Now we can package our simple pipeline and prepare it for a remote execution. defer() will automatically identify dependencies of glue() and include etl(), model() and verify() in the final package.

library(defer)

d <- defer(glue)

Here is how we can "simulate" remote execution. First, we serialize the deferred function d(), clean the environment, deserialize d() and run it on a sample data. At this point all functions need to be a part of d() because they are no longer present in the R session (that is, the global environment).

# serialize
storage_path <- tempfile(fileext = 'rds')
saveRDS(d, storage_path)

# removing these functions "simulates" a new R session
rm(d, glue, etl, verify, model)
ls()

# deserialize and run
d <- readRDS(storage_path)
d(iris, 50)