library(knitr) knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
R as a language is perfect for remote code execution. Functions are
discoverable objects, the session can be queried for dependencies and
the language itself comes with a variety of tools to compute on the
language. An indirect proof of that is the variety of packages that
bring remote execution to R, for example:
foreach,
opencpu,
Rserve or
SparkR with its dapply()
.
All these packages need to solve a shared challenge, that is, they need to handle user-provided code and prepare it for execution in a remote R session. Every package has its own way of handling that and to my best knowledge there is no mechanism that could be shared regardless of the context of each particular package.
This package - defer
is intended to close that gap and propose a more
systematic approach of preparing a "deferred execution package".
I hope it will be useful in a variety of scenarios where a user-defined
function needs to be run in a separate, and possibly remote, R session.
Here is the shortes possible example of what defer
aims at: a user-provided
function is wrapped in a deferred function and then run by the means
of that wrapper.
library(defer) fun <- function(x)x*x deferred <- defer(fun) print(deferred) deferred(10)
rm(fun, deferred)
Because defer
can do much more than just wrap a single function, we
will now take a look at a longer example.
verify <- function (model, test_data) { test_data$predicted <- predict(model, test_data) > .5 with(test_data, predicted == is_setosa) }
model <- function (train_data) { lm(is_setosa ~ petal_area + sepal_area, data = train_data) }
etl <- function (data) { names(data) <- tolower(names(data)) names(data) <- gsub("\\.", "_", names(data)) data$sepal_area <- with(data, sepal_width * sepal_length) data$petal_area <- with(data, petal_width * petal_length) data$is_setosa <- data$species == "setosa" data$is_virginica <- data$species == "virginica" data$is_versicolor <- data$species == "versicolor" data$species <- NULL data }
Let's say we have a simple modelling pipeline that consists of:
etl()
model()
verify()
glue()
Below is the glue()
function. The rest, for the sake of simplicity, is
defined at the end of this vignette.
First, glue()
transforms the input data set via etl()
. Then the new
data set is split into training and testing subsets, and model()
build
a new predictive model using the traning data. Finally, verify()
checks
the quality of that model and returns a vector of TRUE
/FALSE
(success/failure) responses, one for each row in the test data set. The
single response of glue()
is the ratio of examples identified correctly.
glue <- function (data, test_size) { data <- etl(data) test <- sample.int(nrow(data), test_size) train <- setdiff(seq(nrow(data)), test) m <- model(data[train, ]) mean(verify(m, data[test, ])) }
Let's run this simple example, first locally:
glue(iris, 50)
Now we can package our simple pipeline and prepare it for a remote
execution. defer()
will automatically identify dependencies of
glue()
and include etl()
, model()
and verify()
in the final
package.
library(defer) d <- defer(glue)
Here is how we can "simulate" remote execution. First, we serialize
the deferred function d()
, clean the environment, deserialize
d()
and run it on a sample data. At this point all functions need
to be a part of d()
because they are no longer present in the R
session (that is, the global environment).
# serialize storage_path <- tempfile(fileext = 'rds') saveRDS(d, storage_path) # removing these functions "simulates" a new R session rm(d, glue, etl, verify, model) ls() # deserialize and run d <- readRDS(storage_path) d(iris, 50)
Ta-da!
For the sake of completeness, here is the actual code that implements our sample data-processing pipeline.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.