TODO: include use case
knitr::opts_chunk$set( eval = FALSE )
Goal is to release on CRAN in early June 2018. Include the following features:
These should work with Unix fork (priority) and SNOW parallel setups, which means properly detecting and synchronizing state.
Not sure what kind of for loops I need to do. I'll try an experiment to see how the shared memory works out.
For later:
I want to export as few functions as possible, probably just 1 or 2.
This package is meant to simplify parallel programming in R by automating common tasks.
My goal in creating this package was to produce something that I find personally useful.
As data sizes and processor counts increase, parallelism becomes more important.
Parallel programming can be challenging, because it requires further levels of expertise. The core of R is a functional language, and the functional paradigm is well suited to parallel programming.
R functions typically don't have side effects. They don't modify their arguments; instead they produce new objects. This is what makes R functional and what allows us to do parallel computing.
SNOW, parallel packages now included with R as recommended packages.
Bohringer's dynamic parallelization.
Bengsston's futures.
Most users should interact with this software through the functions described in this subsection. Our goal is to make this easy to use by providing only a few functions that are extensible.
Suppose you have a script my_script.R
in R's current working directory.
If you just want a quick transformation of your code into a parallel form
then can do the following:
library(autoparallel) autoparallel("my_script.R")
This generates a parallel version of my_script.R
.
For more control the user can split up these steps. The first thing is to create a task graph. This infers the dependency structure of the expressions and identifies known apply type functions or vectorized functions.
code = parse(text = " ") g = taskgraph(code) plot(g)
TODO: bring picture over from doc.
We can annotate the task graph with additional information to improve the scheduling. Knowing the object sizes and the time it takes for each statement to execute allows more efficient static scheduling.
If you need to run this same script many times then it may be easier to run it once in serial and gather all of these object sizes and timings automatically from that run:
g = taskgraph("my_script.R", timing_run = TRUE)
It's also fine to supply only partial information.
For example, suppose line 2 of my_script.R
is x =
read.csv("x.csv")
. Suppose we know that this line results
in an object x
of size 32768 bytes and it takes 0.00354 seconds to run. We can express that as follows:
g = annotate(g, line = 2, size = list(x = 32768), time = 0.00354)
FEEDBACK: I may implement this with attributes. This type of user API me the freedom to pick that as I like
makeParallel()
takes user code and figures out an intelligent way to make
it parallel by inferring the dependency structure of the expressions and
the other patterns described in this document. It produces executable R
code that is now parallel.
pcode = makeParallel("my_script.R" , clean_first = FALSE , run_now = FALSE , cluster_type = "FORK" , nnodes = 4 ) # visual representation of the graph structure plot(pcode) # Save the parallel version of the script save_code(pcode, "pmy_script.R") # Run the whole thing interactively run_code(pcode)
TODO: I'm not satisfied with the extensibility of these high level
functions. To tell if it's worth it to parallelize we would really like to
know the object.size()
of as many variables as we can, as well as the
times to execute each statement. So we need a way for users to provide that
information if they have it. I can think of a few hacky ways, see the
bottom of this document below Scratch:
Things that we haven't yet implemented, but we plan to.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.