knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
```{R, eval = FALSE}
library(knitr) opts_knit$set(eval=FALSE)
Skimming through what is here, I realize that what I really need is a more systematic form of algebra rather than all these heuristics. ```{R} library(autoparallel) autoparallel("my_script.R")
This generates a parallel version of my_script.R
.
Internally it takes the following steps:
my_script.R
This document is not intended for users. It contains the following:
Estimate more precisely the costs of serialization. Is it much cheaper
to do it directly vs using mccollect()
?
Write a function to check if code is IO bound.
Are there cheaper ways to share read only copies of objects between
processes in R? There's bigmemory
, but what about something a little more
generalized like the plasma object store?
Data analysis workflow where at first you're not sure what you want.
Idea: You could even work totally in the interpreter, and provide a
function such as save_last_statement()
. This function would look through all
previous correct statements executed through the interpreter and use them
to create the minimal script necessary to reproduce that last statement.
The idea is that as you explore a data set you probably enter many more
statemenets than you actually need, because you're not sure at first what
you're after.
Idea: Can we load only columns of a data.frame
with R's load
/ save
commands?
Idea: remove_nse()
function to replace nonstandard evaluation with
equivalant standard evaluation call. Transforms subset(mtcars, select = mpg)
into
mtcars[, "mpg", drop = FALSE]
Idea: The idea of intrinsically useful statments keeps coming back. For
example, one may have head(dframe)
written somewhere in the code. This
implies that all columns of dframe
are needed. But it's likely the case
that one never actually needed that call to head(dframe)
; it was just
being inspected interactively. Then it can be removed from the final
product. So this would be another useful preprocessing step. As it stands,
let's assume that this preprocessing has happened, so that all statements
are strictly necessary.
Idea: Transform code written with %>%
into regular R saving intermediate
variables so that it can be debugged.
Is it possible to tell a priori if code uses the RNG? If we knew it didn't I'd feel better about evaluating it on the workers in parallel.
Idea: Detecting the use of global variables inside functions and modifying them. For example:
x = 10 f = function(y) y + x
This should become
x = 10 f = function(y, .x = x) y + .x
Idea: Computing the same things when they can't change. Here's a real example:
y[f(x), 1] = y[f(x), 2]
This should become:
tmp = f(x) y[tmp, 1] = y[tmp, 2] rm(tmp)
More generally with the above two examples I can think of all kinds of ways
that R code could be improved if it was written quickly or by a novice. We
could do two things: identify things that might be mistakes, similar to
pyflakes
, and offer to programmatically rewrite the code.
Idea: For the functions that parallelize things based on timings we could have a "timeout" where if it takes over a certain threshold time to run the whole script then we just stop and parallelize what we can from there.
Idea: Often we can tell statically which functions are being used, ie. lm
is actually stats::lm
. A preprocessing step could identify all of these
and distinguish them. We have to be careful, since using ::
takes 2
orders of magnitude more time than a simple lookup of an object in the base
package. More generally I'm curious to know when it's safe to evaluate
simple literal code such as c(1, 3)
.
I may want to make some of the following assumptions, depending on convenience.
Input in a canonical form. Have to say precisely what this means. For example:
<-
is used for assignment rather than =
or assign
for
loops are unavoidable, ie. they do something like
iterative updates. Otherwise they should be lapply
type calls.This can probably be done through preprocessing.
Transpile here refers to translating and changing code to make it more efficient. Consider this simple script:
```{R, eval = FALSE}
d = read.csv("data.csv")
hist(d[, 1])
This script only uses the first column of `d`, which means that all the other columns were unnecessary. If `data.csv` is sufficiently large then this program will spend an excessive amount of time reading data that are never used. In general we'd like to do all the preprocessing as early as possible, saving only what we need. ## Design If external packages such as `data.table` are being used, can we preserve these calls? This requires understanding the semantics of the library functions to use. So we'd have to do it custom for every package that we depend on. Unless these packages provide drop in replacement for the target statements. More generally we could keep the calls that read everything in and then remove what we don't need immediately after. Call this the "general approach" This helps with saving memory, but probably won't help with performance otherwise. In other words, we might see a difference if we're up against memory limits. __Possible Use Cases__ - Reducing memory footprint - Increasing speed Or should I just pick one? Only for reading data? And should I just depend on and tie myself to data.table? Ideally I don't make this specific to data.table. Suppose I take the general approach. In which cases will this help with memory pressure? The goal is to avoid making copies of a large data frame. I need to experiment to see exactly when this happens. The experiment shows that the unused columns in a data frame will not be copied. For a matrix they should be, but in my experience people are more likely to use a subset of the columns of a data frame versus a matrix. In the general case we read in the unused columns and then never copy them. The only way this can alleviate memory pressure is if the program meets the following conditions: 1. Has enough memory to contain the whole object in the first place 2. Subsequently requires more memory than is available This seems rare to me. ## Assumptions Assume that column selections can be inferred by static code analysis. For example, we can statically analyze the following: ```{R, eval = FALSE} # Literals mtcars[, 1L] mtcars[, "mpg"] mtcars$mpg # Variables defined from literals. This use case requires # constant propagation cols = c("cyl", "disp") mtcars[, cols]
While we cannot statically analyze these: ```{R, eval = FALSE}
col = read.csv("something_external.csv")[1, 1] mtcars[, col]
mtcars[, min(rpois(1, 1) + 1, ncol(mtcars))]
So we assume that column selection is a literal after constant propagation. ## Design For Minimal Use Case I'm going to start off getting something common and simple to work, and then generalize from there. The example from the beginning is as simple as possible. ```{R, eval = FALSE} d = read.csv("data.csv") hist(d[, 2])
This should be transformed to the following:
```{R, eval = FALSE}
d = data.table::fread("data.csv", select = 2)
hist(d[, 1])
__Basic Steps__ Necessary for the minimal use case to work 1. Infer that a data frame `d` is created by a call to `read.csv()` 2. Identify all calls which subset `d` and transform them into a common form. 4. Find `usedcolumns` the set of all columns which are used 5. Transform the `read.csv(...)` call into `data.table::fread(..., select = usedcolumns)` 6. Transform the calls which subset `d` into new indices. __More Advanced__ Functionality going beyond the minimal use case. I'll think about these once the basics are working. Account for indirect use of variables. The following should infer that the 4th column is used. ```{R, eval = FALSE} d = read.csv("data.csv") d2 = d * 2 d2[, 4]
Verify the indices subsetting d
can be computed at the time of
static analysis.
Check if any commands imply that the whole data set must be loaded. For example:
```{R, eval = FALSE} d = read.csv("data.csv") d[, 5] = 2 write.csv(d, "data2.csv") # Uses all columns of d
Read subsets without corresponding variable assignments, for example: ```{R, eval = FALSE} hist(read.csv("data.csv")[, 2])
More complicated forms of assignment
```{R, eval = FALSE} a = b = read.csv("data.csv")
## Implementation How can we tell which columns are used? Fri Sep 22 10:31:23 PDT 2017 For the moment I'm ignoring NSE such as `subset`. Thinking now that it's fine to depend on `data.table`. `data.table::fread` has a `select` parameter for column names. It would be more convenient for our purposes here if `select` took an integer vector of the column indices instead. Indices are more general because: - Not every text file has column names - Not every data frame has meaningful column names - Column names may not be unique One approach is to take all the uses of column names and map them into integers. The code will go through three representations then: __Original__ including mixed names and integers: ```{R, eval = FALSE} mpg = mtcars$mpg cyl = mtcars[, 2] disp = mtcars[, "disp"] wt = mtcars[[5]]
Name Replacement substitutes the names with integers, and converts all
data.frame
subsetting commands into single [
. Assume that we know the
column names.
```{R, eval = FALSE}
mpg = mtcars[, 1] cyl = mtcars[, 2] disp = mtcars[, 3] wt = mtcars[, 5]
As we replace names we can update the set of variables which are used, so that after processing all statements we know which are used. __Subset mapping__ maps the original indices to corresponding indices in the smaller `data.frame`. The index map is a sorted integer vector of the columns that are used. This step cannot happen with the previous because it's necessary to first know all the columns which will be used. ```{R, eval = FALSE} index_map = c(1, 2, 3, 5) # Suppose fread supports integers here .mtcars = fread(..., select = index_map) mpg = .mtcars[, 1] cyl = .mtcars[, 2] disp = .mtcars[, 3] wt = .mtcars[, 4] # This one changes
Nested subsetting
Suppose that x
is the data frame of interest. Consider the following
reasonable code:
x[x[, "d"] > 10, "b"]
Replacing names gives us the following in standard form:
x[x[, 4] > 10, 2]
Because there is nested subsetting we need to respect the structure of the parse tree to correctly substitute these variables with indices.
TODO: What is the issue in my mind? I don't want this to happen:
# First step updates the inner x[x[, 4] > 10, "b"] # Second step updates the outer based on the original statement x[x[, "d"] > 10, 2]
This leaves us with the task of having to merge the parse trees. We definitely want to avoid this. So we need to update the tree in place, incrementally. In the more general case it may happen that the locations of the parse tree change as it is modified. Then we'll need a way to guarantee that nothing is overwritten. Maybe applying the changes depth first?
Algorithm For Code Modification
The goal is to automate this to the greatest extent possible. Here's the approach:
First preprocess the script. - Single static assignment should be used so that variables aren't written over. - Unnecessary statements should be removed.
Find every instance of the calls to be replaced, ie. read.csv()
and
read.table()
. These calls should return data frames.
For each call (referred to below as readcall
) to be replaced, do the following:
var
.var
. nrows = 1
into the call so
that only 1 row is read. Attempt to evaluate the code and determine
the column names. var
in the code following the assignment. var
into a common form: var[, i]
where
i
is an integer vector. If the column names remain unknown but are
used to subset then raise an error.index_map
, a vector representing the set of all columns
of the data which were usedreadcall
into something of the form data.table::fread(...,
select = index_map)
index_map
to map the code of the form var[, i]
into var[, i*]
representing a subset of the data frame.What are the limitations of the approach that I've just outlined?
It's really only designed for data frames. So it would be a little
dangerous if I think something is a data frame, when in fact it's a list.
Then if I replace [[
and $
with [
it won't work. I can get around
this by focusing on functions that return data frames, for example
read.csv()
.
I haven't yet considered subsetting rows, there may be a way to do that efficiently. A common way is to subset based on the value of some column. I could do this by keeping on open file pointer, reading a chunk of the data, subset it, add that subset to a list, then rbind the subsets together. This potentially lets R quickly process files larger than memory.
How to get every column which is used when new copies of the data frame are created? For example:
```{R, eval = FALSE}
mtcars2 = mtcars[possible_subset, ]
gear
column must be read in.mtcars2$gear
Stepping back, R has many ways to write programs. To simplify tasks here we first put the code into a canonical form, and then do "surgery" on it. # SNOW forking is easier than SNOW clusters, since forked workers all have a consistent, current state. Here's an R script:
code = parse(text = ' library(MASS) start = 0 f = function(end) area(sin, start, end) d = as.list(1:5) lapply(d, f) ')
autoparallel:::snow_fork(code)
It's not sufficient to substitute `parallel::parLapply` in for `lapply`, because the workers will not load the `MASS` library or have the `start` variable. The following works: ```{R, eval = FALSE} library(parallel) cl = makeCluster(2L) code = parse(text = " library(MASS) start = 0 f = function(end) area(sin, start, end) ") eval(code) clusterCall(cl, eval, code, .GlobalEnv) d = as.list(1:5) out = parLapply(cl, d, f) # Check that it's the same: all.equal(out, lapply(d, f))
This suggests the approach of evaluating all code before we reach the lapply
statement. We need several assumptions about the code:
1. It's necessary to run
2. It will do the same thing on each worker (not the case for random things)
3. It won't take so long to run that we lose the subsequent benefit of
parallelism.
We can probably check the first two conditions. It's not easy to know how long it will take the whole thing to run though.
info = lapply(code, CodeDepends::getInputs)
read.csv
func is called, which isread.csv
iscode2 = quote(x <- rnorm(n = read.csv("data.csv")))
CodeDepends::getInputs(code2)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.