scheduleDataParallel: Schedule Based On Data Parallelism
In clarkfitzg/codedoctor: Transform Serial R Code into Parallel R Code

Description Usage Arguments Details See Also

If you're doing a series of computations over a large data set, then start with this scheduler. This scheduler combines as many chunkable expressions as it can into large blocks of chunkable expressions to run in parallel. The initial data chunks and intermediate objects stay on the workers and do not return to the manager, so you can think of it as "chunk fusion".

scheduleDataParallel(graph, data, platform = Platform(),
  nWorkers = platform@nWorkers, chunkFuncs = character(),
  reduceFuncs = list(), knownReduceFuncs = getKnownReduceFuncs(),
  knownChunkFuncs = getKnownChunkFuncs(),
  allChunkFuncs = c(knownChunkFuncs, chunkFuncs))

`graph`	TaskGraph, code dependency graph
`data`	list of data descriptions. Each element is a DataSource. The names of the list elements correspond to the variables in the code that these objects are bound to.
`platform`	Platform describing resource to compute on
`chunkFuncs`	character, names of additional chunkable functions known to the user.
`reduceFuncs`	list of ReduceFun objects, these can override the knownReduceFuncs.
`knownReduceFuncs`	list of known ReduceFun objects
`knownChunkFuncs`	character, the names of chunkable functions from recommended and base packages.
`allchunkFuncs`	character, names of all chunkable functions to use in the analysis.

It statically balances the load of the data chunks among workers, assuming that loading and processing times are linear in the size of the data.

TODO:

Populate chunkableFuncs based on code analysis.
Identify which parameters a function is chunkable in, and respect these by matching arguments. See update_resource.Call.
Clarify behavior of subexpressions, handling cases such as min(sin(large_object))

makeParallel, schedule

clarkfitzg/codedoctor documentation built on Nov. 18, 2020, 4:34 p.m.