scheduleDataParallel: Schedule Based On Data Parallelism

Description Usage Arguments Details See Also

View source: R/scheduleDataParallel.R

Description

If you're doing a series of computations over a large data set, then start with this scheduler. This scheduler combines as many chunkable expressions as it can into large blocks of chunkable expressions to run in parallel. The initial data chunks and intermediate objects stay on the workers and do not return to the manager, so you can think of it as "chunk fusion".

Usage

1
2
3
4
5
scheduleDataParallel(graph, data, platform = Platform(),
  nWorkers = platform@nWorkers, chunkFuncs = character(),
  reduceFuncs = list(), knownReduceFuncs = getKnownReduceFuncs(),
  knownChunkFuncs = getKnownChunkFuncs(),
  allChunkFuncs = c(knownChunkFuncs, chunkFuncs))

Arguments

graph

TaskGraph, code dependency graph

data

list of data descriptions. Each element is a DataSource. The names of the list elements correspond to the variables in the code that these objects are bound to.

platform

Platform describing resource to compute on

chunkFuncs

character, names of additional chunkable functions known to the user.

reduceFuncs

list of ReduceFun objects, these can override the knownReduceFuncs.

knownReduceFuncs

list of known ReduceFun objects

knownChunkFuncs

character, the names of chunkable functions from recommended and base packages.

allchunkFuncs

character, names of all chunkable functions to use in the analysis.

Details

It statically balances the load of the data chunks among workers, assuming that loading and processing times are linear in the size of the data.

TODO:

  1. Populate chunkableFuncs based on code analysis.

  2. Identify which parameters a function is chunkable in, and respect these by matching arguments. See update_resource.Call.

  3. Clarify behavior of subexpressions, handling cases such as min(sin(large_object))

See Also

makeParallel, schedule


clarkfitzg/codedoctor documentation built on Nov. 18, 2020, 4:34 p.m.