View source: R/execute_parallel.R
ex_data_table_parallel | R Documentation |
rquery
pipeline with data.table
in parallel.Execute an rquery
pipeline with data.table
in parallel, partitioned by a given column.
Note: usually the overhead of partitioning and distributing the work will by far overwhelm any parallel speedup.
Also data.table
itself already seems to exploit some thread-level parallelism (one often sees user time > elapsed time).
Requires the parallel
package. For a worked example with significant speedup please see https://github.com/WinVector/rqdatatable/blob/master/extras/Parallel_rqdatatable.md.
ex_data_table_parallel(
optree,
partition_column,
cl = NULL,
...,
tables = list(),
source_limit = NULL,
debug = FALSE,
env = parent.frame()
)
optree |
relop operations tree. |
partition_column |
character name of column to partition work by. |
cl |
a cluster object, created by package parallel or by package snow. If NULL, use the registered default cluster. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
debug |
logical if TRUE use lapply instead of parallel::clusterApplyLB. |
env |
environment to look for values in. |
Care must be taken that the calculation partitioning is course enough to ensure a correct calculation. For example: anything one is joining on, aggregating over, or ranking over must be grouped so that all elements affecting a given result row are in the same level of the partition.
resulting data.table (intermediate tables can sometimes be mutated as is practice with data.table).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.