localDiskControl: Specify Control Parameters for MapReduce on a Local Disk...
In datadr: Divide and Recombine for Large, Complex Data

Description Usage Arguments Note Examples

Specify control parameters for a MapReduce on a local disk connection. Currently the parameters include:

1 2	localDiskControl(cluster = NULL, map_buff_size_bytes = 10485760, reduce_buff_size_bytes = 10485760, map_temp_buff_size_bytes = 10485760)

`cluster`	a "cluster" object obtained from `makeCluster` to allow for parallel processing
`map_buff_size_bytes`	determines how much data should be sent to each map task
`reduce_buff_size_bytes`	determines how much data should be sent to each reduce task
`map_temp_buff_size_bytes`	determines the size of chunks written to disk in between the map and reduce

If you have data on a shared drive that multiple nodes can access or a high performance shared file system like Lustre, you can run a local disk MapReduce job on multiple nodes by creating a multi-node cluster with makeCluster.

If you are using multiple cores and the input data is very small, map_buff_size_bytes needs to be small so that the key-value pairs will be split across cores.

# create a 2-node cluster that can be used to process in parallel
cl <- parallel::makeCluster(2)
# create a local disk control object that specifies to use this cluster
# these operations run in parallel
control <- localDiskControl(cluster = cl)
# note that setting options(defaultLocalDiskControl = control)
# will cause this to be used by default in all local disk operations

# convert in-memory ddf to local-disk ddf
ldPath <- file.path(tempdir(), "by_species")
ldConn <- localDiskConn(ldPath, autoYes = TRUE)
bySpeciesLD <- convert(divide(iris, by = "Species"), ldConn)

# update attributes using parallel cluster
updateAttributes(bySpeciesLD, control = control)

# remove temporary directories
unlink(ldPath, recursive = TRUE)

# shut down the cluster
parallel::stopCluster(cl)