localDiskControl: Specify Control Parameters for MapReduce on a Local Disk...

Description Usage Arguments Note Examples

Description

Specify control parameters for a MapReduce on a local disk connection. Currently the parameters include:

Usage

1
2
localDiskControl(cluster = NULL, map_buff_size_bytes = 10485760,
  reduce_buff_size_bytes = 10485760, map_temp_buff_size_bytes = 10485760)

Arguments

cluster

a "cluster" object obtained from makeCluster to allow for parallel processing

map_buff_size_bytes

determines how much data should be sent to each map task

reduce_buff_size_bytes

determines how much data should be sent to each reduce task

map_temp_buff_size_bytes

determines the size of chunks written to disk in between the map and reduce

Note

If you have data on a shared drive that multiple nodes can access or a high performance shared file system like Lustre, you can run a local disk MapReduce job on multiple nodes by creating a multi-node cluster with makeCluster.

If you are using multiple cores and the input data is very small, map_buff_size_bytes needs to be small so that the key-value pairs will be split across cores.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# create a 2-node cluster that can be used to process in parallel
cl <- parallel::makeCluster(2)
# create a local disk control object that specifies to use this cluster
# these operations run in parallel
control <- localDiskControl(cluster = cl)
# note that setting options(defaultLocalDiskControl = control)
# will cause this to be used by default in all local disk operations

# convert in-memory ddf to local-disk ddf
ldPath <- file.path(tempdir(), "by_species")
ldConn <- localDiskConn(ldPath, autoYes = TRUE)
bySpeciesLD <- convert(divide(iris, by = "Species"), ldConn)

# update attributes using parallel cluster
updateAttributes(bySpeciesLD, control = control)

# remove temporary directories
unlink(ldPath, recursive = TRUE)

# shut down the cluster
parallel::stopCluster(cl)

datadr documentation built on May 1, 2019, 8:06 p.m.