pddply: Parallel wrapper for plyr::ddply
In Smisc: Sego Miscellaneous

Description Usage Arguments Details Value See Also Examples

Parallel implementation of plyr::ddply that suppresses a spurious warning when plyr::ddply is called in parallel. All of the arguments except njobs are passed directly to arguments of the same name in plyr::ddply.

1
2
3

pddply(.data, .variables, .fun = NULL, ..., njobs = parallel::detectCores()
  - 1, .progress = "none", .inform = FALSE, .drop = TRUE,
  .paropts = NULL)

`.data`	data frame to be processed
`.variables`	character vector of variables in `.data` that will define how to split the data
`.fun`	function to apply to each piece
`...`	other arguments passed on to '.fun'
`njobs`	the number of parallel jobs to launch, defaulting to one less than the number of available cores on the machine
`.progress`	name of the progress bar to use, see `plyr::create_progress_bar`
`.inform`	produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging
`.drop`	should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default)
`.paropts`	a list of additional options passed into the `foreach::foreach` function when parallel computation is enabled. This is important if (for example) your code relies on external data or packages. Use the `.export` and `.packages` arguments to supply them so that all cluster nodes have the correct environment set up for computing.

An innocuous warning is thrown when plyr::ddply is called in parallel: https://github.com/hadley/plyr/issues/203. This function catches and hides that warning, which looks like this: Warning messages: 1: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'

If njobs = 1, a call to plyr::ddply is made without parallelization, and anything supplied to .paropts is ignored. See the documentation for plyr::ddply for additional details.

The object data frame returned by plyr::ddply

plyr::ddply

data(baseball, package = "plyr")


# Summarize the number of entries for each year in the baseball dataset with 2 jobs
o1 <- pddply(baseball, ~ year, nrow, njobs = 2)
head(o1)

#  Verify it's the same as the non-parallel version of plyr::ddply()
o2 <- plyr::ddply(baseball, ~ year, nrow)
identical(o1, o2)


# Another possibility
o3 <- pddply(baseball, "lg", c("nrow", "ncol"), njobs = 2)
o3

o4 <- plyr::ddply(baseball, "lg", c("nrow", "ncol"))
identical(o3, o4)


# A nonsense example where we need to pass objects and packages into the cluster
number1 <- 7

f <- function(x, number2 = 10) {
 paste(x$id[1], padZero(number1, num = 2), number2, sep = "-")
}

# In parallel
o5 <- pddply(baseball[1:100,], "year", f, number2 = 13, njobs = 2,
            .paropts = list(.packages = "Smisc", .export = "number1"))
o5


# Non parallel
o6 <- plyr::ddply(baseball[1:100,], "year", f, number2 = 13)
identical(o5, o6)