pddply: Parallel wrapper for plyr::ddply

Description Usage Arguments Details Value See Also Examples

View source: R/pddply.R

Description

Parallel implementation of plyr::ddply that suppresses a spurious warning when plyr::ddply is called in parallel. All of the arguments except njobs are passed directly to arguments of the same name in plyr::ddply.

Usage

1
2
3
pddply(.data, .variables, .fun = NULL, ..., njobs = parallel::detectCores()
  - 1, .progress = "none", .inform = FALSE, .drop = TRUE,
  .paropts = NULL)

Arguments

.data

data frame to be processed

.variables

character vector of variables in .data that will define how to split the data

.fun

function to apply to each piece

...

other arguments passed on to '.fun'

njobs

the number of parallel jobs to launch, defaulting to one less than the number of available cores on the machine

.progress

name of the progress bar to use, see plyr::create_progress_bar

.inform

produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging

.drop

should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default)

.paropts

a list of additional options passed into the foreach::foreach function when parallel computation is enabled. This is important if (for example) your code relies on external data or packages. Use the .export and .packages arguments to supply them so that all cluster nodes have the correct environment set up for computing.

Details

An innocuous warning is thrown when plyr::ddply is called in parallel: https://github.com/hadley/plyr/issues/203. This function catches and hides that warning, which looks like this: Warning messages: 1: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'

If njobs = 1, a call to plyr::ddply is made without parallelization, and anything supplied to .paropts is ignored. See the documentation for plyr::ddply for additional details.

Value

The object data frame returned by plyr::ddply

See Also

plyr::ddply

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
data(baseball, package = "plyr")


# Summarize the number of entries for each year in the baseball dataset with 2 jobs
o1 <- pddply(baseball, ~ year, nrow, njobs = 2)
head(o1)

#  Verify it's the same as the non-parallel version of plyr::ddply()
o2 <- plyr::ddply(baseball, ~ year, nrow)
identical(o1, o2)


# Another possibility
o3 <- pddply(baseball, "lg", c("nrow", "ncol"), njobs = 2)
o3

o4 <- plyr::ddply(baseball, "lg", c("nrow", "ncol"))
identical(o3, o4)


# A nonsense example where we need to pass objects and packages into the cluster
number1 <- 7

f <- function(x, number2 = 10) {
 paste(x$id[1], padZero(number1, num = 2), number2, sep = "-")
}

# In parallel
o5 <- pddply(baseball[1:100,], "year", f, number2 = 13, njobs = 2,
            .paropts = list(.packages = "Smisc", .export = "number1"))
o5


# Non parallel
o6 <- plyr::ddply(baseball[1:100,], "year", f, number2 = 13)
identical(o5, o6)

Smisc documentation built on May 2, 2019, 2:46 a.m.