dfplapply: Parallelized single row processing of a data frame
In Smisc: Sego Miscellaneous

Description Usage Arguments Value Author(s) See Also Examples

Applies a function to each row of a data frame in a parallelized fashion (by submitting multiple batch R jobs). It is a convenient wrapper for plapply, modified especially for parallel, single-row processing of data frames.

dfplapply(X, FUN, ..., output.df = FALSE, njobs = parallel::detectCores() -
  1, packages = NULL, header.file = NULL, needed.objects = NULL,
  needed.objects.env = parent.frame(), workDir = "plapply",
  clobber = TRUE, max.hours = 24, check.interval.sec = 1,
  collate = FALSE, random.seed = NULL, rout = NULL, clean.up = TRUE,
  verbose = FALSE)

`X`	The data frame, each row of which will be processed using `FUN`
`FUN`	A function whose first argument is a single-row data frame, i.e. a single row of `X`. The value returned by `FUN` can be any object
`...`	Additional named arguments to `FUN`
`output.df`	logical indicating whether the value returned by `dfplapply` should be a data frame. If `output.df = TRUE`, then the value returned by `FUN` should be a data frame. If `output.df = FALSE`, a list is returned by `dfplapply`.
`njobs`	The number of jobs (subsets). Defaults to one less than the number of cores on the machine.
`packages`	Character vector giving the names of packages that will be loaded in each new instance of R, using `library`.
`header.file`	Text string indicating a file that will be initially sourced prior calling `lapply` in order to create an 'environment' that will satisfy all potential dependencies for `FUN`. If `NULL`, no file is sourced.
`needed.objects`	Character vector giving the names of objects which reside in the evironment specified by `needed.objects.env` that may be needed by `FUN` which are loaded into the global environment of each new instance of R that is launched. If `NULL`, no additional objects are passed.
`needed.objects.env`	Environment where `needed.objects` reside. This defaults to the environment in which `plapply` is called.
`workDir`	Character string giving the name of the working directory that will be used for for the files needed to launch the separate instances of R.
`clobber`	Logical indicating whether the directory designated by `workDir` will be overwritten if it exists and contains files. If `clobber = FALSE`, and `workDir` contains files, `plapply` throws an error.
`max.hours`	The maximum number of hours to wait for the `njobs` to complete.
`check.interval.sec`	The number of seconds to wait between checking to see whether all `njobs` have completed.
`collate`	`= TRUE` creates a 'first-in-first-out' processing order of the elements of the input list `X`. This logical is passed to the `collate` argument of `parseJob`.
`random.seed`	An integer setting the random seed, which will result in randomizing the elements of the list assigned to each job. This is useful when the computing time for each element varies significantly because it helps to even out the run times of the parallel jobs. If `random.seed = NULL`, no randomization is performed and the elements of the input list are subdivided sequentially among the jobs. This variable is passed to the `random.seed` argument of `parseJob`. If `collate = TRUE`, no randomization is performed and `random.seed` is ignored.
`rout`	A character string giving the name of the file to where all of the `.Rout` files will be gathered. If `rout = NULL`, the `.Rout` files are not gathered, but left alone in `workDir`.
`clean.up`	`= TRUE` will delete the working directory.
`verbose`	`= TRUE` prints messages which show the progress of the jobs.

A list or data frame containing the results of processing each row of X with FUN.

Landon Sego

plapply

X <- data.frame(a = 1:3, b = letters[1:3])


# Function that will operate on each of x, producing a simple list
test.1 <- function(x) {
  list(ab = paste(x$a, x$b, sep = "-"), a2 = x$a^2, bnew = paste(x$b, "new", sep = "."))
}

# Data frame output
dfplapply(X, test.1, output.df = TRUE, njobs = 2)

# List output
dfplapply(X, test.1, njobs = 2)

# Function with 2 rows of output
test.2 <- function(x) {
  data.frame(ab = rep(paste(x$a, x$b, sep = "-"), 2), a2 = rep(x$a^2, 2))
}

dfplapply(X, test.2, output.df = TRUE, njobs = 2, verbose = TRUE)


# Passing in other objects needed by FUN
a.out <- 10
test.3 <- function(x) {
  data.frame(a = x$a + a.out, b = paste(x$b, a.out, sep="-"))
}

dfplapply(X, test.3, output.df = TRUE, needed.objects = "a.out", njobs = 2)