plapply: Simple parallelization of lapply
In Smisc: Sego Miscellaneous

Description Usage Arguments Details Value Author(s) See Also Examples

Parses a large list into subsets and submits a separate batch R job that calls lapply on the subset. plapply has some features that may not be readily available in other parallelization functions like mclapply and parLapply:

The .Rout files produced by each R instance are easily accessible for convenient debugging of errors or warnings. The .Rout files can also serve as an explicit record of the work that was performed by the workers
Three options are available for the ordering of the processing of the list elements: the original list order, randomized, or collated (first-in-first-out).
In each R instance, pre-processing or post-processing steps can be performed before and after the call to lapply

These pre-processing and post-processing steps can depend on the instance of R, such that each instance can be treated differently, if desired. These features give greater control over the computing process, which can be especially useful for large jobs.

plapply(X, FUN, ..., njobs = parallel::detectCores() - 1, packages = NULL,
  header.file = NULL, needed.objects = NULL,
  needed.objects.env = parent.frame(), workDir = "plapply",
  clobber = TRUE, max.hours = 24, check.interval.sec = 1,
  collate = FALSE, random.seed = NULL, rout = NULL, clean.up = TRUE,
  verbose = FALSE)

`X`	A list or vector, each element of which will be the input to `FUN`
`FUN`	A function whose first argument is an element of `X`
`...`	Additional named arguments to `FUN`
`njobs`	The number of jobs (subsets). Defaults to one less than the number of cores on the machine.
`packages`	Character vector giving the names of packages that will be loaded in each new instance of R, using `library`.
`header.file`	Text string indicating a file that will be initially sourced prior calling `lapply` in order to create an 'environment' that will satisfy all potential dependencies for `FUN`. If `NULL`, no file is sourced.
`needed.objects`	Character vector giving the names of objects which reside in the evironment specified by `needed.objects.env` that may be needed by `FUN` which are loaded into the global environment of each new instance of R that is launched. If `NULL`, no additional objects are passed.
`needed.objects.env`	Environment where `needed.objects` reside. This defaults to the environment in which `plapply` is called.
`workDir`	Character string giving the name of the working directory that will be used for for the files needed to launch the separate instances of R.
`clobber`	Logical indicating whether the directory designated by `workDir` will be overwritten if it exists and contains files. If `clobber = FALSE`, and `workDir` contains files, `plapply` throws an error.
`max.hours`	The maximum number of hours to wait for the `njobs` to complete.
`check.interval.sec`	The number of seconds to wait between checking to see whether all `njobs` have completed.
`collate`	`= TRUE` creates a 'first-in-first-out' processing order of the elements of the input list `X`. This logical is passed to the `collate` argument of `parseJob`.
`random.seed`	An integer setting the random seed, which will result in randomizing the elements of the list assigned to each job. This is useful when the computing time for each element varies significantly because it helps to even out the run times of the parallel jobs. If `random.seed = NULL`, no randomization is performed and the elements of the input list are subdivided sequentially among the jobs. This variable is passed to the `random.seed` argument of `parseJob`. If `collate = TRUE`, no randomization is performed and `random.seed` is ignored.
`rout`	A character string giving the name of the file to where all of the `.Rout` files will be gathered. If `rout = NULL`, the `.Rout` files are not gathered, but left alone in `workDir`.
`clean.up`	`= TRUE` will delete the working directory.
`verbose`	`= TRUE` prints messages which show the progress of the jobs.

plapply applies FUN to each element of the list X by parsing the list into njobs lists of equal (or almost equal) size and then applies FUN to each sublist using lapply.

A separate batch instance of R is launched for each sublist, thus utilizing another core of the machine. After the jobs complete, the njobs output lists are reassembled. The global environments for each batch instance of R are created by writing/reading data to/from disc.

If collate = TRUE or random.seed = Integer value, the output list returned by plapply is reordered to reflect the original ordering of the input list, X.

An object called process.id (consisting of an integer indicating the process number) is available in the global environment of each instance of R.

Each instance of R runs a script that performs the following steps:

Any other packages indicated in the packages argument are loaded via calls to library()
The process.id global variable is assigned to the global environment of the R instance (having been passed in via a command line argument)
The header file (if there is one) is sourced
The expression pre.process.expression is evaluated if an object of that name is present in the global environment. The object pre.process.expression may be passed in via the header file or via needed.objects
lapply is called on the sublist, the sublist is called X.i
The expression post.process.expression is evaluated if an object of that name is present in the global environment. The object post.process.expression may be passed in via the header file or via needed.objects
The output returned by lapply is assigned to the object X.i.out, and is saved to a temporary file where it will be collected after all jobs have completed
Warnings are printed

If njobs = 1, none of the previous steps are executed, only this call is made: lapply(X, FUN, ...)

A list equivalent to that returned by lapply(X, FUN, ...).

Landon Sego

parLapplyW, dfplapply, parLapply, mclapply

# Create a simple list
a <- list(a = rnorm(10), b = rnorm(20), c = rnorm(15), d = rnorm(13),
          e = rnorm(15), f = rnorm(22))

# Some objects that will be needed by f1:
b1 <- rexp(20)
b2 <- rpois(10, 20)

# The function
f1 <- function(x) mean(x) + max(b1) - min(b2)

# Call plapply
res1 <- plapply(a, f1, njobs = 2, needed.objects = c("b1", "b2"),
                check.interval.sec = 0.5, max.hours = 1/120,
                workDir = "example1", rout = "example1.Rout",
                clean.up = FALSE)

print(res1)

# Look at the collated 'Rout' file
more("example1.Rout")

# Look at the contents of the working directory
dir("example1")

# Remove working directory and Rout file
unlink("example1", recursive = TRUE, force = TRUE)
unlink("example1.Rout")
 
# Verify the result with lapply
res2 <- lapply(a, f1)

# Compare results
identical(res1, res2)