dfplapply: Parallelized single row processing of a data frame

Description Usage Arguments Value Author(s) See Also Examples

View source: R/dfplappy.R

Description

Applies a function to each row of a data frame in a parallelized fashion (by submitting multiple batch R jobs). It is a convenient wrapper for plapply, modified especially for parallel, single-row processing of data frames.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
dfplapply(
  X,
  FUN,
  ...,
  output.df = FALSE,
  njobs = parallel::detectCores() - 1,
  packages = NULL,
  header.file = NULL,
  needed.objects = NULL,
  needed.objects.env = parent.frame(),
  workDir = "plapply",
  clobber = TRUE,
  max.hours = 24,
  check.interval.sec = 1,
  collate = FALSE,
  random.seed = NULL,
  rout = NULL,
  clean.up = TRUE,
  verbose = FALSE
)

Arguments

X

The data frame, each row of which will be processed using FUN

FUN

A function whose first argument is a single-row data frame, i.e. a single row of X. The value returned by FUN can be any object

...

Additional named arguments to FUN

output.df

logical indicating whether the value returned by dfplapply should be a data frame. If output.df = TRUE, then the value returned by FUN should be a data frame. If output.df = FALSE, a list is returned by dfplapply.

njobs

The number of jobs (subsets). Defaults to one less than the number of cores on the machine.

packages

Character vector giving the names of packages that will be loaded in each new instance of R, using library.

header.file

Text string indicating a file that will be initially sourced prior calling lapply in order to create an 'environment' that will satisfy all potential dependencies for FUN. If NULL, no file is sourced.

needed.objects

Character vector giving the names of objects which reside in the evironment specified by needed.objects.env that may be needed by FUN which are loaded into the global environment of each new instance of R that is launched. If NULL, no additional objects are passed.

needed.objects.env

Environment where needed.objects reside. This defaults to the environment in which plapply is called.

workDir

Character string giving the name of the working directory that will be used for for the files needed to launch the separate instances of R.

clobber

Logical indicating whether the directory designated by workDir will be overwritten if it exists and contains files. If clobber = FALSE, and workDir contains files, plapply throws an error.

max.hours

The maximum number of hours to wait for the njobs to complete.

check.interval.sec

The number of seconds to wait between checking to see whether all njobs have completed.

collate

= TRUE creates a 'first-in-first-out' processing order of the elements of the input list X. This logical is passed to the collate argument of parseJob.

random.seed

An integer setting the random seed, which will result in randomizing the elements of the list assigned to each job. This is useful when the computing time for each element varies significantly because it helps to even out the run times of the parallel jobs. If random.seed = NULL, no randomization is performed and the elements of the input list are subdivided sequentially among the jobs. This variable is passed to the random.seed argument of parseJob. If collate = TRUE, no randomization is performed and random.seed is ignored.

rout

A character string giving the name of the file to where all of the .Rout files will be gathered. If rout = NULL, the .Rout files are not gathered, but left alone in workDir.

clean.up

= TRUE will delete the working directory.

verbose

= TRUE prints messages which show the progress of the jobs.

Value

A list or data frame containing the results of processing each row of X with FUN.

Author(s)

Landon Sego

See Also

plapply

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
X <- data.frame(a = 1:3, b = letters[1:3])


# Function that will operate on each of x, producing a simple list
test.1 <- function(x) {
  list(ab = paste(x$a, x$b, sep = "-"), a2 = x$a^2, bnew = paste(x$b, "new", sep = "."))
}

# Data frame output
dfplapply(X, test.1, output.df = TRUE, njobs = 2)

# List output
dfplapply(X, test.1, njobs = 2)

# Function with 2 rows of output
test.2 <- function(x) {
  data.frame(ab = rep(paste(x$a, x$b, sep = "-"), 2), a2 = rep(x$a^2, 2))
}

dfplapply(X, test.2, output.df = TRUE, njobs = 2, verbose = TRUE)


# Passing in other objects needed by FUN
a.out <- 10
test.3 <- function(x) {
  data.frame(a = x$a + a.out, b = paste(x$b, a.out, sep="-"))
}

dfplapply(X, test.3, output.df = TRUE, needed.objects = "a.out", njobs = 2)

pnnl/Smisc documentation built on Oct. 18, 2020, 6:18 p.m.