cb_apply: Function designed to handle anything that lapply can but can...
In cole-brokamp/CB: CB

Description Usage Arguments Details Examples

Ideally, a function that returns a data.frame should be supplied. This gives the user the advantage of specifying the names of the columns in the resulting data.frame. If the function does not return a data.frame, then column names will be automatically generated.

1
2
3

cb_apply(X, FUN., fill = TRUE, .id = "id", output = "data.frame",
  pb = TRUE, parallel = FALSE, cache = FALSE, error.na = TRUE,
  num.cores = NULL, ...)

`X`	List of objects to apply over
`FUN.`	Function to apply; allows for compact anonymous functions (see ?purrr::as_function) for details
`fill`	(defaults to TRUE) use plyr::rbind.fill to fill in missing columns when rbinding together results
`.id`	controls add identification of the output object based on the input object; see details
`output`	Output type. Defaults to 'data.frame', but can also be set to 'list' to suppress rbinding of the list.
`pb`	logical; use progress bar?
`parallel`	logical; use parallel processing?
`cache`	(defaults to FALSE) cache the results locally in a folder called "cache" using the memoise package
`error.na`	(defaults to TRUE) use purrr::possibly to replace errors with NA instead of interrupting the process
`num.cores`	The number of cores used for parallel processing. Can be specified as an integer, or it will guess the number of cores available with detectCores(). If parallel is FALSE, the input here will be set to 1.
`...`	Additional arguments to the function

Use .id to control the designation of which input generate which output. Set to NULL to suppress naming. By default, output lists will be named and output data.frame will have an added column named id. The name of this inserted column can be changed by specifying a character string. Alternatively, a vector of character strings can be used to manually identify the output (called id if in a data.frame). Names will be autogenerated even if the input object has incomplete names or no names at all. Note that this also works with functions that return a data.frame with more than one row.

Parallel processing is carried out by pbapply::mclapply. Use the parallel option to switch parallel processing on or off. Only specify the number of cores when really needed as the function will detect the maximum number of available cores. This makes it easy to rerun the script with a higher number of available cores without having to change the code.

A progress bar can be shown in the terminal using an interactive R session or in an .Rout file, if using R CMD BATCH and submitting R scripts for non-interactive completion. Although R Studio supports the progress bar for single process workers, it has a problem showing the progress bar if using parallel processing (see the discussion at http://stackoverflow.com/questions/27314011/mcfork-in-rstudio). In this specific case (R Studio + parallel processing), text updates will be printed to the file '.process'. Use a shell and 'tail -f .progress' to see the updates.

## Not run: 
X <- as.data.frame(matrix(runif(100),ncol=10))

fun. <- function(x) {
   Sys.sleep(0.5)
   mean(x)
}

cb_apply(X,fun.,cache=TRUE)

fun. <- function(x) {
  Sys.sleep(0.5)
  data.frame('mean'=mean(x),'median'=median(x))
}

cb_apply(X,fun.)

# when setting names of input object, function will attempt to assign them to
# the output in a new column
names(X) <- LETTERS[1:10]
cb_apply(X,fun.,output='list')
cb_apply(X,fun.)
# name the id columns something else
cb_apply(X,fun.,.id='group')
# specify a new identifier manually
cb_apply(X,fun.,.id=LETTERS[11:20])
# set .id to NULL to supress the addition of the id columns
cb_apply(X,fun.,.id=NULL)
# naming still works even if the function returns a data.frame with two rows
fun. <- function(x) {
  Sys.sleep(0.5)
  data.frame('stat'=c(mean(x),median(x)))
}
cb_apply(X,fun.)

## End(Not run)