setup: Setup SGE jobs

Description Usage Arguments Details Value Methods (by class) See Also Examples

Description

In a given directory, writes the argument grid given from grid_apply(.f, ..., .eval=FALSE), an Rscript to run .f on one set of arguments, a submission script to run .f on combination of arguments, and directories to store results and job log files.

Usage

1
2
3
4
5
6
7
8
setup(object, ...)

## S3 method for class 'gapply'
setup(object, .dir = getwd(), .reps = 1, .seed = NULL,
  .mc.cores = 1, .verbose = 1, .queue = "long",
  .script.name = "doone.R", .job.name = "distributr",
  .out.dir = "SGE_Output", .R.version = "3.2.5", .email.options = "a",
  .email.addr = NULL, .shell = "bash", ...)

Arguments

object

object from grid_apply or gapply with .eval=FALSE

...

arguments to methods

.dir

directory name relative to the current working directory (no trailing backslash)

.reps

total number of replications for each condition

.seed

An integer or NULL (default, no seeds are set automatically). If given, controls RNG using L'ecuyer-CMRG as in parallel by saving and accessing unique seeds in seeds.Rdata

.mc.cores

number of cores used to run replications in parallel (can be a range)

.verbose

verbose level: 1 prints '.' for each replication, 2 prints '.' on completion and prints the current arguments, 3 prints the current arguments and results

.queue

name of queue

.script.name

name of script (default doone.R)

.job.name

name of job

.out.dir

name of directory in which to put SGE output files.

.R.version

name of R version. Possible values include any in module avail. Default is 3.2.5.

.email.options

one or more characters from "bea" meaning email when "job Begins", "job Ends", and "job Aborts". Default is "a".

.email.addr

email address

.shell

shell to use. Default is 'bash'

Details

Long running grid_apply computations can be easily run in parallel on SGE using array tasks. Each row in the argument grid given by grid_apply(f, ...) is mapped to a unique task id, which is run on a separate node. setup() makes this easy by writing the argument grid (arg_grid.Rdata), an R script to run one combination of arguments, a submission script assigning all rows to a unique task id, seeds (if specified), and folders to store results in a given directory. Jobs are submitted to the scheduler by running qsub submit at the prompt, or by running submit() within R.

The argument grid (arg_grid) is saved to .dir as arg_grid.Rdata. It contains the columns of expand.grid(...) from grid_apply(.f, ...). A column $.sge_id is appended that assigns each row a unique job id.

A simple R script (doone.R) is provided that runs .f on one row of arg_grid. Running doone.R at the command line exactly replicates how the script will be run on each node.

A file (submit) is also written, which specifies a task array for qsub for all jobs in arg_grid. It can be submitted to the queue by running qsub submit at the command line. Job status can be monitored with qstat. Various email

Results are stored in results/, as $SGE_TASK_ID.Rdata where SGE_TASK_ID is the array task corresponding to a unique row in arg_grid. It is sometimes convenient to access this variable within .f, which can be done by Sys.getenv("SGE_TASK_ID"). This might be used to cache intermediate results.

If .seed is given, a list of seeds is generated in seeds.Rdata using L'ecuyer-CMRG streams for reproducible random number generation. A unique seed is generated for each independent job in the argument grid. Subsequent calls to setup using the same .seed generate the same seeds and reproducible results. See parallel::nextRNGStream for more details.

The function .f can be run multiple times for every row in arg_grid by setting .reps > 1. These replications can be run in parallel using mclapply by setting .mc.cores > 1. To decrease waiting times in the queue, mc.cores can be given a range (e.g. mc.cores = c(1, 8)), and the job will be submitted when a given set of cores in that range is available. To access the number of cores given to each job, use Sys.getenv("NSLOTS").

It is easy to corrupt arg_grid.Rdata by running setup on different sets of arguments, making future merges of results with arguments based on .sge_id invalid. If arg_grid.Rdata already exists, setup prompts the user for verification that an overwrite is intended, or stops with an error if not run interactively.

Value

Invisibly, the original object with argument grid modified to append a column $.sge_id assigning each row to a unique job id.

As side effects, the function writes the following objects to .dir:

arg_grid.Rdata

Data frame containing the argument grid, appended with a column sge_id corresponding to the task id of each row

doone.R

Script to run one job, or one row from arg_grid

submit

Submission script specifying a task array over the grid of parameters in (all rows of) arg_grid.Rdata

seeds.Rdata

If .seed is specified, a list of seeds for each job.

results/

Folder to store results. Each file is 1.Rdata, 2.Rdata, ... corresponding to the task id (row in arg_grid)

SGE_Output/

Folder for output from SGE

Methods (by class)

See Also

grid_apply to define the grid, jobs to see the grid, collect to collect completed results, and tidy to merge completed results with the argument grid. test_job Runs a job with a given id on the head node. filter_jobs writes a submission script for jobs matching conditions as in dplyr::filter sge_env can be used to access environmental variables.

Examples

1
2
3
4
5
6
7
8
## Not run: 
do.one <- function(a, b){c(sum=a+b, sub=a-b)}
plan <- grid_apply(do.one, a=1:5, b=3, .eval=FALSE)
jobs(plan)  # shows the original argument grid
plan <- setup(plan, .reps=5, .mc.cores=c(1, 5))
jobs(plan)  # modified with a column showing unique job ids

## End(Not run)

patr1ckm/distributr documentation built on May 24, 2019, 8:21 p.m.