setup: Setup SGE jobs
In patr1ckm/distributr: Grid Search with SGE Backend

Description Usage Arguments Details Value Methods (by class) See Also Examples

In a given directory, writes the argument grid given from grid_apply(.f, ..., .eval=FALSE), an Rscript to run .f on one set of arguments, a submission script to run .f on combination of arguments, and directories to store results and job log files.

setup(object, ...)

## S3 method for class 'gapply'
setup(object, .dir = getwd(), .reps = 1, .seed = NULL,
  .mc.cores = 1, .verbose = 1, .queue = "long",
  .script.name = "doone.R", .job.name = "distributr",
  .out.dir = "SGE_Output", .R.version = "3.2.5", .email.options = "a",
  .email.addr = NULL, .shell = "bash", ...)

`object`	object from `grid_apply` or `gapply` with `.eval=FALSE`
`...`	arguments to methods
`.dir`	directory name relative to the current working directory (no trailing backslash)
`.reps`	total number of replications for each condition
`.seed`	An integer or `NULL` (default, no seeds are set automatically). If given, controls RNG using L'ecuyer-CMRG as in `parallel` by saving and accessing unique seeds in `seeds.Rdata`
`.mc.cores`	number of cores used to run replications in parallel (can be a range)
`.verbose`	verbose level: `1` prints '.' for each replication, `2` prints '.' on completion and prints the current arguments, `3` prints the current arguments and results
`.queue`	name of queue
`.script.name`	name of script (default `doone.R`)
`.job.name`	name of job
`.out.dir`	name of directory in which to put SGE output files.
`.R.version`	name of R version. Possible values include any in `module avail`. Default is 3.2.5.
`.email.options`	one or more characters from "bea" meaning email when "job Begins", "job Ends", and "job Aborts". Default is "a".
`.email.addr`	email address
`.shell`	shell to use. Default is 'bash'

Long running grid_apply computations can be easily run in parallel on SGE using array tasks. Each row in the argument grid given by grid_apply(f, ...) is mapped to a unique task id, which is run on a separate node. setup() makes this easy by writing the argument grid (arg_grid.Rdata), an R script to run one combination of arguments, a submission script assigning all rows to a unique task id, seeds (if specified), and folders to store results in a given directory. Jobs are submitted to the scheduler by running qsub submit at the prompt, or by running submit() within R.

The argument grid (arg_grid) is saved to .dir as arg_grid.Rdata. It contains the columns of expand.grid(...) from grid_apply(.f, ...). A column $.sge_id is appended that assigns each row a unique job id.

A simple R script (doone.R) is provided that runs .f on one row of arg_grid. Running doone.R at the command line exactly replicates how the script will be run on each node.

A file (submit) is also written, which specifies a task array for qsub for all jobs in arg_grid. It can be submitted to the queue by running qsub submit at the command line. Job status can be monitored with qstat. Various email

Results are stored in results/, as $SGE_TASK_ID.Rdata where SGE_TASK_ID is the array task corresponding to a unique row in arg_grid. It is sometimes convenient to access this variable within .f, which can be done by Sys.getenv("SGE_TASK_ID"). This might be used to cache intermediate results.

If .seed is given, a list of seeds is generated in seeds.Rdata using L'ecuyer-CMRG streams for reproducible random number generation. A unique seed is generated for each independent job in the argument grid. Subsequent calls to setup using the same .seed generate the same seeds and reproducible results. See parallel::nextRNGStream for more details.

The function .f can be run multiple times for every row in arg_grid by setting .reps > 1. These replications can be run in parallel using mclapply by setting .mc.cores > 1. To decrease waiting times in the queue, mc.cores can be given a range (e.g. mc.cores = c(1, 8)), and the job will be submitted when a given set of cores in that range is available. To access the number of cores given to each job, use Sys.getenv("NSLOTS").

It is easy to corrupt arg_grid.Rdata by running setup on different sets of arguments, making future merges of results with arguments based on .sge_id invalid. If arg_grid.Rdata already exists, setup prompts the user for verification that an overwrite is intended, or stops with an error if not run interactively.

Invisibly, the original object with argument grid modified to append a column $.sge_id assigning each row to a unique job id.

As side effects, the function writes the following objects to .dir:

`arg_grid.Rdata`	Data frame containing the argument grid, appended with a column `sge_id` corresponding to the task id of each row
`doone.R`	Script to run one job, or one row from `arg_grid`
`submit`	Submission script specifying a task array over the grid of parameters in (all rows of) arg_grid.Rdata
`seeds.Rdata`	If `.seed` is specified, a list of seeds for each job.
`results/`	Folder to store results. Each file is `1.Rdata`, `2.Rdata`, ... corresponding to the task id (row in `arg_grid`)
`SGE_Output/`	Folder for output from SGE

gapply: Setup sge files from gapply, grid_apply

grid_apply to define the grid, jobs to see the grid, collect to collect completed results, and tidy to merge completed results with the argument grid. test_job Runs a job with a given id on the head node. filter_jobs writes a submission script for jobs matching conditions as in dplyr::filter sge_env can be used to access environmental variables.

## Not run: 
do.one <- function(a, b){c(sum=a+b, sub=a-b)}
plan <- grid_apply(do.one, a=1:5, b=3, .eval=FALSE)
jobs(plan)  # shows the original argument grid
plan <- setup(plan, .reps=5, .mc.cores=c(1, 5))
jobs(plan)  # modified with a column showing unique job ids

## End(Not run)