Intro

rSubmmiter is a package that allows simple communication between R and a SLURM cluster to achieve three main tasks:

  1. Easy submission, monitoring and management of individual cluster jobs.
  2. Easy and fast submission of many jobs, by implementing SLURM arrays.
  3. Seamlessly lapply (loop) parallelization.

This section demonstrates the simplest way to implement these tasks.

Installation

Currently rSubmitter can only be installed via the R devtools package:

library("devtools")
install_github("pablo-gar/rSubmitter")

Single job sumbssion

The creation, management and submission of individual jobs is achieved by the use of the Job class. The first step is to create an object of the Job class by using the Job$new() constructor method. The only required argument is a character vector, where each element is an independet shell command. Then you can submit the job to the SLURM cluster using the $submit() method of a Job object. And you can use the $wait() method to wait until a job is completed or failed.

On its simplest form the submission of a job looks like this

library("rSubmitter")
myJob <- Job$new(commandVector = c("echo hola world!", "sleep 30"))
myJob$submit()
myJob$wait()
#  2018-04-23 18:22:35 --- Cluster Status |  PENDING = 1 |
#  2018-04-23 18:22:40 --- Cluster Status |  RUNNING = 1 |
#  2018-04-23 18:23:30 --- Cluster Status |  COMPLETED = 1 |

The STDERR and STDOUT are streamed to files of the form rSubmitter_job_[randomNumber].[err|out] in the current working directory. Refer to the advanced section if you wish to select the destination and names of these files, as well as to set the system requirements of memory, time, cpus, nodes, etc.

You can also access the full documentation of the Job class from R:

?Job

Efficient multiple job sumbssion

To submit many jobs at once in an efficient way you can use the JobArray class that implments the SLURM arrays. Similar to the Job class, you first create a JobArray object with the JobArray$new() constructor method. The difference is that you have to pass a list of character vectors, where each element of the list is a vector of commands which wil be submmitted as independent jobs. All jobs in the array will have the same system requirements of memory, cpus, etc. Identically to the Job class, you can submit the array using the $submit() method, and you can use the $wait() method to wait until all jobs are completed or one or more fail.

library("rSubmitter")
commands <- list(c( # First Job
                    "echo hola",
                    "sleep 20"
                   ),
                 c( # Second Job
                   "echo adios",
                   "sleep 60"
                 )
               )

jobArray <- JobArray$new(commandList = commands)
jobArray$submit()
jobArray$wait()
#   2018-04-25 17:49:30 --- Cluster Status |  PENDING = 2 |
#   2018-04-25 17:49:45 --- Cluster Status |  RUNNING = 2 |
#   2018-04-25 17:50:50 --- Cluster Status |  COMPLETED = 2 |

The STDERR and STDOUT are streamed to files of the form rSubmitter_job_[randomNumber]_[jobNumber].[err|out] in the current working directory. Refer to the advanced section if you wish to select the destination and names of these files, as well as to set the system requirements of memory, time, cpus, nodes, etc.

You can also access the full documentation of the JobArray class from R:

?JobArray

Lapply parallelization (loop parallelization)

lapply is an R-base function that applies a function FUN to each element of a list or vector (similar to map() in python). It's one of the R ways of performing loops.

For example this:

x <- lapply(1:4, as.character)

Is equivalent to:

x <- list()
for(i in 1:4)
    x[[i]] <- as.character(i)

rSubmitter comes with an implementation of lapply, superApply, that enables parallelization of lapply function calls using a SLURM cluster. Under the hood, it implements the JobArray class (described above) along with some complicated file management to partition one lapply call into many indepent ones that are parallely executed, then when they are all done the results of indivudal calls are compiled and returned to the user.

The number of parallel tasks is defined in the tasks argument of superApply, on its simplest form a parallel lapply call looks like this:

myFun <- function(x) {
    return(rep(x, 3)) 
}

library("rSubmitter")
x <- superApply(1:100, myFun, tasks = 4)
#   2018-05-04 15:29:34 Partitioning function calls 
#   2018-05-04 15:29:35 Submmiting parallel Jobs
#   2018-05-04 15:29:40 --- Cluster Status |  PENDING = 4 |
#   2018-05-04 15:29:45 --- Cluster Status |  COMPLETED = 4 |
#   2018-05-04 15:29:45 Merging parellel results
#   2018-05-04 15:29:45 Merge done
#   2018-05-04 15:29:46 Cleaning partitioned data
#   2018-05-04 15:29:47 Cleaning done

This will partition the lapply call into 4 different ones, each having an equally distributed elements of 1:100, this partitions will be submitted as independents jobs, the results will be compiled and returned without the need of doing anything else.

It is important to note that not all systems come with an out-of-the-box R executable, and since superApply performs independent R calls in the SLURM cluster then you have to make sure that these R calls can be executed. If you have to execute any commands from your shell enviroment to be able to open the R interpreter, then you have to include those commands in the extraBashLines argument of superApply

For example, if in order to execute R in your system you usally do:

$ module load R
$ R

Then you have to execute superApply like this:

x <- superApply(1:100, myFun, tasks = 4, extraBashLines = "module load R")

You can see the full documentation of superApply from R:

?superApply


pablo-gar/rSubmitter documentation built on Jan. 26, 2020, 2:08 a.m.