Quick Start
In clustermq: Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

```{css echo=FALSE} img { border: 0px !important; margin: 2em 2em 2em 2em !important; } code { border: 0px !important; }

```r
options(clustermq.scheduler = "local")
knitr::opts_chunk$set(
    cache = FALSE,
    echo = TRUE,
    collapse = TRUE,
    comment = "#>"
)

This package will allow you to send function calls as jobs on a computing cluster with a minimal interface provided by the Q function:

# load the library and create a simple function
library(clustermq)
fx = function(x) x * 2

# queue the function call on your scheduler
Q(fx, x=1:3, n_jobs=1)

Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. All calculations are load-balanced, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.

Installation

Install the clustermq package in R from CRAN (including the bundled ZeroMQ system library):

install.packages('clustermq')

Alternatively you can use the remotes package to install directly from Github. Note that this version needs autoconf/automake for compilation:

# install.packages('remotes')
remotes::install_github('mschubert/clustermq')

In the develop branch, we will introduce code changes and new features. These may contain bugs, poor documentation, or other inconveniences. This branch may not install at times. However, feedback is very welcome.

# install.packages('remotes')
remotes::install_github('mschubert/clustermq', ref="develop")

You should be good to go!

By default, clustermq will look for sbatch (SLURM), bsub (LSF), or qsub (SGE) in your $PATH and use the scheduler that is available. If the examples don't run out of the box, you might need to set your scheduler explicitly.

Setting up the scheduler explicitly

An HPC cluster's scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq interfaces with in order to do computations.

We currently support the following schedulers (either locally or via SSH):

Multiprocess - test your calls and parallelize on cores using options(clustermq.scheduler="multiprocess")
LSF - should work without setup
SGE - should work without setup
SLURM - should work without setup
PBS/Torque - needs options(clustermq.scheduler="PBS"/"Torque")
via SSH - needs options(clustermq.scheduler="ssh", clustermq.ssh.host=<yourhost>)

Default submission templates are provided and can be customized, e.g. to activate compute environments or containers.

Examples

The package is designed to distribute arbitrary function calls on HPC worker nodes. There are, however, a couple of caveats to observe as the R session running on a worker does not share your local memory.

The simplest example is to a function call that is completely self-sufficient, and there is one argument (x) that we iterate through:

fx = function(x) x * 2
Q(fx, x=1:3, n_jobs=1)

Non-iterated arguments are supported by the const argument:

fx = function(x, y) x * 2 + y
Q(fx, x=1:3, const=list(y=10), n_jobs=1)

If a function relies on objects in its environment that are not passed as arguments, they can be exported using the export argument:

fx = function(x) x * 2 + y
Q(fx, x=1:3, export=list(y=10), n_jobs=1)

If we want to use a package function we need to load it on the worker using a library() call or referencing it with package_name:::

fx = function(x) {
    `%>%` = dplyr::`%>%`
    x %>%
        dplyr::mutate(area = Sepal.Length * Sepal.Width) %>%
        head()
}
Q(fx, x=list(iris), n_jobs=1)

clustermq can also be used as a parallel backend for foreach. As this is also used by BiocParallel, we can run those packages on the cluster as well:

library(foreach)
register_dopar_cmq(n_jobs=2, memory=1024) # accepts same arguments as `workers`
foreach(i=1:3) %dopar% sqrt(i) # this will be executed as jobs

library(BiocParallel)
register(DoparParam()) # after register_dopar_cmq(...)
bplapply(1:3, sqrt)

More examples are available in the user guide.

Usage

The following arguments are supported by Q:

fun - The function to call. This needs to be self-sufficient (because it will not have access to the master environment)
... - All iterated arguments passed to the function. If there is more than one, all of them need to be named
const - A named list of non-iterated arguments passed to fun
export - A named list of objects to export to the worker environment

Behavior can further be fine-tuned using the options below:

fail_on_error - Whether to stop if one of the calls returns an error
seed - A common seed that is combined with job number for reproducible results
memory - Amount of memory to request for the job (bsub -M)
n_jobs - Number of jobs to submit for all the function calls
job_size - Number of function calls per job. If used in combination with n_jobs the latter will be overall limit
chunk_size - How many calls a worker should process before reporting back to the master. Default: every worker will report back 100 times total

The full documentation is available by typing ?Q.

Comparison to other packages

There are some packages that provide high-level parallelization of R function calls on a computing cluster. A thorough comparison of features and performance is available on the wiki.

Briefly, we compare how long it takes different HPC scheduler tools to submit, run and collect function calls of negligible processing time (multiplying a numeric value by 2). This serves to quantify the maximum throughput we can reach with BatchJobs, batchtools and clustermq.

We find that BatchJobs is unable to process 10⁶ calls or more but produces a reproducible RSQLite error. batchtools is able to process more function calls, but the file system practically limits it at about 10⁶ calls. clustermq has no problems processing 10⁹ calls, and is still faster than batchtools at 10⁶ calls.

In short, use ClusterMQ if you want:

a one-line solution to run cluster jobs with minimal setup
access cluster functions from your local Rstudio via SSH
fast processing of many function calls without network storage I/O

Use batchtools if:

want to use a mature and well-tested package
don't mind that arguments to every call are written to/read from disc
don't mind there's no load-balancing at run-time

Use Snakemake (or flowr, remake, drake) if:

you want to design and run a pipeline of different tools

Don't use batch (last updated 2013) or BatchJobs (issues with SQLite on network-mounted storage).

Any scripts or data that you put into this service are public.

clustermq documentation built on Nov. 21, 2023, 5:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

clustermq
Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

Quick Start
In clustermq: Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

Installation

Setting up the scheduler explicitly

Examples

Usage

Comparison to other packages

Try the clustermq package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

clustermq Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

Quick Start In clustermq: Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

Installation

Setting up the scheduler explicitly

Examples

Usage

Comparison to other packages

Try the clustermq package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

clustermq
Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

Quick Start
In clustermq: Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)