In dide-tools/didewin: DIDE HPC Support

knitr::opts_chunk$set(
  collapse = TRUE,
  error = FALSE,
  comment = "#>"
)
r_output <- function(x) {
  cat(c("```r", x, "```"), sep = "\n")
}

Parallel computing on a cluster can be more challenging than running things locally because it's often the first time that you need to package up code to run elsewhere, and when things go wrong it's more difficult to get information on why things failed.

Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can't physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.

This set of packages (didehpc, queuer and context, along with a couple of support packages (conan, rrq and storr) aims to remove the pain of getting everything set up, and getting cluster tasks running, and retrieving your results.

Once everything is set up, running a job on the cluster should be as straightforward as running things locally.

The documentation here runs through a few of the key concepts, then walks through setting this all up. There's also a "quick start" guide that contains much less discussion.

Functions

The biggest conceptual move is from thinking about running scripts that generate files to running functions that return objects. The reason for this is that gives a well defined interface to build everything else around.

The problem with scripts is that they might do almost anything. They depend on untold files and packages which they load wherever. The produce any number of objects. That's fine, but it becomes hard to reason about them to plan deploying them elsewhere, to capture the outputs appropriately, or to orchestrate looping over a bunch of parameter values. If you've found yourself writing a number of script files changing values with text substitution you have run into this.

In contrast, functions do (ideally) one thing. They have a well defined set of inputs (their arguments) and outputs (their return value). We can loop over a range of input values by iterating over a set of arguments.

This set of packages tends to work best if you let it look after filenames. Rather than trying to come up with a naming scheme for different files as based on parameter values, just return objects and the packages will arrange for them to be saved and reloaded.

Filesystems

The DIDE cluster needs everything to be available on a filesystem that the cluster can read. Practically this means the filesystems //fi--didef3.dide.ic.ac.uk/tmp or //fi--san03.dide.ic.ac.uk/homes/username and the like. You probably have access to network shares that are specific to a project, too. For Windows users these are probably mapped to drives (Q: or T: or similar) already, but for other platforms you will need to do a little extra work to get things set up (see below).

It is simplest if everything that is needed for a project is present in a single directory that is visible on the cluster.

However for the most of this document I will assume that everything is in one directory, which is on a network share.

IMPORTANT: If you are not sure if you are running on a network share, run getwd(); if you are on windows the drive letter should show something like Q: or some other drive that represents a network drive. If it says C: or similar nothing below here will work.

Getting started

The initial setup will feel like a headache at first, but it should ultimately take only a few lines. Once everything is set up, then the payback is that is the job submission part will become a lot simpler.

Installation

Install the packages using drat

# install.package("drat") # if you don't have it already
drat:::add("mrc-ide")
install.packages("didehpc")

Be sure to run this in a fresh session.

Configuration

The configuration is handled in a two stage process. First, some bits that are machine specific are set using options with option names that are prefixed with didehpc. Then when a queue is created, further values can be passed along via the config argument that will use the "global" options as a default.

The reason for this separation is that ideally the machine-specific options will not end up in scripts, because that makes things less portable (for example, we need to get your username, but your username is unlikely to work for your collaborators).

Ideally in your ~/.Rprofile file, you will add something like:

options(
  didehpc.username = "rfitzjoh",
  didehpc.home = "~/net/home")

and then set only options (such as cluster and cores or template) that vary with a project.

If you use the "big" cluster, you can add didehpc.cluster = "fi--didemrchnb" here.

(to set this up, try running usethis::edit_r_profile())

Credentials

Windows users will not need to provide anything unless they are on a non-domain machine or they are in the unfortunate situation of juggling multiple usernames across systems. Non-domain machines will need the credentials set as above.

Mac users will need to provide their username here as above.

If you have a Linux system and have configured your smb mounts as described below, you might as well take advantage of this and set credentials = "~/.smbcredentials" and you will never be prompted for your password:

options(didehpc.credentials = "~/.smbcredentials")

Seeing the default configuration

To see the configuration that will be run if you don't do anything (else), run:

didehpc::didehpc_config()

In here you can see the cluster (here, fi--didemrchnb), credentials and username, the job template (GeneralNodes), information about the resources that will be requested (1 core) and information on filesystem mappings. There are a few other bits of information that may be explained further down. The possible options are explained further in the help for didehpc::didehpc_config

If you request help, we will almost always want to see this!

Additional shares

If you refer to network shares in your functions, e.g., to refer to data, you'll need to map these too. To do that, pass them as the shares argument to didehpc::didehpc_config.

To describe each share, use the didehpc::path_mapping function which takes arguments:

name: a descriptive name for the share
path_local: the point where the share is mounted on your computer
path_remote: the network path that the share refers to (forward slashes are much easier to enter here than backward slashes)
drive_remote: the drive this should be mapped to on the cluster.

So to map your "M drive" to which points at \\fi--didef3.dide.ic.ac.uk\malaria to M: on the cluster you can write

share <- didehpc::path_mapping("malaria", "M:", "//fi--didef3.dide.ic.ac.uk/malaria", "M:")
config <- didehpc::didehpc_config(shares = share)

If you have more than one share to map, pass them through as a list (e.g., didehpc::didehpc_config(shares = list(share1, share2, ...))).

For most systems we hope that didehpc will do a reasonable job of detecting the shares that you are running on, so this should (hopefully) only be necessary for detecting additional shares. The issue there is that you'll need to use absolute paths to refer to the resources and that's going to complicate things...

Contexts

To recreate your work environment on the cluster, we use a package called context. This package uses the assumption that most working environments can be recreated by a combination of R packages and sourcing a set of function definitions.

Root

Every context has a "root"; this is the directory that everything will be saved in. Most of the examples in the help use contexts which is fairly self explanatory but it can be any directory. Generally it will be in the current directory.

root <- "contexts"

This directory is going to get large over time and will eventually need to be deleted. Don't treat these as archival storage - more as long-lived temporary directories and don't be afraid to create a new one and delete old ones when you've collected your results.

Packages

If you list packages as a character vector then all packages will be installed for you, and they will also be attached; this is what happens when you use the function library() So for example if you need to depend on the rstan and dplyr packages you could write:

ctx <- context::context_save(root, packages = c("rstan", "dplyr"))

Attaching packages is not always what is wanted, especially if you have packages that clobber functions in base packages (e.g., dplyr!). An alternative is to list a set of packages that you want installed and split them into packages you would like attached and packages you would only like loaded:

packages <- list(loaded = "rstan", attached = "dplyr")
ctx <- context::context_save(root, packages = packages)

In this case, the packages in the loaded section will be installed (along with their dependencies) and before anything runs, we will run loadNamespace on them to confirm that they are properly available. Access functions in this package with the double-colon operator, like dplyr::select. However they will not be attached so will not modify the search path.

In contrast, packages listed in attached will be loaded with library so they will be available without qualification (e.g., stanc and rstan::stanc will both work).

Source files for function definitions

If you define any of your own functions you will need to tell the cluster about them. The easiest way to do this is to save them in a file that contains only function definitions (and does not read data, etc).

For example, I have a file mysources.R with a very simple simulation in it. Imagine this is some slow function that given an integer n_steps after a bunch of calculation yields a random walk of n_steps steps starting from point x

r_output(readLines("mysources.R"))

To set this up, we'd write:

ctx <- context::context_save(root, sources = "mysources.R")

sources can be a character vector, NULL or character(0) if you have no sources, or just omit it.

Custom packages

If you depend on packages that are not on CRAN (e.g., your personal research code) you'll need to tell context where to find them with its package_sources argument.

If the packages are on GitHub and public you can pass the GitHub username/repo pair, in devtools style:

context::context_save(...,
  package_sources = conan::conan_sources("mrc-ide/dust"))

Like with devtools you can use subdirectories, specific commits or tags in the specification.

Creating the queue

Once a context has been created, we can create a queue with it. This is separate from the actual cluster queue, but will be our interface to it. Running this step takes a while because it installs all the packages that the cluster will need into the context directory.

obj <- didehpc::queue_didehpc(ctx)

If the above command does not throw an error, then you have successfully logged in and the cluster is ready to use.

When you first run queue_didehpc it will install windows versions of all required packages within the context directory (here, "contexts"). This is necessary even when you are on windows because the cluster cannot see files that are on your computer.

obj is a weird sort of object called an R6 class. It's a bit like a Python or Java class if you've come from those languages. The thing you need to know is that the object is like a list and contains a number of functions that can be run by running obj$functionname(). These functions all act by side effect; they interact with a little database stored in the context root directory or by communicating with the cluster using the web interface that Wes created.

obj

For documentation about the individual methods, see help for didehpc::queue_didehpc and in queuer (much of this needs writing still!). For example, to see the overall cluster load you can run:

obj$cluster_load(TRUE)

(if you're on a ANSI-compatible terminal this will be in glorious Technicolor).

Testing that the queue works correctly

Before running a real job, let's test that everything works correctly by running the sessionInfo command on the cluster. When run locally, sessionInfo prints information about the state of your R session:

sessionInfo()

To run this on the cluster, we wrap it in obj$enqueue. This prevents the evaluation of the expression and instead organises it to be run on the cluster:

t <- obj$enqueue(sessionInfo())

We can then poll the cluster for results until it completes:

t$wait(100)

(see the next section for more information about this).

The important part to notice here is that the R "Platform" (second and third line) is Windows Server, as opposed to the host machine which is running Linux. If we had added packages to the context they would be shown too.

Running single jobs

Let's run something more interesting now by running the random_walk function defined in the mysources.R file.

As above, jobs are queued by running:

t <- obj$enqueue(random_walk(0, 10))

Like the queue object, obj, task objects are R6 objects that can be used to get information and results back from the task.

the task's status

t$status()

...which will move from PENDING to RUNNING to COMPLETE or ERROR. You can get information on submission and running times

and you can try and get the result of running the task:

t$result()

This errors if the task is not yet complete.

The wait function, used above, is like result but it will repeatedly poll for the task to be completed for up to timeout seconds.

t$wait(100)

once the task has completed, t$result() and t$wait are equivalent

t$result()

You can query the times of your tasks

t$times()

which will show you when the task was submitted, started and stopped.

Every task creates a log:

t$log()

Warning messages and other output will be printed here. So if you include message(), cat() or print() calls in your task they will appear between start and end.

There is another bit of log that happens before this and contains information about getting the system started up. You should only need to look at this when a job seems to get stuck with status PENDING for ages.

obj$dide_log(t)

The queue knows which tasks it has created and you can list them:

obj$task_list()

The long identifiers are random and are long enough that collisions are unlikely.

Notice that the task ran remotely but we never had to indicate which filename things were written to. There is a small database based on storr that holds all the information within the context root (here, "contexts"). This means you can close down R and later on regenerate the ctx and obj objects and recreate the task objects, and re-get your results. But at the same time it provides the illusion that the cluster has passed an object directly back to you.

id <- t$id
id

## [1] "40934f0a0d28ca7385b8eb201b1146b7"

t2 <- obj$task_get(id)
t2$result()

Running many jobs

There are two broad options here;

Apply a function to each element of a list, similar to lapply with $lapply
Apply a function to each row of a data.frame perhaps using each column as a different argument with $enqueue_bulk

The second approach is more general and $lapply is implemented using it.

Suppose we want to make a bunch of trees of different sizes. This would involve mapping our random_walk function over a vector of sizes:

sizes <- 3:8
grp <- obj$lapply(sizes, random_walk, x = 0)

By default, $lapply returns a "task_bundle" with an automatically generated name. You can customise the name with the name argument.

In contrast to lapply this is not blocking (i.e., submitting tasks and collecting the results is done asynchronously) but if you pass a timeout argument to $lapply then it will poll until the jobs are done, in the same way as wait(), below.

Get the status of all the jobs

grp$status()

Wait until they are all complete and get the results

res <- grp$wait(120)

The other bulk interface is where you want to run a function over a combination of parameters. Suppose we wanted to run random walks of a number of lengths from a number of starting positions, in all combinations. We might enumerate the possibilities like:

pars <- expand.grid(x = c(-1, 0, 1), n_steps = c(5, 10))

We can submit this as a group of 6 jobs with enqueue_bulk. Here we add the timeout option which makes this a blocking operation:

obj$enqueue_bulk(pars, random_walk, timeout = 120)

This has applied the function random_walk over each row of pars.

Cancelling and stopping jobs

Suppose you fire off a bunch of jobs and realise that you have the wrong data or they're all going to fail - you can stop them fairly easily.

Here's a job that will run for an hour and return nothing:

t <- obj$enqueue(Sys.sleep(3600))

Wait for the job to start up:

while (t$status() == "PENDING") {
  Sys.sleep(.5)
}

Now that it's started it can be cancelled with the $unsubmit method:

obj$unsubmit(t$id)

unsubmitting multiple times is safe, and will have no effect.

obj$unsubmit(t$id)

Alternatively you can use obj$task_delete(t$id) which unsubmits the task and then deletes it.

Note that the task is not actually deleted (see below); you can still get at the expression:

t$expr()

but you cannot retrieve results: ``` {r error = TRUE} t$result()

The argument to `unsubmit` can be a vector.  For example, if you had a task bundle `grp` you could unsubmit all members of the group with

```r
obj$unsubmit(grp$ids)

Deleting jobs

Deleting tasks is supported but it isn't entirely encouraged. Not all of the functions behave well with missing tasks, so if you delete things and still have old task handles floating around you might get confusing results.

There is a delete method (obj$task_delete) that will delete jobs, first unsubmitting it if it has been submitted. It takes a vector of task ids as an argument.

Misc

Parallel computation on the cluster

If you are running tasks that can use more than one core, you can request more resources for your task and use process level parallelism with the parallel package. To request 8 cores, you could run:

didehpc::didehpc_config(cores = 8)

When your task starts, 8 cores will be allocated to it and a parallel cluster will be created. You can use it with things like parallel::parLapply, specifying cl as NULL. So if within your cluster job you needed to apply function f to a each element of a list x, you could write:

run_f <- function(x) {
  parallel::parLapply(NULL, x, f)
}
obj$enqueue(run_f(x))

The parallel bits can be embedded within larger blocks of code. All functions in parallel that take cl as a first argument can be used. You do not need to (and should not) set up the cluster as this will happen automatically as the job starts.

Alternatively, if you want to control cluster creation (e.g., you are using software that does this for you) then, pass parallel = FALSE to the config call:

didehpc::didehpc_config(cores = 8, parallel = FALSE)

In this case you are responsible for setting up the cluster.

As an alternative to requesting cores, you can use a different job template:

didehpc::didehpc_config(template = "16Core")

which will reserve you the entire node. Again, a cluster will be started with all available cores unless you also specify parallel = FALSE.

Running heaps of jobs without annoying your colleagues

If you have thousands and thousands of jobs to submit at once you may not want to flood the cluster with them all at once. Each job submission is relatively slow (the HPC tools that the web interface has to use are relatively slow). The actual queue that the cluster uses doesn't seem to like processing tens of thousands of job, and can slow down. And if you take up the whole cluster someone may come and knock on your office and complain at you. At the same time, batching your jobs up into little bits and manually sending them off is a pain and work better done by a computer.

An alternative is to submit a set of "workers" to the cluster, and then submit jobs to them. This is done with the rrq package, along with a redis server running on the cluster.

See the "workers" vignette for details.

Mapping network drives

For all operating systems, if you are on the wireless network you will need to connect to the VPN. If you can get on a wired network you'll likely have a better time because the VPN and wireless network seems less stable in general. Instructions for setting up a VPN are here

Windows

Your network drives are likely already mapped for you. In fact you should not even need to map drives as fully qualified network names (e.g. //fi--didef3/tmp) should work for you.

Mac OS/X

In Finder, go to Go -> Connect to Server... or press Command-K. In the address field write the name of the share you want to connect to. Useful ones are

smb://fi--san03.dide.ic.ac.uk/homes/<username> -- your home share
smb://fi--didef3.dide.ic.ac.uk/tmp -- the temporary share

At some point in the process you should get prompted for your username and password, but I can't remember what that looks like.

These directories will be mounted at /Volumes/<username> and /Volumes/tmp (so the last bit of the filename will be used as the mountpoint within Volumes). There may be a better way of doing this, and the connection will not be reestablished automatically so if anyone has a better way let me know.

Linux

This is what I have done for my computer and it seems to work, though it's not incredibly fast. Full instructions are on the Ubuntu community wiki.

First, install cifs-utils

sudo apt-get install cifs-utils

In your /etc/fstab file, add

//fi--san03/homes/<dide-username> <home-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0  0
//fi--didef3/tmp <tmp-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,vers=2.0,sec=ntlmssp,iocharset=utf8 0  0

where:

<dide-username> is your dide username without the DIDE\ bit.
<local-username> is your local username (i.e., echo $USER).
<local-userid> is your local numeric user id (i.e. id -u $USER)
<local-groupid> is your local numeric group id (i.e. id -g $USER)
<home-mount-point> is where you want your DIDE home directory mounted
<tmp-mount-point> is where you want the DIDE temporary directory mounted

please back this file up before editing.

So for example, I have:

//fi--san03/homes/rfitzjoh /home/rich/net/home cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0  0
//fi--didef3/tmp /home/rich/net/temp cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0  0

The file .smbcredentials contains

username=<dide-username>
password=<dide-password>

and set this to be chmod 600 for a modicum of security, but be aware your password is stored in plaintext.

This set up is clearly insecure. I believe if you omit the credentials line you can have the system prompt you for a password interactively, but I'm not sure how that works with automatic mounting.

Finally, run

mount -a

to mount all drives and with any luck it will all work and you don't have to do this until you get a new computer.

dide-tools/didewin documentation built on Aug. 20, 2023, 9:27 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dide-tools/didewin
DIDE HPC Support

In dide-tools/didewin: DIDE HPC Support

Functions

Filesystems

Getting started

Installation

Configuration

Credentials

Seeing the default configuration

Additional shares

Contexts

Root

Packages

Source files for function definitions

Custom packages

Creating the queue

Testing that the queue works correctly

Running single jobs

Running many jobs

Cancelling and stopping jobs

Deleting jobs

Misc

Parallel computation on the cluster

Running heaps of jobs without annoying your colleagues

Mapping network drives

Windows

Mac OS/X

Linux

R Package Documentation

Browse R Packages

We want your feedback!

dide-tools/didewin DIDE HPC Support

In dide-tools/didewin: DIDE HPC Support

Functions

Filesystems

Getting started

Installation

Configuration

Credentials

Seeing the default configuration

Additional shares

Contexts

Root

Packages

Source files for function definitions

Custom packages

Creating the queue

Testing that the queue works correctly

Running single jobs

Running many jobs

Cancelling and stopping jobs

Deleting jobs

Misc

Parallel computation on the cluster

Running heaps of jobs without annoying your colleagues

Mapping network drives

Windows

Mac OS/X

Linux

R Package Documentation

Browse R Packages

We want your feedback!

dide-tools/didewin
DIDE HPC Support