knitr::opts_chunk$set( collapse = TRUE, error = FALSE, comment = "#>" ) r_output <- function(x) { cat(c("```r", x, "```"), sep = "\n") }
Parallel computing on a cluster can be more challenging than running things locally because it's often the first time that you need to package up code to run elsewhere, and when things go wrong it's more difficult to get information on why things failed.
Much of the difficulty of getting things running involves working out what your code depends on, and getting that installed in the right place on a computer that you can't physically poke at. The next set of problems is dealing with the ballooning set of files that end up being created - templates, scripts, output files, etc.
This set of packages
(didehpc
,
queuer
and
context
, along with a
couple of support packages
(conan
,
rrq
and
storr
) aims to remove the
pain of getting everything set up, and getting cluster tasks
running, and retrieving your results.
Once everything is set up, running a job on the cluster should be as straightforward as running things locally.
The documentation here runs through a few of the key concepts, then walks through setting this all up. There's also a "quick start" guide that contains much less discussion.
The biggest conceptual move is from thinking about running scripts that generate files to running functions that return objects. The reason for this is that gives a well defined interface to build everything else around.
The problem with scripts is that they might do almost anything. They depend on untold files and packages which they load wherever. The produce any number of objects. That's fine, but it becomes hard to reason about them to plan deploying them elsewhere, to capture the outputs appropriately, or to orchestrate looping over a bunch of parameter values. If you've found yourself writing a number of script files changing values with text substitution you have run into this.
In contrast, functions do (ideally) one thing. They have a well defined set of inputs (their arguments) and outputs (their return value). We can loop over a range of input values by iterating over a set of arguments.
This set of packages tends to work best if you let it look after filenames. Rather than trying to come up with a naming scheme for different files as based on parameter values, just return objects and the packages will arrange for them to be saved and reloaded.
The DIDE cluster needs everything to be available on a filesystem
that the cluster can read. Practically this means the filesystems
//fi--didef3.dide.ic.ac.uk/tmp
or //fi--san03.dide.ic.ac.uk/homes/username
and the like.
You probably have access to network shares that are specific to a
project, too. For Windows users these are probably mapped to
drives (Q:
or T:
or similar) already, but for other platforms
you will need to do a little extra work to get things set up (see
below).
It is simplest if everything that is needed for a project is present in a single directory that is visible on the cluster.
However for the most of this document I will assume that everything is in one directory, which is on a network share.
IMPORTANT: If you are not sure if you are running on a network share, run getwd()
; if you are on windows the drive letter should show something like Q:
or some other drive that represents a network drive. If it says C:
or similar nothing below here will work.
The initial setup will feel like a headache at first, but it should ultimately take only a few lines. Once everything is set up, then the payback is that is the job submission part will become a lot simpler.
Install the packages using drat
# install.package("drat") # if you don't have it already drat:::add("mrc-ide") install.packages("didehpc")
Be sure to run this in a fresh session.
The configuration is handled in a two stage process. First, some
bits that are machine specific are set using options
with option
names that are prefixed with didehpc
. Then when a queue is
created, further values can be passed along via the config
argument that will use the "global" options as a default.
The reason for this separation is that ideally the machine-specific options will not end up in scripts, because that makes things less portable (for example, we need to get your username, but your username is unlikely to work for your collaborators).
Ideally in your ~/.Rprofile file, you will add something like:
options( didehpc.username = "rfitzjoh", didehpc.home = "~/net/home")
and then set only options (such as cluster and cores or template) that vary with a project.
If you use the "big" cluster, you can add didehpc.cluster =
"fi--didemrchnb"
here.
(to set this up, try running usethis::edit_r_profile()
)
Windows users will not need to provide anything unless they are on a non-domain machine or they are in the unfortunate situation of juggling multiple usernames across systems. Non-domain machines will need the credentials set as above.
Mac users will need to provide their username here as above.
If you have a Linux system and have configured your smb mounts as
described below, you might as well take advantage of this and set
credentials = "~/.smbcredentials"
and you will never be prompted
for your password:
options(didehpc.credentials = "~/.smbcredentials")
To see the configuration that will be run if you don't do anything (else), run:
didehpc::didehpc_config()
In here you can see the cluster (here, fi--didemrchnb
),
credentials and username, the job template (GeneralNodes
),
information about the resources that will be requested (1 core) and
information on filesystem mappings. There are a few other bits of
information that may be explained further down. The possible
options are explained further in the help for didehpc::didehpc_config
If you request help, we will almost always want to see this!
If you refer to network shares in your functions, e.g., to refer to
data, you'll need to map these too. To do that, pass them as the
shares
argument to didehpc::didehpc_config
.
To describe each share, use the didehpc::path_mapping
function
which takes arguments:
path_local
: the point where the share is mounted on your computerpath_remote
: the network path that the share refers to (forward
slashes are much easier to enter here than backward slashes)drive_remote
: the drive this should be mapped to on the cluster.So to map your "M drive" to which points at \\fi--didef3.dide.ic.ac.uk\malaria
to M:
on the cluster you can write
share <- didehpc::path_mapping("malaria", "M:", "//fi--didef3.dide.ic.ac.uk/malaria", "M:") config <- didehpc::didehpc_config(shares = share)
If you have more than one share to map, pass them through as a list
(e.g., didehpc::didehpc_config(shares = list(share1, share2, ...))
).
For most systems we hope that didehpc
will do a reasonable job of detecting
the shares that you are running on, so this should (hopefully) only
be necessary for detecting additional shares. The issue there is
that you'll need to use absolute paths to refer to the resources
and that's going to complicate things...
To recreate your work environment on the cluster, we use a package
called context
. This package uses the assumption that most
working environments can be recreated by a combination of R
packages and sourcing a set of function definitions.
Every context has a "root"; this is the directory that everything
will be saved in. Most of the examples in the help use contexts
which is fairly self explanatory but it can be any directory.
Generally it will be in the current directory.
root <- "contexts"
This directory is going to get large over time and will eventually need to be deleted. Don't treat these as archival storage - more as long-lived temporary directories and don't be afraid to create a new one and delete old ones when you've collected your results.
If you list packages as a character vector then all packages will
be installed for you, and they will also be attached; this is
what happens when you use the function library()
So for example
if you need to depend on the rstan
and dplyr
packages you could
write:
ctx <- context::context_save(root, packages = c("rstan", "dplyr"))
Attaching packages is not always what is wanted, especially if you
have packages that clobber functions in base packages (e.g.,
dplyr
!). An alternative is to list a set of packages that you
want installed and split them into packages you would like attached
and packages you would only like loaded:
packages <- list(loaded = "rstan", attached = "dplyr") ctx <- context::context_save(root, packages = packages)
In this case, the packages in the loaded
section will be
installed (along with their dependencies) and before anything runs,
we will run loadNamespace
on them to confirm that they are
properly available. Access functions in this package with the
double-colon operator, like dplyr::select
. However they
will not be attached so will not modify the search path.
In contrast, packages listed in attached
will be loaded with
library
so they will be available without qualification (e.g.,
stanc
and rstan::stanc
will both work).
If you define any of your own functions you will need to tell the cluster about them. The easiest way to do this is to save them in a file that contains only function definitions (and does not read data, etc).
For example, I have a file mysources.R
with a very simple
simulation in it. Imagine this is some slow function that given an
integer n_steps
after a bunch of calculation yields a random walk
of n_steps
steps starting from point x
r_output(readLines("mysources.R"))
To set this up, we'd write:
ctx <- context::context_save(root, sources = "mysources.R")
sources
can be a character vector, NULL
or character(0)
if
you have no sources, or just omit it.
If you depend on packages that are not on CRAN (e.g., your personal
research code) you'll need to tell context
where to find them
with its package_sources
argument.
If the packages are on GitHub and public you can pass the GitHub
username/repo pair, in devtools
style:
context::context_save(..., package_sources = conan::conan_sources("mrc-ide/dust"))
Like with devtools
you can use subdirectories, specific commits
or tags in the specification.
Once a context has been created, we can create a queue with it. This is separate from the actual cluster queue, but will be our interface to it. Running this step takes a while because it installs all the packages that the cluster will need into the context directory.
obj <- didehpc::queue_didehpc(ctx)
If the above command does not throw an error, then you have successfully logged in and the cluster is ready to use.
When you first run queue_didehpc
it will install windows versions of all required packages within the context directory (here, "contexts"). This is necessary even when you are on windows because the cluster cannot see files that are on your computer.
obj
is a weird sort of object called an R6
class. It's a bit
like a Python or Java class if you've come from those languages.
The thing you need to know is that the object is like a list and
contains a number of functions that can be run by running
obj$functionname()
. These functions all act by side effect;
they interact with a little database stored in the context root
directory or by communicating with the cluster using the web
interface that Wes created.
obj
For documentation about the individual methods, see help for didehpc::queue_didehpc
and in queuer (much of this needs writing still!). For example, to see the overall cluster load you can run:
obj$cluster_load(TRUE)
(if you're on a ANSI-compatible terminal this will be in glorious Technicolor).
Before running a real job, let's test that everything works
correctly by running the sessionInfo
command on the cluster.
When run locally, sessionInfo
prints information about the state
of your R session:
sessionInfo()
To run this on the cluster, we wrap it in obj$enqueue
. This
prevents the evaluation of the expression and instead organises it
to be run on the cluster:
t <- obj$enqueue(sessionInfo())
We can then poll the cluster for results until it completes:
t$wait(100)
(see the next section for more information about this).
The important part to notice here is that the R "Platform" (second and third line) is Windows Server, as opposed to the host machine which is running Linux. If we had added packages to the context they would be shown too.
Let's run something more interesting now by running the random_walk
function defined in the mysources.R
file.
As above, jobs are queued by running:
t <- obj$enqueue(random_walk(0, 10))
Like the queue object, obj
, task objects are R6 objects that can
be used to get information and results back from the task.
t
the task's status
t$status()
...which will move from PENDING
to RUNNING
to COMPLETE
or
ERROR
. You can get information on submission and running times
and you can try and get the result of running the task:
t$result()
This errors if the task is not yet complete.
The wait
function, used above, is like result
but it will
repeatedly poll for the task to be completed for up to timeout
seconds.
t$wait(100)
once the task has completed, t$result()
and t$wait
are equivalent
t$result()
You can query the times of your tasks
t$times()
which will show you when the task was submitted, started and stopped.
Every task creates a log:
t$log()
Warning messages and other output will be printed here. So if you
include message()
, cat()
or print()
calls in your task they
will appear between start
and end
.
There is another bit of log that happens before this and contains
information about getting the system started up. You should only
need to look at this when a job seems to get stuck with status
PENDING
for ages.
obj$dide_log(t)
The queue knows which tasks it has created and you can list them:
obj$task_list()
The long identifiers are random and are long enough that collisions are unlikely.
Notice that the task ran remotely but we never had to indicate
which filename things were written to. There is a small database
based on storr
that holds
all the information within the context root (here, "contexts").
This means you can close down R and later on regenerate the ctx
and obj
objects and recreate the task objects, and re-get your
results. But at the same time it provides the illusion that the
cluster has passed an object directly back to you.
id <- t$id id
## [1] "40934f0a0d28ca7385b8eb201b1146b7"
t2 <- obj$task_get(id) t2$result()
There are two broad options here;
lapply
with $lapply
$enqueue_bulk
The second approach is more general and $lapply
is implemented
using it.
Suppose we want to make a bunch of trees of different sizes. This
would involve mapping our random_walk
function over a vector of
sizes:
sizes <- 3:8 grp <- obj$lapply(sizes, random_walk, x = 0)
By default, $lapply
returns a "task_bundle" with an
automatically generated name. You can customise the name with the
name
argument.
In contrast to lapply
this is not blocking (i.e., submitting
tasks and collecting the results is done asynchronously) but if you
pass a timeout
argument to $lapply
then it will poll until the
jobs are done, in the same way as wait()
, below.
Get the status of all the jobs
grp$status()
Wait until they are all complete and get the results
res <- grp$wait(120)
The other bulk interface is where you want to run a function over a combination of parameters. Suppose we wanted to run random walks of a number of lengths from a number of starting positions, in all combinations. We might enumerate the possibilities like:
pars <- expand.grid(x = c(-1, 0, 1), n_steps = c(5, 10))
We can submit this as a group of 6 jobs with enqueue_bulk
. Here we add the timeout
option which makes this a blocking operation:
obj$enqueue_bulk(pars, random_walk, timeout = 120)
This has applied the function random_walk
over each row of pars
.
Suppose you fire off a bunch of jobs and realise that you have the wrong data or they're all going to fail - you can stop them fairly easily.
Here's a job that will run for an hour and return nothing:
t <- obj$enqueue(Sys.sleep(3600))
Wait for the job to start up:
while (t$status() == "PENDING") { Sys.sleep(.5) }
Now that it's started it can be cancelled with the $unsubmit
method:
obj$unsubmit(t$id)
unsubmitting multiple times is safe, and will have no effect.
obj$unsubmit(t$id)
Alternatively you can use obj$task_delete(t$id)
which unsubmits
the task and then deletes it.
Note that the task is not actually deleted (see below); you can still get at the expression:
t$expr()
but you cannot retrieve results: ``` {r error = TRUE} t$result()
The argument to `unsubmit` can be a vector. For example, if you had a task bundle `grp` you could unsubmit all members of the group with ```r obj$unsubmit(grp$ids)
Deleting tasks is supported but it isn't entirely encouraged. Not all of the functions behave well with missing tasks, so if you delete things and still have old task handles floating around you might get confusing results.
There is a delete method (obj$task_delete
) that will delete jobs,
first unsubmitting it if it has been submitted. It takes a vector
of task ids as an argument.
If you are running tasks that can use more than one core, you can
request more resources for your task and use process level
parallelism with the parallel
package. To request 8 cores, you
could run:
didehpc::didehpc_config(cores = 8)
When your task starts, 8 cores will be allocated to it and a
parallel
cluster will be created. You can use it with things
like parallel::parLapply
, specifying cl
as NULL
. So if
within your cluster job you needed to apply function f
to a each
element of a list x
, you could write:
run_f <- function(x) { parallel::parLapply(NULL, x, f) } obj$enqueue(run_f(x))
The parallel bits can be embedded within larger blocks of code.
All functions in parallel
that take cl
as a first argument can
be used. You do not need to (and should not) set up the cluster as
this will happen automatically as the job starts.
Alternatively, if you want to control cluster creation (e.g., you
are using software that does this for you) then, pass
parallel = FALSE
to the config call:
didehpc::didehpc_config(cores = 8, parallel = FALSE)
In this case you are responsible for setting up the cluster.
As an alternative to requesting cores, you can use a different job template:
didehpc::didehpc_config(template = "16Core")
which will reserve you the entire node. Again, a cluster will be
started with all available cores unless you also specify
parallel = FALSE
.
If you have thousands and thousands of jobs to submit at once you may not want to flood the cluster with them all at once. Each job submission is relatively slow (the HPC tools that the web interface has to use are relatively slow). The actual queue that the cluster uses doesn't seem to like processing tens of thousands of job, and can slow down. And if you take up the whole cluster someone may come and knock on your office and complain at you. At the same time, batching your jobs up into little bits and manually sending them off is a pain and work better done by a computer.
An alternative is to submit a set of "workers" to the cluster, and
then submit jobs to them. This is done with the
rrq
package, along with a
redis
server running on the cluster.
See the "workers" vignette for details.
For all operating systems, if you are on the wireless network you will need to connect to the VPN. If you can get on a wired network you'll likely have a better time because the VPN and wireless network seems less stable in general. Instructions for setting up a VPN are here
Your network drives are likely already mapped for you. In fact you
should not even need to map drives as fully qualified network names
(e.g. //fi--didef3/tmp
) should work for you.
In Finder, go to Go -> Connect to Server...
or press Command-K
.
In the address field write the name of the share you want to
connect to. Useful ones are
smb://fi--san03.dide.ic.ac.uk/homes/<username>
-- your home sharesmb://fi--didef3.dide.ic.ac.uk/tmp
-- the temporary shareAt some point in the process you should get prompted for your username and password, but I can't remember what that looks like.
These directories will be mounted at /Volumes/<username>
and
/Volumes/tmp
(so the last bit of the filename will be used as the
mountpoint within Volumes
). There may be a better way of doing
this, and the connection will not be reestablished automatically so
if anyone has a better way let me know.
This is what I have done for my computer and it seems to work, though it's not incredibly fast. Full instructions are on the Ubuntu community wiki.
First, install cifs-utils
sudo apt-get install cifs-utils
In your /etc/fstab
file, add
//fi--san03/homes/<dide-username> <home-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0 //fi--didef3/tmp <tmp-mount-point> cifs uid=<local-userid>,gid=<local-groupid>,credentials=/home/<local-username>/.smbcredentials,domain=DIDE,vers=2.0,sec=ntlmssp,iocharset=utf8 0 0
where:
<dide-username>
is your dide username without the DIDE\
bit.<local-username>
is your local username (i.e., echo $USER
).<local-userid>
is your local numeric user id (i.e. id -u $USER
)<local-groupid>
is your local numeric group id (i.e. id -g $USER
)<home-mount-point>
is where you want your DIDE home directory mounted<tmp-mount-point>
is where you want the DIDE temporary directory mountedplease back this file up before editing.
So for example, I have:
//fi--san03/homes/rfitzjoh /home/rich/net/home cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0 //fi--didef3/tmp /home/rich/net/temp cifs uid=1000,gid=1000,credentials=/home/rich/.smbcredentials,domain=DIDE,sec=ntlmssp,iocharset=utf8 0 0
The file .smbcredentials
contains
username=<dide-username> password=<dide-password>
and set this to be chmod 600 for a modicum of security, but be aware your password is stored in plaintext.
This set up is clearly insecure. I believe if you omit the credentials line you can have the system prompt you for a password interactively, but I'm not sure how that works with automatic mounting.
Finally, run
mount -a
to mount all drives and with any luck it will all work and you don't have to do this until you get a new computer.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.