The goal of slurmworkflow is to provide a way to run complex multi step computations on a slurm equipped HPC.
Definitions:
sbatch
to be run on the HPCBy default the steps are run sequentially. But slurmworkflow provides tools for changing the execution order. This allows conditional execution of the steps and loop like behavior.
In this vignette we walk through the creation of a 4 steps workflow showcasing the main utilities provided by slurmworkflow.
Any HPC using slurm as its workload manager should work with slurmworkflow.
HPC tested:
We highly recommend using renv when working with an HPC.
library(slurmworkflow) wf <- create_workflow( wf_name = "test_slurmworkflow", default_sbatch_opts = list( "partition" = "epimodel", "mail-type" = "FAIL", "mail-user" = "user@emory.edu" ) )
First we create a new workflow called "test_slurmworkflow" and store a summary
of it in the wf
object. The second argument specifies that by default each
step should run on the "epimodel" partition and send a mail at
"user@emory.edu" if the step fails.
Calling create_workflow()
result in the creation of the workflow directory:
"workflows/test_slurmworkflow/" that contains the code to send to the HPC. A
workflow summary is returned and stored in the wf
variable. We'll use it to
add elements to the workflow.
The first step that we use on most of our workflows ensures that our local project and the HPC are in sync.
wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_bash_lines(c( "git pull", ". /projects/epimodel/spack/share/spack/setup-env.sh", "spack load r@4.2.1", "Rscript -e \"renv::init(bare = TRUE)\"", "Rscript -e \"renv::restore()\"" )), sbatch_opts = list( "mem" = "16G", "cpus-per-task" = 4, "time" = 120 ) )
The add_workflow_step()
functions takes three arguments:
wf_summary
: the object we made with create_workflow()
, indicating onto
which workflow we want to add a step.step_tmpl
: a step template, a helper function defining what to run on the
HPC (more on this latter).sbatch_opts
: arguments to be passed to sbatch
. Here we specify that we
want 16GB of RAM, 4 CPUs and that the job should not take more than 120
minutes. The default options defined in create_workflow()
will also be used.The step template we are using here, step_tmpl_bash_lines()
takes a vector
of bash
lines to be run by sbatch
on the HPC.
Here we tell the step to:
1. run git pull
2. load our own version of spack and
load the R@4.2.1
module
3. ensure that renv
is initialized on the project (no effect if its already
the case)
4. update the packages to match the renv.lock file
As we usually want to run R
code and not bash
, slurmworkflow provides step
templates simplifying this process.
On HPCs, R
is usually not available directly. On RSPH HPC we use spack to
manage our modules. Therefore, we store the lines used to setup R
on the HPC
in a variable as it will be used by all R
step templates.
setup_lines <- c( ". /projects/epimodel/spack/share/spack/setup-env.sh", "spack load r@4.2.1" )
Our next step will run the following script on the HPC.
# filename: R/01-test_do_call.R cat(paste0("var1 = ", var1, ", var2 = ", var2)) if (!file.exists("did_run")) { file.create("did_run") current_step <- slurmworkflow::get_current_workflow_step() slurmworkflow::change_next_workflow_step(current_step) } else { file.remove("did_run") }
This very simple script prints the content of var1
and var2
to the
standard output. Note that these variables are never declared in the script. We
will pass them as arguments to the step template.
The second part checks for the existence of a file called "did_run". If it does not exist yet, it's created and we instruct slurmworkflow to change the next step to the current step. This is how you make a loop in slurmworkflow.
If the file exists, which means that it's the second time this step is run, it
removes it. In this case change_next_workflow_step()
is not called and the
workflow will just continue to the next step.
Let's now see how we add this script as a workflow step.
wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_do_call_script( r_script = "R/01-test_do_call.R", args = list(var1 = "ABC", var2 = "DEF"), setup_lines = setup_lines ), sbatch_opts = list( "cpus-per-task" = 1, "time" = "00:10:00", "mem" = "4G" ) )
As before we use the add_workflow_step()
function. But we change the
step_tmpl
to use step_tmpl_do_call_script()
with 3 arguments:
r_script
: the path to the script to be run. Here "R/01-test_do_call.R". Note
that this path must be valid on the HPC.args
: a list of variables that will be available for the step. These are
the var1
and var2
that were missing from the script.setup_lines
: some bash code to be run before trying to source the script.
These are the lines used to load the R module we defined earlier.For the sbatch
options, we ask here for 1 CPU, 4GB of RAM and a maximum of 10
minutes.
One common task on an HPC is to run the same code many time and only vary the value of some arguments.
On a typical R session lapply()
, Map()
and Mapply()
are available for this
purpose.
slurmworkflow provides the step_tmpl_map_script()
to run a script with a
syntax similar to the Map()
function. This create an array job where each
element of input are processed in parallel.
First let's take a look at the script to be run.
# filename: R/02-test_map.R library(future.apply) plan(multicore, workers = ncores) future_lapply(seq_len(ncores), function(i) { msg <- paste0( "On core: ", i, "\n", "iterator1: ", iterator1, "\n", "iterator2: ", iterator2, "\n", "var1 = ", var1, ", var2 = ", var2, "\n\n" ) cat(msg) })
This script needs 4 undeclared variables:
- iterator1
and iterator2
: varying values
- ncores
, var1
and var2
: fixed values shared by all replications
As before these values will be set by the step template.
In this script we will print in parallel the message over ncores
.
Now for the addition of the step.
cores_to_use <- 2 wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_map_script( r_script = "R/02-test_map.R", # arguments passed to the script iterator1 = seq_len(5), iterator2 = seq_len(5) + 5, MoreArgs = list( ncores = cores_to_use, var1 = "IJK", var2 = "LMN" ), setup_lines = setup_lines, max_array_size = 2 ), sbatch_opts = list( "cpus-per-task" = cores_to_use, "time" = "00:10:00", "mem-per-cpu" = "4G" ) )
The step_tmpl_map_script()
takes an r_script
argument similar to
step_tmpl_do_call_script()
. The next two arguments iterator1
and iterator2
will be iterated over using sbatch
arrays. Each replication of the job will
only have one value for each (1-6, 2-7, 3-8, 4-9 and 5-10). Similar to Map()
,
the MoreArgs
argument defines variables to be shared across replication.
A new argument max_array_size
as been set to 2. This prevents the array jobs
to be of size more than 2. In our case, we have submissions of array sizes 2, 2,
and 1. In real analysis situation, the value would be around 500. This argument
prevents slurm from refusing a job submission
because of the size of the array (a limit of 1000 submission is common). With
EpiModel we already had cases where 30000 array jobs were to be run. This
template simply submit them in sequential chunks of 500
In the sbatch_opts
we specified mem-per-cpu = "4G"
. This means that if we
change the cores_to_use
value, the memory asked will scale as well.
To recap, this step will submit an array of 5 jobs, each receiving a different
value for iterator1
and iterator2
. Each of theses jobs will run over
cores_to_use
. We use this approach with
EpiModel where we run huge arrays of
jobs where each job is a set of around 30 parallel simulations. Therefore, we
here have 2 levels of parallelization. One in
slurm and one in the script itself.
Sometimes we want to run a simple function directly without storing it into an
R script. The step_tmpl_do_call()
and step_tmpl_map()
do exactly that for
one-of functions and Map()
s.
wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_do_call( what = function(var1, var2) { cat(paste0("var1 = ", var1, ", var2 = ", var2)) }, args = list(var1 = "XYZ", var2 = "UVW"), setup_lines = setup_lines ), sbatch_opts = list( "cpus-per-task" = 1, "time" = "00:10:00", "mem" = "4G", "mail-type" = "END" ) )
The syntax of these two templates is almost identical to the previous two that
we discussed. The main difference the first argument where we pass a function
instead of a path to a script.
Note: the function will be run in clean R
session on the HPC. All the values
used by the function
must be either created by it, loaded by it or passed as
argument.
Finally, as this will be our last step, we override the mail-type
sbatch_opts
to receive a mail when this step finishes, whatever the outcome.
This way we receive a mail telling us that the workflow is finished.
Now that our workflow is created how to actually run the code on the HPC?
We assume that we are working on a project called "test_proj", that this project was cloned on the HPC at the following path: "~/projects/test_proj" and that the "~/projects/test_proj/workflows/" directory exists.
The following commands are to be run from your local computer.
MacOS or GNU/Linux
# bash - local scp -r workflows/test_slurmworkflow <user>@clogin01.sph.emory.edu:projects/test_proj/workflows/
Windows
# bash - local set DISPLAY= scp -r workflows\test_slurmworkflow <user>@clogin01.sph.emory.edu:projects/test_proj/workflows/
Forgetting set DISPLAY=
will prevent scp
from working correctly if using the
RStudio terminal.
Note that it's workflows\networks_estimation
. Windows uses back-slashes for
directories and Unix OSes uses forward-slashes.
For this step, you must be at the command line on the HPC. This means that you
have run: ssh <user>@clogin01.sph.emory.edu
from your local computer.
run set DISPLAY=
on Windows before if you get this error:
ssh_askpass: posix_spawnp: No such file or directory
You also need to be at the root directory of the project (where the ".git"
folder is as well as the "renv.lock" file". In this example you would get there
by running cd ~/projects/test_proj
. The following steps will not
work if you are not at the root of your project.
Running the workflow is done by executing the file "workflows/estimation/start_workflow.sh" with the following command:
# bash - hpc ./workflows/test_slurmworkflow/start_workflow.sh
If you are using Windows, it may not be executable. You can solve it with the following command:
# bash - hpc chmod +x workflows/test_slurmworkflow/start_workflow.sh`
The workflow will not work if you source the file (with source <script>
or
. <script>
).
You can check the state of your running workflow as usual with squeue -u <user>
.
The logs for the workflows are in "workflows/test_slurmworkflow/log/".
This start script additionally allows you to start a workflow at a specific
step with the -s
argument.
# bash - hpc ./workflows/test_slurmworkflow/start_workflow.sh -s 3
This will start the workflow at the 3rd step. Skipping steps 1 and 2.
It is sometimes desirable to start the workflow from outside of the project it
has to run on. The -d
argument allows you to set a different working directory
for the workflow.
# bash - hpc cd / ~/projects/test_proj/workflows/test_slurmworkflow/start_workflow.sh -d ~/projects/test_proj
The previous block places us at the root of the file system with cd /
. Then
we call the "start_workflow.sh" script using its absolute path and we specify
that the working directory for the workflow must be the root of the project.
Remember that for renv
to work, R
must be called from the directory where
the ".Rprofile" file is. It's the directory where you can also find the
"renv.lock" file.
slurmworkflow is a very low level package providing only basic building blocks for complex HPC computation.
This package is used for EpiModel applied projects through higher level functions in EpiModelHPC and swfcalib, an automated calibration system used for our models (WIP).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.