knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
slurmworkflow
is a package to construct workflows on a
SLURM equipped High Performance
Computer (HPC). In this vignette, a workflow refers to a set of tasks to be
executed on the HPC, one after the other.
We will describe how to construct and use workflows using the EpiModel/EpiModelHIV-Template project.
This project uses renv and requires access to the EpiModelHIV-p private repository. This vignette assumes that your project is hosted on a git repository checked out on your local computer and on the HPC.
This vignette will use the Rollins School of Public Health (RSPH) High Performance Computing cluster (HPC) from Emory University as an example.
The R scripts are all located in the "R/" subdirectory and are using the following naming conventions:
source
d by the steps or
workflows. They limit code repetition.The "data/" directory contains:
git
.git
.git
.These applied projects aim to accurately represent the population of Men who have Sex with Men (MSM) from the Atlanta Metro area. And then simulate how the HIV epidemic would behave under different intervention scenarios.
A (massive) oversimplification of the project would be to see it as the following 3 steps:
All the numbered scripts ("R/01-snake_case_name.R") are meant to be executed on
your local computer. They allow you to explore all the steps and substeps on
small networks with few replications. Most of them define a context
variable
at the top. This variable takes the value "local" or "hpc" and is use to set
the context of the computation:
- "local": Uses small networks and few replications, unfit for publication.
This is used to test the code and processes.
- "hpc": Uses full size networks and many replications. This is used to produce
the analysis for publication. Once the code has been validated locally.
A lot of the numbered scripts are re-used by the workflows with the context
variable set to "hpc".
The goal is: if your scripts run locally, they should run on the HPC without modification.
This project simulates HIV dynamics on a population of 100 000 individuals. The first step is to estimate 3 networks of 100 000 nodes representing respectively main, casual and one off partnerships. This step will happen in the script "R/01-networks_estimation.R". Afterwards we will want to diagnose the estimations using the script "R/02-networks_diagnostics.R" and finally explore these diagnostics interactively in the script "R/03-networks_diagnostics_explore.R".
You should run these 3 scripts locally and make sure you understand what they do.
Without modification, the context
variable will be set to "local" and produce
fast "Stochastic-Approximation" for 5k nodes networks.
NOTE: You should run each script in a fresh R console each time to avoid
starting with a polluted environment that will lead to very complicated
debugging. In RStudio: Ctrl-Shift-F10 or .rs.restartR()
(aliased to rs()
in the project). Do not do: rm(list = ls())
instead, it does not do the same
thing and should not be used.
Now that you have run the 3 scripts locally, we will define an HPC workflow to run the first 2 parts, networks estimation and networks diagnostics, with 100k nodes networks with the full MCMLE estimation method. Trying to run this locally will take multiple days and probably crash your computer before ending.
Instead, we will create a workflow locally, send it to the HPC, run it there and collect the results for analysis.
The script "R/workflow_01-networks_estimation.R" is responsible to the creation
of the first workflow. We will walk through it block by block to understand
the basics of slurmworkflow
.
A workflow exists on your computer as a directory inside the "workflows/" directory of your project. Our first workflow is called "networks_estimation" and will live in "workflows/networks_estimation/".
First we load the required libraries and source the "R/utils-0_project_settings.R".
# Libraries -------------------------------------------------------------------- library("slurmworkflow") library("EpiModelHPC") # Settings --------------------------------------------------------------------- source("./R/utils-0_project_settings.R")
This last script contains variable used throughout the project. You should make
sure that the current_git_branch
is correct and put your emai address in
mail_user
. This way the HPC will send you a mail when the workflow is
finished.
Then we set a max_cores
variable to 10. This will be the number of CPU cores
to be used for the network estimations. 10 usually works fine.
max_cores <- 10 source("./R/utils-hpc_configs.R") # creates `hpc_configs`
The "R/utils-hpc_configs.R" script contains helper functions to simplify the HPC setup. Here is the part that should be uncommented for using the RSPH cluster:
# Must be sourced **AFTER** "./R/utils-0_project_settings.R" hpc_configs <- EpiModelHPC::swf_configs_rsph( partition = "epimodel", r_version = "4.2.1", mail_user = mail_user )
We use the EpiModelHPC::swf_configs_rsph
helper function to create the
hpc_configs
objects that will holds some pre-defined configurations. We
specify that we want to work on the "epimodel" partition and that e-mails
telling us when the jobs are done should be sent to "user@emory.edu".
Note: the HPC is using the Slurm workflow manager to allocate tasks to computing nodes. The EpiModel partition allocate jobs to nodes reserved for the EpiModel team. The other option is preemptable, where you can use any empty node but may be kicked out if a reserved node is preempted by someone.
The slurmworkflow::create_workflow
function takes 2 mandatory arguments:
wf_name
: the name of the new workflowdefault_sbatch_opts
: a list of default options for the
sbatch
command. They will be
shared among all steps.# Workflow creation ------------------------------------------------------------ wf <- create_workflow( wf_name = "networks_estimation", default_sbatch_opts = hpc_configs$default_sbatch_opts )
Here we created a workflow called "networks_estimation" and use the sbatch
options stored in hpc_configs$default_sbatch_opts
.
hpc_configs$default_sbatch_opts #> list( #> "partition" = "epimodel", #> "mail-type" = "FAIL" #> "mail-user" = "user@emory.edu" #> )
It specifies that we want to use the "epimodel" partition and that an e-mail should be sent if a task fails.
With this we have created the directory "workflows/networks_estimation" and
stored a summary of it in the wf
variable. For now our workflow has no steps.
notes: SLURM configuration can vary, on HYAK for instance
there is an accounting module and we would have to specify the "account"
option). An equivalent
swf_configs_hyak
function exists for the HYAK ecosystem.
- default_sbatch_opts
and sbatch_opts
parameters accept all the options for
sbatch
starting with "--". (e.g. "account" is valid but "A" is not, as it
corresponds to the "-A" shorthand)
- If a "workflows/networks_estimation" directory already exists,
create_workflow
will throw an error. You have to delete the previous
versions of the workflow yourself if you want to overwrite them.
renv::restore
StepBefore running the actual calculation, we want to make sure that the project on
the HPC is up to date with the right packages version. It translates to running
git pull
on the HPC and renv::restore()
from the project.
To do this we will add a step to the workflow. The
slurmworkflow::add_workflow_step
take 2 mandatory arguments:
wf_summary
: a summary of the workflow to edit (the wf
variable)step_tmpl
: a step template. These are made by a special kind of functions
from slurmworkflow
.Here we will also use the optional sbatch_opts
arguments to override some
of the default options defined above.
# Update RENV on the HPC ------------------------------------------------------- wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_renv_restore( git_branch = current_git_branch, setup_lines = hpc_configs$r_loader ), sbatch_opts = hpc_configs$renv_sbatch_opts )
The step template here is from the function
EpiModelHPC::step_tmpl_renv_restore
which takes two arguments:
git_branch
: the branch that the repository must follow. If the branch
followed (on the HPC) is not the right one, the step will stop there to
avoid potential data loss and undefined behaviors. Here we use the current_git_branch
variable defined in "R/utils-0_project_settings.R".setup_lines
: some boilerplate bash
code to allow running R code on the HPC.Internally this function sets up an sbatch
task that will run git pull
and
renv::restore()
on the HPC.
For this specific task we need to change some of the sbatch
options using
hpc_configs$renv_sbatch_opts
.
hpc_configs$renv_sbatch_opts #> list( #> "mem" = "16G", #> "cpus-per-task" = 4, #> "time" = 120 #> )
It asks for 16GB of RAM, 4 CPUs and tell SLURM that the job should take less than 120 minutes.
We assigned the result of the call to wf
(wf <- add_workflow_step(...)
), and
the function modified the "workflows/networks_estimation/" folder.
notes: on the MOX cluster from HYAK, renv_sbatch_opts
would also changes the
partition
to "build" as on MOX the "default" partition does not have
internet access.
Now that we ensured that the project will be up to date on the HPC, we want to
run the "R/01-networks_estimation.R" script there with context <- "hpc"
.
To do this we add another step with slurmworkflow::add_workflow_step
but with
a different step template.
# Estimate the networks -------------------------------------------------------- wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_do_call_script( r_script = "./R/01-networks_estimation.R", args = list( context = "hpc", estimation_method = "MCMLE", estimation_ncores = max_cores ), setup_lines = hpc_configs$r_loader ), sbatch_opts = list( "cpus-per-task" = max_cores, "time" = "24:00:00", "mem" = "0" ) )
slurmworkflow::step_tmpl_do_call_script
template sets up a step to run the
script located on the HPC under the path r_script
, here
"R/01-networks_estimation.R", with some variables pre-defined.
We set the following variables:
- context = "hpc"
: signals the script to use the "hpc" settings.
- estimation_method = "MCMLE"
: we want the slower but more accurate
estimation method.
- estimation_ncores = max_cores
: this estimation method benefits from being
parallelized. (more that 10 cores can slow things down significantly).
If you take a look at the "R/01-networks_estimation.R" script, you will see that
estimation_method
and estimation_ncores
are set if context == "local"
but
not for "hpc". Thanks to our step template they will be defined when the
script run as part of the workflow.
note: The syntax of step_tmpl_do_call_script
to pass arguments to a script
is similar to the one of base::do.call
.
Important note: Some users like to clear their R environment by placing
rm(list = ls())
at the start of their scripts. In addition to it being
discouraged generally, it will
actually prevent a script to work with step_tmpl_do_call_script
as it
deletes the variable at the start of the script. Restarting the R session or
using the callr
package are better
alternatives when working interactively.
Finally, we also provide the setup_lines
as before and some new
sbatch_opts
. As no "partition" option is provided, it will default
to "epimodel" (using the values set in create_workflow
at the beginning.
This step will write 3 files on the HPC: (see the script itself for details)
Finally we want to generate diagnostics for these networks with "R/02-networks_diagnostics.R".
# Generate the diagnostics data ------------------------------------------------ wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_do_call_script( r_script = "./R/02-networks_diagnostics.R", args = list( context = "hpc", ncores = max_cores, nsims = 50 ), setup_lines = hpc_configs$r_loader ), sbatch_opts = list( "cpus-per-task" = max_cores, "time" = "04:00:00", "mem-per-cpu" = "4G", "mail-type" = "FAIL,END" ) )
This step uses the same template as before, with 3 variables passed to the
script: context
, ncores
and nsteps
.
As it is the last step of this workflow we override the "mail-type" sbatch
option to receive a mail when this step ends. We do so to be notified when the
workflow is finished.
This step will write 3 files on the HPC: (see the script itself for details)
Now that our estimation workflow is set up, we need to send it to the HPC, run it and download the results.
We assume that the "workflows/" and "data/intermediate/" directories are not tracked by git (using ".gitignore" for example) and that the user has an SSH access to the HPC.
We will use scp
to copy the folder over to the HPC as it is available on
Windows, MacOS and GNU/Linux.
In this example, the "EpiModelHIV-Template" repository is located at "~/projects/EpiModelHIV-Template" on the HPC.
Before sending the workflow, make sure that the project on the HPC has renv
initialized. This means running renv::init()
from the root of the project on
the HPC.
If you have never used the command line before, we recommend using the terminal from RStudio (not the R console).
Everything written <between angle brackets>
is to be replaced with the correct
value.
You should make sure to understand what each part of the commands do before running them. It will make your life easier.
The following commands are to be run from your local computer.
MacOS or GNU/Linux
# bash - local scp -r workflows/networks_estimation <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/workflows/
Windows
# bash - local set DISPLAY= scp -r workflows\networks_estimation <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/workflows/
Forgetting set DISPLAY=
will prevent scp
from working correctly.
Note that its workflows\networks_estimation
. Windows uses back-slashes for
directories and Unix OSes uses forward-slashes.
For this step, you must be at the command line on the HPC. This means that you
have run: ssh <user>@clogin01.sph.emory.edu
from your local computer.
run set DISPLAY=
on Windows before if you get this error:
ssh_askpass: posix_spawnp: No such file or directory
You also need to be at the root directory of the project (where the ".git"
folder is as well as the "renv.lock" file". In this example you would get there
by running $ cd ~/projects/EpiModelHIV-Template
. The following steps will not
work if you are not at the root of your project.
Running the workflow is done by executing the file "workflows/estimation/start_workflow.sh" with the following command:
# bash - hpc ./workflows/estimation/start_workflow.sh
If you are using Windows, the may not be executable. You can solve it with the following command:
# bash - hpc chmod +x workflows/estimation/start_workflow.sh`
The workflow will not work if you source the file (with source <script>
or
. <script>
).
Granting that the workflow worked correctly, you should receive a mail telling you that the last step ended with exit code 0 (success, or 0 errors).
We want to download the "data/intermediate/estimates/" and "data/intermediate/diagnostics/" directories back to our local machine:
These command are to be run from your local machine, not from the SSH session on the HPC.
MacOs or GNU/Linux
# bash - local scp -r <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/data/intermediate/estimates data/intermediate/ scp -r <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/data/intermediate/diagnostics data/intermediate/
Windows
Same notes as before for Windows
# bash - local set DISPLAY= scp -r <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/data/intermediate/estimates data\intermediate\ scp -r <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/data/intermediate/diagnostics data\intermediate\
We can now run the R script "03-diagnostics_explore.R" to see if everything is
correct. Don't forget to set the context
to "hpc" at the top of the file to
assess the right networks.
We will skip directly to intervention scenarios workflow as the process is very similar for calibration.
At this point, we assume that you have a "data/intermediate/estimates/restart-hpc.rds" file and a bunch of scenarios defined in "data/input/scenarios.csv"
Further, we will not differentiate the command line from MacOS, GNU/Linux and Windows anymore.We will present only the UNIX version and Windows user can apply the same rules as previously when required.
Here we will define a 3 steps workflow:
1. a renv_update step as before.
2. a set nrep
replications of each scenario.
3. a processing step.
The script "R/workflow_05-intervention_scenario.R" is responsible of the creation of this workflow.
renv::restore
# Libraries -------------------------------------------------------------------- library("slurmworkflow") library("EpiModelHPC") library("EpiModelHIV") # Settings --------------------------------------------------------------------- source("./R/utils-0_project_settings.R") context <- "hpc" max_cores <- 32 source("./R/utils-default_inputs.R") # make `path_to_est`, `param` and `init` source("./R/utils-hpc_configs.R") # creates `hpc_configs` # ------------------------------------------------------------------------------ # Workflow creation wf <- create_workflow( wf_name = "intervention_scenarios", default_sbatch_opts = hpc_configs$default_sbatch_opts ) # Update RENV on the HPC wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_renv_restore( git_branch = current_git_branch, setup_lines = hpc_configs$r_loader ), sbatch_opts = hpc_configs$renv_sbatch_opts )
We go quickly on this part as it is similar to the previous workflow. If you tested the numbered script locally, all the sourced files should make sense to you.
This step assumes that you know how to run an EpiModel network simulation. This part is similar to the local script "R/40-intervention_scenarios.R".
Terminology:
- simulation: One run of an epidemic model.
- scenario: a set of parameters for a simulation. See vignette("Working
with Model Parameters", package = "EpiModel")
- batch: a set of ncores
simulations to be run on a single cluster node.
They all share the same scenario.
In this step we need the path_to_restart
, param
, init
and control
objects as for a the EpiModelHPC::netsim_scenarios
call. They are loaded from
"R/utils-default_inputs.R".
The control
object differs from it's usual form as
the nsims
and ncores
argument will be overridden by the workflow.
# Controls source("./R/utils-targets.R") control <- control_msm( start = restart_time, nsteps = intervention_end, nsims = 1, ncores = 1, initialize.FUN = reinit_msm, cumulative.edgelist = TRUE, truncate.el.cuml = 0, .tracker.list = calibration_trackers, verbose = FALSE )
As in other scripts, the restart_time
and intervention_end
are loaded from
the "R/utils-0_project_settings" script.
Note that for these simulations we restart note from time zero but from a
previous simulation. Therefore we need to specify a different initialize.FUN
to handle the restarting process. reinit_msm
is such a function in
EpiModelHIV-p
.
We then load a tibble
of 2 scenarios found in
"data/input/scenarios.csv" and transform it into a scenario list.
scenarios_df <- readr::read_csv("./data/input/scenarios.csv") scenarios_list <- EpiModel::create_scenario_list(scenarios_df)
To account for the variability in our models, we want each scenario to be run 120 times. (usually 500 to 1000 times for the final paper).
As before we use add_workflow_step
to create the step. This time we use the
step template EpiModelHPC::step_tmpl_netsim_scenarios
. It takes as arguments:
path_to_est
, param
, init
, and control
. path_to_est
is the where the
workflow should look for the est
file on the HPC. Here we will pass path_to_restart
which is "data/intermediate/estimates/restart-hpc.rds"scenarios_list
: the list of scenarios produced by create_scenario_list
output_dir
: a path to a directory to store the resultslibraries
: a character vector of the libraries required to run the model.
here we only need "EpiModelHIV"save_pattern
: what part of the sim
object should be kept. Simple will keep
only epi
, param
and control
. Other values can be used in other use
cases.n_rep
: the number of time each scenarios must be simulated. (here 120)n_cores
: the number of cores to be used on each nodemax_array_size
is detailed below but a value of 500 is usually fine.setup_lines
: same as before.wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_netsim_scenarios( path_to_restart, param, init, control, scenarios_list = scenarios_list, output_dir = "./data/intermediate/scenarios", libraries = "EpiModelHIV", save_pattern = "simple", n_rep = 120, n_cores = max_cores, max_array_size = 500, setup_lines = hpc_configs$r_loader ), sbatch_opts = list( "mail-type" = "FAIL,TIME_LIMIT", "cpus-per-task" = max_cores, "time" = "04:00:00", "mem" = 0 ) )
This step will run the simulations and save the result to output_dir
using the
following format: paste0("sim__", scenario[["id"]], "__", batch_num, ".rds")
.
As we are running the simulations on 32 core machines, each scenario will be run
over 4 batches, with the last one containing only 24 simulations to get to the
desired 120. (3 * 32 + 24 == 120)
.
NOTE: The max_array_size
argument allow us to constrain how many runs could
be submitted as once. On RSPH HPC one is limited to around 1000 job submission
at a time. Trying to submit more will result on SLURM rejecting all the jobs.
To prevent this, slurmworkflow
will split the job into parts that will be
submitted automatically one after the other. If the length of scenarios_list
was 3 000, max_array_size = 500
would split it in 6 parts were each part
would not be submitted before the previous one is over.
Now that all the batches have been run we will process them and create a small
summary tibble
to be downloaded and evaluated locally.
We return to step_tmpl_do_call_script
for this steps.
# Process calibrations # # produce a data frame with the calibration targets for each scenario wf <- add_workflow_step( wf_summary = wf, step_tmpl = step_tmpl_do_call_script( r_script = "./R/41-intervention_scenarios_process.R", args = list( context = "hpc", ncores = 15 ), setup_lines = hpc_configs$r_loader ), sbatch_opts = list( "cpus-per-task" = max_cores, "time" = "04:00:00", "mem-per-cpu" = "4G", "mail-type" = "END" ) )
The arguments we pass to the script are ncores
, how many cores to use as this
step will process the files in parallel using the future.apply
package, and context = "hpc"
as
before.
This script will save two file: 1. "data/intermediate/scenarios/assessments_raws.rds" 2. "data/intermediate/scenarios/assessments.rds"
See the script itself to see what it does.
We send the workflow as before with:
# bash - local scp -r workflows/intervention_scenarios <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/workflows/
run it from our project directory on the HPC with:
# bash - hpc ./workflows/intervention_scenarios/start_workflow.sh
and finally we download the results for evaluation:
# bash - local scp -r <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/data/intermediate/scenarios/assessments_raws.rds data/intermediate/calibration/ scp -r <user>@clogin01.sph.emory.edu:projects/EpiModelHIV-Template/data/intermediate/scenarios/assessments_raws.rds data/intermediate/scenarios/
We can now use these files as we please locally.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.