knitr::opts_chunk$set(echo = TRUE)
An HPC cluster is a group of networked high-performance computers (nodes).
A typical cluster will have multiple compute nodes which can perform heavy computation, as well as a head node which serves as the user's access point to the cluster, and may also be responsible for scheduling jobs among the compute nodes. The compute nodes in an cluster typically have compute resources (CPU Cores, RAM, Disk Space) which far exceed those of a typical laptop or desktop computer.
Because HPC clusters are intended to serve a group of people (i.e. A biostatistics department) rather than a single user, HPC clusters use the concept of jobs to allow for multiple users to effectively share the cluster's resources.
When a user is ready to run something (an analysis, processing pipeline, etc) on the cluster, they will submit a new job to the cluster. The cluster will then schedule and run the job as soon as compute resources are available.
knitr::include_graphics("figures/CIDA_BIOS_Cluster/cluster_basic_diagram.png")
The example figure above shows an example cluster with three users. Each user connects to the Head Node to submit their jobs.
User 1 has submitted two jobs, which are both running on Compute Node 1. User 2 has submitted a single job, which is also running on Compute Node 2. User 3 has submitted a job which requires a larger amount of compute resources (CPU cores, RAM, etc). This job runs on Compute Node 2 to provide the user with the resources they requested.
HPC clusters are useful for:
The CSPH Biostats cluster consists of four nodes. The csphbiostats.ucdenver.pvt
node serves as both a head node and one of the compute nodes (i.e. Submitted jobs may also run on this node), and the cidalappc[1-3].ucdenver.pvt
nodes serve as compute nodes.
knitr::include_graphics("figures/CIDA_BIOS_Cluster/Biostats_HPC_diagram.png")
The CIDA/Biostats server uses the SLURM system to manage job scheduling and resource management on the cluster.
To access the CSPH Biostats cluster, first submit a support ticket on the SOM IS web page requesting:
/biostats_share
(i.e. /biostats_share/<your_username>
).Once approved, an account will be created for you on the server.
To log in to the CSPH Biostats Cluster, you can use SSH from the command line or an SSH client of your choice.
knitr::include_graphics("figures/CIDA_BIOS_Cluster/biostats_hpc_ssh.png")
If you are connecting from the command line (like the above example), run:
ssh <your_username>@csphbiostats.ucdenver.pvt
where <your_username>
is your CU system username (i.e Your username for UCDAccess, Outlook, etc)
On login, you will be prompted for a password which will be your CU system password (i.e. Your password for UCDAccess, Outlook, etc).
Once you've successfully logged in, you should see a prompt like the final line in the screenshot, showing that you are logged in to csphbiostats.ucdenver.pvt
.
To make future logins more convenient, you could configure an SSH config profile and RSA key pair, which will enable password-less login.
In the sections below, I will describe a few different ways of submitting a job on the cluster, along with their potential use cases.
sbatch
The most common way of submitting a job on the cluster is to use the sbatch
command.
To submit a job using sbatch
, you should first create a batch script which will list the commands to be executed as part of your job.
The below example shows a simple batch script my_batch.sh
which executes a single R script.
1 2 3 4 5 6 | #!/bin/bash #SBATCH --job-name=my_batch #SBATCH --output=my_batch.log #SBATCH --error=my_batch.err Rscript my_analysis.R |
The first line:
#!/bin/bash
is required, and is used to determine how your script will be executed. In this case, the script will be executed using bash
.
The next few lines will be parsed by SLURM to set parameters/options for your batch job.
#SBATCH --job-name=my_batch #SBATCH --output=my_batch.log #SBATCH --error=my_batch.err
In order:
--job-name
- Sets a name for your job.--output
- Sets the output log file for your job. Any log messsages or outputs from your script will be sent to this file.--error
- Sets the error log file for your job. Any log or error messages produced by your script will be sent to this file.Any comment line containing #SBATCH
before the first command in your script will be parsed by SLURM.
Although #SBATCH
lines are not required, I recommend at least providing the --output
and --error
options, since otherwise your output and error streams will both be directed to the default file slurm-<job_id>.out
We can submit this job by running:
sbatch my_batch.sh
knitr::include_graphics("figures/CIDA_BIOS_Cluster/cluster_sbatch.png")
When we execute the sbatch
command, SLURM will assign the job an ID and schedule the job to execute. In this case, our job is assigned ID 6179
(first arrow in the figure).
By using the squeue
command, we can obtain the status of all jobs currently scheduled across the cluster (including those submitted by other users).
Using the assigned job ID, we can use the JOBID
column to identify our job in the queue (second arrow in the figure).
The squeue
also provides some useful information about our job:
ST
column tells us that our job is in the running (R
) state.TIME
column tells us that our job has been running for 9 seconds.NODELIST(REASON)
column tells us that our job is executing on the csphbiostats.ucdenver.pvt
node.sbatch
Using sbatch
is most beneficial for long-running scripts or analyses. After the job has been submitted using sbatch
, your job will execute on the cluster until completion. You can continue to work on the cluster (submitting other jobs, etc) or log out without affecting any of your running jobs.
You can monitor the progress of your job using by checking your log (--output
) and error (--error
) output files to see any outputs/messages printed by your script.
Additionally, you can use the squeue
command to obtain other information about your job, including the current state (ST
), the elapsed runtime (TIME
), and location of your job (NODELIST
).
srun
Submitting a job using SLURM's srun
command will schedule and run the job as soon as possible. When the job begins to run, you will see the output of the job in your terminal as it executes.
Similar to running a script locally on the command line (i.e. ./my_script.sh
), you will not be able to execute other commands in your terminal window until the submitted job is complete.
An example below shows execution of a simple shell script using srun
.
The command:
srun --exclude=csphbiostats.ucdenver.pvt ./my_script.sh
will schedule a new job to run the script ./my_script.sh
.
The --exclude
flag tells SLURM to schedule the job on any node except the node(s) listed. In this case, we exclude the csphbiostats.ucdenver.pvt
node to ensure our job runs on one of the cidalappc[1-3].ucdenver.pvt
nodes (and the first line of the script output shows our job ran on cidalappc01
).
knitr::include_graphics("figures/CIDA_BIOS_Cluster/cluster_srun.png")
srun
is also useful for running interactive jobs. An interactive job opens a terminal session on a compute node, allowing you to run commands and script interactively.
Interactive jobs are useful for running quick commands or scripts, or anytime you'd like to directly monitor the output of your script/command.
The command:
srun --pty --exclude=csphbiostats.ucdenver.pvt /bin/bash -i
will schedule a new interactive job on a compute node.
This command is similar to the previous srun
command we used but has a few key differences:
--pty
flag tells SLURM that this is an interactive job./bin/bash -i
will execute an interactive shell on the compute node.After submitting the job, we see the prompt string has changed from:
[hillandr@csphbiostats job_example]
to
[hillandr@cidalappc01 job_example]
.
This indicates that we are now running an interactive job on the compute node cidalappc01
.
At this point, we can execute any commands or scripts normally.
To end the interactive job, type exit
.
knitr::include_graphics("figures/CIDA_BIOS_Cluster/interactive_job.png")
srun
srun
is useful for running interactive jobs with fast-running commands/scripts, or for debugging issues with a larger script you will eventually submit using sbatch
.
I don't recommend using srun
for long-running jobs, as if you disconnect from the cluster while the srun
command is still executing, the job may be cancelled.
To upload and download data from the cluster, it is most convenient to use SFTP. There are multiple ways to use SFTP, including through the command line using the sftp
command (Mac/Unix-like systems only)
On Mac, the Cyberduck application is a free and intuitive GUI SFTP client. On Windows systems, WinSCP is another popular choice.
The below screenshots show an example of connecting to csphbiostats.ucdenver.pvt
using Cyberduck on Mac.
First, open Cyberduck and click the 'Open Connection' button on the top bar.
knitr::include_graphics("figures/CIDA_BIOS_Cluster/cyberduck_1.png")
Next, ensure that the 'SFTP' option is selected in the dropdown, then input your SSH credentials.
knitr::include_graphics("figures/CIDA_BIOS_Cluster/cyberduck_2.png")
If successful, you should see a file browser interface showing your home directory on csphbiostats.ucdenver.pvt
.
You can use the interface to navigate and download and existing files. You can also drag-and-drop files from your local machine to upload them to the cluster.
knitr::include_graphics("figures/CIDA_BIOS_Cluster/cyberduck_3.png")
The CSPH Biostats cluster uses a shared filesystem for many important directories, including /home
and /biostats_share
. This means that any files (scripts, data, etc) you upload to one of these directories on csphbiostats.ucdenver.pvt
will be accessible from any node in the cluster.
Similarly, any output files generated from a job running on a compute node will also be accessible from csphbiostats.ucdenver.pvt
, making it easy to download the results of a job.
The SLURM system has an excellent website, with documentation for each command. I recommend reading the documentation for at least the sbatch command, as it contains information about many configuration options not listed in this document.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.