knitr::opts_chunk$set(echo = TRUE)
This document will guide you through running interactive RStudio (and/or JupyterLab) sessions on the Biostats cluster.
Once configured, you'll be able to easily submit RStudio/JupyterLab SLURM jobs, and access them through your browser.
We'll cover:
If you've never used the csphbiostats
cluster before, you'll need to request an account. Information on how to access the cluster is available in the Accessing the CSPH Biostats Cluster section of the main Biostats HPC article.
If you previously followed this guide to set up RStudio/JupyterLab and just need a refresher on how to launch an RStudio/JupyterLab job, check out the Quickstart section.
Using RStudio/JupyterLab on the csphbiostats
cluster allows you to:
The RStudio container runs R 4.5.0, and supports the usual RStudio functions including RMarkdown/Quartio notebooks, interactive plotting, etc. For convenience, the RStudio container has a variety of commonly used packages pre-installed, including the tidyverse
family, and also packages many system libraries needed for compiling R packages from source.
The JupyterLab container supports running Jupyter Notebooks (.ipynb
), and supports all JupyterLab functions. By default, the JupyterLab container has the following kernels installed:
This project is hosted on GitHub here.
If you have questions, encounter difficulties, or have suggestions for improvements, feel free to either open an issue on GitHub or email me directly.
Although the RStudio/JupyterLab instance will run on an HPC compute node, you'll log in to the cluster normally (via SSH), and access then RStudio/JupyterLab interface through your local browser.
In order for this to work correctly, we need to configure a few SSH options when logging in to the cluster. These SSH options should only need to be configured once.
The simplest way to configure SSH connections is to use the SSH config
file, which allows us to define persistent options that will be used each time we log in to the cluster (rather than having to remember which parameters to pass on the command line each time).
On Mac/Unix-like systems, this file is located at ~/.ssh/config
, and on Windows it is located at C:/Users/<username>/.ssh/config
.
If you don't see a file named config
in that location that's OK, just create it with your text editor of choice.
Note: Some GUI clients (like PuTTY on Windows) do not support port forwarding to Unix sockets (requried functionality to use RStudio/JupyterLab on the cluster). However, modern command-line versions of SSH on Mac and Windows do support this.
Open up the config
file in your text editor of choice. If you had a pre-existing config
file, check if you have an entry with Host csphbiostats.ucdenver.pvt
or HostName csphbiostats.ucdenver.pvt
. If you already have an entry you can modify it instead of creating a new entry.
IMPORTANT: If you have an existing config
, and there is an entry that begins with Host *
, you should add the new configuration we make above the existing Host *
section to ensure that our new configuration will take precedence.
Configure your entry to look something like the below entry template, replacing the three highlighted your_username
placeholders with the username you use to log into the cluster with SSH:
Host biostats
HostName csphbiostats.ucdenver.pvt
User your_username
Port 22
LocalForward 8895 /tmp/jupyter-your_username.sock
LocalForward 8896 /tmp/rstudio-your-username.sock
As an example, a fully-configured entry looks like this:
Host biostats HostName csphbiostats.ucdenver.pvt User coopdale Port 22 LocalForward 8895 /tmp/jupyter-coopdale.sock LocalForward 8896 /tmp/rstudio-coopdale.sock
The three your_username
placeholders from the template have been replaced with the user's username coopdale
.
As a quick breakdown of the above configuration options:
The Host biostats
line assigns this configuration the alias/name 'biostats'. We can use this name when running SSH to automatically use all of the configuration options we specified, like:
```{bash, eval=FALSE, echo=TRUE} ssh biostats
The `User` field should be your cluster username, and `Port` can be set to `22` (the default port for SSH). The two key elements are the `LocalForward` fields:
LocalForward 8895 /tmp/jupyter-<your_username>.sock LocalForward 8896 /tmp/rstudio-<your_username>.sock
These options tell the SSH client to forward local traffic on these ports to specific Unix socket paths on the cluster, allowing you to access the running RStudio/JupyterLab jobs. In this case, port `8895` will be used for JupyterLab and port `8896` will be used for RStudio, but feel free to customize these to any non-reserved ports you like (most ports in the 8800-8999 range should be available). Once configured, save and close the file. ### 1C. Test the SSH Connection Now, we'll run a quick check to verify that the SSH configuration is working as expected. First, we need to open a terminal window, which varies slightly by platform: | OS | Command | |-----|------| | Windows | Search for 'Powershell' or 'Command Prompt' | | Mac | Search for 'Terminal' | Once the terminal window is open, type: ```{bash, eval=FALSE, echo=TRUE} ssh biostats
knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/biostats_hpc_ssh_2.png")
Once you've logged in via SSH, open a web browser on your local machine (but keep the SSH window open!) and enter:
```{bash, eval=FALSE, echo=TRUE} localhost:8895
in the address bar. If you assigned different ports in your config file, modify the address accordingly. In the browser, you should see an error screen. This is expected. ```r knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/localhost_test.png")
Moving back to your SSH window, you should see some error messages populate the screen:
knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/biostats_hpc_ssh_local_forward_test.png")
If you see these error messages in your SSH window, Congratulations!
This means you've configured SSH correctly! The error messages appear because the browser is attempting to connect, but there is nothing on the other end (yet) to accept the connection.
These error messages will continue to appear in your SSH window as long as you keep the localhost:8895
browser tab open. Feel free to close the browser tab now and type clear
in your SSH window to clear the messages from your terminal screen.
If you don't see the error messages in your SSH window:
localhost:<port>
you're visiting in browser matches one of the LocalForward <port>
lines in your SSH config file. ssh biostats
alias from your SSH config file.To launch an RStudio/JupyterLab session, you'll need a copy of the rstudio_helper.sh
and jupyter_helper.sh
scripts. These scripts help automate some of the setup for RStudio/JupyterLab sessions.
In your SSH window, navigate to your home directory (cd ~
) and run:
```{bash, eval=FALSE, echo=TRUE} cp /biostats_share/hillandr/containers/scripts/rstudio_helper.sh . cp /biostats_share/hillandr/containers/scripts/jupyter_helper.sh .
To make copies of both scripts in your home directory (`~`). ## 2. Launching an RStudio Session Once you have copies of the launch scripts, in your SSH window try running: ```{bash, eval=FALSE, echo=TRUE} ./rstudio_helper.sh
You should see a message Submitted batch job <job_id>
. This is the Job ID for your RStudio job, which you'll need again when you want to shut down the RStudio session. This ID will change each time you launch a new job.
Now, run squeue
.
You should see an entry for your Job ID, similar to my example below.
knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/biostats_hpc_rstudio_launch.png")
This job will also produce two output files rstudio_helper.out
and rstudio_helper.err
which will log outputs or errors that the system encounters.
If you don't see your Job ID in the squeue
list, try inspecting the two above files for any error messages.
If you can see your RStudio job running in squeue
, go back to your web browser and visit
```{bash, eval=FALSE, echo=TRUE} localhost:8896
in the address bar. You should see a loading screen, and eventually the RStudio interface! ```r knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/biostats_hpc_rstudio_interface.png")
Congratulations! You can now use RStudio on the cluster!
Feel free to try out the interface and verify that everything works as expected.
When you're finished, shut down the RStudio job by following the instructions in Quit the RStudio/JupyterLab Job
If you don't see the RStudio interface in your browser:
8896
).LocalForward
directives match the template from 1B.1 SSH Config Templatesqueue
again and verify that your job is still running. If the job is not running, check rstudio_helper.err
and rstudio_helper.out
for more information. The rstudio_helper.sh
script essentially just automates some cleanup tasks and submits a new sbatch
batch script which launches RStudio.
If you open rstudio_helper.sh
and scroll down to the sbatch <<==SBATCH=SCRIPT==
section, you can customize the parameters of the underlying SLURM job to suit your needs:
```{bash, eval=FALSE, echo=TRUE}
sbatch <<==SBATCH=SCRIPT==
apptainer run \ --no-mount /etc/localtime \ --bind /biostats_share:/biostats_share \ /biostats_share/hillandr/containers/rstudio.sif
==SBATCH=SCRIPT==
The default `rstudio_helper.sh` script submits a job which: - Runs on any node *except* the head node `--exclude=csphbiostats.ucdenver.pvt` - Requests 8 CPU Cores `--cpus-per-task=8` - Requests 64GB of RAM `--mem=64G` You can change any of these values to meet your needs. The available resources of each compute node are listed in the article [Computing on the CSPH Biostats Cluster](https://cida-csph.github.io/CIDAtools/articles/CIDA_BIOS_Cluster.html#csph-biostats-hpc-cluster). ## 3. Launching a JupyterLab Session Once you have copies of the launch scripts, try running: ```{bash, eval=FALSE, echo=TRUE} ./jupyter_helper.sh
You should see a message Submitted batch job <job_id>
. This is the Job ID for your JupyterLab job, which you'll need again when you want to shut down the JupyterLab session. This ID will change each time you launch a new job.
Now, run squeue
.
You should see an entry for your Job ID, similar to my example below.
knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/biostats_hpc_jupyter_launch.png")
This job will also produce two output files jupyter_helper.out
and jupyter_helper.err
which will log outputs or errors that the system encounters.
If you don't see your Job ID in the squeue
list, try inspecting the two above files for any error messages.
IMPORTANT: The first time you launch JupyterLab, it will create a Python virtualenv
under ~/jupyterlab_env
. This process takes time (in my experience, up to 10 minutes). If you are unable to connect to your JupyterLab instance immediately after launching the job, don't worry! The setup process will log output to the output and error logs listed above.
If you can see your JupyterLab job running in squeue
, go back to your web browser and visit
```{bash, eval=FALSE, echo=TRUE} localhost:8895
in the address bar. You should see a loading screen, and eventually the JupyterLab interface! ```r knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/biostats_hpc_jupyter_interface.png")
Congratulations! You can now use JupyterLab on the cluster!
Feel free to try out the interface and verify that everything works as expected.
When you're finished, shut down the JupyterLab job by following the instructions in Quit the RStudio/JupyterLab Job
If you don't see the JupyterLab interface in your browser:
8895
).LocalForward
directives match the template from 1B.1 SSH Config Template.squeue
again and verify that your job is still running. If the job is not running, check the jupyter_helper.err
and jupyter_helper.out
files for more information./jupyterlab_venv
virtual environment. If the job is running (via squeue
) but you can't connect through your browser, try waiting ~10 minutes for environment to initialize../jupyter_helper.sh
should take only a few seconds.The jupyter_helper.sh
script essentially just automates some cleanup tasks and submits a new sbatch
batch script which launches JupyterLab.
If you open jupyter_helper.sh
and scroll down to the sbatch <<==SBATCH=SCRIPT==
section, you can customize the parameters of the underlying SLURM job to suit your needs:
```{bash, eval=FALSE, echo=TRUE} sbatch <<==SBATCH=SCRIPT==
apptainer run \ --bind /biostats_share:/biostats_share \ /biostats_share/hillandr/containers/jupyterlab.sif \ --notebook-dir "$HOME" ==SBATCH=SCRIPT==
The default `jupyter_helper.sh` script submits a job which: - Runs on any node *except* the head node `--exclude=csphbiostats.ucdenver.pvt` - Requests 8 CPU Cores `--cpus-per-task=8` - Requests 64GB of RAM `--mem=64G` You can change any of these values to meet your needs. The available resources of each compute node are listed in the article [Computing on the CSPH Biostats Cluster](https://cida-csph.github.io/CIDAtools/articles/CIDA_BIOS_Cluster.html#csph-biostats-hpc-cluster). ## 4. Quit the RStudio/JupyterLab Session When you're finished using JupyterLab or RStudio, run:
scancel
from your SSH window to kill the job (make sure everything is saved!). A few things to note: 1. Like other `sbatch` jobs, the RStudio/JupyterLab jobs will continue to run after you log out of your SSH connection (you can try this yourself!). - This means you can log out (via SSH), then log back in later and keep using the same session. - Although you *can* leave RStudio/JupyterLab running, its best to quit the session (using `scancel`) when you're finished to free up resources for other users. 2. **IMPORTANT: Check for a running RStudio/JupyterLab jobs (using `squeue`) before submitting a new one with `./rstudio_helper.sh` or `./jupyter_helper.sh`!!!** - Submitting a second job will not work, and will likely break your original session as well. - By running `squeue` you can quickly locate the Job ID of your running `rstudio_helper` or `jupyter_helper` Job (if one is running). You can then either cancel this job with `scancel` or just connect through your browser and keep working. ## 5. Quickstart (for returning users) The complete process for launching a RStudio/JupyterLab instance (once the [first-time setup](#first-time-setup) is complete) looks like: 1. Open a command prompt and run `ssh biostats`. - `biostats` is the alias for `csphbiostats.ucdenver.pvt` that we configured in the [first-time setup section](#first-time-setup). 2. Check for a currently running RStudio/JupyterLab session using `squeue`. - To reconnect to a running session, skip to step 5. - To cancel the existing session, use `scancel <job_id>`. 3. If no existing session is running, from your SSH terminal run either `./jupyterlab_helper.sh` or `./rstudio_helper.sh`. - The script should exit after printing a Job ID for your new session. 4. Verify that your new session is running using `squeue` and look for the Job ID. 5. Open your browser to `localhost:8895` (JupyterLab) or `localhost:8896` (RStudio). 6. Done! ## 6. Conclusion At this point, you should be able to launch, connect to, and cancel RStudio and JupyterLab jobs on the cluster! The sections below provide some additional configuration options and information, but these sections are completely optional. ## 7. (Optional) Additional Options and Job Customization This section provides some additional information and configuration options. ### 7A. Apptainer Containers The JupyterLab and RStudio instances both run inside of Apptainer containers. Apptainer (formerly Singularity) containers are stored as self-contained files with the `.sif` extension. The benefit of hosting JupyterLab and RStudio within containers is that they offer a consistent environment where all dependencies are installed by default. The RStudio and JupyterLab containers are derived from a [base container image](https://github.com/CIDA-CSPH/Biostats-Cluster-Tools/blob/main/containers/RBase/docker/Dockerfile) which holds installation of R, Python, and Julia languages, along with packages/libraries to support installing most other common R/Python libraries. #### 7A.1 Bind Mounts One important concept when working with containers is a 'bind mount'. Containers are like small Linux installations which run in a separate, read-only filesystem. This means we cannot access any files *outside* of the container by default. *Bind mounts* allow us to overlay specific parts of the host's filesystem onto the container, so we can read from and write our own files (on the host system) from inside the container. By default, Apptainer sets up bind mounts for a few directories including your home directory (`~`), your working directory (`.`) and `/tmp`. The `jupyter_helper.sh` and `rstudio_helper.sh` scripts also include an additional bind mount: `--bind /biostats_share:/biostats_share` which allows us to access the `/biostats_share` directory from inside the container. If you find yourself unable to access files from within the container, you can add the specific directory as a bind mount by modifying the `*_helper.sh` script to include additional bind mounts: ```{bash, eval=FALSE, echo=TRUE} # This command redirects the rest of this script to a sbatch call. sbatch <<==SBATCH=SCRIPT== #!/bin/bash #SBATCH --job-name=jupyter_helper #SBATCH --output=jupyter_helper.out #SBATCH --error=jupyter_helper.err #SBATCH --exclude=csphbiostats.ucdenver.pvt #SBATCH --mem=64G #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 # Launch the container. apptainer run \ --bind /biostats_share:/biostats_share \ --bind /another_special_directory:/another_special_directory \ /biostats_share/hillandr/containers/jupyterlab.sif \ --notebook-dir "$HOME" \ --venv "$HOME/my_custom_venv" ==SBATCH=SCRIPT==
The line --bind /another_special_directory:/another_special_directory
make the /another_special_directory
path available in the container (make sure to use an absolute path).
In general, bind mount arguments look like:
--bind <local_directory>:<container_path>
To mount <local_directory>
path on the cluster to <container_path
inside the container.
To reduce confusion, I usually mount each local path to the same path within the container (as I did with /biostats_share
).
The JupyterLab container holds an installation of Python 3.13.
When launched for the first time, the container will create a new virtualenv
environment in your home directory at ~/jupyterlab_venv
, which will hold all Python dependencies (including JupyterLab itself).
venv
for JupyterLabAlthough the ~/jupyterlab_venv
is used by default, you can specify a different virtualenv if you wish.
You can do this by modifying the jupyter_helper.sh
script and passing a value for the --venv
parameter to the jupyter.sif
container.
The would look like:
```{bash, eval=FALSE, echo=TRUE}
sbatch <<==SBATCH=SCRIPT==
apptainer run \ --bind /biostats_share:/biostats_share \ /biostats_share/hillandr/containers/jupyterlab.sif \ --notebook-dir "$HOME" \ --venv "$HOME/my_custom_venv" ==SBATCH=SCRIPT==
The second-to-last line `--venv "$HOME/my_custom_venv"` sets the path to the alternative venv. If this path exists already, the launch script will attempt to activate and use this virtual environment. If the path does *not* exist, it will be created and initialized with a default set of packages, which includes JupyterLab. **NOTE:** If you do attempt to use an pre-existing virtual environment, ensure that JupyterLab is installed *before* running `jupyter_helper.sh`, as this script will fail if no JupyterLab installation is found. #### 7B.2 Setting an Alternative home By default, the JupyterLab interface will use your `$HOME` (`~`) directory as the launch directory. If you wish to use a different directory, you can change the `--notebook-dir` argument in `jupyter_helper.sh` ### 7C. RStudio Options The RStudio container uses an installation of R 4.5.0, which is the latest version at the time of writing. By default, the container comes with many libraries pre-installed, including the `tidyverse` family of packages. Additionally, the container has many development libraries installed, which should allow you to compile most R packages from source if required. #### 7C.1 Renv Usage Although the container has many libraries pre-installed, I highly recommend using `renv` in each of your RStudio projects. You can run `renv::init()` from an RStudio project to initialize `renv`. #### 7C.2 Changing the R Version By default, the container uses R-4.5.0, but R-4.4.3 is also installed for compatibility. To change the version of R used when launching RStudio, you can change the `R_PATH` variable defined in the `rstudio_helper.sh` script: ```{bash, eval=FALSE, echo=TRUE} # Default version of R is 4.5.0. #R_PATH="/opt/R/4.5.0/bin/R" R_PATH="/opt/R/4.4.3/bin/R"
Changing this variable changes the path passed to the --r-path
argument later in the script.
Note that these paths are relative to the container, not to the cluster's filesystem.
Currently the only two supported paths (and versions of R) are:
|Version|Path| |-------|----| | R-4.5.0 (default) | /opt/R/4.5.0/bin/R | | R-4.4.3 | /opt/R/4.4.3/bin/R |
If you need an alternative version of R installed, open an issue in the GitHub repository for this project and I can add it to the container.
If you are curious as to how the RStudio/JupyterLab interfaces are configured, this section provides a brief explanation.
The challenge with running RStudio/JupyterLab on the cluster is that the RStudio/JupyterLab server needs to run on a compute node, which is not directly acessible (via the network) to your local machine. Additionally, hosting a public web server on these nodes could present a security risk.
knitr::include_graphics("figures/CIDA_BIOS_Cluster_Interactive_IDEs/biostats_hpc_rstudio_diagram.png")
To facilitate a secure connection, each RStudio/JupyterLab job creates a link between your (local) computer and the compute node running the job via a pair of SSH tunnels.
The first SSH tunnel is configured locally on your machine (see the First-time Setup section above):
This SSH tunnel forwards any local traffic on localhost:8895
or localhost:8896
to specific Unix sockets (not ports) on the csphbiostats
head node. We use Unix sockets for this since they:
The second SSH tunnel is opened by the RStudio/JupyterLab job on the compute node. Once the job begins executing on the compute node, the RStudio/JupyterLab launch script will:
csphbiostats
node, which forwards all traffic on the csphbiostats
's Unix socket to the compute node's Unix socket (which is hosting the JupyterLab/RStudio instance).In order for this to work seamlessly, the compute node needs to be able to SSH into the head node without prompting for a password. The launch script takes care of this by generating a SSH keypair (located at ~/.ssh/cluster_rsa
) which the compute node may use to log in to the head node.
Together, these two SSH tunnels form a connection between your browser visiting localhost:8895
or localhost:8896
and the RStudio/JupyterLab instance running on the compute node.
In summary, the container's launch script will:
~/.ssh/cluster_rsa
exists. csphbiostats
or a compute node (any node which is not csphbiostats
)csphbiostats
, we don't need the second SSH tunnel, as your local machine will already forward traffic directly to the configured Unix socket.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.