notes/batch_notes.md

Overview

This document describes the setup process to allow you to run jobs on remote machines directly from your local machine.

Fixme: earnserv1 is dead

set up your remote account

SHARCnet (graham.mcmaster.ca)

You need an active Compute Canada account, which then allows you to request a SHARCnet account, which needs to be verified by a PI. My (BMB) CC identifier is cze-501, in case that's useful. Once you get your account, I believe you have to sit through a training webinar before you can be certified to use more than 8 CPUs at a time or run jobs > 24 hours. (After that the cap goes up to 256 CPUs.)

yushan (jdserv.mcmaster.ca)

Ask Jonathan Dushoff for access.

earnserv (earnserv[12].mcmaster.ca)

Ask David Earn for access. (If you have access, your username and password are shared with the math & stats (ms.mcmaster.ca) server.)

If necessary, set up an RSA key pair: see here

Use your own judgment/degree of paranoia about whether to password-protect your RSA key and/or set up a local storage daemon (so you can have the security of a password-protected key without the hassle of entering your password multiple times per session)

Copy SSH public key (from local machine to remote machine) [one time]

You need this for passwordless communication from your local machine to the remote machine

ssh ssh-copy-id -i ~/.ssh/rsa_id.pub USERNAME@REMOTE

Set up VPN (install once on local machine; run every session on local machine)

Store github credentials (on remote machine)

In order to be able to pull onto the remote machine (don't do this if you're paranoid/careful)

In your home directory:

cat >~/.netrc <<EOF
machine github.com login YOURGITHUBNAME password YOURGITHUBPASSWORD
EOF
chmod +600 ~/.netrc  ## make file write-only

Optional (yushan only, one time): set up passwordless communication from head node to servers

Clone the repo (on remote machine; one time)

git clone https://github.com/mac-theobio/PHAC_covid.git

(I normally use git@ rather than https://, but (?) can't use git@ unless we also set up an SSH private key for talking to GitHub ...)

create personal library (on remote machine; one time)

## get major version
mver=`R --version | grep "R version" | awk '{print $3}' | cut -c1-3`
mkdir -p ~/R/x86_64-pc-linux-gnu-library/$mver

load modules (on graham, every session)

This is built into the snmake alias, so you don't actually need it unless you're working interactively (e.g. for debugging, or for the step below).

module load nixpkgs/16.09  gcc/8.3.0; module load r/4.0.0

Installing packages (on remote machine; once)

On graham, this has to be done on the head node (which should be OK, it's not too intensive) or we have to download all the relevant tarballs and install locally. The latter is a pain in the butt because there isn't/I don't know of a good way to handle dependencies/ordering properly.

(Let's hope we don't need Stan! (Would need to install dependencies; download tarball; set up a batch job with enough memory & time to install it))

setting up paths (on yushan, one time)

Non-shell logins work a little differently from shell (interactive) logins, and on yushan they don't work quite as expected. Create a .bashrc file in your home directory if you don't have one already, and add the following lines to it:

export PATH=$PATH:/usr/local/sge/2011.11p1/bin/linux-x64
export SGE_ROOT=/usr/local/sge/2011.11p1

to test this, try ssh USERNAME@yushan qstat -g c from your local machine (with VPN running)

(Presumably a configuration issue RHPCS could fix centrally ... ?)

make cache on remote machine

(To make wrapR make rules work) If Dropbox isn't set up on the remote machine you need:

ssh REMOTEHOST mkdir $PHACDIR/cache

Local configuration (on local machine)

(From here on I'm assuming you're using bash as your shell.)

Create the batch_setup file that defines your username and the location of your PHAC_covid directory on the remote machines (yushan/SHARCnet/earnserv), as well as the maximum number of cores to use for local runs. Mine looks like this:

## default user name
export USER=`whoami`
export WD=PHAC_covid
## username and location for yushan (change defaults if necessary)
export YUSHAN_USER=bbolker  ## override default
export YUSHAN_WD=$WD
export SN_USER=$USER
export SN_WD=$WD
export EARNSERV_USER=$USER
export EARNSERV_WD=$WD
export earnserv_MAXCORES=10
## max number of cores to use locally
export MAXCORES=5

Running jobs (on local machine)

testing/troubleshooting

Files

To do/issues

further notes/comments

The current system is set up to use furrr/future.batchtools as an interface to whatever batch system is running on a given server (i.e. SLURM scheduler [SHARCnet], SGE scheduler [yushan], nothing [earnserv]). The following components need to fit together:

There a variety of ways one can interact with the scheduler to submit a bunch of jobs:



bbolker/McMasterPandemic documentation built on Aug. 25, 2024, 6:35 p.m.