wait_on_slurm_job_id: Wait for a Slurm job to finish (find 'RUNNING|PENDING' State...
In epi-sam/SamsElves: Helper functions for the data science at IHME

wait_on_slurm_job_id

R Documentation

Wait for a Slurm job to finish (find 'RUNNING|PENDING' State jobs).

Description

Option to break if some jobs fail (find 'FAILED' State jobs).

Usage

wait_on_slurm_job_id(
  job_id,
  initial_sleep_sec = 30,
  cycle_sleep_sec = 30,
  filter_by = c("jobidraw"),
  filter_regex = NULL,
  break_on_failure = FALSE,
  dryrun = FALSE,
  batch_size = 500
)

Arguments

`job_id`	[int] a Slurm 'JobID' (single or vector)
`initial_sleep_sec`	[int] how long to sleep before initial check for jobs on cluster
`cycle_sleep_sec`	[int] how long to wait between checks
`filter_by`	[chr] vector of sacct fields to search e.g. 'c("User", "JobName")' (case insensitive)
`filter_regex`	[regex] required if 'filter_by "Account")'
`break_on_failure`	[lgl] if _any_ of your jobs fail, should this function break? Failure itself is always determined based on the user's filters, but failure _feedback_ always returns JobID, regardless of filtering. This is due to how jobs are initially queried. This may included unwanted recycled 'JobID's, and it is _up to the user_ to determine which are relevant to their work.
`dryrun`	[lgl] return a list of commands built by this function, but do not wait on jobs (only returns command for the first batch - see 'batch_size')
`batch_size`	[int] how many jobs to group together to wait on (grep limitation in max number, 500 is a good default, could increase near 900). All jobs in batch 1 must finish before batch 2 is checked.

Details

Option to filter 'sacct' results by multiple fields with 'grep -P'.

Works in batches to accommodate grep limitations (default 500 'job_id's) - All batch 1 jobs are checked before moving to batch 2.

First find all jobs with a given base 'JobID' (fastest search method) - default behavior: next filter for 'JobID' matching 'JobIDRaw' - there may be duplicate 'JobIDRaw', so you could also filter by active 'User'

If you are submitting _array jobs_ they may overlap with old 'JobID's: - you'll get one 'JobID' back from the system when you submit an 'sbatch' for an array - it will only match a single 'JobIDRaw' - e.g. '1234' is returned for an array of '1234_1' and '1234_2', which have 'JobIDRaw' of '1234' and '1235' under the hood - if you filtered on ‘JobIDRaw', you’d only find the first array job, and miss the others - instead of filtering on ‘JobIDRaw', it’s probably more helpful to filter on 'User' and/or 'JobName'

**NOTE:** Slurm recycles the 'JobID' field, which may cause ambiguity between the user's current job, and another user's prior job. This 'JobID' may further share an ID with a prior, recycled array job. To resolve this fundamental weakness, the user may filter on various 'sacct' fields. Supported fields are listed below (case-insensitive). - filtering is somewhat limited compared to data.frames ('grep' limitation) - all specified fields are filtered simultaneously, rather than individually

Currently supported: - 'NULL' - apply no filters - '"JobIDRaw"' - **default option** - strictly filter for specified 'JobID' = 'JobIDRaw' - filter to strictly include single jobs and _exclude any array jobs_ that may match the base 'JobID' - will filter to the most recent unique 'JobIDRaw' if duplicates exist (Slurm behavior at time of writing) - ‘"User"' - filter to only the current active user’s jobs - **NOTE:** The 'User' field can only find the active user in Rstudio - Cannot find other 'User's - Singularity container limitation (returns 'nobody') - '"JobName"' - filter according to a 'filter_regex' regex pattern - '"Account"' - filter according to a 'filter_regex' regex pattern

Slurm sacct field documentation: - https://slurm.schedmd.com/sacct.html#OPT_format - https://slurm.schedmd.com/sacct.html#OPT_helpformat