wait_on_slurm_job_id: Wait for a Slurm job to finish (find 'RUNNING|PENDING' State...

View source: R/wait_on_slurm_job_id.R

wait_on_slurm_job_idR Documentation

Wait for a Slurm job to finish (find 'RUNNING|PENDING' State jobs).

Description

Option to break if some jobs fail (find 'FAILED' State jobs).

Usage

wait_on_slurm_job_id(
  job_id,
  initial_sleep_sec = 30,
  cycle_sleep_sec = 30,
  filter_by = c("jobidraw"),
  filter_regex = NULL,
  break_on_failure = FALSE,
  dryrun = FALSE,
  batch_size = 500
)

Arguments

job_id

[int] a Slurm 'JobID' (single or vector)

initial_sleep_sec

[int] how long to sleep before initial check for jobs on cluster

cycle_sleep_sec

[int] how long to wait between checks

filter_by

[chr] vector of sacct fields to search e.g. 'c("User", "JobName")' (case insensitive)

filter_regex

[regex] required if 'filter_by "Account")'

break_on_failure

[lgl] if _any_ of your jobs fail, should this function break? Failure itself is always determined based on the user's filters, but failure _feedback_ always returns JobID, regardless of filtering. This is due to how jobs are initially queried. This may included unwanted recycled 'JobID's, and it is **_up to the user_** to determine which are relevant to their work.

dryrun

[lgl] return a list of commands built by this function, but do not wait on jobs (only returns command for the first batch - see 'batch_size')

batch_size

[int] how many jobs to group together to wait on (grep limitation in max number, 500 is a good default, could increase near 900). All jobs in batch 1 must finish before batch 2 is checked.

Details

Option to filter 'sacct' results by multiple fields with 'grep -P'.

Works in batches to accommodate grep limitations (default 500 'job_id's) - All batch 1 jobs are checked before moving to batch 2.

First find all jobs with a given base 'JobID' (fastest search method) - default behavior: next filter for 'JobID' matching 'JobIDRaw' - there may be duplicate 'JobIDRaw', so you could also filter by active 'User'

If you are submitting _array jobs_ they may overlap with old 'JobID's: - you'll get one 'JobID' back from the system when you submit an 'sbatch' for an array - it will only match a single 'JobIDRaw' - e.g. '1234' is returned for an array of '1234_1' and '1234_2', which have 'JobIDRaw' of '1234' and '1235' under the hood - if you filtered on ‘JobIDRaw', you’d only find the first array job, and miss the others - instead of filtering on ‘JobIDRaw', it’s probably more helpful to filter on 'User' and/or 'JobName'

**NOTE:** Slurm recycles the 'JobID' field, which may cause ambiguity between the user's current job, and another user's prior job. This 'JobID' may further share an ID with a prior, recycled array job. To resolve this fundamental weakness, the user may filter on various 'sacct' fields. Supported fields are listed below (case-insensitive). - filtering is somewhat limited compared to data.frames ('grep' limitation) - all specified fields are filtered simultaneously, rather than individually

Currently supported: - 'NULL' - apply no filters - '"JobIDRaw"' - **default option** - strictly filter for specified 'JobID' = 'JobIDRaw' - filter to strictly include single jobs and _exclude any array jobs_ that may match the base 'JobID' - will filter to the most recent unique 'JobIDRaw' if duplicates exist (Slurm behavior at time of writing) - ‘"User"' - filter to only the current active user’s jobs - **NOTE:** The 'User' field can only find the active user in Rstudio - Cannot find other 'User's - Singularity container limitation (returns 'nobody') - '"JobName"' - filter according to a 'filter_regex' regex pattern - '"Account"' - filter according to a 'filter_regex' regex pattern

Slurm sacct field documentation: - https://slurm.schedmd.com/sacct.html#OPT_format - https://slurm.schedmd.com/sacct.html#OPT_helpformat

Value

[std_out/std_err] std_out for sleep cycle duration & successful ending, std_err printing failed job ids


epi-sam/SamsElves documentation built on June 12, 2025, 7 a.m.