knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The SGE accounting file holds information on all submitted and
finished jobs submitted to the SGE scheduler. Queued or currently
running jobs are not included in this file. A job is considered
"finished" if it completed successfully, terminated due to an error,
or was cancelled. The file is a colon-delimited text file with one
job entry per row, with the most recently finished job appended at
end. Contrary to what one might expect, the file is not perfectly
ordered by the end_time
of the jobs. We might find some entries
where the end_time
of two consequentive entries might differ by a
few seconds in the "wrong" order. It's not clear to me why this is
but it might be that the scheduler updates the SGE accounting file at
regular intervals, say every few minutes, and when it does it goes
through the jobs in order of job index.
In March 2020, the Wynton SGE accounting file was ~12 GB and took 6-8 minutes to read. In June 2021, it was ~58 GB and had 161 million job entries. In October 2021, it was ~69 GB and had 192 million job entries. The accounting file was "rolled over" on 2022-03-02, that is, the old file was renamed and replaced by a new one. As of December 2024, the "new" accounting file was ~112 GB and had 300 million job entries.
As seen below, in May 2025, the new accounting file was ~131 GB and had ??? million job entries.
It is rare to be interested in all entries. It is more common to work with a subset of the job entries. In order to do this efficiently, we start by indexing the SGE accounting file to identify the file byte offset for each job entry;
library(wyntonquery) library(progressr) handlers(global = TRUE) ## Report on progress handlers("cli") pathname <- sge_accounting_file() cat(sprintf("File size: %.3g GB\n", file.size(pathname)/1000^3)) #> File size: 131 GB ## It takes ~20 minutes to index a 131 GB SGE accounting file. ## Indexing requires only a small amount of memory, i.e. only ## a portion of the accounting file is in memory at any time. index <- make_file_index(pathname, skip = 4L) cat(sprintf("Number of job entries: %d\n", length(index))) #> Number of job entries: 352616562 ## Save per-job index to file save_file_index(index, file = "accounting.index_by_row") cat(sprintf("File size: %.3g GB\n", file.size("accounting.index_by_row")/1000^3)) #> File size: 2.82 GB
Next, we want to group the job entries by the ISO-6801 week of the job end times.
library(wyntonquery) library(progressr) handlers(global = TRUE) ## Report on progress handlers("cli") pathname <- sge_accounting_file() index <- read_file_index("accounting.index_by_row") cat(sprintf("Number of job entries: %d\n", length(index))) #> Number of job entries: 352616562 ## It takes ~50 minutes to build the week index for 350 million entries week_index <- sge_make_week_index(pathname, index = index) saveRDS(week_index, file = "accounting.index_by_week.rds") cat(sprintf("File size: %.3g kB\n", file.size("accounting.index_by_week.rds")/1000)) #> File size: 2.4 kB range(week_index$week, na.rm = TRUE) #> [1] "2022W09" "2025W20" print(week_index) # A tibble: 166 × 3 week nbr_of_jobs file_offset <chr> <dbl> <dbl> 1 NA 28 1188 2 2022W09 1177975 9628 3 2022W10 1578885 421146956 4 2022W11 1672371 1000012580 5 2022W12 843393 1617407461 6 2022W13 1688776 1930692024 7 2022W14 1324258 2551671766 8 2022W15 1413045 3044101933 9 2022W16 1435363 3567520111 10 2022W17 1825431 4101619930 # ℹ 156 more rows # ℹ Use `print(n = ...)` to see more rows
Here is an example how we can read the job entries for a couple of weeks:
library(wyntonquery) library(dplyr) pathname <- sge_accounting_file() week_index <- readRDS("accounting.index_by_week.rds") weeks <- subset(week_index, week %in% c("2025W18", "2025W19")) print(weeks) #> # A tibble: 2 × 3 #> week nbr_of_jobs file_offset #> <chr> <dbl> <dbl> #> 1 2025W18 3723095 129069474766 #> 2 2025W19 1623099 130432327795 offset <- weeks$file_offset[1] n_max <- sum(weeks$nbr_of_jobs) cat(sprintf("Number of job entries to read: %d\n", n_max)) #> Number of job entries to read: 5346194 ## It takes ~45 seconds to read the ~5.3 million job entries of interest jobs <- read_sge_accounting(pathname, offset = offset, n_max = n_max) ## We anonymize the content so we can share it publically jobs <- anonymize(jobs) print(head(select(jobs, -account))) #> # A tibble: 6 × 44 #> qname hostname group owner job_name job_number priority submission_time #> <chr> <chr> <chr> <chr> <chr> <int> <int> <dttm> #> 1 member.q qb3-id284 group005 owner082 batch_3d 3439679 0 2025-04-27 23:49:43 #> 2 member.q qb3-id244 group005 owner082 batch_3d 3439671 0 2025-04-27 23:47:42 #> 3 member.q qb3-id202 group005 owner082 batch_3d 3439671 0 2025-04-27 23:47:42 #> 4 member.q qb3-id333 group005 owner082 batch_3d 3439671 0 2025-04-27 23:47:42 #> 5 long.q qb3-as9 group005 owner082 batch_3d 3439679 19 2025-04-27 23:49:43 #> 6 long.q qb3-id290 group005 owner082 batch_3d 3439680 19 2025-04-27 23:49:45 #> # ℹ 36 more variables: start_time <dttm>, end_time <dttm>, failed <int>, #> # exit_status <int>, ru_wallclock <drtn>, ru_utime <drtn>, ru_stime <drtn>, #> # ru_maxrss <chr>, ru_ixrss <chr>, ru_ismrss <chr>, ru_idrss <chr>, #> # ru_isrss <chr>, ru_minflt <dbl>, ru_majflt <dbl>, ru_nswap <dbl>, #> # ru_inblock <dbl>, ru_oublock <dbl>, ru_msgsnd <dbl>, ru_msgrcv <dbl>, #> # ru_nsignals <dbl>, ru_nvcsw <dbl>, ru_nivcsw <dbl>, project <chr>, #> # department <chr>, granted_pe <chr>, slots <int>, task_number <int>, …
period <- range(jobs$end_time, na.rm=TRUE) cat(sprintf("Period: %s/%s\n", period[1], period[2])) #> Period: 2025-04-27 23:59:59/2025-05-11 23:59:58 jobs <- add_weeks(jobs) period <- range(jobs$end_time_week, na.rm=TRUE) cat(sprintf("Period (weeks): %s/%s\n", period[1], period[2])) #> Period (weeks): 2025W17/2024W19 cat(sprintf("Number of jobs finished during this period: %d\n", nrow(jobs))) #> Number of jobs finished during this period: 5346194 nusers <- length(unique(jobs$owner)) cat(sprintf("Number of unique users: %d\n", nusers)) #> Number of unique users: 301
Let's see how many of the jobs finished succesfully and how many failed.
## Get successful and failed jobs groups <- list( success = subset(jobs, failed == 0L), fail = subset(jobs, failed > 0L) ) stats <- sapply(groups, nrow) print(stats) #> success fail #> 5247273 98921 print(stats/sum(stats)) #> success fail #> 0.98149693 0.01850307
We see that failure rate among the ~5.4 million jobs that ran during this period was ~1.8%. Next, let's see how much CPU time this corresponds to:
## CPU time consumed cpu <- lapply(groups, function(jobs) { d <- sum(jobs$cpu); units(d) <- "days"; d }) total <- tibble(outcome = names(cpu), cpu = do.call(c, cpu)) cat(sprintf("Total CPU processing time: %.1f %s\n", sum(total$cpu), units(total$cpu))) #> Total CPU processing time: 159065.0 days print(total) #> # A tibble: 2 × 2 #> outcome cpu #> <chr> <drtn> #> 1 success 116262.86 days #> 2 fail 42802.14 days ## CPU-time fractions ratio <- mutate(total, cpu = { x <- as.numeric(cpu); x / sum(x) }) print(ratio) #> # A tibble: 2 × 2 #> outcome cpu #> <chr> <dbl> #> 1 success 0.731 #> 2 fail 0.269
From this, we find that during these two weeks, ~27% of the CPU time was consumed by jobs that failed. Among the failed jobs, the failure code was distributed as:
codes <- groups$fail$failed print(table(codes)) #> codes #> 1 7 8 11 19 25 26 27 28 37 100 #> 8754 4 1 1 37 3319 1804 1 52 27438 57510
From details on these codes, see help("read_sge_accounting",
package="wyntonquery")
, or:
subset(sge_failed_codes(), Code %in% unique(codes), select=c(Code, Explanation)) #> # A tibble: 11 × 2 #> Code Explanation #> <int> <chr> #> 1 1 failed early in execd #> 2 7 failed before prolog #> 3 8 failed in prolog #> 4 11 failed in shepherd before starting job #> 5 19 shepherd didnt write reports correctly - probably program or machine crash #> 6 25 ran, will be rescheduled #> 7 26 failed opening stderr/stdout file #> 8 27 failed finding specified shell #> 9 28 failed changing to start directory #> 10 37 ran, but killed due to exceeding run time limit #> 11 100 ran, but killed by a signal (perhaps due to exceeding resources), task died, shepherd died (e.g. node crash), etc.
The jobs that exhausted their run-time limit (failure code 37), consumed 24,485 days of CPU time;
d <- sum(subset(groups$fail, failed == 37)$cpu); units(d) <- "days" print(d) #> Time difference of 23773.49 days
which corresponds to ~15% of all CPU time spent:
as.numeric(d)/sum(as.numeric(total$cpu)) #> [1] 0.1494577
The jobs that failed because it was killed by the scheduler (failure code 100), consumed 18,968 days of CPU time;
> d <- sum(subset(groups$fail, failed == 100)$cpu); units(d) <- "days" > d Time difference of 18967.52 days
which corresponds to ~12% of all CPU time spent:
> as.numeric(d)/sum(as.numeric(total$cpu)) [1] 0.1192438
In total, these two types of failures (37 and 100) wasted 26.9% of the CPU compute resources corresponding to 42,741 days;
> d <- sum(subset(groups$fail, failed %in% c(37, 100))$cpu); units(d) <- "days" > d Time difference of 42741.01 days > as.numeric(d)/sum(as.numeric(total$cpu)) [1] 0.2687015
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.