Base queue object:

ctx <- context::context_save("contexts")
#> [ open:db   ]  rds
#> [ save:id   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ save:name ]  discretionary_stingray
obj <- didehpc::queue_didehpc(ctx)
#> Loading context 81b64478fe2182e45e83fec2156c6ec7
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]

My job has failed

My job status is ERROR

Caused by an error in your code

If your job status is ERROR that probably indicates an error in your code. There are lots of reasons that this could be for, and the first challenge is working out what happened.

t <- obj$enqueue(mysimulation(10))
#> (-) waiting for f416387...bac, giving up in 9.5 s (\) waiting for f416387...bac,
#> giving up in 9.0 s

This job will fail, and $status() will report ERROR

t$status()
#> [1] "ERROR"

The first place to look is the result of the job itself. Unlike an error in your console, an error that happens on the cluster can be returned and inspected:

t$result()
#> <context_task_error in mysimulation(10): could not find function "mysimulation">

In this case the error is because the function mysimulation does not exist.

The other place worth looking is the job log

t$log()
#> [ hello     ]  2021-08-17 14:53:09
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:53:09.042
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  3672
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]
#> [ parallel  ]  running as single core job
#> [ root      ]  Q:\didehpc\20210817-145020\contexts
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ task      ]  f4163877615d5f6fa6fb6392b6739bac
#> [ expr      ]  mysimulation(10)
#> [ start     ]  2021-08-17 14:53:09.199
#> [ error     ]
#>     Error in mysimulation(10): could not find function "mysimulation"
#> [ end       ]  2021-08-17 14:53:09.308
#>     Error in context:::main_task_run() : Error while running task:
#>     Execution halted

Sometimes there will be additional diagnostic information there.

Here's another example:

t <- obj$enqueue(read.csv("c:/myfile.csv"))
#> (-) waiting for 9df1d9f...c74, giving up in 9.5 s (\) waiting for 9df1d9f...c74,
#> giving up in 9.0 s

This job will fail, and $status() will report ERROR

t$status()
#> [1] "ERROR"

Here is the error, which is a bit less informative this time:

t$result()
#> <context_task_error in file(file, "rt"): cannot open the connection>

The log gives a better idea of what is going on - the file c:/myfile.csv does not exist (because it is not found on the cluster; using relative paths is much preferred to absolute paths)

t$log()
#> [ hello     ]  2021-08-17 14:53:10
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:53:10.777
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  2000
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]
#> [ parallel  ]  running as single core job
#> [ root      ]  Q:\didehpc\20210817-145020\contexts
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ task      ]  9df1d9f60717796422e1e62c8e206c74
#> [ expr      ]  read.csv("c:/myfile.csv")
#> [ start     ]  2021-08-17 14:53:10.949
#> [ error     ]
#>     Error in file(file, "rt"): cannot open the connection
#> [ end       ]  2021-08-17 14:53:11.074
#>     Error in context:::main_task_run() : Error while running task:
#>     In addition: Warning message:
#>     In file(file, "rt") :
#>       cannot open file 'c:/myfile.csv': No such file or directory
#>     Execution halted

The real content of the error message is present in the warning! You can also get the warnings with

t$result()$warnings
#> [[1]]
#> <simpleWarning in file(file, "rt"): cannot open file 'c:/myfile.csv': No such file or directory>

Which will be a list of all warnings generated during the execution of your task (even if it succeeds). The traceback also shows what happened:

t$result()$trace
#>  [1] "context:::main_task_run()"
#>  [2] "task_run(task_id, ctx)"
#>  [3] "eval_safely(dat$expr, dat$envir, \"context_task_error\", 3)"
#>  [4] "tryCatch(withCallingHandlers(eval(expr, envir), warning = function(e) warni"
#>  [5] "tryCatchList(expr, classes, parentenv, handlers)"
#>  [6] "tryCatchOne(expr, names, parentenv, handlers[[1]])"
#>  [7] "doTryCatch(return(expr), name, parentenv, handler)"
#>  [8] "withCallingHandlers(eval(expr, envir), warning = function(e) warnings$add(e"
#>  [9] "eval(expr, envir)"
#> [10] "eval(expr, envir)"
#> [11] "read.csv(\"c:/myfile.csv\")"
#> [12] "read.table(file = file, header = header, sep = sep, quote = quote, dec = de"
#> [13] "file(file, \"rt\")"

Caused by an error during startup

These are harder to troubleshoot but we can still pull some information out. The example here was a real-world case and illustrates one of the issues with using a shared filesystem in the way that we do here.

Suppose you have a context that uses some code in mycode.R:

times2 <- function(x) {
  2 * x
}

You create a connection to the cluster:

ctx <- context::context_save("contexts", sources = "mycode.R")
#> [ open:db   ]  rds
#> [ save:id   ]  dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ save:name ]  fuzzy_bass
obj <- didehpc::queue_didehpc(ctx)
#> Loading context dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ context   ]  dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ library   ]
#> [ namespace ]
#> [ source    ]  mycode.R

Everything seems to work fine:

t <- obj$enqueue(times2(10))
t$wait(10)
#> (-) waiting for d478d26...d70, giving up in 9.5 s (\) waiting for d478d26...d70,
#> giving up in 9.0 s
#> [1] 20

...but then you're editing the file and save the file but it is not syntactically correct:

times2 <- function(x) {
  2 * x
}
newfun <- function(x)

And then you either submit a job, or a job that you have previously submitted gets run (which could happen ages after you submit it if the cluster is busy).

t <- obj$enqueue(times2(10))
t$wait(10)
#> (-) waiting for ef79e3b...390, giving up in 9.5 s (\) waiting for ef79e3b...390,
#> giving up in 9.0 s
#> <context_task_error in source(s, envir): mycode.R:5:0: unexpected end of input
#> 3: }
#> 4: newfun <- function(x)
#>   ^>
t$status()
#> [1] "ERROR"

The error here has happened before getting to your code - it is happening when context loads the source files. The log makes this a bit clearer:

t$log()
#> [ hello     ]  2021-08-17 14:53:14
#> [ wd        ]  Q:/didehpc/20210817-145020
#> [ init      ]  2021-08-17 14:53:14.324
#> [ hostname  ]  FI--DIDECLUST26
#> [ process   ]  3528
#> [ version   ]  0.3.0
#> [ open:db   ]  rds
#> [ context   ]  dd8c63ce681e6586c8e4fd8c1b5f6925
#> [ library   ]
#> [ namespace ]
#> [ source    ]  mycode.R
#>     Error in source(s, envir) : mycode.R:5:0: unexpected end of input
#>     3: }
#>     4: newfun <- function(x)
#>       ^
#>     Calls: <Anonymous> -> withCallingHandlers -> context_load -> source
#>     Execution halted

My jobs are getting stuck at PENDING

This is the most annoying one, and can happen for many reasons. You can see via the web interface or the Microsoft cluster tools that your job has failed but didehpc is reporting it as pending. This happens when something has failed during the script that runs before any didehpc code runs on the cluster.

Things that have triggered this situation in the past:

There are doubtless others. Here, I'll simulate one so you can see how to troubleshoot it. I'm going to deliberately misconfigure the network share that this is running on so that the cluster will not be able to map it and the job will fail to start

home <- didehpc::path_mapping("home", getwd(),
                              "//fi--wronghost/path", "Q:")

The host fi--wronghost does not exist so things will likely fail on startup.

config <- didehpc::didehpc_config(home = home)
ctx <- context::context_save("contexts")
#> [ open:db   ]  rds
obj <- didehpc::queue_didehpc(ctx, config)
#> Loading context 81b64478fe2182e45e83fec2156c6ec7
#> [ context   ]  81b64478fe2182e45e83fec2156c6ec7
#> [ library   ]
#> [ namespace ]
#> [ source    ]

Submit a job:

t <- obj$enqueue(sessionInfo())

And wait...

t$wait(10)
#> (-) waiting for 36dece3...711, giving up in 9.5 s (\) waiting for 36dece3...711,
#> giving up in 9.0 s (|) waiting for 36dece3...711, giving up in 8.5 s (/) waiting
#> for 36dece3...711, giving up in 8.0 s (-) waiting for 36dece3...711, giving
#> up in 7.5 s (\) waiting for 36dece3...711, giving up in 7.0 s (|) waiting for
#> 36dece3...711, giving up in 6.4 s (/) waiting for 36dece3...711, giving up
#> in 5.9 s (-) waiting for 36dece3...711, giving up in 5.4 s (\) waiting for
#> 36dece3...711, giving up in 4.9 s (|) waiting for 36dece3...711, giving up
#> in 4.4 s (/) waiting for 36dece3...711, giving up in 3.9 s (-) waiting for
#> 36dece3...711, giving up in 3.4 s (\) waiting for 36dece3...711, giving up
#> in 2.9 s (|) waiting for 36dece3...711, giving up in 2.4 s (/) waiting for
#> 36dece3...711, giving up in 1.9 s (-) waiting for 36dece3...711, giving up
#> in 1.4 s (\) waiting for 36dece3...711, giving up in 0.9 s (|) waiting for
#> 36dece3...711, giving up in 0.3 s (/) waiting for 36dece3...711, giving up in
#> 0.0 s
#> Error in task_wait(self$root$db, self$id, timeout, time_poll, progress): task not returned in time

It's never going to succeed and yet it's status will stay as PENDING:

t$status()
#> [1] "PENDING"

To get the log from the DIDE cluster you can run:

obj$dide_log(t)
#> [1] "Task failed during execution with exit code . Please check task's output for error details."
#> [2] "Output                          : The network path was not found."

which here indicates that the network path was not found (because it was wrong!)

You can also update any incorrect statuses by running:

obj$reconcile()
#> Fetching job status from the cluster...
#>   ...done
#> manually erroring task 36dece32d6c38f71464ed1fd9a9be711
#> Tasks have failed while context booting:
#>   - 36dece32d6c38f71464ed1fd9a9be711

Which will print information about anything that was adjusted.

My job works on my computer but not on the cluster

In that case, something is different between how the cluster sees the world, and how your computer sees it.

Some of my jobs work on the cluster, but others fail.

My job is slower on the cluster than running locally!

Asking for help

If you need help, you can ask in the "Cluster" teams channel or try your luck emailing Rich and Wes (they may or may not have time to respond, or may be on leave).

When asking for help it is really important that you make it as easy as possible for us to help you. This is surprisingly hard to do well, and we would ask that you first take a look at these two short articles:

Things we will need to know:

Too often, we will get requests from people that where we have no information about what was run, what packages or versions are being installed, etc. This means your message sits there until we see it, we'll ask for clarification - that message sits there until you see it, you respond with a little more information, and it may be days until we finally discover the root cause of your problem, by which point we're both quite fed up. We will never complain if you provide "too much" information in a good effort to outline where your problem is.

Don't say

Hi, I was running a cluster job, but it seems like it failed. I'm sure it worked the other day though! Do you know what the problem is?

Do say

Since yesterday, my cluster job has stopped working.

My dide username is alicebobson and my dide config is:

<didehpc_config> - cluster: fi--dideclusthn - username: rfitzjoh (etc)

I am working on the myproject directory of the malaria share (\\projects\malaria)

I have set up my cluster job with

```

include short script here if you can!

```

The job 43333cbd79ccbf9ede79556b592473c8 is one that failed with an error, and the log says

```

contents of t$log() here

```

with this sort of information the problem may just jump out at us, or we may be able to create the error ourselves - either way we may be able to work on the problem and get back to you with a solution rather than a request for more information.



mrc-ide/didehpc documentation built on Aug. 20, 2023, 10:27 a.m.