knitr::opts_chunk$set( collapse = TRUE, comment = "#>", error = FALSE ) r_output <- function(x) { cat(c("```r", x, "```"), sep = "\n") }
Base queue object:
ctx <- context::context_save("contexts") obj <- didehpc::queue_didehpc(ctx)
ERROR
If your job status is ERROR
that probably indicates an error in
your code. There are lots of reasons that this could be for, and
the first challenge is working out what happened.
t <- obj$enqueue(mysimulation(10))
t$wait(10)
This job will fail, and $status()
will report ERROR
t$status()
The first place to look is the result of the job itself. Unlike an error in your console, an error that happens on the cluster can be returned and inspected:
t$result()
In this case the error is because the function mysimulation
does
not exist.
The other place worth looking is the job log
t$log()
Sometimes there will be additional diagnostic information there.
Here's another example:
t <- obj$enqueue(read.csv("c:/myfile.csv"))
t$wait(10)
This job will fail, and $status()
will report ERROR
t$status()
Here is the error, which is a bit less informative this time:
t$result()
The log gives a better idea of what is going on - the file
c:/myfile.csv
does not exist (because it is not found on the
cluster; using relative paths is much preferred to absolute paths)
t$log()
The real content of the error message is present in the warning! You can also get the warnings with
t$result()$warnings
Which will be a list of all warnings generated during the execution of your task (even if it succeeds). The traceback also shows what happened:
t$result()$trace
These are harder to troubleshoot but we can still pull some information out. The example here was a real-world case and illustrates one of the issues with using a shared filesystem in the way that we do here.
writeLines("times2 <- function(x) {\n 2 * x\n}", "mycode.R")
Suppose you have a context that uses some code in mycode.R
:
r_output(readLines("mycode.R"))
You create a connection to the cluster:
ctx <- context::context_save("contexts", sources = "mycode.R") obj <- didehpc::queue_didehpc(ctx)
Everything seems to work fine:
t <- obj$enqueue(times2(10)) t$wait(10)
...but then you're editing the file and save the file but it is not syntactically correct:
writeLines("times2 <- function(x) {\n 2 * x\n}\nnewfun <- function(x)", "mycode.R")
r_output(readLines("mycode.R"))
And then you either submit a job, or a job that you have previously submitted gets run (which could happen ages after you submit it if the cluster is busy).
t <- obj$enqueue(times2(10)) t$wait(10) t$status()
The error here has happened before getting to your code - it is happening when context loads the source files. The log makes this a bit clearer:
t$log()
PENDING
This is the most annoying one, and can happen for many reasons.
You can see via the web interface or
the Microsoft cluster tools that your job has failed but didehpc
is reporting it as pending. This happens when something has failed
during the script that runs before any didehpc
code runs on the
cluster.
Things that have triggered this situation in the past:
There are doubtless others. Here, I'll simulate one so you can see how to troubleshoot it. I'm going to deliberately misconfigure the network share that this is running on so that the cluster will not be able to map it and the job will fail to start
home <- didehpc::path_mapping("home", getwd(), "//fi--wronghost/path", "Q:")
The host fi--wronghost
does not exist so things will likely fail
on startup.
config <- didehpc::didehpc_config(home = home) ctx <- context::context_save("contexts") obj <- didehpc::queue_didehpc(ctx, config)
Submit a job:
t <- obj$enqueue(sessionInfo())
And wait...
t$wait(10)
It's never going to succeed and yet it's status will stay as PENDING
:
t$status()
To get the log from the DIDE cluster you can run:
obj$dide_log(t)
which here indicates that the network path was not found (because it was wrong!)
You can also update any incorrect statuses by running:
obj$reconcile()
Which will print information about anything that was adjusted.
In that case, something is different between how the cluster sees the world, and how your computer sees it.
C:
for instance?top
(linux) running, and watch to see what the memory usage is. If the job is single-core, consider the total memory used if you run 8 or 16 instances on the same cluster machine. If the total memory exceeds the available, then behaviour will be undefined, and some jobs will likely fail.If you need help, you can ask in the "Cluster" teams channel or try your luck emailing Rich and Wes (they may or may not have time to respond, or may be on leave).
When asking for help it is really important that you make it as easy as possible for us to help you. This is surprisingly hard to do well, and we would ask that you first take a look at these two short articles:
Things we will need to know:
obj$config
if you have managed to create an object)Too often, we will get requests from people that where we have no information about what was run, what packages or versions are being installed, etc. This means your message sits there until we see it, we'll ask for clarification - that message sits there until you see it, you respond with a little more information, and it may be days until we finally discover the root cause of your problem, by which point we're both quite fed up. We will never complain if you provide "too much" information in a good effort to outline where your problem is.
Don't say
Hi, I was running a cluster job, but it seems like it failed. I'm sure it worked the other day though! Do you know what the problem is?
Do say
Since yesterday, my cluster job has stopped working.
My dide username is
alicebobson
and my dide config is:
<didehpc_config> - cluster: fi--dideclusthn - username: rfitzjoh (etc)
I am working on the
myproject
directory of the malaria share (\\projects\malaria
)I have set up my cluster job with
```
include short script here if you can!
```
The job
43333cbd79ccbf9ede79556b592473c8
is one that failed with an error, and the log says```
contents of t$log() here
```
with this sort of information the problem may just jump out at us, or we may be able to create the error ourselves - either way we may be able to work on the problem and get back to you with a solution rather than a request for more information.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.