DIAL release 1.30: Job failures

There are many points of failure in a distributed system and it is inevitable that some jobs will fail. Here we discuss what is meant by failure and how one can handle it.

What is a failed job?

An endpoint job, i.e. one with no constituents is defined as a failure if its run script returns nonzero or if it does not produce a outout dataset that can be parsed by the scheduler managing the job. It may also may be marked as failed if the scheduler loses access for some other reason.

A compound job has is declared a failure when too many of its subjobs are marked as failed. Here "too many" is defined by the scheduler possibly making use of the parameters (e.g. max_retry) in the job preferences.

Resubmit

The easiest way to handle failure is to resubmit the job. When problems are intermitent or confined to resources that are not always accessed, this may be successful.

Increase the fault tolerance in the preferences

With the current schedulers, there is a preference parameter max_retry that sets an upper limit on the number of times subjobs are resubmitted. Modestly increasing this value may resolve the problem.

Use the job monitor

The job monitor provided by ana anlysis service may be used to examine compound jobs and understand whether their failure is due to subjob failure or some other problem. Compare the the return code of a failed job with those supported by the run script in the job's application. Examine log files that may be found in the run directory. The job description includes the locations of of this directory and the identity of a logical file holding a tarball of that directory.

Get help

If you are unable to resolve the problem on you own, pass it along to the DIAL managers or other relevant experts. Your job ID should be enough to allow us to reproduce the problem and resolve it.


DIAL release 1.30: Defining a job, updated 14nov05