DIAL release 1.30: Job failures
There are many points of failure in a distributed system and it is inevitable
that some jobs will fail. Here we discuss what is meant by failure and how
one can handle it.
What is a failed job?
An endpoint job, i.e. one with no constituents is defined as a failure if
its run script returns nonzero or if it does not produce a outout dataset
that can be parsed by the scheduler managing the job. It may also may be marked
as failed if the scheduler loses access for some other reason.
A compound job has is declared a failure when too many of its subjobs are
marked as failed. Here "too many" is defined by the scheduler possibly making
use of the parameters (e.g. max_retry) in the job preferences.
Resubmit
The easiest way to handle failure is to resubmit the job. When problems are
intermitent or confined to resources that are not always accessed, this
may be successful.
Increase the fault tolerance in the preferences
With the current schedulers, there is a preference parameter max_retry that
sets an upper limit on the number of times subjobs are resubmitted.
Modestly increasing this value may resolve the problem.
Use the job monitor
The job monitor provided by ana anlysis service may be used to examine
compound jobs and understand whether their failure is due to subjob failure
or some other problem. Compare the the return code of a failed job with those
supported by the run script in the job's application. Examine log files
that may be found in the run directory. The job description includes the
locations of of this directory and the identity of a logical file holding a
tarball of that directory.
Get help
If you are unable to resolve the problem on you own, pass it along to
the DIAL managers or other relevant experts. Your job ID should be enough
to allow us to reproduce the problem and resolve it.
DIAL release 1.30: Defining a job, updated 14nov05