DIAL release 1.20: Failed jobs
David Adams
06jul05
Introduction
There are many steps in the processing of an ADA job: the task
is built, the input dataset split, a subjob is run for each
subdataset and the results are merged as these job complete.
In order to understand why a job fails, one must first understand
which step failed and then examine the appropriate logs to
determine the cause.
Before going too far with digging through log files, it is often
wise to resubmit the job and se if the same problem occurs.
Problems due to disk, DB or network failure may go away.
In the event of failure, it might be useful to go back and run
one of the demos to ensure that system is working well.
The following uses some root client commands. If you are using
PyDIAL/GANGA, similar commands should be available. You can
connect to your job from any dialroot client by creating a file
scheduler holding the URL for the service and a file jid holding
the ID of your job.
Log files
At present log files are written to disks local to the analysis
service and users must log in to the corresponding site to gain
access to these files. There are four relevant directories:
- Service directory - This is where the service is run. The
scheduler logs and stdoud and stderr logs may be found here.
- Task directory - This is where the task is built and is
$DIAL_TASKS/<application-ID>/<task-ID>
- Master job directory - This is where the merging is done
- Subjob directories - Each subjob has its own directory
where its logs may be found
To access the BNL service directory, one must log in to the service
machine and look in subdirectories of
/home/dial/server/<DIAL-VERSION>.
The file env.log in that directory lists the enviromental variables
including DIAL_TASKS, presently set to /usatlas/dial/local/tasks/sl3.
We do plan to provide means for clients to directly access the logs
and this page will be updated when this is possible.
No job created
If a job submission returns an invalid ID, then the job could
not be created. This may be because any of the input application,
task or dataset are not valid or could not be accessed from
service. Or there could be a failure in splitting the dataset.
Most likely is that task could not be built.
In the event of a submission failure, request that the
scheduler write its log with the command
root> msch.log()
This is done automatically with the submit() command.
The files master_scheduler.log and slave_scheduler.log
will be written in the service directory. Examine the last
entries in each of these files. The latter will include an
obvious message and the full path to the task directory if
the task build has failed. In this case, examine the logs in
that directory to try to determine why the build failed and
rectify that problem.
Job starts but fails due to subjob failure
If a job reports failure, look to see if any of the subjobs
have failed. Typically there will only be one because all
other subjobs are killed when the main job fails. If there
is a failed subjob, get its ID with the command
root> print(msch.job(jid).failed_subjobs())
and then fetch the description of the subjob with
root> print(msch.job("123-456"))
Replace 123-456 with the job ID output from the previous command.
The subjob description will include the directory where the
job was run and logs may be found. See stdout.log and athena.log
in that directory. If the problem is one in you can fix, then
do so and resubmit.
Job starts but fails but subjobs are done or killed
If a job fails and no subjobs are failed, the the problem is likely
in the merging. Look at the log files in the main job directory.
Problem cannot be resolved
If you cannot resolve the problem because you cannot access the
logs, believe it is not a problem with your submission, or for
any other reason, then please send mail to dladams@bnl.gov
and we will resolve it. Please include the URL of the service
and the description of your job or at least the ID.
dladams@bnl.gov