DIAL release 1.20: Failed jobs

David Adams
06jul05


Introduction

There are many steps in the processing of an ADA job: the task is built, the input dataset split, a subjob is run for each subdataset and the results are merged as these job complete. In order to understand why a job fails, one must first understand which step failed and then examine the appropriate logs to determine the cause.

Before going too far with digging through log files, it is often wise to resubmit the job and se if the same problem occurs. Problems due to disk, DB or network failure may go away.

In the event of failure, it might be useful to go back and run one of the demos to ensure that system is working well.

The following uses some root client commands. If you are using PyDIAL/GANGA, similar commands should be available. You can connect to your job from any dialroot client by creating a file scheduler holding the URL for the service and a file jid holding the ID of your job.

Log files

At present log files are written to disks local to the analysis service and users must log in to the corresponding site to gain access to these files. There are four relevant directories: To access the BNL service directory, one must log in to the service machine and look in subdirectories of /home/dial/server/<DIAL-VERSION>. The file env.log in that directory lists the enviromental variables including DIAL_TASKS, presently set to /usatlas/dial/local/tasks/sl3.

We do plan to provide means for clients to directly access the logs and this page will be updated when this is possible.

No job created

If a job submission returns an invalid ID, then the job could not be created. This may be because any of the input application, task or dataset are not valid or could not be accessed from service. Or there could be a failure in splitting the dataset. Most likely is that task could not be built.

In the event of a submission failure, request that the scheduler write its log with the command

root> msch.log()
This is done automatically with the submit() command. The files master_scheduler.log and slave_scheduler.log will be written in the service directory. Examine the last entries in each of these files. The latter will include an obvious message and the full path to the task directory if the task build has failed. In this case, examine the logs in that directory to try to determine why the build failed and rectify that problem.

Job starts but fails due to subjob failure

If a job reports failure, look to see if any of the subjobs have failed. Typically there will only be one because all other subjobs are killed when the main job fails. If there is a failed subjob, get its ID with the command
root> print(msch.job(jid).failed_subjobs())
and then fetch the description of the subjob with
root> print(msch.job("123-456"))
Replace 123-456 with the job ID output from the previous command.

The subjob description will include the directory where the job was run and logs may be found. See stdout.log and athena.log in that directory. If the problem is one in you can fix, then do so and resubmit.

Job starts but fails but subjobs are done or killed

If a job fails and no subjobs are failed, the the problem is likely in the merging. Look at the log files in the main job directory.

Problem cannot be resolved

If you cannot resolve the problem because you cannot access the logs, believe it is not a problem with your submission, or for any other reason, then please send mail to dladams@bnl.gov and we will resolve it. Please include the URL of the service and the description of your job or at least the ID.


dladams@bnl.gov