Deploying an analysis service with DIAL 1.20

D. Adams
12jul05


Installation
The DIAL 1.20 release includes the software required to deploy a DIAL analysis service. First install the software following the instructions on the installation page. The release page has links to the DIAL release kits and tarballs for all external packages.

It is probably wise to run the client to test the installation. Try running the demos (e.g. demo 6) provided with the DIAL client.

Basic configuration
Create a directory in which to run the service. I use the script start to start the service and the script stop to stop it. Copy these to the service directory and modify appropriately for your deployment. Both of these scripts must be executable.

The start script may be used to define some of the environmental variables used to control the behavior of the DIAL service. Thes include directories where applications are
   DIAL_APPS - where applications are stored
   DIAL_TASKS - where tasks are constructed
   DIAL_JOBS - where job directories are created and accessed
   DIAL_MASTER_JOBS - same for compound jobs
   DIAL_TASKBUILDER - script called to construct tasks
   DIAL_UIDS - unique ID server
Here is an example task build script that creates and LSF job to build the task. It is used to take the load off the server and to ensure the correct platform is used.

In that directory create a file name port that holds the number of the port on which the service should listen for requests. On any given machine, only one service may listen on any given port. If you are behind a firewall, you will have to open this port if you wish to handle requests from external users. Here is an example port file.

The service reads an XML description of type of scheduler to use for job submission. Copy the appropriate description for your site, modify it appropriately and modify the start script to point at this description. Here are descriptions for some of the job types supported by DIAL:
   Condor
   LSF
   fork
   scripted

The last runs a scripted job which is configured by putting instructions for creating, starting, updating and killing a job in a script. This makes it easy to extend DIAL to run jobs on almost any batch or grid workload management system. Here in an example script which runs the job on the local machine. It requires a run script be in the same directory. Both of these scripts must be executable.

GSI configuration
The following are needed if you want to run with GSI security. This is highly recommended.

Obtain a host certificate for the server machine and install hostcert.pem and hostkey.pem in the directory grid-security. The key file must only be readable by the owner (chmod 400 hostkey.pem).

Obtain or create a gridmap file listing the authorized users. The mapped account for each DN is not used by dial. Here is an example: authorized_dn. This lists all ATLAS users registered at the time of its creation. Install this file in the service directory with the name authorized_dn. Typically you will use the gridmap provided by you local globus installers, e.g. the one in /etc/grid-security/grid-mapfile.

Ensure that the CA files are installed in the standard location on your machine or install them yourself (see the DIAL external packages) and set X509_CERT_DIR to their location.

Starting the service
After configuration, enter the service directory and execute the command ./start to start the service. All calls to the service will be recorded in the file server.log. Check stdout.log for messages to stdout and stderr.

The following job definition may be used to create a trivial job for testing:
   aname = "test1"
   tname = "test1"
   dname = "empty"

Debugging
In case of problems, first look in stdout.log for error messages. Next execute the scheduler log command from a client (msch.log()). This triggers

For scheduler problems (e.g. client reports invalid scheduler), use the client method WsClientScheduler::log() (e.g. msch.log()) to trigger creation of master_scheduler.log and slave_scheduler.log in the service directory. These may be inspected for error and other processing messages.

If a job or task build fails, check the log files in the associated directory. The job directory appears in the job printout (see the client or server.log). Subjob ID's may be obtained with Job::subjobs() or Job::failed_subjobs(). The task directory is $DIAL_TASKS/application-ID/task-ID. Task results are cached and once a task build fails, it will continue to report failure until the directory is deleted to force a rebuild.

Detailed messages during job processing may be obtined by creating debug files in the local directory. Relevant files include:
   debug_MasterScheduler
   debug_LocalScheduler
   debug_CompoundJob
   debug_WorkingDirectory
Each triggers messages from the named component. Simply delete or rename the file to stop the messages. All output will appear in stdout.log.

Unique ID service
The files for starting a unique ID service may be found here. Modify them for your area, update the URL in your DIAL configuration or sett DIAL_UIDS by hand, and start the service. The command uidtest should return the next value in your sequence. Contact D. Adams to obtain truly unique sequences for other object types (Dataset, application, ...).


dladams@bnl.gov