Installation
The DIAL 1.20 release includes the software required to deploy a DIAL
analysis service. First install the software following the instructions on the
installation page. The
release page has links to the DIAL release kits and tarballs
for all external packages.
It is probably wise to run the client to test the installation. Try running the demos (e.g. demo 6) provided with the DIAL client.
Basic configuration
Create a directory in which to run the service. I use the script
start to start the service and the script
stop to stop it. Copy these to the service directory and modify
appropriately for your deployment.
Both of these scripts must be executable.
The start script may be used to define some of the environmental variables
used to control the behavior of the DIAL service. Thes include directories where applications are
DIAL_APPS - where applications are stored
DIAL_TASKS - where tasks are constructed
DIAL_JOBS - where job directories are created and accessed
DIAL_MASTER_JOBS - same for compound jobs
DIAL_TASKBUILDER - script called to construct tasks
DIAL_UIDS - unique ID server
Here is an example task build script
that creates and LSF job to build the task. It is used to take the load off the server
and to ensure the correct platform is used.
In that directory create a file name port that holds the number of the port on which the service should listen for requests. On any given machine, only one service may listen on any given port. If you are behind a firewall, you will have to open this port if you wish to handle requests from external users. Here is an example port file.
The service reads an XML description of type of scheduler to use for
job submission. Copy the appropriate description for your site, modify
it appropriately and modify the start script to point at this description.
Here are descriptions for some of the job types supported by DIAL:
Condor
LSF
fork
scripted
The last runs a scripted job which is configured by putting instructions for creating, starting, updating and killing a job in a script. This makes it easy to extend DIAL to run jobs on almost any batch or grid workload management system. Here in an example script which runs the job on the local machine. It requires a run script be in the same directory. Both of these scripts must be executable.
GSI configuration
The following are needed if you want to run with GSI security. This is
highly recommended.
Obtain a host certificate for the server machine and install hostcert.pem and hostkey.pem in the directory grid-security. The key file must only be readable by the owner (chmod 400 hostkey.pem).
Obtain or create a gridmap file listing the authorized users. The mapped account for each DN is not used by dial. Here is an example: authorized_dn. This lists all ATLAS users registered at the time of its creation. Install this file in the service directory with the name authorized_dn. Typically you will use the gridmap provided by you local globus installers, e.g. the one in /etc/grid-security/grid-mapfile.
Ensure that the CA files are installed in the standard location on your machine or install them yourself (see the DIAL external packages) and set X509_CERT_DIR to their location.
Starting the service
After configuration, enter the service directory and execute the command
./start to start the service. All calls to the service will be recorded in
the file server.log. Check stdout.log for messages to stdout and stderr.
The following job definition may be used to create a trivial job for testing:
aname = "test1"
tname = "test1"
dname = "empty"
Debugging
In case of problems, first look in stdout.log for error messages. Next
execute the scheduler log command from a client (msch.log()). This triggers
For scheduler problems (e.g. client reports invalid scheduler), use the client method WsClientScheduler::log() (e.g. msch.log()) to trigger creation of master_scheduler.log and slave_scheduler.log in the service directory. These may be inspected for error and other processing messages.
If a job or task build fails, check the log files in the associated directory. The job directory appears in the job printout (see the client or server.log). Subjob ID's may be obtained with Job::subjobs() or Job::failed_subjobs(). The task directory is $DIAL_TASKS/application-ID/task-ID. Task results are cached and once a task build fails, it will continue to report failure until the directory is deleted to force a rebuild.
Detailed messages during job processing may be obtined by creating debug
files in the local directory. Relevant files include:
debug_MasterScheduler
debug_LocalScheduler
debug_CompoundJob
debug_WorkingDirectory
Each triggers messages from the named component. Simply delete or rename the
file to stop the messages. All output will appear in stdout.log.
Unique ID service
The files for starting a unique ID service may be found
here. Modify them for your area, update the URL
in your DIAL configuration or sett DIAL_UIDS by hand,
and start the service. The command uidtest should return the next
value in your sequence. Contact D. Adams to obtain truly unique
sequences for other object types (Dataset, application, ...).