Introduction
We do not expect that a single analysis service instance can meet
all the needs of the ATLAS collaboration. Instead, a hierarchy of
services will provide the required scaling, robustness, and ability
to respond to different types of job requests. See the
deployment page for specific ATLAS models.
Here we examine the different types of services that can be used to
construct this hierarchy.
One important property of an analysis service is the backend system it uses to carry out its processing. We identify four categories of analysis services based on this property: local, grid, switch and ATLAS production. Details are provided below.
All existing analysis services are based on DIAL. DIAL services run different types of jobs with the job type holding the means to interact with the backend system. Existing job types directly support log job forking and LSF and Condor batch systems. There is also a compound job type to handle the processing of and merging of results from a collection of subjobs. The latest DIAL release (1.20) adds a scripted job which is easily extended to any system by providing a script.
Processing has three stages: task building, running and result merging. The task build is ususally cached for each application-task-platform combination so that it need not be repeated for each job or subjob. If the task is not built, then this done before any jobs are started. The present DIAL analysis services allow the service manager to specify a directory where tasks should be read and written. The manager may also provide a wrapper script to build the task to ensure the correct platform is used.
Merging can and should take place as soon as subjobs complete so that partial results can be made available to the user. At present DIAL does this inside the service but there are plans to enable the service to create jobs for this purpose.
Local service
A local service will typically split a user request into a collection of
subjobs and then process each subjob on the local machine or by submission
a local batch queue. The service must also handle task building and the
merging of results from the subjobs.
DIAL provides direct support for LSF and Condor.
Other batch systems can be added by adding a class to DIAL or by
providing a job script for use with scripted job.
ATLAS presntly runs LSF services at BNL and CERN.
Grid service
A grid-based service also splits the user request into subjobs and must
handle all three steps of processing. Jobs are run by submitting them
to a gatekeeper, CE (computing element) or WMS (workload management
system) depending on the destination grid. A task might be built with
results distrubuted using existing data or software management systems
or a request can be sent to build the task at each site. Results can be
merged by sending additional jobs to the grid site where the data was
produced or by bringing the data back to the service machine.
ATLAS is developing a service based on the gLite/LCG WMS.
Switch service
A switch service forwards requests on to other analysis services based
on job requirements and the states of those services and the systems
supporting them. It may do job splitting in which case it is responsible
for merging results. DIAL does not yet provide such services but will
soon do so. Such services allow users to submit requests to a generic
service which can then select an appropriate service based on the location
of software, data and compute resources.
ATLAS production
ATLAS has developed a sophisticated system to handle production
of very large data samples. This system does not have the extensibility,
merging or responsiveness desired for an analysis system but it does
efficiently harness a large set of resources and has a large support base.
Work is in progress to construct analysis service that provides an interface
to this system and adds the splitting of user requests and merging of the
corresponding results.