ARDA Comments from DIAL David Adams September 8, 2003 Introduction ------------ Torre asked me to provide some feedback to the draft ARDA report based on our experiences with DIAL. The document I have is version 0.2 dated 6/8/2003. I begin with a couple general comments and then provide the same decomposition for DIAL that the report provides for AliEn. I would be happy to provide any further feedback that might be useful. A description of DIAL (including the recent release 0.40) may be found at http://www.usaltlas.bnl.gov/~dladams/dial). At present DIAL provides distributed scheduling on at the local site but it has been designed with more widely distributed processing in mind. Datasets -------- The unit of data processing and provenance tracking in DIAL is the dataset rather than the logical file. For more information on this dataset, see http://www.usatlas.bnl.gov/~dladams/dataset and, in particular, the note "Datasets for the GRID" linked from that page. A common realization of a dataset is a set of logical files and I infer this is implicitly supported by AliEn and ARDA. HOwever, a dataset is also allowed to be represented by distinct collections of logical files in the same way that a logical file has multiple physical replicas. I would add a dataset catalog to the decomposition, distinct from the file and metadata catalogs. Analysis -------- DIAL's primary emphasis is on interactive analysis while AliEn and the ARDA report seem to focus on user-level data production. The former leads to a requirement that users be able to specify a reponse time and the analysis system be fast and flexible enough to attempt to meet the demand. This applies not only to the creation and running of jobs but also to the gathering of results and, in particular, partial results while the job is processing. Decomposition ------------- 5.1 API and User interface A DIAL user selects a dataset, specifies an application to carry out the processing and defines a task to configure that application including specification of the result to be generated. The application, task and dataset are then handed to a scheduler which carries out the (distributed) processing and returns the result. DIAL is written in C++ and work has begun to provide a Python wrapper and the same might be done for Java. The application, task, dataset and result all have XML representations to enable communication between different sites and languages. DIAL provides a very simple command line interface. It is expected that most users will work inside an external analysis framework such as ROOT, JAS, GANGA or PI. These may enable users to interact directly with DIAL objects or may provide graphical interfaces. 5.2-3 Authentication and Authorization Local login is sufficient for the current version of DIAL except that our LSF does nor forward AFS credentials. We expect to make use of certificates when we move to the GRID. This is clearly an area where we expect leverage other work. 5.4-5 Auditing and Information system We expect to work with services provided by experiments and the grid to provide global information about schedulers and jobs. A DIAL scheduler provides information about installed applications, tasks and jobs relevant to the scheduler. 5.6 Computing element DIAL presently supports fork, LSF, lsrun and Condor. Throught its scheduler hierachy, it will allow users to transparently view farms, sites or grids as compute elements. The scheduler is responsible for erifying the exitence of applications and installing tasks. 5.7 Storage elements DIAL interacts with storage though a logical file class which deals with various file "catalogs". At present these include the local file system, NFS, AFS and Magda. Replication is not yet supported but is envisioned. 5.8 Workload management The heart of DIAL is a hierachy of schedulers each of which takes an application, task and dataset as input and returns a result. The dataset may be split and each sub-dataset along with the application and task are used to define a job in a traditional management system such as LSF. Or the requests may be passed to another scheduler which handles them in a similar manner. Each scheduler is responsible for monitoring its job, combining results if splitting was done and returning the combined result to the calling user or scheduler. Partial results are generated on request. 5.9 Data management As indicated above, it is expected that replication will be handled though the logical file interface. The replication system is dictated by and (provided by) the experiment. 5.10 File catalog At present Magda is the only file replica catlog that is supported. We envision adding RLS possibly through the POOL interface when it becomes relevant to ATLAS. We will soon implement dataset catalogs. 5.11 Metadata catalog At present users can select a dataset from a short list. It is clear that we need to catalog a much more extensive collection and provide metadata from which users can make selections. The result of a selction would be a dataset name or ID. 5.12 Job monitoring Again we will connect to such a system as required. 5.13 Job provenance I would prefer to call this dataset provenance or data provenance. DIAL does not provide a provenance tracking system by is designed with provenance in mind. Produced datasets and other data are included in results which are generated using recorded application, task and dataset. The application is identified by name and version and the task and dataset have unique ID's. Details of production can be obtained by jobs and sub-jobs which also have unique ID's. 5.14 Package management A DIAL application will typically correspond to a lightweight package whose dependencies specify those packages that must be present when the application runs. DIAL does not provide any tools for package installation but envisions that some schedulers will eventually be able to install missing packages. This will be accomplished using an external package management system such as PACMAN. 5.15 Grid monitoring High level schedulers will make use of and may provide feedback to grid monitoring systems.