DIAL JDL Proposal

D. Adams
20oct03


Introduction

One of the principal results of the DIAL project is recognition of the need for a high level job definition language. This language should enable users to formulate job requests and receive the result when the job is complete. It should also support querying job status and aborting running jobs.

The word job is used in the DIAL sense. The job is specified with an input dataset and an application that describes the action to perform on the dataset to produce a result. The job also carries a task, which is a collection of user-supplied data used to configure the application. The task may include parameters, code or scripts. The application provides a mechanism to prepare the task (e.g. compile code and archive in a shared library) and another mechanism to process the dataset using the prepared task.

Typically job processing is distributed by breaking the submitted job into sub-jobs and then concatenating the sub-results to create the overall result. Here we are especially interested in the case where the dataset is split into sub-datasets and each of these are used along with the full application and task to define the sub-jobs.

XML is the language used to express the the job definition language. Thus parsers are available for common programming and scripting languages such as C++, java and Python. This also make its straightforward to wrap these components in SOAP messages and exchange them in web and grid service environments.

The job definition language is used to define a SOAP interface for scheduler services for job management. This interface is described in a separate document.


Notation

To improve readability, we drop the tags in our XML definitions and use indentation to set attributes and child elements from their parent element. An equals sign followed by a sample value is used to distinguish attributes from child elements. For example, the XML code
<Book title="Howto use the GRID" publisher="Anything Co.">
  <TableOfContents />
  <Chapter number="1" />
  <Chapter number="2" />
  <Chapter number="3" />
  <Chapter number="4" />
</Book>
is represented by
Book
  title = How to use the GRID
  publisher = Anything Co.
  TableOfContents
    Chapter
      number = 1
    Chapter
      number = 2
    Chapter
      number = 3
    Chapter
      number = 4
We use ellipses to indicate more of the same and indented ellipses to abbreviate content that has already been indicated:
Book
  title = How to use the GRID
  publisher = Anything Co.
  TableOfContents
    Chapter
      number = 1
    Chapter
      ...            [as in previous Chapter]
    ...              [may be more Chapter elements]


Extensibility, polymorphism and schema

The job definition language defined here must be rich enough to support a standard scheduler interface and must also be extensible so that it can be used to pass information to non-standard schedulers and to applications. in the former case, we extend rather than replace the language so the the same request cam be handled by a standard scheduler.

A natural way to make the language extensible is to introduce polymorphism, i.e. add types which extend other types. For example dataset with file extends dataset. Simple XML does not directly support polymorphism but XML schema can be used to add this functionality at the price of carrying around a schema description for each XML object.

Here we try to allow the use of schema without requiring their use. This is done by allowing substitution of elements with different names as long as the attribute "extends" is present and includes the substituted element name. For our above example, the type Dataset

Dataset
  name = mydataset
  BaseDatasetStuff
and be replaced with
FileDataset
  extends = Dataset
  Dataset
    Name = myfiledataset
    BaseDatasetStuff
  File
    name = myfile.dat
Other conventions are possible. We could elevate the Dataset attributes to the FileDataset level and drop the Dataset entry. Or we could require Dataset as the element name and embed the extended type. We could require use of schema or could adopt a conventions (such as the latter) that make their use impossible.

For definiteness, we follow the convention of this example in the discussion below. We are open to adopting any convention for a common standard. A different convention will change the formatting of the data but will not affect the content and the discussion below remains relevant.


Components

We identify the following components:
  Application
  Task
  Dataset
  Result
  JobId
  Job


Application

An appplication is specified by name and version.
Application
  name = DataCruncher
  version = 1.2.3
This is assumed to be enough that a package management system can identify the required software and verify its presence or install it if needed. A scheduler will use this application description to discover the scripts used to prepare tasks and start jobs.

An atlternative task description might add a list of required packages and embed the scripts required for preparing tasks and running jobs. This would avoid the need to have the package management system install the interface package carrying these scripts.

The application schema described here is not extensible.


Task

Tasks carry the user-supplied data use to configure an application. These may be parameters, scripts or code. This data is carried in a collection of embedded text files and a collection logical files.
Task
  id = 123-456
  TextFiles
    TextFile
      name = data.dat
      [text data]
    TextFile
      name = code.cpp
      [text data]
    ...
  NamedLogicalFiles
    NamedLogicalFile
      name = caloped.dat
      LogicalFile
        catalog_type = Magda
        catalog_name = ATLAS
        file_id = calo_pedastals.0003367
    ...
The scheduler has the reponsibility of extracting the text files and fetching replicas of the logical files before calling the application script to prepare the task. The extracted files are placed in a directory with the indicated names.

It is up to the installed application to prepare the task using the file names to identify the files. This preparation mway fail if files are missing or have unexpected content.

The task is not extensible. It is assumed that arbitrary lists of embeded and logical files provide sufficient flexibility.


Dataset

Datasets are discussed in great detail in the document "Datasets for the Grid" available at http://www.usatlas.bnl.gov/~dladams/dataset. That note identifies a collection of dataset properties. The properties relevant to this discussion are
  identity - a unique identifier for referencing the dataset
  content - a description of the kind of data in the dataset
  location - where the data can be found, e.g. a list of logical files
  mutability - is the dataset finished?
  compositeness - is the dataset made up of other datasets?
A virtual dataset (all but location) is described by a base dataset:
BaseDataset
  identity = 123-456
  mutability = locked    [or appending]
  location = virtual     [or logical or physical or staged or mixed]
  composite = false
  ContentList
    Content
      type = RecoJets
      key = cone-0.3
    Content
      type = Tracks
      key = Kalman-refit
    ...
The content here does not include event information.

A case of particular importance is the event dataset which is made up of a collection of "events" where each event may be processed independently. Each event is assumed to have an identifier.

EventDataset
  extends = BaseDataset
  BaseDataset
    ...
  EventDatasetData
    event_count = 100
    EventIdList                      [optional]
      count = 100
      EventId
        id = 4200-201
      EventId
        id = 4200-202
      ...
The event ID list is optional and is typically absent for a composite dataset where it can be derived from the lists in the sub-datasets. Note that we add the element EventDatasetData to hold the data that this type adds beyond its sub-types. This pattern allows us to make the base type BaseDataset explicit in subclasses. This pattern will be repeated in every type thatis both a sub-type and intended to be used as a base for other types.

If the data for a logical dataset is in a single logical file, then a logical file dataset can be used.

LogicalFileDataset
  extends = BaseDataset
  BaseDataset
    ...
  LogicalFileDatasetData
    LogicalFile
      id = dc1.ttbar.recon008.sim
      FileCatalog
        type = Magda
        name = ATLAS
Typically users might add their own element type to describe a dataset which has events and for which data can be found in a single logical file:
MonteCarloFileDataset
  extends = BaseDataset,EventDataset,LogicalFileDataset
  BaseDataset
    ...
  EventDatasetData
    ...
  LogicalFileDatatsetData
    ...
  geometry = 4.10
  simulator = geant4
If the data is distributed over multiple logical files (the usual case), then datasets can be merged to form a composite dataset.
CompositeDataset
  extends = BaseDataset
  BaseDataset
    ...
  CompositeDatasetData
    DatasetIdList
      DatasetId
        id = 333-101
      DatasetId
        id = 333-102
      ...
A composite dataset may include events. For these datasets, two types of composition are allowed: merging datasets with the same content and different events and merging datasets with the same events and different content. We use the flag event for the first case and content for the latter:
CompositeEventDataset
  extends = BaseDataset,EventDataset,CompositeDataset
  BaseDataset
    ...
  EventDatasetData
    ...
  CompositeDatasetData
    ...
  CompositeEventDatasetData
    merge_type = event         [or content]
More to come...
dladams@bnl.gov