The word job is used in the DIAL sense. The job is specified with an input dataset and an application that describes the action to perform on the dataset to produce a result. The job also carries a task, which is a collection of user-supplied data used to configure the application. The task may include parameters, code or scripts. The application provides a mechanism to prepare the task (e.g. compile code and archive in a shared library) and another mechanism to process the dataset using the prepared task.
Typically job processing is distributed by breaking the submitted job into sub-jobs and then concatenating the sub-results to create the overall result. Here we are especially interested in the case where the dataset is split into sub-datasets and each of these are used along with the full application and task to define the sub-jobs.
XML is the language used to express the the job definition language. Thus parsers are available for common programming and scripting languages such as C++, java and Python. This also make its straightforward to wrap these components in SOAP messages and exchange them in web and grid service environments.
The job definition language is used to define a SOAP interface for scheduler services for job management. This interface is described in a separate document.
<Book title="Howto use the GRID" publisher="Anything Co."> <TableOfContents /> <Chapter number="1" /> <Chapter number="2" /> <Chapter number="3" /> <Chapter number="4" /> </Book>is represented by
Book
title = How to use the GRID
publisher = Anything Co.
TableOfContents
Chapter
number = 1
Chapter
number = 2
Chapter
number = 3
Chapter
number = 4
We use ellipses to indicate more of the same and indented ellipses
to abbreviate content that has already been indicated:
Book
title = How to use the GRID
publisher = Anything Co.
TableOfContents
Chapter
number = 1
Chapter
... [as in previous Chapter]
... [may be more Chapter elements]
A natural way to make the language extensible is to introduce polymorphism, i.e. add types which extend other types. For example dataset with file extends dataset. Simple XML does not directly support polymorphism but XML schema can be used to add this functionality at the price of carrying around a schema description for each XML object.
Here we try to allow the use of schema without requiring their use. This is done by allowing substitution of elements with different names as long as the attribute "extends" is present and includes the substituted element name. For our above example, the type Dataset
Dataset name = mydataset BaseDatasetStuffand be replaced with
FileDataset
extends = Dataset
Dataset
Name = myfiledataset
BaseDatasetStuff
File
name = myfile.dat
Other conventions are possible. We could elevate the Dataset attributes
to the FileDataset level and drop the Dataset entry. Or we could require
Dataset as the element name and embed the extended type. We could
require use of schema or could adopt a conventions (such as the latter)
that make their use impossible.
For definiteness, we follow the convention of this example in the discussion below. We are open to adopting any convention for a common standard. A different convention will change the formatting of the data but will not affect the content and the discussion below remains relevant.
Application Task Dataset Result JobId Job
Application name = DataCruncher version = 1.2.3This is assumed to be enough that a package management system can identify the required software and verify its presence or install it if needed. A scheduler will use this application description to discover the scripts used to prepare tasks and start jobs.
An atlternative task description might add a list of required packages and embed the scripts required for preparing tasks and running jobs. This would avoid the need to have the package management system install the interface package carrying these scripts.
The application schema described here is not extensible.
Task
id = 123-456
TextFiles
TextFile
name = data.dat
[text data]
TextFile
name = code.cpp
[text data]
...
NamedLogicalFiles
NamedLogicalFile
name = caloped.dat
LogicalFile
catalog_type = Magda
catalog_name = ATLAS
file_id = calo_pedastals.0003367
...
The scheduler has the reponsibility of extracting the text files
and fetching replicas of the logical files before calling the
application script to prepare the task. The extracted files are
placed in a directory with the indicated names.
It is up to the installed application to prepare the task using the file names to identify the files. This preparation mway fail if files are missing or have unexpected content.
The task is not extensible. It is assumed that arbitrary lists of embeded and logical files provide sufficient flexibility.
identity - a unique identifier for referencing the dataset content - a description of the kind of data in the dataset location - where the data can be found, e.g. a list of logical files mutability - is the dataset finished? compositeness - is the dataset made up of other datasets?A virtual dataset (all but location) is described by a base dataset:
BaseDataset
identity = 123-456
mutability = locked [or appending]
location = virtual [or logical or physical or staged or mixed]
composite = false
ContentList
Content
type = RecoJets
key = cone-0.3
Content
type = Tracks
key = Kalman-refit
...
The content here does not include event information.
A case of particular importance is the event dataset which is made up of a collection of "events" where each event may be processed independently. Each event is assumed to have an identifier.
EventDataset
extends = BaseDataset
BaseDataset
...
EventDatasetData
event_count = 100
EventIdList [optional]
count = 100
EventId
id = 4200-201
EventId
id = 4200-202
...
The event ID list is optional and is typically absent for a composite
dataset where it can be derived from the lists in the sub-datasets.
Note that we add the element EventDatasetData to hold the data that
this type adds beyond its sub-types. This pattern allows us to make
the base type BaseDataset explicit in subclasses. This pattern will
be repeated in every type thatis both a sub-type and intended to be
used as a base for other types.
If the data for a logical dataset is in a single logical file, then a logical file dataset can be used.
LogicalFileDataset
extends = BaseDataset
BaseDataset
...
LogicalFileDatasetData
LogicalFile
id = dc1.ttbar.recon008.sim
FileCatalog
type = Magda
name = ATLAS
Typically users might add their own element type to describe a dataset
which has events and for which data can be found in a single logical
file:
MonteCarloFileDataset
extends = BaseDataset,EventDataset,LogicalFileDataset
BaseDataset
...
EventDatasetData
...
LogicalFileDatatsetData
...
geometry = 4.10
simulator = geant4
If the data is distributed over multiple logical files (the usual case),
then datasets can be merged to form a composite dataset.
CompositeDataset
extends = BaseDataset
BaseDataset
...
CompositeDatasetData
DatasetIdList
DatasetId
id = 333-101
DatasetId
id = 333-102
...
A composite dataset may include events.
For these datasets, two types of composition are allowed:
merging datasets with the same content and different events and
merging datasets with the same events and different content.
We use the flag event for the first case and content for the latter:
CompositeEventDataset
extends = BaseDataset,EventDataset,CompositeDataset
BaseDataset
...
EventDatasetData
...
CompositeDatasetData
...
CompositeEventDatasetData
merge_type = event [or content]
More to come...