DIAL release 1.30: Creating a dataset

This section describes how a user can create his or her own DIAL dataset. Of course if you run a DIAL job, the output is already a dataset. Here we describe how to create a dataset from a collection of files most likely produced outside the dial framework.

DIAL is not yet fully integrated with the ATLAS data management system. Once that is done (release 1.4 expected in early or mid 2006), ATLAS users will be able to use the tools from that system to create datasets and then reference them in DIAL jobs.

In the meantime for ATLAS users and longer term for other users, the following section describes how to create your own dataset. We use a typical ATLAS example: creating a dataset from a collection of AOD files.

DIAL supports many different kinds of datasets. Here we describe how to to create three kinds of datasets: multifile, event collection and merged event.


Multifile dataset

A multifile dataset is simply a list of files or, more precisely, a list of file URL's. Even in the common case that the files carry event data, the dataset does not contain information about those events. Thus it is not possible to use the dataset interface to select events or split the dataset into sub-datasets with a given number of events. On the other hand, it is quite simple and fast to construct such datasets. The dataset carries only a content label (e.g. ESD, AOD or HIST) but no details about the content such as the list of storegate type-keys or histogram names.

Such datasets are constructed from the content label and a list of URL's either using the MultifileDataset class in C++, root or python or with the following shell command:

   > make_multifile_dataset -c AOD myfiles.dat
In this example, the content label is AOD and the list of file URL's is contained in myfiles.dat. Or the list of URL's may appear on the command line in place of the file name.


Event collection

An event collection dataset is constructed from a POOL explict collection root file. These files contain a list of events with a pool object pointer for each event along with some metadata, at minumum the run and event numbers. The event pointers include the GUID for the file containing the event. The DIAL dataset holds a URL for the event collection file and a list of GUID URL's for the files holding the data. It also holds the list of events and it is possible to select events to construct a new dataset or split into a list of sub-datasets each with a given number of events. It carries content information about the event metatdata but not about the data files. Typically the content label would be TAG or AODTAG rahter than AOD.

Event collection datasets may be constructed using the ApecDataset class in C++, root or python or may be constructed using the single file command:

   > make_file_dataset -t ApecDataset -c TAG file:/home/data/mycollection.root
The -t option is used to specify the dataset type and -c to specify the content label. The last argument is a URL of any known type. If -p is included, then the file will be copied into the default storage element and a logical GUID will be recorded in the dataset.

Note that if a multifile dataset already exists, the transformation atlasopt:make_collection (i.e. application atlasopt with task make_collection) may be used to construct a corresponding event collection dataset.


Merged event

Merged event datasets are constructed by constructing a single-file dataset with full event and content information for each file and then merging these datasets. The dataset has very complete information including the full list of events and detailed content and guarantees that the included files are consistent, i.e. all have the same content (list of type-keys) and that no event ID's are duplicated. However, it is much more difficult to create such a dataset.

Create single file datasets

The first step is to create a dataset to describe each file. Use the following command:
  > make_file_dataset -t AtlasPoolEventDataset -p -c AOD myzeedata._00001.AOD.pool.root
for each file. Note that this and all DIAL commands provide a -h option that displays a help screen. The last word on the command line is the file name.

In the example the dataset type is AtlasPoolEventDataset which can be used for any ATLAS implicit event collection. It will extract the the list of event ID's and Storegate type-keys from the file and include them in the dataset description.

The content label is set to AOD in this example--this must be the same for all files in the dataset and should follow ATLAS conventions.

The flag -p indicates that the file should be copied to the local storage element.

The command produces a dataset description file dataset.xml. You can examine this file (or any dataset) using the dataset_property command:

  > dataset_property -f dataset.xml

Record the ID of the dataset which can be read from the above or can be displayed directly with

  > dataset_property -f dataset.xml id

Insert single-file datasets in the repository

The next step is to insert each of the single-file dataset descriptions into the dataset repository. This can be done as follows:
  > dataset_insert -f dataset.xml
If successful, this command returns the ID of the inserted dataset. This should match the above value.

Creating the compound dataset

The next step is to merge the single-file datasets into a compound datasets. This can be done as follows:
  > dataset_merge_events 10013-256018 10013-256019
were the command line includes the list of datasets or
  > dataset_merge_events -l dataset_ids.dat
where dataset_ids.dat is a file containing the single-file dataset ID's.

Again the output is in dataset.xml by default and you can use the dataset_property command to examine it.

Note the the merge command does consistency checking and will fail if datasets have different event content (either label or list of type-keys) or if event ID's are duplicated.

Insert compound dataset in the repository

Once you are satisfied all is OK, insert the compound dataset in the repository using the dataset_insert command just as for each of the single-file datasets. You may now use this dataset as input to a DIAL job by referencing its ID in a job description file.

Hierarchical datasets

It is possible to construct a datasets with more complex trees by merging subsets of single-file datasets into intermediate datasets and the merging these intermediate datasets into to top level dataset. The Rome AOD datasets were created in this manner by first creating intermediate datasets each with a maximum of 50 single-file datasets.


DIAL release 1.30: Creating a dataset, updated 02feb06