ADA datasets

Contact: David Adams

ATLAS Offline Computing: [ Top | Grid | ATLAS grid | Analysis | ADA ]


Introduction

Datasets provide the means to specify and examine collections of data. Here we are interested in identifying the datasets that are required to support ATLAS activities. All datasets meet the AJDL Dataset interface; in practice this means that DIAL provides a Dataset subclass used to construct the dataset.

For DC1, we only provide the capability to extract histograms from the combined ntuple files that are the endpoint of reconstruction. The relevant datasets describe combined ntuples and HBOOK histograms.

For DC2, we will support a much broader range of activities including simulation, reconstruction, selection and the extraction of summary data from any data type. Most of the event data will be in the form of POOL event collections with the raw data in bytestream format being one notable exception. We assume it is now more useful to write ROOT histograms.

The relevant datasets are described below. Unless otherwise noted, these are already implemented in DIAL.


Dataset properties

There are a number of properties that characterize a dataset: Each dataset (and most other AJDL objects) is assigned a unique identity.

Content describes the data held by the data. If the dataset holds event data, it includes a list of event ID's and a list of content ID's that are type-key pairs (as in StoreGate). For non-event data, only the list of content ID's is present. The content is expressed as a list of content blocks each with this information. Each content block also carries a label and its dataset type name.

Location tells where the data may be found and is most often a list of logical file names.

If a dataset is composed of other datasets, then the sub-datasets are called the constituents. The content and location may explicit, i.e. included directly in the dataset description, or are implicit if one must examine the content or location or the constituents to determine their values.

See "Datasets for the grid" on the ADA documents page for more details.


Generic dataset types

Here we describe dataset types not specific to ATLAS or any other experiment.

Dataset
This defines the interface for all datasets.

GenericDataset
This is an implementation of the Dataset interface that is used to hold the data for all current dataset types. This class defines a common XML schema that is used to describe all these dataset types.

SimpleCompoundDataset
A dataset which is made up of a collection of other datasets. The content and location are implicit.

EventMergeDataset
Collection of event datasets with the same type-keys and different events.


ATLAS dataset types

Here we describe dataset types specific to ATLAS. Each of these holds a single file and can be constructed from that file.

CbntDataset
This dataset describes a single HBOOK file containing a DC1 combined ntuple. It provides acess to the list of event ID's and the list of blocks in the ntuple.

HbookDataset
This dataset describes a single HBOOK file containing histograms. It provides a merge method to append the histogram contents from aonther such dataset.

AtlasPoolEventDataset
This event dataset describes an ATLAS-POOL event collection, holding a single file that is an implicit collection of ATLAS event headers. The conetn includes the ATLAS event ID's and the StoreGate type-keys.

RootHistogramDataset
This dataset describes a single ROOT file containing histograms and/or ntuples. It provides a merge method to append the contents from aonther such dataset.

AtlasRaw
This dataset has not been implemented. It will hold an ATLAS bytestream file.


ATLAS content labels

Most of the ATLAS datasets will be of type AtlasPoolEventDataset but we idntify many stage in the processing. The content labels carry the information about the processing stage. Below are some of the labels relevant to ATLAS along with the corresponding dataset type.

EVGEN (AtlasPoolEventDataset)
Holds data produced by event generators such as pythia and iasajet.

HITS (AtlasPoolEventDataset)
Holds hits and truth information produced by a detector simulation program such as GEANT4.

DIGI (AtlasPoolEventDataset)
Holds the digits (aka digitizations) simulating detector responses to hits.

RAW (AtlasRaw)
Hold raw data from the detector or simulation therof.

ESD (AtlasPoolEventDataset)
Holds event summary data (ESD), a summary of the reconstruction of raw data or digits.

AOD (AtlasPoolEventDataset)
Holds analysis-oriented data (AOD), a summary of ESD

TAG (no type yet)
Summary of AOD in a relational table

NTUP (not type yet)
Ntuples often with each entry describing an event.

AOD (AtlasPoolEventDataset)
Holds analysis-oriented data (AOD), a summary of ESD

HISTO (RootHistogramDataset)
Histograms.


Last modified 27jun05 by dla