Data history

D. Adams, S. Rajagopalan and P. Califiura
Version 1.3.2
24dec01 1300


Here we describe requirements and a design for maintaining the history for the collection of data objects generated in a HEP experiment. Our global requirement is that any piece of event data be self describing. This means one can recover the complete chain of operations used to produce the data.


Contents

Status

Definitions

Use cases

Requirements

Design issues

Current implementation

Comments

Talks

Calafiura and Rajagopalan


Definitions

Data object

A data object is the smallest piece of data that we write in to or read out of the data store. We expect these to be self describing.

Event data object (EDO)

This is a data object that is part of the reconstruction of a particular event, i.e. a particular triggered beam crossing or simulation thereof.

Event data contained object

EDO's are typically containers of physics data objects such as clusters, tracks or electrons. We call these event data contained objects or simply contained objects. These are accessed by first fetching the containing EDO from the data store and then extracting the contained object from that container.

Event data history

Event data history is the history information associated with an EDO. This specifies the immediate origin of this object including its parent objects and the algorithm used to create it.

Algorithm

An algorithm is something which takes a well defined collection of EDO's as input and produces one or more output EDO's. There may be input from other non-event sources such as calibration or alignment data. There is no other input from event sources.

EDO replication

It may be desirable to reorganize data after processing by clustering commonly accessed events and widely used EDO's from those events. This implies that the EDO's from one file or more files may be copied to another file. We call this EDO replication.

EDO regeneration

EDO regeneration is the process of using the history of an EDO to recreate the EDO. If the history is complete and the software and hardware environment are sufficiently close to their original values, then the regenerated EDO will be identical to the original and can be used in its place.


Use cases

Some of the following use cases provide long lists of tasks to be carried out by the "user". It is expected that tools will be provided to encapsulate most of these activities to hide them from the end user.

Check history

A user has a reconstructed electron from an event and would like to know which fitting algorithm was used to evaluate the parameters for the associated track. The electron was taken from a particular EDO (event data object) holding a collection of electrons for this event. This EDO was built from two parent EDO's: one containing tracks and the other EM clusters.

The history of the electron EDO is used to fetch the history of the parent track EDO. The history of the latter is then used to fetch the name of the fitting algorithm, its version (or the release version) and the run time parameters associated with this algorithm.

Select on history

A user has list of track EDO's associated with an event. Each EDO contains a list of reconstructed tracks found with a particular algorithm and set of run time parameters. The user wishes to find all track collections generated with a particular algorithm for some range of run time parameters.

The user iterates of the the list of EDO's, fetches the history for each, fetches the algorithm and run time parameters for each history and then keeps those which meet the desired criteria.

Virtual data

A user has developed a new track fitting algorithm and would like to refit the tracks in a collection of events. Each track is a list of pointers to clusters. For each event, the tracks are stored in an EDO on disk but the clusters were stored in a separate EDO which has been discarded. The user has access to copies of the raw data used to construct the clusters.

For each event the user fetches the track EDO history and fetches an index for the parent cluster EDO from that history. The clustering algorithm name, version and run time parameters are obtained from the cluster history and the algorithm is rerun to reproduce the cluster data in the EDO. The tracks are then refit using the new clusters.

Replicated data

Tracking data is processed by first creating clusters from adjacent strips and then running a track finding algorithm to group clusters into tracks. An EDO continuing clusters is written after the first stage and a second EDO containing tracks is created in the second step. A track is a kinematic fit plus "pointers" to the clusters associated with the track.

A user replicates the tracking data and brings the copy to his local site. Later he decides to refit the tracks and replicates the clusters in a separate file. The two files are then used as input to the refitting job. The program must recognize that the cluster references in the replicated track EDO can be satisfied using the clusters in the replicated cluster EDO.

Creation in Gaudi environment

Normal data production will be in the Gaudi environment. The controlling framework (athena) will invoke an algorithm and then create a history object and associate it with each produced event data object. The creator of the history object is given the name (identifying string) of the algorithm.


Requirements

1. In addition to the physics data (which may not be present), each EDO (event data object) must contain or provide access to essential history information. This essential history information includes: With the exception of the parent EDO's, this history information may be common to (and shared by) EDO's from many events.

2. The above information must be sufficient to enable exact reproduction the data. If not, it should be expanded.

3. In addition to the essential information above, the history may include the following nonessential information:

This information is not required to reproduce the data. Much of if will change if the data are regenerated.

4. The history information may still be present even if the data has been discarded or is stored in another place.

5. There must be a way to index EDO's (and other data objects) so that a child EDO can reference its parents (the first piece of history information in item 1).

6. These indices should span files, federations, database technologies and geographical locations so that an EDO can be copied without carrying along its ancestry (EDO's and other data objects).

7. EDO's (and other data objects) can be replicated. The copies have exactly the same data, essential history and index. Some of the optional history data may differ.

8. The uniqueness of the indices and reproducibility of the data guarantee the two data objects with the same index can be freely interchanged. Each will have the same (or equivalent) data and essential history.

9. Replicated or regenerated EDO's may carry extra information reflecting their true origin.


Design issues

Object identifiers

There must be a persistent way to label event data objects so that we can record parents. Object identification can be done via global identifiers.

Distributing history information

Clearly most of the history data is shared by more than one data event data object. The data should be distributed over histories of different types which are shared to minimize duplication.

Here is a possible structure:

Job history (created at the beginning of the job)

Algorithm history (created from algorithm after initialization but before event processing)

Data history (created for each EDO produced by an algorithm)

Algorithm and job histories are associated with the data history and so the latter can be used to access all history data.

Transient interface

History information must be persistent, i.e. we must be able to recover the history of an event data object taken from the persistent store. However we would like to construct persistent history data from transient objects (algorithms, data objects, ...) and hide the persistency from the user.

Historian

Job, algorithm and data histories are created whenever jobs are run to produce data. The class Historian provides a convenient interface for creating, accessing and managing all these history objects.


dladams@bnl.gov