Data history
D. Adams, S. Rajagopalan and P. Califiura
Version 1.3.2
24dec01 1300
Here we describe requirements and a design for maintaining the history
for the collection of data objects generated in a HEP experiment.
Our global requirement is that any piece of event data be self describing.
This means one can recover the complete chain of operations used to
produce the data.
Contents
Calafiura and Rajagopalan
Definitions
Data object
A data object is the smallest
piece of data that we write in to or read out of the data store.
We expect these to be self describing.
Event data object (EDO)
This is a data object that is part of the reconstruction of a
particular event, i.e. a particular triggered beam crossing or
simulation thereof.
Event data contained object
EDO's are typically containers of physics data objects such as
clusters, tracks or electrons. We call these event data contained
objects or simply contained objects. These are accessed by first
fetching the containing EDO from the data store and then extracting
the contained object from that container.
Event data history
Event data history is the history information associated with an EDO.
This specifies the immediate origin of this object including its
parent objects and the algorithm used to create it.
Algorithm
An algorithm is something which takes a well defined collection of
EDO's as input and produces one or more output EDO's.
There may be input from other non-event sources such as calibration
or alignment data. There is no other input from event sources.
EDO replication
It may be desirable to reorganize data after processing by clustering
commonly accessed events and widely used EDO's from those events. This
implies that the EDO's from one file or more files may be copied to
another file. We call this EDO replication.
EDO regeneration
EDO regeneration is the process of using the history of an EDO to
recreate the EDO. If the history is complete and the software and
hardware environment are sufficiently close to their original values,
then the regenerated EDO will be identical to the original and can
be used in its place.
Use cases
Some of the following use cases provide long lists of tasks to be
carried out by the "user". It is expected that tools will be provided
to encapsulate most of these activities to hide them from the
end user.
Check history
A user has a reconstructed electron from an event and would like to
know which fitting algorithm was used to evaluate the parameters for
the associated track.
The electron was taken from a particular EDO (event data object)
holding a collection of electrons for this event. This EDO was
built from two parent EDO's: one containing tracks and the other
EM clusters.
The history of the electron EDO is used to fetch the history of the
parent track EDO. The history of the latter is then used to fetch
the name of the fitting algorithm, its version (or the release version)
and the run time parameters associated with this algorithm.
Select on history
A user has list of track EDO's associated with an event. Each EDO contains
a list of reconstructed tracks found with a particular algorithm and
set of run time parameters. The user wishes to find all track collections
generated with a particular algorithm for some range of run time parameters.
The user iterates of the the list of EDO's, fetches the history for each,
fetches the algorithm and run time parameters for each history and then
keeps those which meet the desired criteria.
Virtual data
A user has developed a new track fitting algorithm and would like to
refit the tracks in a collection of events. Each track is a list of
pointers to clusters. For each event, the tracks are stored in an
EDO on disk but the clusters were stored in a separate EDO which has
been discarded. The user has access to copies of the raw data used
to construct the clusters.
For each event the user fetches the track EDO history and fetches
an index for the parent cluster EDO from that history. The clustering
algorithm name, version and run time parameters are obtained from
the cluster history and the algorithm is rerun to reproduce the
cluster data in the EDO. The tracks are then refit using the new
clusters.
Replicated data
Tracking data is processed by first creating clusters from adjacent
strips and then running a track finding algorithm to group clusters
into tracks. An EDO continuing clusters is written after the first
stage and a second EDO containing tracks is created in the second step.
A track is a kinematic fit plus "pointers" to the clusters associated
with the track.
A user replicates the tracking data and brings the copy to his local
site. Later he decides to refit the tracks and replicates the
clusters in a separate file. The two files are then used as input
to the refitting job. The program must recognize that the cluster
references in the replicated track EDO can be satisfied using the
clusters in the replicated cluster EDO.
Creation in Gaudi environment
Normal data production will be in the Gaudi environment. The
controlling framework (athena) will invoke an algorithm and then
create a history object and associate it with each produced
event data object. The creator of the history object is given the
name (identifying string) of the algorithm.
Requirements
1. In addition to the physics data (which may not be present),
each EDO (event data object) must contain or provide access to
essential history information.
This essential history information includes:
- Parent EDO's
- Other parent data objects (e.g. calibration or alignment data)
- The algorithm which produced the EDO
- Type
- Version (if not specified by the release)
- Run time parameters
- Release version
- Relevant run time environment (OS, OS version, shared library versions)
With the exception of the parent EDO's, this history information may be
common to (and shared by) EDO's from many events.
2. The above information must be sufficient to enable exact reproduction
the data. If not, it should be expanded.
3. In addition to the essential information above, the history may
include the following nonessential information:
- event identifier (tells to which crossing this data belongs)
- time stamp
- computer identifier
- reference to a description of the job that produced it
- CPU time consumed (and other system resources?)
- algorithm return status
- checksum to verify the data
This information is not required to reproduce the data. Much of if will
change if the data are regenerated.
4. The history information may still be present even if the data has
been discarded or is stored in another place.
5. There must be a way to index EDO's (and other data objects) so that a
child EDO can reference its parents (the first piece of history information
in item 1).
6. These indices should span files, federations, database technologies and
geographical locations so that an EDO can be copied without carrying
along its ancestry (EDO's and other data objects).
7. EDO's (and other data objects) can be replicated. The copies have exactly
the same data, essential history and index. Some of the optional history
data may differ.
8. The uniqueness of the indices and reproducibility of the data guarantee
the two data objects with the same index can be freely interchanged.
Each will have the same (or equivalent) data and essential history.
9. Replicated or regenerated EDO's may carry extra information reflecting
their true origin.
Design issues
Object identifiers
There must be a persistent way to label event data objects so that
we can record parents.
Object identification can be done via global
identifiers.
Distributing history information
Clearly most of the history data is shared by more than one data
event data object. The data should be distributed over histories of
different types which are shared to minimize duplication.
Here is a possible structure:
Job history (created at the beginning of the job)
- Release version
- Relevant runtime environment (OS, shared libs and their versions)
- CPU identifier
- Start time
Algorithm history (created from algorithm after initialization but
before event processing)
- algorithm type
- algorithm name or identifier
- algorithm version
- algorithm properties
- subalgorithm histories
Data history (created for each EDO produced by an algorithm)
- job history
- algorithm history
- parent EDO's
- global data identifiers (calibration, alignment, ...)
- event identifier
- algorithm start, stop and CPU time
- algorithm return status
- data checksum
Algorithm and job histories are associated with the data history and
so the latter can be used to access all history data.
Transient interface
History information must be persistent, i.e. we must be able to
recover the history of an event data object taken from the persistent
store. However we would like to construct persistent history data
from transient objects (algorithms, data objects, ...) and hide
the persistency from the user.
Historian
Job, algorithm and data histories are created whenever jobs are run
to produce data. The class Historian provides a convenient interface
for creating, accessing and managing all these history objects.
dladams@bnl.gov