From - Fri Apr 12 17:54:07 2002 X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 Message-ID: <3CB7539A.3090707@bnl.gov> Date: Fri, 12 Apr 2002 17:37:30 -0400 From: David Adams User-Agent: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en-us MIME-Version: 1.0 To: RD Schaffer CC: Atlas Database Group Subject: Re: Comments on the HES document References: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit RD: Thanks for you comments. I will try to quickly respond to your points. We discussed some of these on Wednesday so I have had time to think over some of the issues. Comments below. da RD Schaffer wrote: > Hi David, > > Better late than never. Sorry for the delay. I include below some > general comments followed by some detailed comments. The detailed comments > section is not complete. I will try to complete it over the weekend. > > see you, RD > > Email: R.D.Schaffer@cern.ch > Address: LAL BAT 200 tel(Orsay): 33-1 64 46 8378 > BP 34 tel(cern) : 41-22 76 71267 > 91898 ORSAY > France > > > > There are comments on the Hybrid Event Store document Version 1.5.3. > > 12 April 2002 RD Schaffer > > ----------------- > General comments: > ----------------- > > ------------------------------ > Event collections and datasets. > ------------------------------ > > The DB architecture document and many earlier discussions/documents > have had event collections and iteration/selection over events in a > collection as basic concepts in the model, with event collections > being explicit components in the model. In the HES document, the term > "dataset" appears to be synonymous with the concept of event > collection. I would suggest replacing the term "dataset" with "event > collection". I am not sure the concepts are exactly the same. Both dataset and event collection refer to a well-defined set of event ID's but dataset also specifies which placement categories are included (implying that others may be excluded). I am not sure this is true for event collection. > > A second point concerns the "role" of event collections > (i.e. dataset). In the architecture document, the event collection is > the "starting point" for access to events (see e.g. > http://documents.cern.ch/cgi-bin/setlink?base=agenda&categ=a01560&id=a01560s5t1/transparencies > slides 10-13 or the doc itself, > http://atlas.web.cern.ch/Atlas/GROUPS/DATABASE/event_store/ev-arch-index.html). > Event collections contain an "index" which is some "externalizable > reference" to an event. Thus one normally identifies an event > collection of interest and then iterates over its "indices", > navigating to the events. The emphasis is on collections independent > of whether the different components of each event lies in one or more > files. I guess it is just not clear to me from the HES document that > event collections play this central role. We do not have an event or an event index but a dataset (or a subset of the event ID's in a dataset) is expected to be the typical user input for reconstruction or analysis. We expect that the dataset would be used to construct an input stream and Athena would effectively use this stream to iterate over events. > > ----------------- > Stream definition: > ----------------- > > Related to the above discussion is the definition of a stream. I > would expect that the primary definition of a "stream" is that it is > associated with a sequence of algorithms in the Athena/Gaudi sense > where event filtering/selection may have been applied. There is > always a corresponding event collection associated with a stream. One > of course must specify which objects (EDOs) are written to each > stream. It appears the HES document maps a stream to a set of > placement categories to a single file for any single event. I would > rather expect that a single stream can map EDOs to one or MORE files > for any single event, via the placement categories. So perhaps the > HES stream definition is a "lower level" definition than I would > apply. I agree that we may be using the word stream in different ways. For HES it is important to distinguish between input and output streams but they do have some common properties. For either input or output, there is a named stream type which specifies the PC's (placement categories) that are present in the stream. I expect an output stream would have one file for any given event but that events can be distributed over many files by having multiple output streams with different PC's diverted to different streams. I agree this is probably to restrictive for input and I propose we now allow an input stream to take data for one event from different files. Each of these files are also input streams so we are saying that input streams can be combined to form a new input stream. I believe this provides the functionality you describe. I agree the HES stream is a lower-level concept. > > ------------------------------------- > Sharing between input/output streams: > ------------------------------------- > > Again related to the above, all output event collections should > contain by reference or by value all of the objects in the input > collection. Thus, for example, given a event collection resulting > from the reconstruction of geant-simulated event, one should be able > to access the digits or the generator tracks/vertices, whether or not > they are contained by value in the recon output stream. I am not > saying that this is not possible given what is in the HES design. But > I would think that we would want this to be explicitly part of the > default design. You are correct. We do not do require this. A dataset is a view of the event that does not necessarily contain all the pieces used in its construction. Note however that referenced pieces may still be accessible through the references even though they are not part of the directly visible data that defines the event. This may be an important difference. > > --------------------------------------------------- > Event headers and sharing and placement categories: > --------------------------------------------------- > > I think that I would like to have more discussion on event headers > and sharing and placement categories. As far as I understand, there > is no event header in the HES document. So the "externalizable > reference" to an event mentioned above points to what exactly? Why > not an event header? An event header could also be the means of > "sharing" data objects, e.g. the events in an output stream can > maintain a reference to the event header in the input stream. If a > data object is not found in the output stream event, the input stream > event can be checked for data object. Yes, there is no "event" and so no event header. The user simply asks athena for the next event and the type-keys that are included in the event view presented by the input stream appear in the transient data store. I suppose an event header could be constructed from the list of type-keys but i believe this is transient concept and is not needed for interaction with the persistent store. Again, this may be an important point. > > > --------------------- > Placement categories: > --------------------- > > If placement categories continues to play the role of both sharing > categories and placement categories, a better name may be needed. In > most cases doesn't the term refer to "sharing" rather than > "placement"? Should look at this in more detail. As I've said in the past, sharing might be a better description of their functionality but placement also has some relevance in that all EDO's in a PC are placed in the same file(s). At the risk of making things even less clear, I will note that the PC's may be closer to sub-event headers. Each specifies a subset of the EDO's that make up the view of the event. We may have elevated PC's to a level higher than originally intended foe either placement of sharing categories. > > > ---------- > Store view: > ---------- > > I am not sure if I have a good understanding of what this really > is? The Store view is the interface to the persistent data store. It uses input streams to define the event ID loop and to define what data is visible for each event. It uses output streams and a placement policy to determine what data is written out. > > > ------------------------------ > Specific comments on the text: > ------------------------------ > > Here are some detailed comments/questions on the text. I copy here > the text inside "" and provide comments in <<>>. > > NOTE: I have only included here comments on defintions. I will include > comments on the rest of the text in a subsequent version. > > 1) > > "1. Introduction > The architecture of the ATLAS event store is described in a separate > document[2]. Here we present a realization of that architecture in > which a relational database is used to keep track of files that store > and provide access to the objects that make up the event data. This is > called the hybrid event store. " > > << Is the rdb intended to only keep track of files? >> No. But this is its most important function. The files are self-describing but references between files are expressed in terms of file ID's and the RDB is needed to locate the referenced files. It is not yet clear whether dataset definitions will primarily exist in files or in the RDB. An of course later section detail how RDB's are used to keep stream and PC type definitions and to catalog some of the data inside files. > > > 2) > > "Event > > The word event is used in many ways. It can refer to a beam crossing > or to a subset data associated with a particular crossing. The latter > might include the data in a file or that visible during event > processing." > > << The meaning of the last sentence is not clear. >> The point is that the word event is used in many ways. The sentence means that one user might use event to mean all the data associated with an event ID in a particular file while another might mean a subset or might append data from another file. > > > 3) > > "Global data > We use the term ..." > > << I would not introduce a new term here. How about using Detector > Description data, where the time-varying part is Conditions data. See > elsewhere is it used. >> > > 4) > > > "Reconstruction > > The reconstruction of event data is the process of using existing data > from a beam crossing to produce new data associated with that same > beam crossing. For example raw data may be used to produce clusters > that are used to find tracks that are used to create electrons. Global > data may also be used as input in some stages of reconstruction." > > << Sorry, I don't like your def. How about: > > This start from the raw data and uses the detector description to > correct for detector response and positions building candidates for > the particles of a beam crossing which traverse the detector. For > example raw data may be used to produce clusters that are used to find > tracks that are used to create electrons. detector description is > required to perform some stages of reconstruction. >> > > > 5) > > > "Algorithm > Reconstruction is carried out in a series of well-defined steps." > <> > > 6) > > "Algorithm instance > > The algorithm is characterized by type, version, release version, > run-time parameters and name. A unique combination of these is called > an algorithm instance. This is sometimes abbreviated as algorithm." > > < run-time parameters and name. An algorithm which runs in a particular > job has a unique combination of these and is called an algorithm > instance. This is sometimes abbreviated as algorithm.>> > > > 7) > > "Event data object (EDO) > > In our object-oriented view of the world it is natural to describe the > data from a particular beam crossing as a collection of > objects. Rather than attempt to keep track of low-level objects such > as individual tracks or electrons, relational databases are used to > keep track of files that, in turn, manage collections of these low- > level objects. We refer to these collections as event data objects or > EDO?s. These are called DataObjects in StoreGate." > > << How about: An EDO is the smallest granularity of data which is > storable as a unit and which can be externally referenced. Generally, > EDO's are collections of more basic objects, such as tracks. EDO's are > synonymous with DataObject in Athena. >> > > > 8) > > "Referenced EDO > > In the context of an EDO, a referenced EDO is another (or the same) > EDO that contains objects that are referenced from object in the first > EDO. A referenced EDO may be the current one, a parent or one from any > previous generation." > > << I would be explicit about the direction of refs: This means that > EDO's can only be referenced by subsequently produced EDO's. (perhaps > this should be said elsewhere?) >> > > 9) > > "EDO identifier > > We require that objects in one EDO be able to reference those in > another, e.g. a track points to its clusters or an electron points to > its EM cluster and track. These references must persist even if the > two EDO?s are in different files and if either or both files or EDO?s > are replicated. These references are implemented by specifying an > identifier for the referenced EDO and an index for the object within > the EDO." > > << This appears to be describing more than just EDO id's - change > title? >> > > 10) > > "Production thread > > A production thread specifies a series of algorithm instances within a > production environment. The parameters of these algorithms specify the > EDO type-keys used as input and output. The thread also specifies > collections of input and output stream types, a selection algorithm > for each output stream and placement policy for the output data. The > placement policy is used to determine whether data is written by value > or reference and to assign an owner for data that is written by value > to more than one file. > > The algorithms in a production thread transform one collection of EDO > type-keys into another. The thread transforms input streams into > output streams." > > << this def is not just proper to the DB domain. Is this an accepted > computing model def? E.g. "The parameters of these algorithms specify > the EDO type-keys used as input and output." >> > > 11) > > "Store view > > The event ID list and input and output streams used to define a job > define a view of the event store (or views of the event stores). The > store view is the interface between the event-processing framework > (Athena/StoreGate in ATLAS) and the persistent event store." > > << Not too clear to me on the first reading. >> > > > 12) > > "Regenerated data > > It may be desirable in some cases to regenerate data because the > original data is lost or difficult to access. Regeneration is done at > the EDO level. The regenerated EDO?s are assigned new ID?s but also > carry the original ID to indicate their origin." > > << This means that ids are saved elsewhere if an EDO is lost? Held by > a placement category object? Not too sure of the use case here. >> > > > 13) > > "FILE > ... > > A placement category in a file may be replaced by a reference to > another file where the placement category may be found. In any file > except its owner, an EDO may be replaced with a reference that holds > an EDO identifier. There is an important difference between placement > category and EDO references: a placement category reference can only > be satisfied in the referenced file. An EDO reference can be satisfied > by the original version of the EDO or any replica." > > << I'm not sure I understand what's going on here and the importance > of this statement?? >> > > > 14) > > "File ID > > Each logical file is assigned an identifier called the file ID so that > it can be referenced from other files. . There is a one-to-one > mapping between file ID?s and logical file names." > > << Why not call this the logical file id to be clearer. It refers to > the logical name and not the physical file. >> > > 15) > > "Dataset" > > << rename event collection >> > > > Some (all?) of your detailed points are valid but most are minor. I would like to understand the major issues and clarify the future of the document before addressing them in a rewrite. Let me know if there are points to which I should respond in advance rewriting the document. Thanks again. da -- David Adams desk: 631-344-6049 Brookhaven National Lab fax: 631-344-5078 PAS group, Building 510A email: dladams@bnl.gov Upton, NY 11973-5000 http://www.usatlas.bnl.gov/~dladams