Date: Mon, 25 Mar 2002 14:22:29 -0500 X-PH: V4.4@bnl.gov From: David Adams User-Agent: Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en-us MIME-Version: 1.0 To: Luc GOOSSENS CC: Atlas - Database Group Subject: Re: HES store comments References: <3C9878D2.FBD1182@cern.ch> <3C9A75A3.6020805@bnl.gov> <3C9F5478.A79CE80@cern.ch> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-atlas-database@listbox.cern.ch Precedence: bulk Reply-To: David Adams My responses are included below. I hope we are converging at least on a set of issues for the upcoming meeting. You raise some important points. Thanks for your input. Thanks. da Luc GOOSSENS wrote: > Hi David, > > Thanks for your reaction, here are my comments to your comments on my > comments. > > David Adams wrote: > >>Here are some off-the-cuff responses indexed by slide number: >> >>3. We left the class definitions out of the information that a file is >>required to carry. This was a deliberate but not widely discussed >>choice. HES supports multiple file formats and leaves it up to them >>whether or not to include this information. >> >>We tried to to stay away from the transient world. The job of HES is to >>locate persistent objects. The individual file formats take care of >>conversion to and from transient representation. >> > > can you explain then a bit more what "traverse" means to you, and what > consultation of the file catalog could be involved with it > > I don't see the word traverse and I'm not sure what you are asking. I can imagine high-level input streams whose files are mostly or entirely references which use catalogs to locate replicas of referenced EDO's. >>5. Of course we added an integer index because we wanted a compact way >>to identify files but you may be right that the saving in space is not >>worth maintaining a global mapping. My experience is that string >>identifiers for files can get very long as people try to pack more and >>more information into them. >> >>This is a point to reconsider. >> > > Note that it's not the global map I am worried about, this will exist > anyway mapping LFNs to PFNs. However, LFNs are *hierarchical* strings > allowing the uniqueness constraint to be managed at several hierarchcal > levels. Introducing a unique number shortcuts this completely > re-introducing the need for a global management. > > This is a pervasive problem. Compact representation of an identifier implies central management of identifiers to avoid wasting values. Hierarchical strings can almost eliminate this management but can be quite large. For EDO identifiers, we went the other way and use a hierarchical scheme. For the file identifiers, we are somewhere in the middle. There are enough extra bits that we can assign ranges of values to production centers and require only a small level of central management. Again it is straightforward to move away from the integer identifiers if they are felt to be a large burden. >>7. Including the file ID in the EDO ID makes it easy to guarantee the >>latter is unique as well as providing a pointer to the location of the >>original version of the object. >> >>The EDO ID serves as the EDO reference. >> > > I guess this is more a difference in taste of terminology. > However, am I right in thinking that if I copy an EDO replica "by > reference" it will contain a reference to the file of the original? I > think it makes more sense to refer to the file of the replica (or to > both) > Yes a reference to a replica is the same as a reference to the original EDO. It is just the original EDO identifier. > >>I'm not sure what you mean by your last point. We know an EDO is a >>replica if its ID differs from the ID of the file that holds it. Note >>that a replica may be bitwise identical to the original if the two file >>formats are the same. >> > > It was more a general question. How does HES assure that replica's are > really replica's. Obviously it can probably not prevent malicious false > replica's, but what can it do to prevent unintentional false > replication. > > There is an implication that replica is equivalent to the original. Equivalence is defined by the application and HES does nothing to check or enforce it. Clearly we would want to do testing of this equivalence but that testing is not part of HES. We have tried to define object references so that the replica of an EDO is a bitwise copy of the original to help ensure this equivalence and to simplify the process of replication. >>10. Again I'm not sure what you mean. It is a requirement that we be >>able to point to an object inside an EDO. HES takes the responsibility >>for defining this mechanism so that is is possible to point to objects >>in a different file format. >> > > If the resolution of inside EDO pointers was left to the transient world > altogether, HES would not have to worry about it at all. Note that HES > can not do anything with the part of the reference beyond the EDO > pointer anyway. > > We have left it up to the transient world to dereference object (meaning inside EDO) pointers. We define a class and its representation for the pointers so that different file formats will have a common implementation. This enables cross-format pointers. >>11. At least you recognize it as such. I hope this will be clearer in >>the next rewrite. >> >>12. Event is a suprisingly subtle concept. One could define all the data >>within a file associated with one event ID to be an event. But this is >>just a collection of placement categories and we didn't see the need to >>make it explicit because of the likelihood of misinterpretation. >> > > The combination of an event ID and a set of type-keys defines an event > view/projection. > I see two issues here. > 1. Is there NO use case to refer to complete events? Note that > "complete" event is a floating definition. If type-keys are added, > events grow. > If you want all the data associated with an event ID, you can go to the file catalogs and find all files with data for that event. But it is unlikely that this sample would be useful because of varying production histories. It is also likely there will be type-key duplicates. > 2. Having this floating definition of event is a bit tricky BUT do you > really think we'll have static definitions of placement categories? > This means that whenever you want to add something, you have to think of > a new name. > This of course opens up a can of worms. On the one hand, if the > definition of PC foo changes, older datasets for which foo was saved > will not contain the extensions. On the other hand, I don't want to > think of a new name for my PC each time a new item needs to be added. > > I started with fixed global PC type definitions but it became clear that this was likely too restrictive. This is one reason files carry the definitions of their PC types. I imagine definitions will change when it is felt that new and old data can still be usefully combined. For catalogs it may be useful to introduce PC type aliases or versions to keep track of such changes. >>13. Our placement categories are essentially an implementation of the >>sharing categories and, to a lesser extend, the placement categories >>introduced in the ADM (architecture DB model). I don't feel we have the >>freedom to drop them unless the notions of sharing and placement >>categories are dropped from the ADM. >> > > That may happen, but I am not holding my breath. :-) > > >>That said, personally I have become used to the idea of sharing >>categories and would be reluctant to drop them. I think it may be very >>useful to have a level of sharing that is between that on the EDO and >>the event. >> > > What do you mean with "a level of sharing that is between that on the > EDO and event level"? > What I proposed to the DBA team is to view the PC and SC as symbolic > names for collections of type-keys and *nothing more*. This means that > it is a convenience to have them, it is not essential. > > Again, if you define PC's that allow all types, you get exactly the behavior you request (assuming you are willing to make these definitions at the time the file is written). The output stream, not the type-key determines which type-keys go to which PC's. Restricting the type-keys associated with a PC provides a check at output time and, more importantly, can speed up the search for data of a particular type-key by restricting the candidate PC's. Most important, it is a hint to the user as to what type of data is likely to present in the PC. >>A file writer can largely ignore placement categories by creating a >>single category that allows all types and keys. Readers can ignore the >>categories by concatenating the contents of all categories. It would >>probably be useful to add a method to the file interface that does the >>latter. >> >>Yes, I think you understand (but I should probably make clearer) that a >>placement category is not required to hold all the tye-keys that its >>type allows. >> > > That's a pretty strange statement. I guess this means one can have PC > instances being a subset of the type-key set defined in the PC > definition. Isn't this taking things a bit too far. In the end what > matters is what type-keys are present for each event and nothing more. > What's the point of saying that you support set PC1 if in the end you > have to additionally say which subset {tk1 tk3 tk7} of it, AND at the > end of the day even that spec is an upper bound. > > Personally, I like the idea of requiring all type-keys be present but I was voted down. I can understand how some might feel this is too restrictive. >>23. While it is essential to have catalogs, I agree that they certain >>can and should be kept distinct from the file handling pieces that make >>up HES proper. It is important to have a rudimentary view of the >>catalogs because that defines what data must be held by the >>"self-describing" files in HES and defines how HES files reference data >>that is stored in the catalogs (e.g. JobOptions). >> >>24. I think the concept of stream type is important. Physics analyses >>will be done on collections of files (datasets) for which the data >>should all be in the same state. The stream type is a way if labeling >>this state. Event if we dropped placement categories, it would be useful >>to keep the notion of a stream type. >> > > See my remarks above. What does "common state" mean? A shared upper > bound for the type-key set? > > Common state means the same set of PC names. It is up to the production coordinators to assign and use the PC names in a consistent manner. >>26. Once you have stream types, input streams have to assign these types >>to files. >> >>31. I agree it would be okay for two EDO's to have the same type-key as >>long as they have the same EDO ID. >> >>general 1: >>I think we agree that the usual unit of analysis is not the single file >>but is a collection of files in a common state. I call this a dataset. >>This is discussed in the HES document but not the talk. Input and output >>streams will generally correspond to datasets rather than single files. >> >>general 2: >>Now you are moving back into the world of cataloging. The first versions >>of input streams will only handle single files or small collections of >>files. I do imagine later versions could support large numbers of files >>with catalogs to locate desired EDO's. Or the appropriate files might be >>retrieved before the job is run. In any case I believe replica >>management can be layered on top of the HES file interface that we propose. >> >> > > Cheers, > > Luc > -- David Adams desk: 631-344-6049 Brookhaven National Lab fax: 631-344-5078 PAS group, Building 510A email: dladams@bnl.gov Upton, NY 11973-5000 http://www.usatlas.bnl.gov/~dladams ____________________________________________________________________ This mail has been sent to everyone on the atlas-database list ____________________________________________________________________