Dear All, as promised I have read with attention the paper of D.Adams and here are my comments. Before diving into them, let me express my appreciation for David's attempt to define such a complex concept. All what follows has to be taken as a constructive attempt to provide my 0.02c to the discussion. By attempting to define a very high level concept, this paper is indeed an attempt to describe part of a grid architecture, at least the architecture of the data management system, even if it ventures also in job splitting, provenance and so on. It is known that the most effective way to do this is to layer the functionality and build new and more complex functionality on the basis of lower level functions. In my opinion it is here where the paper fails and this is the reason for which the discussion is becoming endless. The concept of dataset as it is described in the paper encompasses everything from bytes-on-disk and records to meta-metadata and externalisable pointers. There is no way to come to a reasonable conclusion as the number of variables is too large. For instance the fact that a dataset can contain directly the information about the CPU nodes, is like requesting that an UML diagram could contain the mapping between variables and CPU registers! What the paper (and iVDGL) calls Dataset is, in HEPCAL terminology, a collection of datasets, so these are two different things. The concept is interesting and certainly useful, but only if the design is properly layered. To avoid confusions I will call the iVDGL Dataset a Datacollection from here on. The name is so bad that nobody can think I am seriously proposing it, and in fact I am not, but in this context it will suffice. My proposal is to build on the HEPCAL definition of Datasets and to concentrate on the definition of datacollections. In this sense a datacollection will always be logical, because it will only reference Logical Dataset names. The system may use a physical and even a staged Datacollection, but this is of no concern to anybody, because the system will do it by recursively dereferencing the information contained in the Logical Dataset names in a completely transparent way. I do not think that a paper describing Datacollections should ever mention records or files. These are just low level concepts that are handled by lower level layers. Also, I would not confuse provenance and materialisation information. Provenance is where something comes from, so it is an account of the past. A Datacollection for me is virtual if the list of datasets and pointers does not exist but there is a recipe to create it, that I would call materialisation information. Of course a Datacollection with at least one virtual Dataset in it is virtual. However we do not have to worry how the virtual Dataset is handled or materialised, because this belongs to the Dataset layer. I think that if these suggestions are retained, the paper on Datacollections (or whatever you want to call them) will become much easier to write and much less controversial, as the focus will be better defined. In this sense I would welcome the introduction of such a concept in HEPCAL-II, and this would not require any change in the dataset definition in HEPCAL. I have tried to make my suggestion in the paper itself, that you can find in http://cern.ch/fca/gds_fca.doc Let me know what you think. Best regards, Fed