From David Adams To Federico Carminati September 8, 2003 Federico: Thanks for your comments. When I read the HEPCAL documents, I had the impression that we were using the term dataset in a nearly consistent manner but now I am less sure. I do not see the point of introducing another level of data management, your Datacollection. My basic premise is that any collection of data that is deemed to belong together may be declared as such by calling it a dataset. A user wishing to carry out an analysis would normally begin by selecting a dataset for study. Perhaps a second dataset might be used for background or simulation. I recognize there is some ambiguity as to whether the term dataset refers to the data itself or to the properties that I outline in my dataset document. These properties are those required for production and analysis and for recording the state to guarantee reproducibility. One can debate whether dataset refers to the data and/or some subset of properties. However I infer from the term DataCollection that this not your main objection. A file is a collection of data that happen to be physically clustered and thus easily accessed collectively. A tape record and a database table also share these properties. Is this what you intend the term dataset to mean? If so, I believe we have a similar model: you speak of datasets and Datacollections where I speak of files (or tape records or ...) and datasets. I prefer to reserve dataset for the higher level concept. The identification of various dataset categories (virtual, logical and physical) is an attempt to handle the many levels in the system. (I defer discussion of the controversial staged dataset.) Perhaps an example will clarify the meaning of logical dataset. The data in a group of physical files in declared to be a dataset. Those files then constitute the location for a physical representation of the dataset. If the files are assigned logical file names, then that set of logical names constitute the location for a logical representation of the dataset. Replication may produce a different collection of physical files and thus a different physical dataset (or more precisely another physical representation with a different location). Next the data in the physical files may be shuffled and copied into a different set of files, e.g. with file sizes of 100 MB instead of GB. The physical files define a third physical representation and if they are put in a file catalog, their logical names are the basis for a second logical representation. Both logical datasets and all three physical datasets are equivalent and it is useful to speak of all as being representations of the same virtual dataset. I envision that a typical user would select a virtual dataset from a dataset metadata catalog and then submit it to a scheduler (workflow management system) for processing. The scheduler would interact with a dataset catalog (dataset catalog is to virtual dataset as file replica catalog is to logical file) to extract a physical dataset (if you prefer, a physical representation of the dataset). Or the scheduler might receive a logical dataset and then use a file catalog to find the corresponding physical files. The latter can then be used to define a physical dataset which might be recorded for the next user. Are we reduced to the point of only disagreeing on the name, Datacollection or dataset? Or am I still missing an important point? da P.S. A few minor points on your comments in the paper. Perhaps I was unclear. 1. Records were intended only as a non-HEP way of saying events. They have nothing to do with Zebra or tape records or.... Indeed most of our data is event data and thus, by my definition, record-oriented. 2. On the subject of mapping, even if our data is randomly accessible, the events (aka records) will have a natural order in which they will be most rapidly processed. It is important to keep this distinction especially if the files are ever split for processing on more than one node. --da