ATLAS dataset catalogs for DC1

D. Adams
April 8, 2004


Introduction

Here we describe an initial implementation of datasets for ATLAS using the dataset model developed as part of the DIAL project. The implementation has four components:
1. the dataset selection catalog (DSC),
2. the dataset replica catalog (DRC),
3. the dataset file catalog (DFC) and
4. the dataset database (DDB).

The DSC is the primary user interface to datasets and it plays a role of what is often called a metadata catalog. It enable users to select a dataset based on its content, provenance and other metadata. This dataset is virtual, that is it need not have a unique mapping to any particular collection of logical or physical files. The data may not even exist (still or yet).

Virtual datasets are identified both by name and ID and the mapping between these is found in the DSC. The dataset corresponding to a given ID is normally immutable (although nonvirtual representations) may come and go). The mapping from name to ID may change, for example, as more data is acquired, a new dataset (with a new ID) might be formed by appending this data to an existing dataset and then reassigning the name from the old to the new dataset. A analyzer referencing a dataset by ID can expect to always get the same result while one referencing by name may see different results at differnt times. We expect most users to request datasets by name while a provenance system records ID's.

The DRC provides a mapping between a virtual dataset and one or more nonvirtual datasets where the latter are associated with a collection of logical or physical files or some other prescription for locating the referenced data. This mapping of virtual to nonvirtual datasets is analagous to the mapping of of logical to physical files in a file replica system such as RLS or Magda.

The DDB is a repository of dataset objects indexed by ID. At present, the datasets have an XML representation and are stored as files but it would also be possible to store these objects in another manner such as an XML database.

In many cases, a dataset will be formed from a collection of logical files and this can be sensibly done by creating one dataset for each file and then merging the resulting datasets. In this case we register the association between logical file names and dataset ID's in a DFC so that a future user can discover whether the file has already been used to create a dataset.


The implementation

Dataset query page

This page wll allow users to select a dataset with a query on the DSC. At present, on can retrieve the DSC data for a given dataset name.

DSC (dataset selection catalog)

This is a web interface to the DSC implemented as a MySQL table.

DRC (dataset replica catalog)

This is a web interface to the DRC implemented as a MySQL table.

DFC (dataset file catalog)

This is a web interface to the DFC implemented as a MySQL table.

All DIAL catalogs ( ATLAS password, GSI)

This is a web interface allowing user to browse all DIAL catalogs.


Update procedure

In addition to its programmatic interface, the dataset package provides executables and scripts for managing datasets from the command line. See the dataset packages for the former and the link to dataset_com for the latter. For convenience, we provide here links to a directory holding the Makefile and scripts that we use to construct and catalog ATLAS CBNT datasets. When new files come in for a dataset, we need only update the file infiles/dataset_name.infiles and run make.


dladams@bnl.gov