Introduction
Data access and movement are critical issues for distributed analysis.
Data is specified in terms of datasets with varying degrees of
abstraction. Typically, the ultimate access is via physical files
which are replicas of logical files. It is expected that users will
work with datasets: a job definition includes the name or ID of and input
dataset and the result of that job is an output dataset. A user wishing
to examine the result obtains replicas of the relevant files (e.g.
a ROOT histogram file) from the output dataset.
Users often will want to know if the data associated with a dataset exists and whether it is "nearby", i.e. rapidly accessible at low cost. A user may also want to localize the data in a manner that allows access without a network connection, e.g. to examine on a laptop computer.
Replication
Data movement is accomplished by replication: when a local copy of a file
is required but not present, it is copied from a remote site or a tape
archive. Typically the original file is left in place. File replica
catalogs track these replicas so that the file need not be copied the next
time that file is required at the same location. When the destination site
is a user machine, e.g. a laptop, it is likely there is not file catalog in
place and it is then the user's responsibility to track replicas.
Lifetime management
Distributed analysis will generate many temporary results
because users are trying out ideas or examining partial results. To
avoid wasting storage resources, there must be means to release replicas
when they are no longer needed locally and to release logical files
when they are no longer of global interest. Support for file and dataset
lifetimes and individual user claims on these lifetimes
are important requirements for an analysis data management system.
User interface
The user interface for accessing data is provided by the AJDL dataset
implemented in DIAL. Concrete datasets make use of the DIAL FileCatalog
interface to gain access to file replicas. At present there
implementations of this interface for NFS, AFS and Magda. It is a high
priority for ADA to provide a similar connection to the evolving
ATLAS data management system, Don Quijote.
Documents
A couple documents specifying requirements and user cases may be found on the
ADA documents page. Files are discussed in
"File management on the grid" and datasets in "Dataset management".
ADA data management systems
The important data management systems for ADA are NFS/AFS, magda and Don Quijote.
NFS and AFS
In the DIAL NFS and AFS file catalogs, "logical" files are identified by
physical file name and are trivially accessible as long as the file system
is accessible. These give the appearance of a data management system where
none is available (e.g. a user laptop) or where one might want to circumvent
an existing system for performance reasons, e.g. for files with limited
lifetime or only of local interest.
Magda
Magda maintains a central table that catalogs logical file names and the
physical locations of each replica. It provides various mechanism for replicating
files between known Magda sites. ATLAS DC1 data and some of the DC2 data is
cataloged in Magda. Wensheng Deng maintains Magda and its DIAL interface.
For more information about Magda, see the Magda home page.
Most of the ATLAS data (in particular the Rome AOD samples) presently available for analysis with ADA are cataloged using magda. Although this is not a solution that is expected to be able to scale to the long-term ATLAS requirements, it is easy to add additional data and sites in the short term. Sites need only provide the magda client which is included in the DIAL installation.
Don Quijote
The ATLAS data management system is called
Don Quijote
or DQ for short.
It is under development and is also presents a user view based on "datasets",
there defined as collections of logical files. It will provide means to catalog
and replicate datasets from one location to another with local (site) file replica
catalogs providing the mapping between logical file names (or ID's) and local
physical file replicas.
Work need to be done to understand how to integrate ADA with the DQ file catalogs and how to make the DQ datasets available for analysis as ADA datasets.
Other products and systems
There are also a number of products or ideas that are are not specific to
ATLAS.
RLS
RLS has been developed by globus and EDG to provide replica cataloging
in the grid context. Unfortunately the two implementations are not compatible.
At present, the three ATLAS grids (LCG, Nordugrid and Grid3) each maintain
separate RLS catalogs for the data they produce. RLS does not provide for
data movement or management and so is only a partial solution.
SRM
SRM
provides an interface and couple implementations for local management
of replicas (and other files) including space allocation and lifetime
management. SRM also provides an interface for moving data between SRM sites.
RMS
There is at least one attempt to tie some of these pieces together.
The RMS was described in a PPDG talk
(ppt)
in June 2004. One component of this system is the RRS for which a
A draft specification (doc) exists for
one component of this system.
LCG
DQ depends on lcg_util to do file transfers on LCG. See this
talk for more information.
GLite
GLite is developing an extensive set of services for file cataloging and
data management and movement. It will likely adopt the SRM and EDG RLS
interfaces. It envisions queues to manage the movement of data between
sites. It is not clear whether DIAL should connect directly to gLite
or DQ should be extended so it can be used as a bridge between DIAL and
gLite.
Castor
Castor is primary data store
at CERN.