DIAL use cases

David Adams
August 2, 2002

This document gives use cases for DIAL (Distributed Interactive Analysis of Large datasets).

The DIAL home page is http://www.usatalas.bnl.gov/~dladams/dial.

A. Event data specification

A1. Dataset definition

User defines a dataset by specifying which events are included (i.e. which event or beam crossing ID's) and which data are part of each event. The data for any event does not depend on that in any other event.

A2. Data version

The same type of data may be generated multiple times for a given event as the code evolves. The user specification of which data are included for each event includes this version.

A3. Dataset content.

A user processing data specifies that only a subset of the types of data in any event will be required. The system is able to define a dataset which includes only this subset and process only this restricted dataset (to reduce the cost of data access) but report back to the user that the full dataset has been processed.

A4. Dataset input

The user specifies a dataset as the input to any of the processing options described below.

A5. Dataset persistence

Means are provided so that a user can record a dataset produced in one process and use as the basis for analysis in a later job.

B. Event loop Processing

B1. Event selection

User provides code that it applied to each event independently and returns whether the event is accepted or rejected. A new dataset is created from the input dataset.

B2. Fill histogram

User defines a histogram and provides code to fill it for each event indpendently. Histogram is filled and returned to the user to view and manipulate in selected analysis tool, e.g. ROOT.

B3. Fill tuple

User defines a tuple (collection of named variables) and provides code to add any number of entries to this tuple for each event independently. The tuple is filled for each event and returned to the user for examination in selected analysis tool.

C. Single event processing

C1. Fetch

User specifies an event (by ID) and all the data associated with that event is returned. Or the returned data may be limited to a predefined subset of the data in the event.

C2. Visualization

User specifies a view for an event and provides code to fill that view from the event. The view is filled using that code on a spacified event (by ID).

D. Distributed processing

D1. Remote processing

In any of the above, the data are located on a machine different from that of the user. The user describes the job to be run on tke local machine and the job is created and run on the remote machine and the results are returned to the local machine.

D2. Parallel processing

As in the previous except the dataset is divided into multiple datasets each with each contining a subset of the events in the original dataset. Each dataset is processed in a separate process (or thread) and the results are combined and returned to the user.

D3. Multi-node processing

As in the previous except the processing jobs are distributed over multiple compute nodes.

D4. Multi-site processing

Same as the previous except the distribution is over different sites.

D5. GRID processing

Special case of the previous where job specification, submission and authentication are all done in the GRID framework.