Large-scale production of simulation data

D. Adams
12dec02 1230 EST


Introduction

This scenario is produced as part of the US ATLAS production and analysis project. It describes the production of Monte Carlo data in that environment and is based on the actual production that took place in late 2002.

Monte Carlo data are produced in three steps:

Produced data is organized into datasets. A dataset is a collection of events which may be distributed over multiple files. In a typical production chain, a MC track dataset is generated for the physics process of interest, GEANT detector simulation takes this dataset as input and produces a dataset with MC hits. In the last step, this dataset and background datasets are used as input for digitization which produces a digitization dataset.

In this scenario we assume the production is carried out by a production manager. The manager has been given a physics process (e.g. Higgs to four muons) and been asked to produce some number of events. The ultimate goal is to produce and verify a dataset holding simulation of digitized data for those events.

The production of dataset my be distributed over multiple processes, nodes and even sites. A job specifies the data to be generated in any one process. For simplicity, we assume that any one output file is produced by a single job.

Virtual data

We adopt a virtual data model where the unit of data is the dataset. Datasets are created by applying derivations of predefined transformations. This derivation specifies which input datasets (if any) are used. In the general model, additional parameters may be introduced in the derivation but we assume a simpler model where all parameters are specified by the transformation.

The virtual data model is being developed under the auspices of the GriPhyN project. The Chimera virtual data system is an important component.

A very important feature of a virtual data system is the tracking of the history of data production. Transformations, i.e. the application and its parameters are recorded in a transformation catalog.

History

We require that the data be reproducible. This implies the existence of the above history information and that our algorithm be inherently reproducible. The production of simulation data depends on pseudorandom generators and the latter implies specify the seeds for these generators be specified in advance or at least be recorded so they can be recovered if the data is regenerated.

Complete history of a piece of data includes information that is specific to the job that produced it. This includes the a description of the producing node include hardware, operating system software and libraries and the date and time. This data is recorded in a production catalog.

Events, distributed processing and files

The Monte Carlo data is organized into events that are generated and processed independently. This makes it possible to distribute the production of a dataset over multiple independent processes each of which sequentially processes a distinct subset of the events in the dataset. We assume processing is distributed in this manner.

For simplicity we assume a one-to-one correspondence between jobs and output files.

The specification of the executable, libraries, input files and input parameters required to run a process is called a job. The files in a dataset are created by running a collection of jobs. The progress of these jobs are recorded in the production catalog.

Each event is assigned a unique ID. A job which creates events must have a policy for assigning these ID's. In practice this is accompished by specifying the first ID and then incrementing for following events.

Random numbers

Simulation software makes use of random numbers or more precisely pseudorandom seqeunces. In order to avoid correlations between events, it is desirable to ensure that a different sequence is used for each event. This is achieved by starting each process with a different seed, i.e. by assigning a different seed to each job.

Generating MC tracks

The first step in the production chain is to simulate a physics process and produce MC tracks.

The production manager consults the transformation catalog and if no appropriate transformation is present, the manager adds one to the catalog. The transformation specification includes the application (e.g. Pythias), physics channel, decay table and kinematic cuts. The transformation specifies the total number of events, the number of events per output file, the event ID's and the random number seeds. Output file names might also be included.

The next step is to add the dataset to the dataset catalog. There it is assigned an ID. The only required information for this entry is the transformation ID. There is no input dataset.

Next the production manager submits the dataset ID to the production system that creates the jobs that produce the files holding the generated data. Once these jobs have all completed successfully, the production catalog holds a complete list of files that hold the data for the generated dataset. The files are stored and registered in a file catalog and the file ID's are added to the entry in the dataset catalog.

Detector simulation In the second step, the MC track dataset from the first step is used as input to the detector simulation program such as GEANT3 or GEANT4 which propagates tracks through the detctor and produces MC hits. These hist are collected into a

More to come ...

Identified components

dataset job
transformation catalog
dataset catalog
production catalog
file catalog
production system


dladams@bnl.gov