Monte Carlo data are produced in three steps:
In this scenario we assume the production is carried out by a production manager. The manager has been given a physics process (e.g. Higgs to four muons) and been asked to produce some number of events. The ultimate goal is to produce and verify a dataset holding simulation of digitized data for those events.
The production of dataset my be distributed over multiple processes, nodes and even sites. A job specifies the data to be generated in any one process. For simplicity, we assume that any one output file is produced by a single job.
We adopt a virtual data model where the unit of data is the dataset. Datasets are created by applying derivations of predefined transformations. This derivation specifies which input datasets (if any) are used. In the general model, additional parameters may be introduced in the derivation but we assume a simpler model where all parameters are specified by the transformation.
The virtual data model is being developed under the auspices of the GriPhyN project. The Chimera virtual data system is an important component.
A very important feature of a virtual data system is the tracking of the history of data production. Transformations, i.e. the application and its parameters are recorded in a transformation catalog.
We require that the data be reproducible. This implies the existence of the above history information and that our algorithm be inherently reproducible. The production of simulation data depends on pseudorandom generators and the latter implies specify the seeds for these generators be specified in advance or at least be recorded so they can be recovered if the data is regenerated.
Complete history of a piece of data includes information that is specific to the job that produced it. This includes the a description of the producing node include hardware, operating system software and libraries and the date and time. This data is recorded in a production catalog.
For simplicity we assume a one-to-one correspondence between jobs and output files.
The specification of the executable, libraries, input files and input parameters required to run a process is called a job. The files in a dataset are created by running a collection of jobs. The progress of these jobs are recorded in the production catalog.
Each event is assigned a unique ID. A job which creates events must have a policy for assigning these ID's. In practice this is accompished by specifying the first ID and then incrementing for following events.
Simulation software makes use of random numbers or more precisely pseudorandom seqeunces. In order to avoid correlations between events, it is desirable to ensure that a different sequence is used for each event. This is achieved by starting each process with a different seed, i.e. by assigning a different seed to each job.
The first step in the production chain is to simulate a physics process and produce MC tracks.
The production manager consults the transformation catalog and if no appropriate transformation is present, the manager adds one to the catalog. The transformation specification includes the application (e.g. Pythias), physics channel, decay table and kinematic cuts. The transformation specifies the total number of events, the number of events per output file, the event ID's and the random number seeds. Output file names might also be included.
The next step is to add the dataset to the dataset catalog. There it is assigned an ID. The only required information for this entry is the transformation ID. There is no input dataset.
Next the production manager submits the dataset ID to the production system that creates the jobs that produce the files holding the generated data. Once these jobs have all completed successfully, the production catalog holds a complete list of files that hold the data for the generated dataset. The files are stored and registered in a file catalog and the file ID's are added to the entry in the dataset catalog.
More to come ...