r2 - 08 Apr 2009 - 23:13:15 - AkiraShibataYou are here: TWiki >  AtlasSoftware Web > UsTierStressTest

US Grid Stress Test 2009

Introduction (from Jim Cochran)

As you know we are planning a 1-day stress test of the US analysis queues and users sometime in May. To make the activity meaningful for the users and thus to increase the participation, some sort of large appropriately mixed (preferably blind) dataset is desired. The initial plan had been to use the recently mixed top+W+jets sample together with a weighted addition of the soon to be available 10pb-1 QCD/multijet sample. This is unlikely to significantly stress the system and it may be difficult to get all the users to implement the addition + weighting correctly (if we have to give them code to run over the data, we might just as well run robot tests). The problem here, as with the FDR, is that we don't have enough QCD/multijet background. I understand that the 10pb-1 QCD/multijet sample itself is not yet ready but is to be run through full simulation in the very near future. I further understand that the main reason that these QCD/multijet samples have been so difficult to produce is that they are generated with ALPGEN which is a rather painful two step procedure which requires some amount of manual intervention (although it's thought to do a more accurate job on the jet multiplicities - not sure if that's so relevant for our stress test/challenge). I would like to explore the possibility of making a Pythia QCD/multijet (or better yet SM) sample (perhaps in simple Q**2 ranges) with some basic generator level cuts which could be run through the fast simulation (not ideal but all we can consider on this timescale). Since getting such a sample(s) into the production queues would likely take a very long time (long past May), we are considering whether we can instead do this on the US analysis queues. We still need to do some work to see whether such an exercise is even feasible but wanted to check with you to find out if some fraction of the T2 analysis queues might be spared for such an exercise (which itself could be a useful test of the T2 analysis queues).

Summary (from March facility meeting)

Ideas for stress test: This exercise will stress test the analysis queues in the T2 sites with analysis jobs as realistic as possible both in volume and quality. We would like to make sure that the T2 sites are ready to accept real data and analysis queues to analyze them. The stress test will be organized sometime near the end of May.

Basic outline of the exercise: To make the exercise more useful and interesting we will generate and simulate (Atlfast-II) a large amount of mixed sample at T2. We are currently trying to define the job for this and we expect this to be finalized after the BNL jamboree next week. The mixed sample is a blind mix of all SM processes, which we call "data" in this exercise. For the one day stress test, we will invite people with existing analysis to try and analyze the data using T2 resources only. It was suggested to compile a list of people who have the ability to participate.

Estimates of data volume: A very rough estimate of the data volume is 100M-1B events. Assuming 100kb/event (realistic considering no truth info and no trigger info), this sets an upper limit of 100TB in total. It was mentioned that this is probably an upper-limit from the current availability of USER/GROUP disk on T2 (which is in addition to MC/DATA/PROD and CALIB disk) but this need to be checked.

Estimate of computing capability: Right now there are "plenty" of machines assigned to analysis though the current load of analysis queue is rather low. The computing nodes are usually shared between production and analysis and typically configured with upper limit and priority. For example MWT2 has 1200 cores and setup to run analysis jobs with priority with an upper limit of 400 cores. If production jobs are not coming in, the number of running analysis jobs can exceed this limit.

Site configuration: Site configuration varies among the T2 sites. For this exercise, it is useful to identify which configuration is most efficient in processing analysis jobs. It was suggested that a table be compiled showing basic settings of the analysis queues for each analysis queue.

Pre-stress-test test: To make the most of the exercise and not to stumble upon trivial issues during the stress test, pre-stress test exercise was suggested. It was requested that before launching a large number of jobs, the site responsible people are notified.

To do: Data generation/simulation job to be defined by Akira List of possible participants to be compiled by Rik A table of site configuration to be produced by Rob Someone to define pre-stress-test test routine

Data preparation

Right now I'm in the process of analyzing JF17. We have quite a few full sim events, some events. But given the huge cross section (according to AMI, it is 0.1 milibarn) this only corresponds to 0.017 pb-1. The bad news is that we need ~100M events per pb. This means that assuming 100k event size, we need 10 TB per 1 pb-1. So we really need to push to the limit of what we were suggesting: 1B events 10 pb-1 100TB. Yet this is still not that large. If we want to do 100 pb-1, we need to do some serious skimming to reduce the number of event by 1/10. This is possible but we need to decide what selection is appropriate.

Event Generation

We will follow the production recipe for tag e347. Use lxplus to setup: # first make sure (among other things) that you have
macro ATLAS_TEST_AREA "" \ "${HOME}/scratch0/athena/"\ "${HOME}/scratch0/athena/"\

in your requirements file. Then to setup.

source setup.sh -tag=,AtlasProduction,releases
mkdir $TestArea
cd $TestArea
mkdir WorkArea/run
cd WorkArea/run

get_files -jo EvgenJobOptions/MC8.105802.JF17_pythia_jet_filter.py

We will use a transformation called csc_evgen08_trf.py if you setup correctly, you can already do in your terminal


which will give you a help menu. For our purposes, we need the following:

csc_evgen08_trf.py 105802 1 -1 5555 ./MC8.105802.JF17_pythia_jet_filter.py test105802.evgen.pool.root NONE NONE NONE

This process will take a long time since the filter eff is 0.075 or so. You cannot reduce the number of events because it will complain. But once it starts generating events, you can assume that it is OK.

Now send this to panda. First setup

source /afs/cern.ch/atlas/offline/external/GRID/DA/panda-client/latest/etc/panda/panda_setup.sh

Then try the following command

pathena --tmpDir /tmp  --outDS user09.AkiraShibata.mc08.105802.JF17_pythia_jet_filter.recon.AOD.e347   --cloud US  --trf  "csc_evgen08_trf.py 105802 1 5000 5555 ./MC8.105802.JF17_pythia_jet_filter.py %OUT.evgen.pool.root NONE NONE NONE"

This will produce 5000 events on the grid. Now you need to remember to change the random seed when we are submitting more than one job with the same configuration.

Fast Simulation

We will now simulate the evgen files. We will follow the production recipe for tag a84 (see also: https://twiki.cern.ch/twiki/bin/view/Atlas/AtlfastII)

source setup.sh -tag=,AtlasProduction,releases
mkdir $TestArea
cd $TestArea
mkdir WorkArea/run
cd WorkArea/run

We will use a transformation called csc_simul_reco_trf.py if you setup correctly, you can already do in your terminal


and get a help menu. For our purposes, we need the follwoing:

csc_simul_reco_trf.py test105802.evgen.pool.root test105802.AOD.pool.root 5000 0 5555 ATLAS-GEO-02-01-00 1  2 QGSP_BERT jobConfig.VertexPosFastIDKiller.py FastSimulationJobTransforms/FastCaloSimAddCellsRecConfig.py DBRelease=/afs/cern.ch/atlas/www/GROUPS/DATABASE/pacman4/DBRelease/DBRelease-6.5.1.tar.gz

This will extract a lot of database files so you should do this in your /tmp directory. First copy the evgen there and then execute the above.

You might have to reduce the number of events to a smaller value. I don't understand why.

Now send this to panda. First setup

source /afs/cern.ch/atlas/offline/external/GRID/DA/panda-client/latest/etc/panda/panda_setup.sh


pathena --tmpDir /tmp  --inDS user09.AkiraShibata.mc08.105802.JF17_pythia_jet_filter.recon.AOD.e347 --outDS user09.AkiraShibata.mc08.105802.JF17_pythia_jet_filter.recon.AOD.e347_a84   --cloud US  --trf  "csc_simul_reco_trf.py %IN %OUT.AOD.pool.root 5000 0 5555 ATLAS-GEO-02-01-00 1  2 QGSP_BERT jobConfig.VertexPosFastIDKiller.py FastSimulationJobTransforms/FastCaloSimAddCellsRecConfig.py DBRelease=/afs/cern.ch/atlas/www/GROUPS/DATABASE/pacman4/DBRelease/DBRelease-6.5.1.tar.gz"

Major updates:
-- TWikiAdminGroup - 27 May 2020

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback