r3 - 07 Apr 2009 - 11:02:14 - RobertGardnerYou are here: TWiki >  Admins Web > FacilityWGAPMinutesApr7

FacilityWGAPMinutesApr7

Introduction

Meeting of the Facilities working group on analysis queue performance, April 7, 2009

Links

Attending

  • Meeting attendees: Horst, Charles, Paul, Rob, Nurcan

Analysis Queue testing reports

  • Daily stress testing of analysis queues started on April 6th with SUSYValidation jobs described in AnalysisQueueJobTests.
  • A bash script is used to submit 204 jobs to each Tier2, total 1020 jobs. It takes one hour to submit these jobs.
  • The script can be found at /usatlas/u/nurcan/StressTest/14.5.0/WorkArea/run/submitter_burstSubmit.sh
  • Input datasets used:
     mc08.105401.SU1_jimmy_susy.recon.AOD.e352_s462_r541/
     mc08.105403.SU3_jimmy_susy.recon.AOD.e352_s462_r541/
     mc08.106400.SU4_jimmy_susy.recon.AOD.e352_s462_r604/
     mc08.105404.SU6_jimmy_susy.recon.AOD.e352_s462_r541/
  • The grid certificate in production is used in job submission, these jobs have an output dataset name: user09.NurcanOzturk.StressTest.$name.`uuidgen`
  • The progress of job submission (job id and time stamp) can be seen at /usatlas/u/nurcan/StressTest/14.5.0/WorkArea/run/progress_submission_burstSubmit.txt
  • Analysis job summary can be seen at Analysis dashboard
  • I'll look at the failures today, need help from site admins and pandashift.

Comments

  • SU4 is the largest dataset
  • Statistics gathered from panda database w/ search on dataset name.
  • Will continue submissions later today.
  • pathena retries - are they happening?

Analysis queue chart (Rob)

  • Need to set this up.

Additional ANALY queue jobs (Rik)

  • Need a viable TAG job
  • Will get several people to start submitting jobs at MWT2
  • Difficult to figure out which datasets are at the Tier 2's - especially if you're not sure what you're looking for. Browser is very slow - and wildcards as well.
  • mc08 datasets should be available at Tier 2s.

Job metrics

  • Hammer cloud tests - Alden working with Dan
  • Alden working on metrics - expect something soon.

Stress testing plan (revisited)

Recall discussion from last meeting:

Hi all

Below, I summarized what we discussed about the stress test. More input welcome. We are still looking for contributors for preparing this exercise, if you know anyone, please let me know.

Cheers Akira

Ideas for stress test:

  • This exercise will stress test the analysis queues in the T2 sites with analysis jobs as realistic as possible both in volume and quality. We would like to make sure that the T2 sites are ready to accept real data and analysis queues to analyze them. The stress test will be organized sometime near the end of May.

Basic outline of the exercise:

  • To make the exercise more useful and interesting we will generate and simulate (Atlfast-II) a large amount of mixed sample at T2. We are currently trying to define the job for this and we expect this to be finalized after the BNL jamboree next week. The mixed sample is a blind mix of all SM processes, which we call "data" in this exercise. For the one day stress test, we will invite people with existing analysis to try and analyze the data using T2 resources only. It was suggested to compile a list of people who have the ability to participate.

Estimates of data volume:

  • A very rough estimate of the data volume is 100M-1B events. Assuming 100kb/event (realistic considering no truth info and no trigger info), this sets an upper limit of 100TB in total. It was mentioned that this is probably an upper-limit from the current availability of USER/GROUP disk on T2 (which is in addition to MC/DATA/PROD and CALIB disk) but this need to be checked.

Estimate of computing capability:

  • Right now there are "plenty" of machines assigned to analysis though the current load of analysis queue is rather low. The computing nodes are usually shared between production and analysis and typically configured with upper limit and priority. For example MWT2 has 1200 cores and setup to run analysis jobs with priority with an upper limit of 400 cores. If production jobs are not coming in, the number of running analysis jobs can exceed this limit.

Site configuration:

  • Site configuration varies among the T2 sites. For this exercise, it is useful to identify which configuration is most efficient in processing analysis jobs. It was suggested that a table be compiled showing basic settings of the analysis queues for each analysis queue.

Pre-stress-test test:

  • To make the most of the exercise and not to stumble upon trivial issues during the stress test, pre-stress test exercise was suggested. It was requested that before launching a large number of jobs, the site responsible people are notified.

To do:

  • Data generation/simulation job to be defined by Akira
  • List of possible participants to be compiled by Rik
  • A table of site configuration to be produced by Rob
  • Someone to define pre-stress-test test routine

AOB

  • Next meeting in two weeks.


-- RobertGardner - 06 Apr 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback