r14 - 28 Aug 2012 - 15:31:27 - SaulYoussefYou are here: TWiki >  Admins Web > AnalysisQueuePerformance

Analysis Queue Performance

Goals

There is more and more pressure on resources and number of analysis jobs is increasing. Unlike production jobs that we more or less understand in terms of resource consumption, analysis jobs differ wildly and can present much higher stress to sites. We have no framework that all the physicists use and that we could instrument in order to find usage patterns or performance. But we do have HC functional tests submitted to all the sites. Results of these tests complemented with information from gratia and/or other site collected metrics we could try to understand what is the optimal way to run analysis queues at each site separately. Here we'll try to document knowledge collected.

People and sites

In HC tests these US ATLAS ANALY queues are included:

  • ANALY_AGLT2
  • ANALY_BNL_ATLAS_1
  • ANALY_HU_ATLAS_Tier2
  • ANALY_MWT2
  • ANALY_NET2
  • ANALY_OU_OCHEP_SWT2
  • ANALY_SWT2_CPB

- all T2Ds? except SLAC . Should some other be included?

People

https://oim.grid.iu.edu/oim/resource

Geographically:

  • NortEastern? US
    • BNL T1
    • Boston U & Harvard U => NET2
  • MidWest? US
    • UChicago & U of Illinois & Indiana U => MWT2
    • Michigan Uni & Michigan State => AGLT2
  • SouthWest? US:
    • UTA, OU => SWT2
  • Western US:
    • SLAC (Tier 2)

TO DO list

  1. collect all the addresses of people needed at the meeting. Will use usatlas-t2-l@lists.bnl.gov
  2. document all the resources. Only three site filed up forms.
  3. collect base line results
  4. prepare fast and simple web interface to most important results. Half done.
  5. add e-mail alarms
  6. get finished and queued jobs per hour per site information
  7. try to get quantitative measure of improvements made
  8. do data mining on dcache billing DB. Understand if we can get some useful information in this context.

Issues

  1. 24 core AGLT2 machines not giving any results from 1st Jun
  2. MWT2 is not getting any analysis jobs since 1st Jun solved by Sarah et al.
  3. MWT2 was giving mostly empty results since 8th May. Still having unexpectedly small of finished jobs. To be investigated.
  4. MWT2 - number of jobs processed increased a lot after switching to copy-to-scratch. To be understood.
  5. NET2 HU, BU - need to understand sites configuration and analy queues.
  6. Sudden increase in CPU times but not at all sites. Was it due to going for compiled read.C code?
  7. Understand 120+ stage-out and 330+ stage-in times in copy-to-scratch mode. Partly done.

Results

Resources

Site configuration

There are three Google Forms made to collect this information. You can also view it or change at any time. Please let me know if anything important is missing.

Monitoring resources

  • HC
    • tests - look for template: "443: ROOTIOTests SVN eGamma 16.6.7 Panda"
    • results
  • gratia link

Per site

Admin tips & tricks

Meetings

Presentations

-- Main.ivukotic - 01 Jun 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback