r7 - 24 Feb 2009 - 13:56:30 - NurcanOzturkYou are here: TWiki >  Admins Web > FacilityWGAPMinutesFeb24



Meeting of the Facilities working group on analysis queue performance, Feb 24, 2009


  • Meeting attendees: Rob, Marco, Patrick, Jim C, Charles, Nurcan, Saul, Xin, Torre, Paul, Wei, Horst, Fred

Meeting logistics

  • Time: switch to 9:30 CST/10:30 EST
  • bi-weekly
  • Ad-hoc bridge to be used.

Working group goals (Rob)


  • Support testing and validation of analysis queues by Nurcan, US ATLAS physicists and others
  • Address issues uncovered during HammerCloud and other functional tests on-going in ATLAS
  • Measure analysis queue performance in a number of metrics.

An overarching goal is to assess the facility readiness for analysis workloads

  • In terms of scale - response to large numbers of I/O intensive jobs
  • In terms of stability and reliability of supporting services (gatekeepers, doors, etc)
  • In terms of physics-throughput efficiency


  • Use this meeting to spot systematic problems across sites
  • Provide feedback to site administrators
  • Provide feedback to ADC development team
  • Provide feedback to ADC operations


  • Not a general user-support meeting (debugging user transformations, for example) but a forum for understanding systematic problems with the facility infrastructure and distributed services supporting analysis.


  • As issues are reported from hypernews and other discussion lists indicating facility or distributed software issues, we will need to track them to resolution liaising with administrators, developers and outside providers as appropriate.

Job archetypes

  • We want to make sure we have a good mix of well-defined job types to be used to establish a baseline performance
    • Tutorial examples (FDR-like)
    • TAG selection jobs
  • Data access methods
    • direct reading
    • copy to local work scratch
    • SE type (dccp, dcap, bm-xrd, bm-gpfs, bm-nfs)
  • Queue measurement jobs via prun

Queue metrics (examples)

  • Completion times for reference benchmarks (eg. Alessandros swmgr.sh kit validation for evgen, sim, digi, reco Z-->ee)
    • Characterize queues
  • Standardized I/O tests to local disk (job work dir), and queue's storage element (as function of protocols)
  • Task & job latency testing: elapsed user response times for a reference job set.
    • Submission to queued
    • Time in queue
    • Total elapsed to completion
    • View as distributions of jobs versus queue
    • View as distributions of job-sets versus queue
  • Measurement modality:
    • Regular testing (robot-like)
    • Stress testing
    • Response versus scale (for example, queue response to 10/100/200/500/1000 job sets of various types, I/O intensive) to determine breaking points

Deliverables (preliminary)

  • March 15: Specify and test a well-defined set of job archetypes representing likely user analysis workflows hitting the analysis queues.
  • March 31: Specify set of queue metrics and test measurements made
  • April 15: Facility-wide set of metrics taken


  • Need ability to look at old analysis jobs - expect to improve w/ Oracle
  • Oracle in a few weeks

TAG selection job problems (Nurcan)

Report from Nurcan on jobs using, panda-client-0.1.8 and inDS=fdr08_run2.0052283.physics_Egamma.merge.TAG.o3_f8_m10:

  • TAG selection jobs work at AGLT2, SLAC, BNL, do not work at UTA, OU, NET2, MWT2. The jobs at the latter sites still have a status of finished, however PFC.xml is empty so no associated AOD's are inserted, thus TAG selection was actually not run. These jobs should have a status of failed.
    • Paul and Tadashi discussing this. Should Athena fail this, rather than the pilot?
    • See Marco's report below - there may be an option in the job options to throw a failure if a file is missing. Marco will follow-up regarding NET2.
    • UTA - is the pilot back-navigating into xrootd (via root-URL, like SLAC)?

  • Error at UTA, OU, NET2 (PandaID's 26835256, 26835254, 26835252): ImportError: /atlasgrid/osg-wn-1.0.0/lcg/lib64/python/_lfc.so: cannot open shared object file: No such file or directory. Comment from Tadashi: The above message means that 32bit python in Athena failed to import lfc.py due to lib64/python/_lfc.so and then native 64bit python was used. The plugin was tried with the following LD_LIBRARY_PATH but failed. I have the impression that 64bit osg-wn's runtime was screwed up due to something in 32bit Athena. Tadashi reported that this works at CERN and TRIUMF etc which use 64bit OS. What needs to be done from Tadashi: Check to see if the shared file exists. And then check to see if lfc.py can be imported in 64bit wc-client runtime and 32bit Athena runtime.

  • Error at MWT2: no error is found as above, it only prints: RuntimeWarning: Python C API version mismatch for module _lfc: This Python has API version 1013, module_lfc has version 1012. PFC.xml is empty. PandaID=26835630.
    • Probably AOD dataset missing at the site - FDR files. Charles will re-subscribe.

  • Error at ANALY_BNL_test: send2nsd: NS002 - send error : client_establish_context: Could not find or use a credential ERROR : LFC access failure - Bad credentials. The voms proxy (valid for one day) has expired before the pilot (new pilots sent by Paul) picked up the job.
    • Not sure what is causing this - whether its a transient or not. Works at ANALY_BNL_ATLAS_1.

  • Note: at ANLAY_BNL_ATLAS_1 the ESD files are also inserted into PFC.xml together with AOD's, any idea why?
    • Seems like a transformation "bug", but only happens at one site? Marco believes it only happens if the ESDs are at the site. TAG group is working with Tadashi on limiting the list of files included.

  • BNL VOMS server not recognizing nickname. Horst will ping John Hover.

BNL queue scheduling issue

  • https://rt-racf.bnl.gov/rt/Ticket/Display.html?id=11880
  • The long and short queue has 420 slots allocated for each. We are not occupying all slots. It is a problem with Condor. Pilots are stuck waiting for a slot for a long period of time (40' currently) as last reported on 2/16.
    • Proxy is needed, so had to switch to pilot submission
    • Xin believes there too many stage-out jobs going simultaneously; moved condor submit host to a more powerful machine on Feb 16 - have not seen the problem.
    • Looks like this morning things are fine, but we need to check regularly.
    • API is available to get the number of running jobs.

Time scale and scope of next stress test

  • Plan above looks okay
  • Would like to run a stress test and report at next software workshop - if possible.
  • Monitoring close to what HammerCloud is doing (not yet running on Panda sites). There is a development version in the ARDA dashboard to display Panda analysis jobs. Job timings are available via the Panda database - Sergei has query tools to capture these.
  • Which dataset to use? Reprocessed datasets are being distributed now.
  • Suggest Marco and Nurcan to define the jobs and make some tests.
  • Large datasests? Perhaps use container datasets.


  • Fred: pilot jobs not setting up CMT properly. Will send thread to Paul.
  • Next meeting 9:30 am CST, March 10.

-- RobertGardner - 16 Feb 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


txt pathena-report-marco.txt (1.1K) | MarcoMambelli, 24 Feb 2009 - 11:33 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback