r5 - 20 Aug 2008 - 08:46:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune4

MinutesJune4

Introduction

Minutes of the Facilities Integration Program meeting, June 4, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Charles, Rob, Shawn, Sarah, Rich, Justin, John, Horst, Karthik, Bob, Mark, Kaushik, Nurcan, Wei, Xin, Hiro, Fred, John Hover
  • Apologies: none
  • Guests: none

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Out of jobs once again. However there is a new sample to fill the queues, but there problems with DQ2 callbacks. Expect scout jobs to finish quickly, then there would be 50K jobs.
  • Overall, 5M single particle events - short jobs - being defined now.
  • Unhappy about lack of planning and information dissemination.
  • Validation samples keep coming in - low quantity, but important jobs.
  • Re-processing: 3 issues discovered: Panda mover timeouts optimized for 2GB file jobs. dbrelease file is 4 GB - causing timeouts.. Tadashi increased. Will add modification to adjust to filesize. LCG-Utils was using lcg-ls (signed integer used, bug). Had max value of 3 GB. Newer release fixes the problem. Xin to check new OSG worker-node client LCG-Utils w/ Wensheng. Bamboo starvation and site specific scheduling.
  • Rod asked to define a task for reprocessing at Tier2 sites.
  • Michael suggests we summarize the experience and make available to ATLAS.
  • What about Oracle database access? The BNL oracle instance will be used. What is the fall back? Use Triumf's instance. This is written into the transformation.
  • Reprocessing job: 2 GB input file; 4.5 GB dbrelease file. Immediately after untarring.

Shifts (Mark)

  • Welcoming Marco back into the mix - thanks.
  • Not many jobs.. but expect to increase shortly.
  • AGLT2 - small bug in site mover - Paul fixed.
  • Autopilot ready for MWT2 - end of the week expect.
  • 22441 - large number of failed jobs, don't worry.
  • Heavy Ion jobs - 'looping job killed by pilot' - was this issue resolved? Kaushik believes HI tasks redefined by Pavel increased to 200K. For dedicated sites, keep queues with very long wall-time limits.

Analysis queues, FDR analysis (Nurcan)

  • Expecting increase in activity w/ FDR2 data - w/ release 14.
  • Have requested a validation data sample to be replicated. Done. Will do validation.
  • Helping with tutorial at Vancouver workshop. Working with Akira.
  • Will run DPD maker package.

Operations: DDM (Hiro)

RSV probe updates

  • Upgrade takes about 30 minutes. Sarah will circulate instructions.
  • Fred: a couple of sites are not passing the CE monitoring. AGLT2 (failing one test) and BU_ATLAS_Tier2 (failing everything - will look into it).
  • Tomasz is working on RSV-Nagios probe.

WLCG accounting

SRM v2 and Space Tokens

  • Follow-up:
  • OU - will not update to SRM v2.2 until new storage arrives.
  • Which roles should space tokens support. Role usatlas production vs atlas production. Two roles? And the mappings are different. Is there only one binding between the attribute and the space token?
  • Note - jobs are being defined w/ space tokens.
  • Enable multiple roles in the certificate?
  • Unify all production with the simple atlas production.

Pilot upgrade for space tokens

  • AGLT2 has a site setup; SE - ATLASMCDISK, and an SE path. In contact with Paul. SE prod path.
  • ATLASENDUSER disk also included. (At AGLT2, put 18 TB)

Unified LHC client (Marco)

  • Available for testing, see: WlcgClient
  • Issue that has come up is the new set of "dash".

LFC status (John)

  • Almost up and running - off an Oracle cluster backend.
  • Then Hiro will test an LRC migration script.
  • Then will decide on an exact migration path. Lazy migration.
  • Don't expect any issues with Panda integration.
  • We need to comet to a formal decision about our own deployment model. If we stick to our existing model, we should prepare arguments, and the converse. Fault tolerance and scalability issues. Suggestion is to revisit.
  • Revisit in 3 weeks

Next procurements

  • Standing agenda item, see CapacitySummary.
  • ATLAS meeting on benchmarks: http://indico.cern.ch/conferenceDisplay.py?confId=34293
  • Looking at the pledges - we're short by about 20% in summary. But at specific sites there are severe shortcomings. And 2008-2009 we need to double 1.5 to 2.5 PB, a significant growth.
  • 5 - 6.3 MSI2K? a minor step, while storage by factor of 2.
  • Kaushik should give guidance about what production and analysis will need.
  • Jim has given requirements as well. Will start from those numbers.
  • Deployed by September 15
  • Implies need to go out for bids in July.
  • Action item next week for rough guidance.
  • Want to be finished with this by end of June - technology and how much
  • Need specifications from Internet2 for network monitoring hosts. Encourage Rich to get these specifications by next Rich.

OSG 1.0

  • Expect release next week - some early testing at MWT2 in advance of release.

Throughput initiative - status (Shawn)

Nagios monitoring subcommittee (Dantong)

  • WT2 and SWT2 will be reporting available space.
  • Tomasz organizing a meeting to test globus-job-run.

Release installation via Pacballs (Xin)

  • Follow-up
  • Progress - this morning to discuss this. Fred - hoping this week to have first set of pacballs installed in DQ2. Will test with some older releases on some test machines.
  • Need official naming scheme.
  • Get installed with a special Panda pilot job using the software role. Expect performance to improve.
  • Expect a couple of weeks of testing.
  • Goal to bring into production by end of the month.

Site news and issues (all sites)

  • T1: lots of activities last week regarding FDR preparation, mixing jobs, and a group studying triggers. Busy deploying storage and network infrastructure (foundry core, 2 force 10s) for connection to 10G thumpers. Expect farm extension this week 3M SI2K? .
  • AGLT2: Busy with FDR2 calibration work. NFS-lock problems with SQL-lite databases. Site issue? Need to follow-up with ATLAS on this file locking.
  • NET2: All is well.
  • MWT2: All is well.
  • SWT2 (UTA): All is well.
  • SWT2 (OU): Got replacement server for gatekeeper, to be installed. 10G switch to be connected still.
  • WT2: All is well.

AOB

  • Updated instructions for LRC data deletion.


-- RobertGardner - 27 May 2008

  • Panglia for the week:
    week4June2008.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png week4June2008.png (43.8K) | RobertGardner, 04 Jun 2008 - 07:56 | Panglia for the week
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback