r10 - 29 Aug 2007 - 13:04:14 - KaushikDeYou are here: TWiki >  Admins Web > MinutesAug22



Minutes of the Facilities Integration Program meeting, Aug 22, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.


  • Meeting attendees: Horst, Karthik, Rich, Rob, Charles, Marco, Joe, Xin, Hiro, Tom, Alexei, Nurcan, Dantong, Wei, Michael, Jay, Kaushik, Patrick, Bob
  • Apologies: none

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • A very hectic and painful week. Various problems. Logger and database servers failed, but are back up now.
    • The huge backlog of last week was cleared.
    • Production was ramped back up to 3500 jobs, just before the hardware failure of the Panda server.
    • Back online now, ramping back up.
  • Shifters report and other production issues (Nurcan)
    • OSCER showing pending pilots - queues filled by other users.
    • UC Teraport's job manager is hung.
    • People on shift are follow-ing up on the new sites.

Operations: DDM (Alexei)

  • Functional testing
  • Ran for two weeks - BNL's DQ2 server crashed often, and there were dcache instabilities or a backlog was created.
  • RAL/UK worked very well. Discovered why Tier0 was so successful - there is a weekly cleaning of the subscriptions database. There will be a dedicated meeting to discuss this further, with Kors.
  • In summary, received 50% of the test data.
  • DQ2 0.4
    • It is under pre-testing right now.
    • Is it visible? Can it be deployed yet (not quite).
    • timescale is a few weeks.
  • M4 ESD solutions
    • All agreements since July have been abandoned.
    • AOD will not be produced by T0.
    • Will be organized as datasets, though.
    • What will we distribute to Tier2's.
    • Will start this on Monday morning.
    • We should try to get as much of the ESDs to as many of the sites as possible. 4TB.
    • Should we try dq2-cr?
    • Start with subscriptions.
  • Testbed
    • We need to get machines ready.
    • Will involve upgrade to site services.
    • LRC upgrade? This is a US responsibility. Hiro will look into the progress.
  • AOD replication
  • FT: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listFunctionalTests
  • Back to FTS monitoring - Rod Walker sent links to FTS monitoring tools used at TRIUMF and FZK:

ATLAS releases (Xin)

  • 13.0.20, 13.0.10 have recently been deployed.
  • 13.0.10 slc3 32bit, slc4 32bit. For now, simply deployed slc3 everywhere.
  • Two sites with 64bit platforms require compat libs (teraport, aglt2)
  • Suggestion from D. Quarrie to try slc4 64bit. Note this release still has problems.
  • No one has tested 32bit SLC4 on a 64bit OS.
  • Alessandro's system could be extended to cover OSG sites. US ATLAS sites need to be published to BDII, and a convention decided for the installation area.
  • Pilot-based system as well. Need a working glexec to change the role from a production to installation role.
  • Kaushik points out that we should move towards interoperability.
  • Need to get an update w/ Torre, before making a decision.
  • Publishing info to EGEE, we should discuss this a bit. Start first with the ITB.

Load testing (Jay)

  • Putting in hosts and directories, and is running tests between sites.
  • Addressing firewall problems, investigating number of streams.
  • Goal is to get plots of performance to establish benchmarks.
  • Jay and Dantong will work out firewall issues.
  • Load tests - will do these one at a time. This is designed to run in parallel to production, it won't be agressive. Need to assess capabilities and trans

Network Performance (Shawn, Dantong)

  • NetworkPerformanceP1
  • http://netmon.usatlas.bnl.gov/netflow/tier2.html
  • Dantong has results for NDT vs iperf performance. NDT and iperf perform similarly, but 10 second time interval not representative.
  • BNL, AGL, UTA have deployed NDT.
  • Has found NDT stability problems.
  • Rich: NDT was designed to give you a starting point to figure out whats wrong. Also to be used in conjunction with other tools.
  • Results: (bnl-umich: IPERF: 157-200 Mb/s up to 940Mb/s, NDT ~300Mb/S);
  • Concern about versions? Shouldn't be much - the basic test engine the same, there may be a change in the protocol. Note: on the disk, you have both iperf and NDT.
  • Installation
    • OU - will wait
    • SLAC - there was a security issue, Don Petravic. Need to check with Bob.
    • UC:

Analysis Queues (Bob, Mark)

  • Bob and Mark are working. Trying to get a ANALY queue put into siteinfo.py.
  • Also considering local analysis pilots.
  • All working.

OSG Integration (Rob)

Some informational pointers:

Site news and issues (All Sites)

Follow-up from last week's news:
  • T1: weekend dcache problems solved.
  • AGLT2: running testjobs, and now starting production jobs. Files highlighted brown - assigned files have not yet been picked up.
  • NET2: no report.
  • MWT2_IU: firmware update on all 500 GB western digital drives. all OK
  • MWT2_UC: all OK
  • SWT2_UTA: new cluster is built. Issue with power cords. Otherwise things are going well. Ready to host DQ2 testbed, has a new machine for this. Happy to test the new LRC.
  • SWT2_OU: all OK; still no date for a final move. Phone conf with Dell on Friday to plan reconfig of cluster, 23 nodes. Storage order to arrive next week. 5 headnodes, 3 segment servers, rest in worker nodes. OSCER still down.
  • WT2: No news. Power outage reschedule for Aug 29. Will shutdown DQ2 server on Tuesday.
  • UC Teraport: job manager taking too long to come back. Gratia still not reporting.

RT Queues and pending issues (Dantong, Tomasz)

  • Decided to *defer to next week.
  • RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.
    • Dantong has guidelines.
    • There are a couple of tickets that are not getting attention.

Carryover action items

  • Encryption to syslog-ng Still to do, carryover.
  • Install NDT at each site - put in site certification table. Follow-up next week.

New Action Items

  • See items in carry-overs and new in bold above.


  • TBD

-- RobertGardner - 21 Aug 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


ppt facilities-status-aug07_3.ppt (101.5K) | KaushikDe, 29 Aug 2007 - 13:03 | Production Summary Slides
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback