r11 - 01 Aug 2007 - 14:37:29 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug1



Minutes of the Facilities Integration Program meeting, Aug 1, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • OLD: 1-510-665-5437 #24743
    • NEW: (605) 475-6000, Access code: 735188


  • Meeting attendees: Tom, Kaushik, Shawn, Saul, John Brunelle, Xin, Horst, Karthik, Fred, Wei, Charles, Rich, Rob, Joe, Marco, Patrick, Jay
  • Apologies: Bob Ball (see AnalysisQueueP1 updates); Alexei (see notes below).

Operations: Production

  • Production summary from Kaushik
    • facilities-status-aug07_1.ppt: Production Summary Slides
    • See plot on page 2 comparing previous week to the week just past.
    • Job shortage issue seems to have been fixed.
    • UBC nearly ready to start production. There were some VOMS issues - mapping of production cert. DQ2 site services at Triumph were down - load issues.
  • Shifters report and other production issues (Mark)
    • Sites have been mostly full.
    • Brief DQ2 central server failed, causing about 800 failures. This is also causing site services to crash. BU, IU, AGL, SLAC -- all experiencing problems.
      • Patrick recommends restarting fetcher and agents. Shawn reports this didn't make a difference.
      • Miguel's email claims problem was solved.
      • Wensheng is serving as point person on DQ2 problems.
      • Shawn claims fetcher is not running correctly.
    • UT-Dallas: Mark will follow-up. Globus issues that are not fully understood. Perhaps firewall related.
    • UC Teraport (Greg and Xin still working through release install issues on RHEL4 x86_64.)
  • Site issues
    • IU_ATLAS_Tier2 - has production been paused? Kaushik claims its enabled in Panda. Gatekeeper is being rebooted - Fred will follow-up offline.

Operations: DDM (Alexei)

AOD replication

  • AOD replication to all Tier1 sites has resumed. Monday, July 30, should start replication to Tier2s according to the plan as of June 18.
  • Shawn reports data flowing into AGLT2.
  • What about queues? Are the defaults sensible? Hiro will send info on this. DONE
    • See his note here.
    • Have all sites implemented this? (AGLT changed production fraction 40% to 60%. Is this working correctly?)
    • Questions abound...what do the fractions mean? Patrick will send for clarification.
  • Is there interference? Kaushik noticed problems early Monday. Is the weighting scheme working?

Functional tests and Cosmic run

  • Functional tests the second week of August.
  • M4 Cosmic run in Aug/Sep, about 10 TB. Agreement is to distribute this as well to Tier2s.

DDM Testbed

  • Last week: Need to setup a flexible testbed. Rob will develop a little plan for this. DONE

Site certification review for Phase 1

  • Note: Phase 1 report is due by end of week (Rob).
  • The site certification table, SiteCertificationP1 (all sites please update prior to meeting)
  • No Analysis Queues have been setup. Note - there is information from Bob on
  • Load Tests: Jay
    • Has a plugin that runs a number of tests: g-u-c, g-j-r, srmcp, bonnie++, etc. Lots of options.

OSG update

Network Performance (Shawn, Dantong)

  • NetworkPerformanceP1
  • NDT security issues (Shawn)
    • Bob Cowles complained about not consulting site security officers
    • Need to involve them from the beginning.
    • Not aware of a specific problem for NDT. Note - some Web100 sites in the past have not kept up with security patches.
    • Rich - SLAC is the only DOE site has expressed any concerns. Does not think these kernels (2-6.20) are any more vulnerable than others. No current issue. Don Petravic to lead a discussion on this, including Bob.
  • Dantong will run tests from BNL to each NDT server running at the Tier2s.
    • Three sites available: BNL, UTA, AGLT2
    • Starts a web browser outside the firewall. Finding descrepancy - not sure if its a local or network problem.
    • BNL-AGLT2: a 10 second outbound test. ~30 MB/s, far off tests with iperf (Gbps to UM).
    • BNL-UTA: 6.5 Mbs.
    • BNL-BNL: 928 Mbs - so local tests don't have problems.
    • Shawn suggests checking "more info" on what NDT was measuring.
  • Need to follow-up on this issue in a broader context. Meeting with Bob, Don, Michael, Rob.
  • Continue deploying NDT images.

Carryover action items from previous meetings

  • All sites: check off actions taken in SiteCertificationP1 By this Thursday.
  • Follow-up with Shawn and folks on first set of network I/O load tests. Need to install NDT on dedicated host at each site. Then tests on-demand can take place. Dantong driving this.
  • Follow-up with OSG troubleshooting on AGLT2 gatekeeper problem. Shawn and Yuri to follow up.
  • Encryption to syslog-ng Still to do, carryover.
  • Install NDT at each site - put in site certification table. In progress.
  • RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.

New Action Items

  • Patrick is working on a cleanse script. Marco will commit to CVS when ready.
  • See items in bold above.

-- RobertGardner - 01 Aug 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


ppt facilities-status-aug07_1.ppt (72.5K) | KaushikDe, 01 Aug 2007 - 10:59 | Production Summary Slides
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback