r4 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb13



Minutes of the Facilities Integration Program meeting, February 13, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Shawn, Michael, Rob, Charles, John, Horst, Kaushik, Mark, Jay, John, Saul, Fred, Tom, Bob, Nurcan, Xin, Wensheng, Gabriele, Patrick, Hiro, Torre
  • Apologies: none

Integration program update (Rob, Michael)

Next procecurement

  • Contingent on having the funding.
  • Expect to have full funding for FY08, available to us in a short while. Next couple of weeks.
  • Discuss technolgies available, benchmarking.
  • Q: how do we spend it - all at once, sooner rather than later? WLCG wants all sites ready w/ 08 pledges by April 2008.
  • Also - the decision about priority of job slot versus storage. Do need to look at the WLCG commitment.
  • Action item - Rob: Compare pledge amounts to current capacity.
  • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • There was a figure of 20% above the pledge - should this be revisited.
  • One versus two purchases, etc.
  • We may be short by about 20% for storage.

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Going quite well for the past week.
    • No agreement from ADC on a few days of shutdown. We've been taking jobs from other clouds - we'll probably run out of MC jobs tomorrow.
    • How long will the quiet period last? Not sure.
    • Michael: Note - we need to be getting ready for analysis, this is the top priority. Also, not convinced we are prepared for the ramp-up of use of our storage elements.
    • Kaushik: please send tasks that may be needed for the next week.
    • Follow-up items
      • Issue of md5sum - for DQ2 0.6 we would like adler32 (implemented in DQ 0.5.2). Hiro will work with Charles on testing the upgrade.
        • new FDR data has adler32 - going into the LRC as md5sum. Will show up as corrupted files.
        • Hiro - its a small change in the LRC and web interface. Adds an adler32 column. No checksum type? What is the effect on the Panda server. Will need to verify at BNL. Also need to fix pilots. Need to upgrade DQ2.
      • Follow-up on Eowyn scalability problems, roll-out update for Panda-based system. Progress being made: Bamboo now exisits and being tested by Rod. Tadashi will switch.
  • Production shift report (Nurcan)
    • All is well.
    • Handshaking w/ europe ADC.

LFC (John)

  • Following up:

Operations: DDM (Kaushik/Hiro)

  • Status of FDR AOD replication for analysis at Tier2s
    • Most sites working except UC (dCache).
    • BU (DQ problem - fixed).
    • AGLT2 - restarted things this morning after DQ2 queue recreated.
    • Hiro will make a DQ2 monitor for test subscriptions.
  • Status of CCRC datasets at Tier2s
    • Not started yet, maybe later today - Hiro.
    • What data will be distributed? M5 data? If so, won't be able to re-process at Tier2s. Are there other use-cases (apart from random data)?
    • Will use large files, optimized for data replication.
  • Follow-up's from last week: (did not cover.. follow-up next week.)
    • Issue of AOD's on Tier2's before FDR
      • Suggestion was to delete them all, and start over. Can we do this in a more selective fashion?
      • What should we be doing? Patrick and Hiro will discuss w/ Miguel. Charles interested in contributing.

Analysis Queues (Nurcan)

  • See AnalysisQueues
  • Latest test results by site
    • SWT2_CPB site available
    • MWT2 running
    • NET2, SLAC - still need test jobs.
    • Need to patch some releases (rel 12) at OU
    • Create a site-level notes at AnalysisQueues
  • Follow-up on: compatibility libraries for 64bit OS
    • Is there a definitive list? Update on site-by-site basis AnalysisQueues page, feedback to SIT group.
  • Another analysis package in progress - high PtView
    • Requires 13.0.40., about to be released.
  • Fred suggests 13.0.30 should be check, and potentially updated on all sites.

Accounting (Shawn, Rob)

Summary of existing accounting issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Inconsistencies noted by Shawn - there were differences stemming from when the data get uploaded.
  • Follow-up from last meeting:
    • Follow-up - last week: SWT2_UTA (Patrick) - Now being accounted. DONE. Still need to add the SWT2_CPB.
    • US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.

Throughput initiative - status (Shawn)

  • Update from meeting this week
  • Tests continuing, see LoadTestsP5
  • Going through site-by-site. See table for storage endpoints and network diagrams.
  • Informing optimization in terms of RRT and number of streams.

Panda release installation jobs (Xin)

  • From last week (any update?) Still waiting for a firewall to open up.
  • Working at all sites.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Facility Nagios
    • Follow-up: Split of Nagios server into internal and external - still working on this. Work has now started. The server has been built. The external server will be moved to a new server. Delayed.
  • RSV publishing to WLCG
    • We want to participate in the SAM program, but not necessarily advertise our sites to EGEE resource brokers and other WMS to send jobs to us.
  • Local RSV to Nagios publishing
    • Sent code for RSV->Nagios reporting.
    • Waiting for a reply from the RSV team.
  • Nagios problems
    • OU - timeouts - but Horst disputes this.
    • UC, IU timeouts having to do w/ managed fork

Site news and issues (all sites)

  • T1: condor autopilot crashed due to /var log filling up.. increasing space. A few srm glitches otherwise all is fine. Started to work on procurement of the farm extensions, 3M SI2K? . Tony ran some benchmarks. 140 8 core machines. Xeon 5440 at 2.8 GHz. Schedule a benchmarking/technology session; invite Tony.
  • AGLT2: Experimenting w/ space token reservation. All okay.
  • NET2: All okay. DQ2 list agents - sometimes it says stops. Had case of two UDP loggers running at once - that was the problem.
  • MWT2: dCache problems persisting for DQ2. Working on a cleanup plan for 0-sized files. 5% failure rate on writes.
  • SWT2_UTA: working on the analysis queue on CPB cluster. xrootd configuration. Shutdown CPB for some electrical work.
  • SWT2_OU: filling back up after autopilot probs this morning. No news on 10G.
  • WT2: Wei.

RT Queues and pending issues (Tomasz)

Carryover action items

New Action Items

  • See items in carry-overs and new in bold above.


  • none

-- RobertGardner - 12 Feb 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback