r4 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb27

MinutesFeb27

Introduction

Minutes of the Facilities Integration Program meeting, February 27, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Marco, Hiro, Xin, Michael, Horst, Jay, Rich, Saul, Nurcan, Bob, Wei, John H, Tom, Mark, Gabriele, Tom W, John/BU
  • Apologies: Torre

Integration program update (Rob, Michael)

Next procecurements

  • Standing agenda item
  • Carryover items
    • Action item - Rob: Compare pledge amounts to current capacity. See: CapacitySummary
    • Next update - end of March.

Facility FDR analysis roundup (Nurcan)

  • Happy to report that all Tier2's have passed all FDR tests!
  • Will likely have to follow-up with different users, eg., Akiro.
  • NET2 now finishing fine.
  • HighPt package also running okay at OU.
  • Thanks to all, esp. Patrick, Horst!
  • ANALY_GLOW_ATLAS - will add. Not completely certified, jobs currently failing with config errors.
  • Major Facility milestone achieved.
  • Issue - special tags in the release..discuss / follow-up with Fred.

LRC, Adler32 updates at Tier2 sites (Hiro, all)

  • Follow-up - any remaining issues?
  • All sites updated. Any problems?
  • Adler32 update script for FDR data - have all sites updated? Reminder to do so.

Adler32 updates for Panda services (Kaushik)

  • Follow-up from last week:
    • Mark reports - there have been pilot-specific issues. Paul is on top of this. Updates to pilot3 are automatic. Pilot2 have to be done manually - will check.
  • No update.

Summary of problems at BNL

  • Analysis jobs running at BNL too slow. New autopilot + BNL gatekeper take 20 minutes to schedule jobs. Currently troubleshooting, expect update tonight/tomorrow.
  • Dantong suggested to Kaushik to revert to the old version for the time being.

Operations: Production (Kaushik)

  • Production summary (Michael)
    • FDR-2 preparations are now beginning.
    • Reprocess
  • Production shift report (Mark)
    • Barry on shift.

Operations: DDM (Kaushik/Hiro)

  • CCRC08 replication plan
  • Hiro started creating subscriptions - checking size.
  • Would like to run for 3 days. Setting up a DQ2 subscriptions monitor.

ATLAS requirements for storage elements

  • Now a formal requirement space tokens at the Tier2s. Outlined in Kor's document for CCRC08.
  • We can no longer afford only gridftp only for entry points.
  • Need srm v2.2 at Tier2s.
  • Need firm plan for equipping T2's with srm v2.2 by April 2
  • Space token - a method to reserve space for dedicated purposes. Necessary for managing the space. Basically a quota system.
  • See further ToA file for how tokens are used.
  • We will continue the prep phase for FDR2 in a mixed mode.
  • Role determines authorization to use the space. Will start with a single space token for analysis groups.
  • Lets reserve some time to discuss this at UNC.
  • What about Storm? Tightly coupled with LCG and glite.

LFC integration (John/Mark/Hiro)

  • Still waiting on a testsite installed in scheddb. Mark will get Torre.
  • Hiro and John working on migration problem. Takes a lot of time. There may be speedups by running locally.

Accounting (Shawn, Rob)

Summary of existing accounting issues.

Throughput initiative - status (Shawn)

  • Report from meeting last Friday
  • Jay: adding tests from T2-T1 to the graphs. Finding some asymmetries.
  • Can UTA install iperf? Mark will follow-up with Patrick.

Panda release installation issues (Xin)

  • Any release installation issues to follow up?
  • Should we use a separate submit host for this purpose? Needed at the moment since we need the software voms role.
  • Dantong will discuss maintenance issues Torre.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Facility Nagios
    • Follow-up: Split of Nagios server into internal and external. Still waiting.
    • Access to OSG RSV data for use within US ATLAS Tomasz.
  • Local RSV to Nagios publishing
    • Port now working at MWT2_IU * RSV to SAM
    • OSG will use an interoperability list.
  • Selected Nagios facility alerts:
    • UTA - gatekeeper and LRC.
  • There are several OSG footprints tickets in our RT

Site news and issues (all sites)

  • Review SiteCertificationP4 table
  • T1: Gabriele - need to upgrade dCache again. Plan for next Tuesday. All day exercise. Upgrade will make pools start more quickly. 1.8p6, released on Monday. Need to consult Kaushik and Hong.
  • AGLT2: Scheduled maintenance shutdown tomorrow. Been getting lots of Nagios tickets lately - 50 seconds. Suggest forming a small task force to look into these details. Calibration center - need to decide whether you want to use a dedicated channel to the T0. This requires an FTS channel (maybe not?). Endpoint needs to be defined in the GOC database (for WLCG BDII). Bob will be in touch with Hiro.
  • NET2: Shutdown was on Tuesday - upgraded worker nodes to RHEL4. Production back up this evening; analysis queue setup.
  • MWT2: Postgres database filling up- caused a partition to fill.
  • SWT2_UTA: UTA_dpcc - building lost power, recovering. (Note - this was cause for many of the Nagios tickets). Will retire SWT2_UTA analysis queue.
  • SWT2_OU: Subscribed FDR data - in support of analysis queue. Ibrix crash - to be cleaned up.
  • WT2: Updating endpoint - doesn't work with dq2_get. The external commands used are not working on the slac srm server. It is hard-coded for BNL. lcg-copy

RT Queues and pending issues (Tomasz)

Carryover action items

  • Procurements
    • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • DDM
    • Issue of AOD's on Tier2's before FDR
      • Suggestion was to delete them all, and start over. Can we do this in a more selective fashion?
      • What should we be doing? Patrick and Hiro will discuss w/ Miguel. Charles interested in contributing.
  • Accounting: US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.
  • RSV publishing to WLCG
    • We want to participate in the SAM program, but not necessarily advertise our sites to EGEE resource brokers and other WMS to send jobs to us.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • No meeting next week (OSG).

-- RobertGardner - 26 Feb 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback