r2 - 20 Feb 2008 - 14:54:59 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb20

MinutesFeb20

Introduction

Minutes of the Facilities Integration Program meeting, February 20, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Michael, Rich, Mark, Wei, John, Jay, Shawn, Hiro, Tom, Marco, Patrick, Horst, Fred, Saul, Xin, Hal, Alexei
  • Apologies: Torre, Nurcan

Integration program update (Rob, Michael)

Next procecurements

  • Standing agenda item
  • Carryover items
    • Expect to have full funding for FY08, available to us in a short while. Next couple of weeks. Any update?
    • Action item - Rob: Compare pledge amounts to current capacity.
  • What is our understanding of dedicated versus leveraged resources? There is an April 2007 request that sites should review capacities, esp 2008, 2009.
  • Are table numbers representative of fiscal or calendar year. Calendar.
  • In the table, the figure should represent mid-year, +- 2 months.
  • Still no word from Columbia on FY08 funds.

LRC, Adler32 updates at Tier2 sites (Hiro, all)

  • Few sites have upgraded: AGLT2, BU. IU, UC still have probs. UWISC was working.
  • OU - still doing clean up, expect to be finished in an hour.
  • UTA should be updated.
  • SLAC - still to do.

Adler32 updates for Panda services (Kaushik)

  • Mark reports - there have been pilot-specific issues. Paul is on top of this. Updates to pilot3 are automatic. Pilot2 have to be done manually - will check.

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Follow-up on downtime schedules
    • Defer for now.
  • Production shift report (Mark)
    • Running at a reduced rate in the last few days. Sites are using downtimes to do site upgrades. Focus on Adler32 and md5 issues are being worked on.
    • If sites notice any failures related to checksum's etc, please feedback to shift.
    • Expect ramp-up towards beginning of next week.
  • Kaushik's report:
    • Adler32 update issue - not completely resolved with the various components Panda, Pilot and DQ2 services.
    • Downtime needs - would like to propose one day per site "down" and roll through sites over the next week.

Operations: DDM (Kaushik/Hiro)

  • Status of FDR AOD replication for analysis at Tier2s
    • Complete at all sites DONE
    • Adler32 updates for FDR datasets - Hiro will send a script later today
  • Status of CCRC08 replication
    • Still coming, but expect many are complete.
    • Should bring replication to the Tier2s.
    • For Tier2 replication - to be handled within the cloud.
    • Tier2's - may wish to setup separate locations in the SE, from the default. See Hiro
  • Any news about DQ2 0.6?
    • Alexei: testing is finished, ready to start with UK cloud and all Tier2's.
    • Hiro has not been contacted by Miguel.
    • Alexei says this was intended for M6.
    • DQ2 0.6 required for hierarchical datasets.
    • TID and physics datasets. * Other news from Alexei:
    • T1-T1 in progress.
    • CCRC 10-18 datasets - but BNL not getting 100%. Disagreement on this point. Is the plugin to be changed? (The plugin needs frequent changes - will be given to the ADC operation.)
    • ~1000 CCRC datasets: between 3-30 files/datasets, with sizes 2.5-3 GB
    • Free notes:
1st week
   700 datasets random number ESD, AODs 
   miguel debuging

2nd phase green week
   t1-t1: subscribed from all t1's.  11 datasets from other t1s.  if smoothly, increase scale.  2-3 days

next set: M5 samples. predefined samples selected, renamed to ddm5. 

what reprocessing at t1's?  perhaps fdr reprocess, and then redistribute

french cloud - bulk subscriptions

Analysis Queues (Mark)

  • See AnalysisQueues and HN threads for FDR analysis progress.
  • Please update the twiki pages.
  • Lots of activity with sending jobs over the past week.
  • Two classses of probs:
    • System level patches at sites - under control, Xin providing updates.
    • Real user jobs - new types of problems - is it an analysis code or site problem. Heads-up: this is a potentially a big time synch debugging jobs and user code errors.
  • How to deal with this? Can't afford to synch lots of site admin time intro troubleshooting.
  • Its a subtle question as to whether this its a user, or "site" problem.
  • Lots of issues come up:
    • file system, schedulers, user account mappings
  • Kaushik thinks the problems are mostly software release issues - not site problems.

LFC integration (John/Mark/Hiro)

  • Following up: report from this week's meeting
  • Phone meeting yesterday morning (John, Paul, Mark, Hiro, Rob)
  • Better understanding on what is needed to migrate to LFC, setting up a test site and running some Panda jobs.
  • Paul thinks that required pilot changes are known.
  • A test setup should happen by next week.
  • Comment from Hiro - transferring between LRC and LFC will be time-consuming. Hiro notes the LFC is a pseudo file system, adding more complication. Work proceeding w/ Mysql database.
  • Problem for gsi access to storage by LFC. Michael will address this next week. If this were available, pilot code would be simplified.

Accounting (Shawn, Rob)

Summary of existing accounting issues.

Throughput initiative - status (Shawn)

  • Meeting scheduled for this Friday
  • No update.

Panda release installation jobs (Xin)

  • From last week (any update?) Still waiting for a firewall to open up. Done DONE
  • Continuing validation tests now on production.
  • 13.0.30, 13.0.40 updated on all sites.
  • Need to automate presence of new updates - will continue this discussion w/ SIT group.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Facility Nagios
    • Follow-up: Split of Nagios server into internal and external - still working on this. Work has now started. The server has been built. The external server will be moved to a new server. Delayed.
  • Local RSV to Nagios publishing
    • Work underway between Tomasz, MWT2_IU, RSV team
  • Selected Nagios facility alerts:
    • UC, IU timeouts having to do w/ managed fork - Condor team investigating.

Site news and issues (all sites)

  • T1: all okay.
  • AGLT2: got a direct circuit from Tier2 to building 513 at CERN - 150 Mbps. As a backup path for muon calibration. Takes existing lightpath from lhcnet - subdividing a 10G link (can ask for 50 Mbps increments). RTT 124 ms.
  • NET2: have hardware for a storage upgrade. Doubling GPFS storage. Expect a new gatekeeper soon. Adler
  • MWT2: dCache cleanup completed, services restored. http interface issues being addressed.
  • SWT2_UTA: all is well.
  • SWT2_OU: Adler32 nearly done.
  • WT2: storage purchase in progress. reorganizing computing room, probably will run at reduced capacity.

RT Queues and pending issues (Tomasz)

Carryover action items

  • Procurements
    • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • DDM
    • Issue of AOD's on Tier2's before FDR
      • Suggestion was to delete them all, and start over. Can we do this in a more selective fashion?
      • What should we be doing? Patrick and Hiro will discuss w/ Miguel. Charles interested in contributing.
  • Accounting: US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.
  • RSV publishing to WLCG
    • We want to participate in the SAM program, but not necessarily advertise our sites to EGEE resource brokers and other WMS to send jobs to us.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 19 Feb 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback