r5 - 20 Aug 2008 - 08:46:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar19

MinutesMar19

Introduction

Minutes of the Facilities Integration Program meeting, March 19, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Bob, Michael, Horst, Wei, Dick and Michael/LTU, Nurcan, Kaushik, Rich, Tom, John, Jay, John B, Saul, Patrick, Wensheng, Hiro, Xin, Torre
  • Apologies: none

Integration program update (Rob, Michael)

Special topic: DQ2 0.6 upgrade (Hiro)

  • News on deployment readiness - final version still testing, not available.

Next procecurements

  • Standing agenda item, see CapacitySummary
  • Invite Tony to give a server evaluation report.

Facility FDR analysis follow-ups (Nurcan)

  • Jamboree - see presentation on Monday morning.
  • There is a new wiki page for the turtorial. Most users went through the instructions successfully.
    • Panda monitor problems - causing MWT2 job failures; autopilot types and AGLT2; see usatlas-prodsys-l thread.
    • Some people continue to submit jobs to ANALY queues at BNL; there might have been transient problems caused by the thumper, but these were solved quickly.
  • follow-up: Will collect job-types from users - will create wiki page by next week.
    • Will generalize the page from the Jamboree.
    • Need instructions for local Tier3 submit hosts.
    • Concern expressed from users about not having info about site maintenance issues, up/down.

Operations: Production (Kaushik)

  • Production summary
    • We have jobs defined now, but a large transferring backlog. Hiro: dCache and gridftp doors were not working last night, but now fixed.
    • Once thumpers came back things became stable and running well.
    • Failure rate went up today. Lots of lost heartbeat jobs at ALGT2. Lots of jobs have gone into a holding state. There were problems with the pilot wrapper downloading the actual script from an Apache web server (caching from SVN). There were also probs with gridui02, downloading from queuedata (which was down).
  • Production shift report
    • Holding jobs at Boston - problem was fixed on Monday. Two worker nodes didn't have nfs-access, jobs failed. Should be fine now.
    • Lost heartbeats - job related, not site related (seen at many sites). Kaushik: might also be related to a Panda server issue.

Operations: DDM (Kaushik/Hiro)

  • Follow-up
    • CCRC08 replication plan (Hiro)
    • Will need to check whether data was replicated.
  • AOD replication to Tier2s for analysis - we need a higher level plan for identifying and managing the placement of data to Tier2s. Kaushik - have been discussing this, we have two new Panda hires coming on board. These are higher level services coming above the Tier2s.
    • Need to bridge the gap between now and when these services would be available.
    • Small group to discuss this offline.

ATLAS requirements for storage elements (April 2)

  • Follow-up - plans from sites
    • Now a formal requirement space tokens at the Tier2s. Outlined in Kor's document for CCRC08.
  • AGLT2 v2.2 running. Space reservation setup.
  • MWT2 v2.2 running at MWT2_IU but needs v2.2 space-token and other functional tests; debugging _UC (pools on a private network issue).
  • WT2 v2.2 in development (Bestman-xrootd). Not much progress. Andy is ready to deliver his piece by the end of the week. Waiting
  • NET2 - schedule meeting. Saul: will install Bestman, getting some experience. Will contact about GPFS.
  • SWT2 - will be installing Bestman-xrootd on SWT2_CPB; waiting on getting some additional hardware.
  • Space tokens go to a specific directory, but users will need to only know a name. Consult ToA file.

LFC integration (John/Mark/Hiro)

  • Follow-up:
    • Hiro ran a test of migration - which was slow. Will run a test on the same host to speed up. Looking into writing something to do a lazy migration.
    • Panda site testing (Mark) - working Paul to work through pilot issues. Will schedule another call.
    • Will update schedule next week.
  • Hiro is running an update load.
  • Will schedule meeting for next Tuesday.

Accounting (Shawn, Rob)

Summary of existing accounting issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up from last meeting:
    • Follow-up - last week: Still need to add the SWT2_CPB. Still waiting to get done. DONE
    • IU_OSG - Fred is following up.
    • Michael urges all sites to give this priority, and to check that information is reported correctly.
  • Current acconting plot from WLCG: (Normalised CPU time [units 1K.SI2K.Hours] )
    wlcg-accounting.png

Throughput initiative - status (Jay)

  • Report from Monday's meeting
  • Current waiting on upgrades to sites; when ready, ask Hiro.
  • Also will have update

Panda release installation issues (Xin)

  • Any release installation issues to follow up? Xin thinks we can start using it now. The issue is how to submit the pilots to do these jobs. There is a plan possible to use autopilot, but there may be problems. Xin will follow-up with Torre.
  • No update - still tied up with Condor-G. Revisit next week.

Nagios Alerts - Focus review (Dantong)

  • Any follow-up
  • No update from Dantong.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Local RSV to Nagios publishing
    • Port now working at MWT2_IU - some problems to be cleaned up, but basically working. To be included in next OSG release.
  • RSV to SAM

PROOF / Xrootd

Tier3 issues

  • LTU - in the process of installing storage resources in the next two weeks.

Site news and issues (all sites)

  • Review SiteCertificationP4 table
  • T1: A number of issues have been resolved. Still working on the Condor-G issue with Wisconsin, specifically in the BNL environment. On the storage side, progress has been made - all dCache services up, should be no access.
  • AGLT2: Heavy loads on our gatekeeper (generating timeouts). Setup a separate server for non-ATLAS VOs. Otherwise running steadily.
  • NET2: No outstanding problems; new hardware 85 TB, 96 cores in blades (Intel).
  • MWT2: dCache v1.8 upgrade
  • SWT2 (UTA): May need to shut down CPB's cluster next week for electrical work. _UTA back up.
  • SWT2 (OU): Waiting for 10G network. Working with Dell on server crashes. Problems with g-u-c on OU_OSCER.
  • WT2: New storage acquisition - 7 thumpers, 230-240 usable TBs usable. End of April.

RT Queues and pending issues (Tomasz)

Carryover action items

  • Procurements
    • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • Accounting: US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • Update from Nurcan: regarding jobs using the same output datasets - Tadashi will put in a protection for this.


-- RobertGardner - 18 Mar 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png wlcg-accounting.png (21.9K) | RobertGardner, 18 Mar 2008 - 19:45 | Current acconting plot from WLCG
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback