r4 - 20 Aug 2008 - 08:46:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar12

MinutesMar12

Introduction

Minutes of the Facilities Integration Program meeting, March 12, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Michael, Shawn, Patrick, John H, Jay, Fred, Nurcan, Kaushik, Mark, Horst, Tom, Alexei, Justin, Bob, Wei
  • Apologies: none

Integration program update (Rob, Michael)

Special topic: DQ2 0.6 upgrade (Hiro, Alexei, Miguel)

  • Hiro believes its not stable enough for deployment - no stable version yet of rpms declared.
  • Alexei: no urgency for 0.6.2 upgrade. PIC and LYON upgraded - will run functional tests. For BNL - important that its upgraded for functional tests - all sites will send data to BNL - BNL-OSG2_DATADISK.
  • Leave alone BNLDISK, BNLTAPE for now.
  • Alexei: FT will be happening over the next few weeks. Each T1 will subscribe to LYON, few terabytes in size.
  • AOD distribution within the US - equipping of Hiro for tools. But what other AODs are needed, eg. Release 13 AODs? All AODs to Tier2s.
    • Alexei will update the monitoring page for the fdr data.
    • Pilots query both the site's LRC and the DQ2 catalog - Kaushik.
  • All M6 have replicated to BNL.

Following up from US ATLAS Transparent Distributed Facility Workshop

Next procecurements

  • Standing agenda item, see CapacitySummary
  • MWT2 - 105 dual-quad core AMD servers, delayed to April.
  • UTA: last of FY07 large purchase from Dell - 540 cores, 240 TB raw (will be xrootd). This will be merged with SWT2_CPB.

Facility FDR analysis follow-ups (Nurcan)

  • Have continued submitting jobs from the workshop.
  • MWT2 - used a file from IU, now all data replicated.
  • UW site probs resolved; dq2_get works to retrieve files from its SE.
  • Need to provide more nodes for the analysis queues.
  • 20% for US ATLAS production purposes, that we can use (RAC) as we wish, for example, dedicating to analysis queues.
    • For now, we're agreed to run 20 job slots per site.
  • Alden and Amir will be running highptview with trigger updates on all FDR data on all analysis queues - may also have ntuples ready.
  • susy dpd maker will be available later today, to make ntuples for d1pd's.
  • Will collect job-types from users - will create wiki page by next week.

Operations: Production (Kaushik)

  • Production summary
    • limited by lack of jobs lately - issue has been raised with Ian, Alex
    • Now getting some FDR-2 jobs for the US.
    • Today having Panda Mover problems.
  • Production shift report
    • Last few days slow - no other updates.

Operations: DDM (Kaushik/Hiro)

  • Follow-up
    • CCRC08 replication plan
    • Hiro started creating subscriptions - checking size.
    • Would like to run for 3 days. Setting up a DQ2 subscriptions monitor.
  • There will be a new set of data to the Tier2s.

ATLAS requirements for storage elements (April 2)

  • Follow-up - plans from sites
    • Now a formal requirement space tokens at the Tier2s. Outlined in Kor's document for CCRC08.
  • AGLT2 v2.2 running. Space reservation setup, need to
  • MWT2 upgrading tomorrow.
  • NET2 - will need to follow-up, as with all other sites.

LFC integration (John/Mark/Hiro)

  • Hiro ran a test of migration - which was slow. Will run a test on the same host to speed up. Looking into writing something to do a lazy migration.
  • Panda site testing (Mark) - working Paul to work through pilot issues. Will schedule another call.
  • Will update schedule next week.

Accounting (Shawn, Rob)

Summary of existing accounting issues.

Throughput initiative - status (Shawn)

  • Next meeting? Probably next week.
  • Jay has generated graphs for the Tier2 --> BNL.

Panda release installation issues (Xin)

  • Any release installation issues to follow up?
  • Xin thinks we can start using it now. The issue is how to submit the pilots to do these jobs. There is a plan possible to use autopilot, but there may be problems.
  • Xin will follow-up with Torre.

Nagios Alerts - Focus review (Dantong)

  • Follow-up next.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Facility Nagios
    • Follow-up: Split of Nagios server into internal and external. Done. DONE
  • Local RSV to Nagios publishing
    • Port now working at MWT2_IU - some problems to be cleaned up, but basically working.
  • RSV to SAM
    • Now working - needs review.

PROOF / Xrootd

  • See presentation from Sergey at last week's workshop.
  • There will be a meeting this Thursday, see ProofXrootd
  • Hiro points out there is an xrootd deamon for a site's LRC.

Tier3 issues

* No issues this week.

Site news and issues (all sites)

  • Review SiteCertificationP4 table
  • T1: Gabriele working on stabilzing the dCache instance. Finding lots of hanging instances - perhaps related the patches to the OS's on the thumpers. Hope the build process will be completed later today, fully completed by the end of the week. There are performance problems regarding job throughput through BNL. New machines came in, some machines going to other network segments, configuration changes through autopilot, etc - lots of things changed at once. Last week John Hover and Xin were able to talk with Condor-G team - discovered probs within Condor-G that resolved part of the issues. Still not completely understood.
  • AGLT2: Tom - all okay. Monalisa at OSG being deprecated. Jay is going to host this - only needs configuration files.
  • NET2: all okay. Need to register
  • MWT2: dcache upgrade.
  • SWT2 (UTA): Moving to a rocks install for upgarde to dpcc.
  • SWT2 (OU): all okay. still waiting for 10G equipment from UTA; switch now in place. will also have to upgrade ibrix segment servers to rhel5.
  • WT2: starting to work on implementation of space tokens for srm v2.2 - working with xrootd developer, sent to Alex. Long power outtage in April, last weekend 2/3 day.

RT Queues and pending issues (Tomasz)

Carryover action items

  • Procurements
    • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • Accounting: US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • None


-- RobertGardner - 11 Mar 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback