r6 - 18 Sep 2007 - 08:51:44 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep12

MinutesSep12

Introduction

Minutes of the Facilities Integration Program meeting, Sep 12, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Shawn, Rob, Kaushik, Torre, Horst, Karthik, Jay, Joe, Fred, Charles, John Brunelle, Wei, Bob, Tom, Saul, Marco, John Hover, Hiro, Saul, Tom
  • Apologies: Michael, Alexei

Integration program update (Rob, Michael)

  • Phase 2 schedule and Phase2
  • Updates coming for:
    • RSV/Nagios/SAM deliverables
    • Throughput initiative to be led by Shawn (100 MB/s routinely to all Tier2s)

Accounting

  • Follow-up on WLCG reports on accounting
  • Accounting portal, http://www3.egee.cesga.es/gridsite/accounting/CESGA/osg_view.html and https://goc.gridops.org
  • Shawn - no data listed for July and August at CERN.
  • Wei - should be PROD_SLAC - correct name has been applied (prod-slac).
  • BU - still at OSG 0.4
  • UTA_DPCC is reporting okay in the EGEE.
  • What is the VO name? (atlas or usatlas)
  • At SWT2, Gratia doesn't build correctly - binary release not compatible with machines with gcc 3.2.3 libs.
  • AGLT2 - reporting, but not representative.
  • Action item: follow-up with John W and WLCG on these issues (Rob)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Running well again. Last two weeks Canada has been playing a big role. This week the big issue is the HPSS upgrade at BNL. Thanks to Hiro for manually staging input files. HPSS will come back Friday.
  • Shifters report and other production issues (Mark/Nurcan)
    • Validation of new 13.0.20.2 validation jobs - lots of failures.
    • Teraport - still debugging; please inform pandashift when ready to put back into full production.
    • Router issues at Indiana gigapop resolved.
    • Update of Eowyn, seems to have helped scaling problems. A second Eowyn at cern, handling jobs for canada. May think about running an Eowyn instance at CERN.
    • Missing software at sites - mostly a problem at LCG site (Rod suggests matchmaking). Paul is implementing a backup to download automatically.
    • Marco asks about transferring timeouts of output. Are there disk probs.
    • New PandaMover being tested. Idea is to use Panda jobs to move data rather than DQ2 subscriptions. Panda job which does dq2_cr. Has HPSS pre-staging capability.
  • (Note from the last meeting about upcoming workloads):
    • Will need to reprocess at least twice, dating back to November, using rel 12, 13. Any preparations for the sites? Expect higher than usual network traffic, and lots of I/O intensive jobs. And these are shorter jobs. The load will be most at BNL - since that is where the data is.
    • Ratio of input/output? There should be a factor of 5 compression. For data that comes back to BNL - it should be custodial.
    • Saul: I/O per unit CPU for these reconstruction jobs? Kaushik will look up. When do these jobs hit the Tier2's? "Next week".
    • Kaushik will summarize these metrics and provide a table. No update yet.
    • RDO file ~100-200MB as input for 30 minute jobs.
    • Release 13.0.20.2 - new - supposed to do trigger reconstruction, validation. Still a big question mark since the failure rate is so high.
    • Release 12.0.7.2 will happen first. Heard that there will be no 12.0.7.3. Expect to generate a list of datasets to be re-processed.

Operations: DDM (Alexei)

  • M4 replication review.
  • Review plan in DQ2SiteServicesP2 carryover
  • Report from Alexei:
DQ2 0.4
 Rod Walker found bugs in 0.4 and according to Miguel there is no reason 
to release new version before bugs will be fixed. I'll have a dedicated 
meeting with Miguel Thursday morning. So more news on Thursday.

M4 runs

 BNL received ~1030 datasets out of ~1500 (similar numbers for other 
sites), I'll working on overall statistics, some plots and tables will 
be presented on Thursday's DDM operations meeting
 The reason why 30% of M4 RAW data wasn't delivered is not understood 
yet. The working assumption that DQ2 internally cancelled subscriptions 
after several attempts to transfer data. It isn't obvious, because there 
is no indication that BNL had problems (or CERN)
 Other issues (md5sum, T1-T1, T2-T2 data transfer during M4, though 
hierarchical path T0-T1-T2 was expected) are also under investigation.

Load testing update, issues (Jay)

  • Jay has updated LoadTestsP2 with detailed site information tables for contact points, tcp parameters, MTU sizes and load tests results
  • Varying tcp buffer size and number of stream for gridftp memory-to-memory tests (best performance was 17 MB/s).
  • Question about deleting files - edg-gridftp-rm to remove files at slac; John H will help him with this.
  • Monitoring - parse files and create graphs in monalisa.
  • Throughput initiative to be led by Shawn. Focus first on two sites (BNL and AGLT2) and once 120 MB/s is achieved move onto the other Tier2s.

Network Performance (Shawn) - closely related to the above

  • Two approaches - simple iperf measurements on existing hosts, and NDT tests
  • Review plan in NetworkPerformanceP2
  • iperf (or netperf) simple to install with rpms, do we require it? Issue is leaving an iperf server running that remote clients could "exploit" to use up bandwidth.
  • For sites which have a gatekeeper, run an iperf service within a grid job. Most ustlas sites run gatekeepers on their gridftp doors, so this is preferred.
  • For sites with dCache gridftp doors, need to install an iperf server (and publish this to Shawn and Jay).

Analysis Queues (Bob, Mark)

  • Bob has setup instructions for implementing analysis queues for Condor sites, see: AnalysisQueueP2
  • Mark will provide similar instructions for PBS.
  • Action items moving forward (each site:) * We need to setup analysis queues * Allocate a small number of cpu's to this site.

Site performance jobs and metrics (Rob)

  • At CHEP we discussed a simple program using the Panda infrastructure to inject test jobs which measure performance metrics to be useful for diagnostics, capacity measurements, and job metics.
  • Will start this on the OSG ITB, and branch out into the production infrastructure.

LFC Evaluation

  • Will start looking into this - John Hover will have a preliminary next week.

RSV, Nagios, SAM (WLCG) site availability monitoring program

  • Discussed with James Basney (EGEE), Rob Quick (OSG), Dantong, Tomasz, John Hover at CHEP
  • Will define a program of work around this.

OSG Integration (Rob)

  • Testing on ITB 0.7, Site Validation Table
  • Call for help to submit test jobs to the ITB.
  • UTA student Aaron working with Torre at BNL.

Site news and issues (All Sites)

  • T1: no report.
  • AGLT2: selected Dell for next round purchase. Waiting for final quote. Hope to have systems at end of this month at UM, and mid-October.
  • NET2: no issues currently.
  • MWT2: will work on re-enambling jumbo frames. 10G fiber card misbehavior. iu-atlas-tier2 cluser offline this friday IU_OSG (128 cores).
  • SWT2_UTA: cluster will offline Tuesday, Wednesday for electrical work. Otherwise all okay.
  • SWT2_OU: moved cluster on Monday - all went well. Expect to be down most of next week.
  • WT2: no report.
  • UC Teraport: working on the lost heartbeat jobs.

RT Queues and pending issues (Dantong, Tomasz)

  • Review of RT tickets - need experts.
  • Recommendation to at least assign ownership within a day.
  • Sometimes there are false alarms, but Tomasz is working to eliminate them.
  • Saul notes we sometimes get confusing messages from Nagios - and so we hesitate responding to them. Tomasz is working on adding protrection against these.
  • If there are questions, reply to the message w/ cc to Tomasz.
  • Suggestion to be able to resolve a message via email.
  • Suggestion to have a summary website related.
  • No solution yet to listing the cc'd recipients.
  • Tomasz will send URL for RT issues, to be on next week's agenda.
  • Suggestion about building a heiarchy of services in Nagios - Tomasz is working on this, will report on Nagios service groupings in two weeks.

Carryover action items

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

FTS monitoring

  • FTS 2.0 deployment no news, carryover

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • None

-- RobertGardner - 11 Sep 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback