r6 - 29 Aug 2007 - 18:53:06 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug29

MinutesAug29

Introduction

Minutes of the Facilities Integration Program meeting, Aug 29, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Michael, John, Mark, Kaushik, Charles, Marco, Nurcan, Xin, Bob, Saul, Jay, Wensheng, Tom, Rich, Kunal, Joe.
  • Apologies: none

Integration program update (Rob, Michael)

Accounting

  • Accounting portal, http://www3.egee.cesga.es/gridsite/accounting/CESGA/osg_view.html
  • See also https://goc.gridops.org
  • We need to assign a person to keep track of these figures, their reporting to WLCG, and to follow-up with the sites, which is very important. This is an administrative issue, rather. Note this is a mandatory requirement that has been agreed with WLCG and OSG.
  • Need to spell out tasks and milestones in detail, and decide these.

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • facilities-status-aug07_3.ppt: Production Summary Slides,
    • New sites joining - T1 (Triumph, Canadia Tier1), Westgrid Tier2's, all working well.
    • Site by site comparisons.
    • Eowyn scaling - 12,000 finished jobs required 5 hours to update atlas proddb at cern. Will take up with atlas prodsys group.
      • How scalable is it, and how much time will it take to develop improvements (Luc, also with T0 responsibilties).
      • The issue is the update of post-job status: blocks new jobs. Issue has been raised previously. Simultaneous Eowyns? Introduces new problems.
    • Many reconstruction jobs coming - some datasets to be reconsructed twice; all for csc physics notes, so high priority. Can we get lists of datasets to pre-stage in advance? Adding a pre-stage job in Panda.
    • Large number of jobs failing due to Athena seg faults? Caused by validation of new software on the grid. There is a multi-stage validation process, RTT, and the nightly tests. There is a validation group under Andrew that looks at standard samples before release. Then there is "sample A", 10-100 jobs that are run before full release. Have been working with validation group to reduce this. Look at error summaries.
  • Shifters report and other production issues (Mark/Nurcan)
    • One update to the pilot code this week. Updates from Paul and Marco on data movement utilities.
    • New job submission has been slow due to Eowyn.
    • Lots of new bug reports submitted, because of the validation tasks.
  • What else is coming up?
    • There have been more delays in releases.
    • Will need to reprocess at least twice, dating back to November, using rel 12, 13. Any preparations for the sites? Expect higher than usual network traffic, and lots of I/O intensive jobs. And these are shorter jobs. The load will be most at BNL - since that is where the data is.
    • Ratio of input/output? There should be a factor of 5 compression. For data that comes back to BNL - it should be custodial.
    • Saul: I/O per unit CPU for these reconstruction jobs? Kaushik will look up. When do these jobs hit the Tier2's? "Next week".
    • Kaushik will summarize these metrics and provide a table.
    • Release 13.0.20.2 - new - supposed to do trigger reconstruction, validation.
    • Release 12.0.7.2 will happen first.
    • Both of these will updates to the

Operations: DDM (Alexei)

  • M4 replication news?
  • Review plan in DQ2SiteServicesP2 carryover
  • ESD's subscribed to Tier2's.
  • AGL host for DQ2 server has crashed a couple of times in the last few days.

Load testing update, issues (Jay)

  • App ready to go, once we have site configurations.
  • First results from gridftp memory-to-memory transfers (dev0 to dev/null)
  • Charles will write up some instructions for FUSE.
  • Review plan in LoadTestsP4
  • Some concern about the tests being disruptive.

Network Performance (Shawn, Dantong)

  • NDT news and tests
  • Review plan in NetworkPerformanceP2
  • Results:
    • NDT (10 second test for each direction - because it is a web-client) versus longer running iperf tests.
    • Do we want to do repetitive tests?
    • Ran a 60 second test, 900Mbps (4 MB window size).

Analysis Queues (Bob, Mark)

  • Revew plan in AnalysisQueueP2
  • New site at AGLT2: ANALY_AGLT2 - new entry in siteinfo. Have submitted a small number of test jobs - analysis pilots. Are the jobs doing the right thing on the backend (at Michigan, Condor). Bob investigating if these things are working correctly.
  • Need to specifiy a queue name; need to cap this for analysis jobs.
  • If this works at UM, then we can have a template for Condor-sites; already have this for PBS (existing at UTA).
  • Xin is doing this with local pilot submissions at Brookhaven.
  • Question moving forward: * setting up analysis queues at each site * allocate number of cpu's. Small number.
  • Today's results - Mark submitted 6 jobs, looks like the jobs ran (running local pilots).
  • Pretty simple to setup for dedicated queues - a few modifications.
  • Most likely will use Auto-pilot to schedule analysis jobs.
  • Next week - will have instructions ready for PBS and Condor. Identify two sites to setup.
  • Q about siteinfo file: what does nodes mean? Meaning is number of cores, though its not an important parameter.

OSG Integration (Rob)

Site news and issues (All Sites)

  • T1: none
  • AGLT2: problems with dq2 servers; rfq back this week; PO next week. Three gridftp servers for load testing.
  • NET2: reboot of gatekeepers. atlas.bu.edu: not authenticating
  • MWT2_IU: no probs; iut2-dc1.iu.edu.
  • MWT2_UC: production due to Eowyn. Finding huge variation in dccp transfers from local disk into dcache, very slow performance. Was caused by write-pools being too busy. Too much load on the write pool nodes? Consult w/ Robert Papkus at BNL.
  • SWT2_UTA: campus-wide network problems, now resolved. two nights ago DQ2 problems. There having problems with tomcat processes on the gatekeeper, causing auth problems on the gatekeeper. What is the underlying problem? Is there a patch available? Repeatedly restart of tomcat server on the GUMS host.
  • SWT2_OU: everytyhing is running fine. Moving day is Monday Sep 10. Then Dell will install new nodes that week. Will come back with 260 cores, 20 TB space.
  • WT2: Power outage reschedule for Aug 29. Will shutdown DQ2 server on Tuesday.
  • UC Teraport: none.

RT Queues and pending issues (Dantong, Tomasz)

  • RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.
    • Dantong has guidelines.
    • There are a couple of tickets that are not getting attention.

Carryover action items

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

FTS monitoring

ATLAS releases (Xin)

  • Note
  • Alessandro's system could be extended to cover OSG sites. US ATLAS sites need to be published to BDII, and a convention decided for the installation area.
  • Pilot-based system as well. Need a working glexec to change the role from a production to installation role.
  • Kaushik points out that we should move towards interoperability.
  • Need to get an update w/ Torre, before making a decision.
  • Publishing info to EGEE, we should discuss this a bit. Start first with the ITB.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • TBD

-- RobertGardner - 28 Aug 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


ppt facilities-status-aug07_3.ppt (101.5K) | KaushikDe, 29 Aug 2007 - 13:06 | Production Summary Slides
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback