r11 - 10 Oct 2007 - 15:03:43 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct10

MinutesOct10

Introduction

Minutes of the Facilities Integration Program meeting, October 10, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Charles, Michael, Rob, Horst, Karthik, Tom, Saul, Wei, Dantong, Fred, Patrick, Jay, Hiro, Xin, Wensheng, Nurcan, Kaushik, Alexei
  • Apologies: Bob, Shawn

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Plagued by DQ2 problems - concerns about DQ2 0.4, same version as on LCG and BNLPANDA since Sunday. Good discussions with Miguel last week - lots of new developments. Real issue are these production quality? Serious impact on production over the past week. Communications between BNL and CERN - creating timeouts/truncation - needs fixing in the DQ2 code. Now have site services problems. BNL-fetcher.
    • LRC-BNL timeout problems (Hiro) - current web-interface code creates a temp files, and is supposed to remove the files. Lots of temp files did not get deleted, and problems with queries. Patrick notices a problem with the query XML output. (This is the code that provides http access to the LRC.)
    • User analysis impact - dCache was hammered, started experiencing problems on Friday. A single user submitted .25M jobs through pathena, with high failure rates >90%, user apparently not concerned. Results / files were archived to tape, and massive logfile creation. Shutdown user analysis queues. Have setup quota enforcement in panda.
    • Eowyn - could not get sufficient jobs - Karthik discovered a built-in limit of 30K jobs, which we reached due to the number of transfer jobs. Another Eowyn instance will be hosted by Rod Walker for Canada, France.
  • Shifters report and other production issues (Nurcan)
    • Will try again OU ITB site - 10K pilots sent, but there are problems running there. Need help from Marco and Xin.

Operations: DDM (Alexei)

  • DQ2 0.4 status. 3 weeks ago was deployed on LCG production sites. Functional tests at BNLDISK and UTA were successful. Request from Miguel for more changes to DQ2 0.4 - then ready for deployment on Friday. There were additional problems with the DQ2 client (an old bug resurfaced).
  • Three hosts: file catalog, agents, services on separate hosts is recommended. More specifically, the host running the agents need to be on a relatively powerful machine.
  • For GL, to test the fair share.
  • M4 has been reprocessed at CERN, ready for distribution (1.4 TB). Will start subscriptions from CERN to BNL. Old versions can be deleted on sites.

DQ2 0.4 testing (Hiro, Patrick)

  • See further DQ2SiteServices to capture deployment experience, known issues.
  • Hiro: not much different from previous versions - running with agents and queue catalog on same machine without problems (on BNLDISK). Upgraded BNLPANDA already.
  • Backlog issue: too many obsolete datasets - sitting in sub catalog in central services, taking up the queue.
    • Alexei: BNLDISK - 780K files - will send around the subscriptions. Notes that there were probs with Tadashi's subscription cancellations at central catalogs due to stabilities issues, errors not returned.
    • There was a problem with the DQ2 site services machine communicating with central catalogs.
  • UTA: installation complete. Have pulled db-release files, but no large scale tests.

FTS (Hiro)

  • Experience with the deployed FTS 2.0 in past two weeks
  • FTS monitoring - update?

Mysql LRC (John)

  • Evaluation progress

Accounting (Shawn, Rob)

  • Follow-up on pending accounting issues, see AccountingP2 and items therein.
  • WT2 - reporting correctly at Gratia level
  • NET2 - should be reporting correctly from October onwards.
  • SWT2 - normalization factors missing. DPCC should be okay. For UTA_SWT2 - need to change machine reporting Gratia reporting. Need to make sure data is being attributed to ATLAS correctly. Expect to be complete be
  • AGLT2 - reporting okay, just a some historical.

Load testing update, issues (Jay)

  • iperf is there, but not at all the sites. Plan is to use bwctl at all sites.
  • There is a question of the accuracy of the plots.
  • Load test graph:
    loadtest_10_10_2007.jpg
  • Need an iperf endpoint for BU and Michigan.
  • BU will will install iperf.

Network Performance and Throughput initiative (Shawn, Dantong)

  • See work in progress at NetworkPerformanceP2
  • Follow-up with Dantong on UTA, SLAC, BU, OU sites
  • BNL to UTA. RTT 54 msec, 8 MB buffer size chosen. Found 30-40 Mbps per stream UTA -to- BNL. But 100 in the other direction. Applied tunings, but no improvement. Discovered a firewall on the UTA site. Started a large number of iperf streams - found 500 Mbps with 10 streams.
  • *Will work w/ SLAC next week.
  • Defer BU until SLC4 and 10G fiber installed.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Dantong, Tomasz)

  • Tomasz is working on RSV probes - finding they are slow. Job manager test - taking 5 minutes.
  • SE probe already sent to Arvind.

Site news and issues (All Sites)

  • T1: working on decoupling services to support individual users and monitor the systems. Coarse grain monitoring of storage usage. In addition, DQ2 tests and other
  • AGLT2: racking up new servers in Ann Arbor; at MSU: Dell order is in, 50 quad-core computes, 200 TB disk (in next 10 days); room completed in next couple of weeks.
  • NET2: problems cleaning up disk from cleanse script (due to bnl-lrc); completed.
  • MWT2: working on integrating IU_OSG; UC_ATLAS_MWT2 memory upgrade; 20 new dual duals coming online to MWT2_UC soon.
  • SWT2_UTA: working on next round of equipment installed and running. Working on cleanse.
  • SWT2_OU: bringing equipment online this week.
  • WT2: no news.

RT Queues and pending issues (Dantong, Tomasz)

Carryover action items

Panda release installation jobs

  • Need to find a Facilities person to work with Tadashi

Analysis Queues (Bob, Mark)

  • See AnalysisQueueP2
  • Action item: Mark will provide similar instructions for PBS. -Mark still working on it.
  • Main problem is the set of AODs are not available.
  • Action items moving forward (each site):
    • We need to setup analysis queues
    • Allocate a small number of cpu's to this site

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics

  • Carryover

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 09 Oct 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


jpg loadtest_10_10_2007.jpg (92.8K) | JayPackard, 10 Oct 2007 - 12:05 | Load test graph
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback