r16 - 07 Nov 2007 - 14:56:20 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov7

MinutesNov7

Introduction

Minutes of the Facilities Integration Program meeting, November 7, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Charles, Nurcan, Mark, Kaushik, Saul, Jay, John, Gabriele, Hiro, Xin, John B, Wei, Horst, Karthik, Torre

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Production still up and down w/ very few jobs. There are lots of jobs assigned, but not activated; problem is RDO files on tape and cannot feed out the the sites (whch process them quickly).
    • Michael: currently there is about 120 TB disk available - move to a larger BNLPANDA disk-only area. In total, over 1100 TB available, but much of it has been assigned to dedicated physics tasks, service challenges - meaning and role changing. Starting the FY08 installation process - first bunch of upgrades available in January.
    • Gabriele: should be more in contact w/ production managers as to how to use a new scratch area. Plan on a short visit next week during the Tier1 meeting next week - Thursday. Also need to publish which datasets are available on disk.
    • Everything on BNLPANDA is in a mixed disk/tape situation. No management or factorization - for example frequently needed datasets can get migrated to disk. Pin files on disk at the dataset level.
  • Production shift report (Mark)
    • Input files
    • MWT2 - DQ2 0.4 upgrade complete
    • siteinfo.py updated
    • Pusher process died, restarted
    • New release of Pilot3 installed at BNL
    • BNL - two operators availale till midnight (Kevin/Enrique) - to cover BNL site issues after hours. Will post contact info on shift twiki.
    • eLog - had a student at UTA working on this. Need to find a machine - hope to have a week or two.

Operations: DDM (Alexei)

  • Status of M5 processing and distribution of datasets to the facility

DQ2 0.4 deployment (Hiro, Patrick, Shawn)

  • See further DQ2SiteServices to capture deployment experience, known issues.
  • Next site: MWT2 - done.
  • Patrick reports a DQ2 host failed - not sure if it was a hardware failure. Also notices a large number of zombie processes wrapped around glite-transfer processes (possible a status call). Wei also reports this happening at SLAC as well (3 started since November 3, still there). Happening at BU, AGLT2 (1500), UC. Submit as a DDM Savannah ticket. Could be an FTS issue.

Analysis Queues (Bob, Mark)

  • See AnalysisQueueP2 - Nothing changed since last week.
  • Looking at sites with analysis queues - starting first with sites w/ Condor queues. Proceed w/ OU first (waiting to come back online). Can start by getting an entry in siteinfo.py. (Horst: still working on Ibrix issue - hope to get back to it later this week, early next week.)
  • Then move onto PBS/LSF queues.
    • Action item: Mark will provide similar instructions for PBS. -Mark still working on it.
    • Setup one machine, soley reserved for analysis.
    • Wei - running LSF - probably be running another queue - using fairshare - what to give it? Need to discuss this in-depth.
    • Site admins: please contact Bob at ball@umich.edu to discuss setting up analysis queues.
  • Can test w/ event generator jobs. Can direct to a specific site, need analysis pilots going in. Needs to be based on pilot3.

Accounting (Shawn, Rob)

Follow-up on (see AccountingP2) issues.

Network Performance and Throughput initiative (Dantong)

  • See work in progress at NetworkPerformanceP2
  • OU progress - cannot get good performance with a single stream (~250 Mbps), up to 940 Mbps w/ multiple threads. Also some directional - dependence.
  • Follow-up on BU - no changes to atlas.bu.edu.
    • Worked with Augustine and Shawn - step-by-step; changed tcp buffer size - BNL-to-BU (950 Mbps). BU now at 10 Gpbs.
    • Finding lower performance BU to BNL due to 2% packet loses, killing tcp performance.
    • Traced to a problem with a dirty fiber causing CRC errors at the NOX.
  • This week - fix BU, do OU.

Throughput initiative - overview (Shawn)

  • First step is to make sure the hosts are properly tuned, above.
  • Hiro preparing the files and has initiated a transfer, 3.6 GB files in a test subscription.
  • Ramping up to a higher rate, using FTS controlled by Hiro.

Load test displays, issues from the last week (Jay)

  • 1 stream:
    1stream.png

  • 12 streams:
    12streams.png

  • Making live graphs available on web page via MonALISA respoitory.
  • Looking into gridview plots via web service publisher
  • Focus on AGLT2

OSG

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • No change about RSV probes.
  • There were some firewall issues creating false alarms.
  • Plan to switch this into instances - one for "internal" and one "external". Need some hardware for the second server. Target before SLAC meeting.

Site news and issues (All Sites)

  • T1: Lots of issues regarding missing data on disk necessary for production. Currently investigating - dcache to hpss service. M5 data taking has stopped. US ATLAS did an outstanding job in getting the data.. more than >72000 files, >2000 datasets. 100% of raw data (all except one file) replicated to BNL. They started on Tuesday of last week - since then >40 TB. 20TB/day, 250-300 MB/s sustained over multiple days. Written to tape and disk. Internal bandwidth 500-600 MB/s.
    • OSG update to 0.8 one server this week, second week of 15th.
  • AGLT2: OSG 0.8 - worked fine (outside AFS). Ugraded to Condor 6.9.4. All working well. Problem with channels in FTS. Create a channel per SE.
  • NET2: Update on accounting issue - looking at WLCG accounting - "not obviously wrong". Problem with cleanse.py. DB release subscription. Upgrade to OSG 0.8 right after SC07.
  • MWT2: DQ2 0.4 upgraded. At IU_OSG - having troubles w/ GPFS. OSG 0.8 in the next week.
  • SWT2_UTA: No major hardware issues for the last week. Busy getting latest cluster online - storage characterized. OSG 0.8 before SLAC meeting.
  • SWT2_OU: bug fixed rpm for Ibrix issues. DQ2 LRC is working - latest DB releases. Expect to get 10 Gbps into machine room in about a month. Expect to ask Xin releases this afternoon.
  • WT2: Still working on prep for Tier2 meeting. Remote access - working on remote access. OSG 0.8 - before Thanksgiving. Filesystem interface to Xrootd. 10 Gbps network upgrade postponed to next year.

RT Queues and pending issues (Tomasz)

Carryover action items

Panda release installation jobs

  • Wei is working with Tadashi; fixed a proxy problem.

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics

  • Carryover; some benchmarking work w/ quad core opterons.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 05 Nov 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png 12streams.png (69.2K) | JayPackard, 07 Nov 2007 - 08:50 | 12 streams
xls ATLAS-CE-benchmark.xls (34.5K) | MarcoMambelli, 07 Nov 2007 - 13:46 |
png 1stream.png (70.6K) | JayPackard, 07 Nov 2007 - 08:50 | 1 stream
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback