r8 - 15 Apr 2009 - 16:21:46 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr15

MinutesApr15

Introduction

Minutes of the Facilities Integration Program meeting, April 15, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Charles, Rob, Michael, Nurcan, Mark, Fred, BNL (Kaushik, Alexei, Pedro, Armen, Jim S, Torre), Saul, Hiro, Xin, Shawn, John B, Sarah, Karthik, Horst, Wei, Bob
  • Apologies: Patrick

Integration program update (Rob, Michael)

  • IntegrationPhase8 - concluded
  • Special meetings
    • Tuesday bi-weekly (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS Facilities meetings - Tier 2, Tier 3 workshops
  • ATLAS meetings this fortnight
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Other remarks:
    • Quarterly report due today
    • Discussion of ATLAS schedule: in particular STEP09 analysis challenge, May 25-June 6. 300M events are going to be produced. Two cosmic runs. June 25 reprocessing, Cosmics, analysis
    • SL5 migration preparations are underway at CERN. Need to schedule this within the facility. Execution of migration in July.
    • How long will SL4 be supported? Alexei: will ask IT tomorrow at ADC operations meeting.
    • What about next purchases - need to discuss later w/ revised ATLAS resource requests
    • HEP-SPEC benchmark, see CapacitySummary.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Next task will be large-scale reprocessing, using release 14.5.2.4/5. Files must come tape, as an exercise. Expect this to begin tomorrow. Expect only Tier 1 resources to be sufficient, though will add SLAC as a Tier 2 to augment.
    • Reprocessing jobs (200K jobs, only getting few K per day) & pileup - both need files from tape
    • Lots of requests from users for files from tape as well
    • Options? Merging to large files. Process data while its on disk quickly. Increase disk or tape capacity?
    • Michael: Note - we don't have enough jobs in a state that would allow optimizing the I/O - can't do "near future" scheduling.
    • Some concern about ATLAS policy regarding requests for raw data (10% on disk).
    • Idle CPUs at Tier 2's. We have regional requests from US users, but we can't get them approved through physics coordination. But there are problems getting tasks schedule for the US cloud by Panda - jobs getting blocked. Panda team working on this by classifying jobs, and allowing all types to flow. However, the US has now assignments for evgen or simul. There has to be some care in the priority assignments - so that reprocessing maintains its priority over simulation, eg.
    • Reprocessing issues:
    • Increase input queue to 6000. PNFS load shouldn't be a problem - Pedro.
    • Fraction of jobs with transformation errors - US cloud getting more than its fair share. Cosmic stream reprocessing tasks - we got 80% for the US. 10K job failures. 2000 real jobs - but skimming jobs are getting flagged as an error. Pavel is allowing the jobs to fail 5 times before re-defining the transformation.
    • 62K reprocessing jobs to do.

  • this week:
    • New brokering algorithm deployed a week ago - simulation jobs ramped back up - full.
    • Backlog of data needed from tape, queue filled w/ 120K requests; merges. Panda will wait a week for a job.
    • Reprocessing: 25K jobs left to do. Huge backlog from merge and pileup will slow this down.
    • We're beginning to run low on simulation jobs - may run out (5K left). The max cputime might be too high. Can this be corrected? Throttled at Bamboo level. Double limit for US.
    • Supposed to be 200K jobs? Where are these getting held up? Pavel's scout jobs? RW - "running wait" - or - Rod Walker limit. US limit is 50,000.
    • Why the 2 day buffer? Increase to 4 days.
    • Production required for May stress test - still not finalized. 1 month production of mixed sample. 45 TB of AOD data.
    • Note about DDM requests - some users are asking for lots of data on tape - there is lots of load.

Shifters report (Mark)

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • See AnalysisQueueJobTests, from FacilityWGAP
    • Last meeting: FacilityWGAPMinutesApr7
    • Started daily stress-tests on Monday.
    • Two key issues so far: LFC registration failures ("file exists"). Same library input dataset names being used - pathena bug- fixed. Second - AGLT2 srm authentication failure, gums. Shawn investigating. Resending to AGLT2.
    • No proper monitoring yet - checking manually. Torre, Aaron being consulted for a stress-test monitor.
    • Will run some queries against Panda DB for failures - will summarize into metrics.
    • Need to make sure input datasets are present at Tier 2s.
    • Will solicit users for job types and plans. Will measure success/failure rates - triage most critical failures. Need job timing information - to help sites optimize performance. Some of these figures have already been measured - need to collect and plot.
  • this meeting:
    • AnalysisQueueJobTests - set of tests defined
    • AnalysisSiteCertification - facilities readiness schedule for May 28 stress test
    • AnalyQueueCertificationPrep - has profile for ANALY queues
    • AnalyQueueMetrics - performance metrics (starting with Hammer cloud definitions)
    • Some progress on running HammerCloud in the US, http://gangarobot.cern.ch/st/test_240/, pie charts showing number of successes/failures, metrics still to come, the links at the bottom need to be pointed at the pandamon.
    • Summary of stress tests last week:
      • Runs on Monday (all sites), Wednesday (all sites), Thursday (AGLT2)
      • Finished/failed job count: SLAC: 238/0, MWT2: 235/3, SWT2: 209/29, NET2: 179/59, AGLT2: 146/90.
      • LFC registration failures (burstSubmit option fixed in pathena) and AGLT2 SRM authentication failures (GUMS server upgrade) were reported last week.
      • Missing datasets/files from SU3 and SU4 container datasets are found at SWT2 and NET2. I placed data transfer requests, transfers completed.
      • Jobs failures with "lfc-mkdir failed" at AGLT2: The source of the problem was a clock which had badly drifted. The clocks were updated and the 'ntpd' process was restarted.
      • write permission issue at AGLT2: jobs were mapped to 'usatlas3' instead of 'usatlas1'. There was a problem with the pilot proxy.
      • "Get error: Staging input file failed" at AGLT2: a complete re-registration of files in dCache as of late Friday evening.
      • "Put error: Error in copying the file from job workdir to localSE" at MWT2: FRom Paul: This failure type (athena output transferred ok but not the log) put the jobs in holding state, but they could not be recovered since the pilot failed to find a control file. This was corrected in yesterdays pilot update ( version 36g).
    • Status of this week's test:
      • Attempted to run a test yesterday. Tested the new feature of pathena in version 0.1.40, an option was made available by Tadashi to label stress test jobs so that they could be monitored in Panda monitor (http://panda.cern.ch:25880/server/pandamon/query?dash=analysis&processingType=stresstest&reload=yes), --processingType=stresstest. This worked however I encountered a new problem with "ddm: Adder._updateOutputs() could not add files". Tadashi fixed this today in version 0.1.41.
      • A data-loss incident last week at MWT2 (lost 2 RAID arrays while bringing new storage online): SU4 dataset disappeared, others may be too, working on a recovery.
      • I'll start a run today for all sites.
    • Large backlog of jobs in ANALY_BNL_LONG - jobs held over from last week. Xin: At the local condor level, all long slots used 340, at odds w/ Panda monitor. Is there a Condor-G problem?

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
  • this meeting:
    • Two problems at BNL last week - bad drivers in the Thumper, resulted in some corrupted files; problem fixed, but cleanup is needed also at that Tier 2s. Converting PNFS IDs to PFNs now - may take a week or more. Also cleaning specific datasets - will be done today. Could be ~ 0.5 M files affected.
    • A network interruption.
    • New version of DQ2 - need a Tier 2 to volunteer - AGLT2.
    • Will configure DQ2 to reject bad transfers based on Adler 32 checksum.
    • Pilot bug whereby a file is registered in DQ2 but failed to be transferred to the SE causing DQ2 to fail the transfer back to BNL. There is an email thread.
    • Adding pnfsID to LFC; standard client API using setname field. Adding a field in ToA for fs. Can use at AGLT2 and MWT2, to reduce PNFS load.

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • US decision for ATLASSCRATCHDISK needed (ref here)
    • Space clean-up at Tier 2. What about MCDISK and DATADISK. AGLT2, SWT2 have already run into the problem. Big mess!
    • Hiro has installed Adler 32 plugin for dq2 site services at BNL. Checks value dcache and dq2. Catches errors for corruption during transfer. Running in passive mode - no corrupted files in a week. Active mode will fail the transfer if there's a mismatch.
    • Another big issue is when BNL migrates to storage tokens.
    • Pedro, Hiro working on services to reduce load on pnfs servers. Using pnfs IDs rather than filenames. Call backs from dcache when a file is staged, rather than polling.
    • Alexei's group has developed a nice way to categorize file usage at each site. There's a webpage prototype.
    • ATLASSCRATCH deadline?
    • Wei - MCDISK - full - are there any activities to delete old data? Will ask Stephane to delete obsolete datasets.
    • Need some dataset deletions.
  • this week:

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
   USATLAS Throughput Call
             -----------------------------------
 
April  14, 2009  at 3 PM Eastern
 
Attending:  Shawn, Jason,  Horst, Karthik,  Sarah, Saul
 
General discussion of perfSONAR.   BU installation is “up” but is being worked on (firewall issue?).   We (USATLAS) need to begin evaluating how we will use perfSONAR in the near future.   Test type, duration, alerting, graphing, etc. are all important areas that will take experience to evaluate how best to configure/utilize.
 
Throughput testing needs to resume.  MWT2 (Sarah) plans to contact Hiro today about some new tests because many of the MWT2 issues have been resolved.   All other sites should try for more testing by next meeting.   Need to meet our milestones soon and should be easily achievable as sites fix all the problems we are finding.
 
Site Reports:
 
                BNL – No report
                AGLT2 – Working on bonded/trunked links for resiliency.  perfSONAR use for network debugging has resolved all know WAN issues.  Still cleaning/reorganizing dCache pools to optimize number of storage nodes available for load testing.  Lastly, exploring jumbo frames for public NICs and want to test with MWT2 once we are ready.
                MWT2 – Resolved many network issues.   Have setup VLAN trunks on 10GE connections.   Jumbo frames are implemented on public NICs at UC.   Working on IU issues with Chelsio cards (similar UC issues resolved).   LRO problems with Myricom cards being tested.     Sarah asked about nettest10g (at BNL).  Shawn told her to contact Jay Packard.  Also is lhcmon box (10GE) at BNL which John Bigroy manages.
                NET2 -  Working on fixing perfSONAR boxes to work with firewall settings.  Should be ready soon.  Replacement Chelsio NIC fixed framing problems seen.   In the market for new 10GE NIC.  Still seeing some data corruption (1/2000 files?)…need to find out if this is specific to NET2 or if everyone has this and just NET2 is the only one looking!
                SWT2 -  Still working on finding out from network folks what may have changed to improve the network measurements from perfSONAR.  Interested in 10GE tests (maybe between Michigan and OU?).
                WT2 – No report.
 
No other business.  Send along corrections and additions via email.   Plan to meet next week at the usual time.
 
Shawn
 
From: usatlas-ddm-l-bounces@lists.bnl.gov [mailto:usatlas-ddm-l-bounces@lists.bnl.gov] On Behalf Of McKee, Shawn
Sent: Tuesday, April 14, 2009 10:10 AM
To: Usatlas-ddm-l@lists.bnl.gov
Subject: [Usatlas-ddm-l] Throughput Meeting Today April 14th, 2009 at 3PM Eastern
 
Hi Everyone,
 
Our next Throughput meeting is today, Tuesday April 14th, 2009 at 3 PM Eastern time.
 
The ESnet call-in number:
 ES net phone number:
 Call: 510-665-5437
 *Dial up number does not apply to Data Only ( T-120) Conferencing
 When: April 7th, 2009, 03:00 PM Eastern/America/Detroit Meeting ID: 1234
 
The agenda is:
 
1)      perfSONAR status and related issues
a.      Updates/questions?
b.      Status of the BU nodes…seem to be up but not functional?
c.       OU status
d.      BNL status
2)      Throughput testing
a.      Need to complete milestones – reschedule based upon network issue resolution(s) –This week?
3)      Site reports
a.      BNL
b.      AGLT2
c.       MWT2/IU/UC
d.      NET2/BU/Harvard
e.      SWT2/OU/UTA
f.        WT2/SLAC
4)      AOB
 
Let me know if there are other topics we should add to the agenda.
 
Thanks,

Shawn
 

  • last week:
    • Perfsonar for Tier 3 sites - good idea, once we have a turn-key solution. Primary purpose would be as a test point for that site. Also - Tier 3's can test with "partner" Tier 2's.
  • this week:
    • Perfsonar working well; getting deployment simplified for easy maintenance. Michael suggests having a dedicated session to review how the probe information is being presented. Perhaps a tutorial at some point, when we're in a robust situation.
    • Getting closer to doing throughput benchmark tests.
    • Next week: NET2, AGLT2

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • Sites need to identify hardware. Help available for setting up test beds.
    • Established a load balancer in front of the BNL Frontier server.
  • this week:
    • AGLT2 - two servers setup, working, talking to a front-end at BNL

Site news and issues (all sites)

  • T1:
    • last week: Have two additional 10G Esnet circuits in place. Can now establish dedicated circuits w/ all Tier 2s. Expect a second 10G link between BNL and CERN, that will need an additional Esnet circuit. Additional storage deployment - Armen getting requirements for additional space tokens. Note - there are new numbers for requested resources, 20% lower than October 2008 RRB. Deployment schedule granularity now given quarterly. Discussed at WLCG GDB meeting yesterday. Will look at numbers and schedule next week. Tier 2 numbers probably unchanged.
    • this week: expecting 10 more tape drives on Friday - should speed up staging process. Will require a 3 hour downtime, early next week. Additional PB of disk now available, BNLPANDA migration being done by Pedro and Armen. There was a failure of one of the core switches - took 5 hours to get back online. It was fixed within 2 hours - but there problems w/ connectivity to the OPN network part (to CERN and other Tier 1's). Not sure how. Line modules in the Cisco 6509 were being switched off automatically.

  • AGLT2:
    • last week: GUMS issues - attempted upgrade is not working. Currently have GK issues as a result. Problem was that GUMS getting read timeouts under heavy load (causing authorization failures).
    • this week: Low level SRM failures going on for a long time. "File exists" problem. Changed some parameters in the postgres parameters - no errors since then. Still watching. ntpd failed, clock drifted, caused auth problems. Implemented MONIT program to watch running processes. Wenjing has developed a pool balancer. Bob working on benchmarking. Harpertown benchmark running, takes about ~2 hours.

  • NET2:
    • last week(s): HU site is up and working well. 128 cores added at BU. 130 TB into production. Perfsonar machines up and working. New dedicated circuit to BNL - this Friday. Continuing cleanup operation from corrupted files (3K files).
    • this week: Running smoothly at BU and HU. ANALY queue full from users. Two perfSonar machines up and reporting since yesterday. Direct circuit to BNL working. Still receiving corrupt files at rate of 1 per 2000. Mostly from Pandamover, but some from DDM. Not sure if this is due to network card, but seems intermittment. HU running smoothly. Running into situation where if a transfer takes too long, perhaps leaving a pilot process around -- resulting in a file exists error.

  • MWT2:
    • last week(s): Long-standing dcache instability probably due to dropped packets in the network. Progress on network configuration at UC (many thanks to Shawn!). Found possible source of packet loss - MTU mismatch on the private VLAN. Reconfigured Dell, Cisco switches yesterday, 10G NICs. No packet loss, ethernet NIC errors, or giants reported in the switch. Studies continuing today. 21 compute nodes hopefully online this week.
    • this week: Last Friday lost two dCache pools while adding two new MD1000 shelves. Lost 47K files across 4000 datasets. About 2000 subscriptions for MC_DISK. Aggravated by "write pools available" errors - in discussion w/ OSG storage and dcache.org. Downtime planned for tomorrow.

  • SWT2 (UTA):
    • last week: All is well
    • this week: LFC - crashing a lot in the last few days. Two simultaneous data cleanup operations - central deletion and local clean up. 3.5M replicas ~400K cleaned up locally, ~0.5M entries deleted from LFC centrally. File deletion still going on.
    • From Mark:
      1. ) Heavy local and central clean-up of the storage starting mid-last week.
      2. ) Over the weekend (Sunday) the LFC daemon died a couple of times. Re-started, stayed up for most of Monday, but then died again several times yesterday.
      3. ) Last night we noticed that certain SQL queries against the database were failing. Further checks indicated database corruption.
      4. ) Was this due to the activity in 1)? Not sure, but the coincidence seems too big.
      5. ) We setup a new SLC 4.7 32-bit host (to mimic our current LFC server), and installed the LFC software.
      6. ) Plan is to load a recent back-up of the db onto this test host, and attempt to repair the corruption there. If this is successful we'll have more confidence about modifying the production version of the db.

  • SWT2 (OU):
    • last week: All is well. 100 TB storage ordered.
    • this week: Everything running fine.

  • WT2:
    • last week: srm died this morning, otherwise all okay. April 16 power outtage.
    • this week: SRM problem persisting - possibly due to a bad client. Had to setup firewall to block traffic, killed client. Worked fine afterwards. Central deletion, but not very fast 2-10 Hz. Doesn't run all the time. Power outage tomorrow.

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • The new system is in production.
    • Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
    • Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
    • Testing release installation publication against the development portal. Will move to the production portal next week.
    • Future: define a job that compares whats at a site with what is in the portal.
    • Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
    • Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
    • Xin will develop a plan and present in 3 weeks.
  • this meeting:

Tier 3 coordination plans (Doug, Jim C)

  • last report:
    • Doug would like to report bi-weekly.
    • Would like to consider Tier 2 - Tier 3 affinities - especially with regard to distributing datasets.
    • Writing up a twiki for Tier 3 configuration expectations
    • Will be polling Tier 3's for their expertise.
    • Tier 3 meeting at Argonne, mid-May, for Tier 3 site admins.
    • Should Tier 3's have perfsonar boxes. Question is timeframe for deployment. To be discussed at the throughput call.
  • this report:
    • Workshop at Argonne, May 18-19 - for sites to setup Tier 3's ready for analysis stress testing
    • Rate tests between Tier 2s and Tier 3s. dq2-get. Reasonably sized dataset available within the Tier 2 cloud.
    • Tier 3 support group to be formed. Need some input from Tier 2's.
    • BAF xrootd-bestman setup at BNL. Question as to whether space tokens should be implemented? Need an ANALY_BAF queue?

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Tier3 networking (Rich)

  • last week
    • Reminder to advise campus infrastructure: Internet2 member meeting, April 27-29, in DC
    • http://events.internet2.edu/2009/spring-mm/index.html
    • Engage with the CIOs and program managers
    • Session 2:30-3:30 on Monday, 27-29 to focus on Tier 3 issues
    • Another session added for Wednesday, 2-4 pm.
  • this week

Local Site Mover

AOB

  • None.


-- RobertGardner - 14 Apr 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


ppt ATLAS-Summer09-Schedule.ppt (106.0K) | RobertGardner, 15 Apr 2009 - 08:51 | ATLAS run schedule
pdf ATLAS_resource_requests_Apr09.pdf (308.8K) | RobertGardner, 15 Apr 2009 - 09:00 | Revised ATLAS resource requests
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback