r7 - 22 Apr 2009 - 15:05:07 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr22

MinutesApr22

Introduction

Minutes of the Facilities Integration Program meeting, April 22, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Pedro, Sarah, Charles, Fred, Rich, Saul, Bob, Michael, Justin, Doug, Armen, John, Nurcan, Kaushik, Mark, Tom, Wensheng, Karthik
  • Apologies: Patrick, Horst, Wei

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Planning issues
    • Discussion of ATLAS schedule: in particular STEP09 analysis challenge, May 25-June 6. 300M events are going to be produced. Two cosmic runs. June 25 reprocessing, Cosmics, analysis
    • SL5 migration preparations are underway at CERN. Need to schedule this within the facility. Execution of migration in July.
    • HEP-SPEC benchmark, see CapacitySummary.
  • Other remarks

HEP-SPEC (Bob)

  • Ran benchmarks last week - instructions from Alex
  • Opteron 285, Intel Xeon 5335, 5440
  • SPEC06 for the entire machine; reproducible across machines.
  • Numbers are of order 10 per core.
  • Michael - testing at BNL
  • Doug - testing with a supermicro
  • Factor of ~4 to convert SI2K? to HEP-SPEC.
  • Takes about 6-7 hours

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • New brokering algorithm deployed a week ago - simulation jobs ramped back up - full.
    • Backlog of data needed from tape, queue filled w/ 120K requests; merges. Panda will wait a week for a job.
    • Reprocessing: 25K jobs left to do. Huge backlog from merge and pileup will slow this down.
    • We're beginning to run low on simulation jobs - may run out (5K left). The max cputime might be too high. Can this be corrected? Throttled at Bamboo level. Double limit for US.
    • Supposed to be 200K jobs? Where are these getting held up? Pavel's scout jobs? RW - "running wait" - or - Rod Walker limit. US limit is 50,000.
    • Why the 2 day buffer? Increase to 4 days.
    • Production required for May stress test - still not finalized. 1 month production of mixed sample. 45 TB of AOD data.
    • Note about DDM requests - some users are asking for lots of data on tape - there is lots of load.
  • this week:
    • Lots of simulation samples to do
    • There was a throughput issue getting input files to sites at some point, but improved about 12 hours ago.
    • Not quite finished with reprocessing jobs ~ 1 days worth. This will then clear the tape backlog for BNL and SLAC
    • Plenty of activated jobs everywhere; 30K jobs
    • Lots of site issues, mainly regarding
    • MegaJam
      • JF17 sample, unbiased all SM processes turned on.
      • 88M event evgens already produced; subscribed to US
      • ATLFAST2
      • Borut will start 200M evgen tomorrow
      • This will be high priority jobs
      • Tier 3 request - have some fraction at every Tier 2 for testing access.
      • Tier 2: reserve 15-20 TB for this. MCDISK

Shifters report (Mark)

  • Reference
  • last meeting:
    • New pilot version available yesterday. After job complete - getting job log files transferred.
    • Transformation errors are being ignored for some tasks - make a note of this in the monitoring.
  • this meeting:
    • 134K completed production jobs completed world-wide per day @ 90%
    • Ilinois issue - there were some missing rpm's on worker nodes - good success rate now. Need correct page for missing rpms (compat libs) for 64 bit rhel OS.
    • LFC issues at swt2-cpb resolved, back into production. UTA_SWT2 - cleaning out files on gatekeeper probably from leftover gridmanager/pbs files.
    • Tasks 595223, 59222 large failure rates over the weekend.
    • Three clouds have migrated to the oracle backend database
    • New pilot version 36h.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
  • this meeting:
    • AnalysisQueueJobTests - set of tests defined
    • AnalysisSiteCertification - facilities readiness schedule for May 28 stress test
    • AnalyQueueCertificationPrep - has profile for ANALY queues
    • AnalyQueueMetrics - performance metrics (starting with HammerCloud definitions). Paul reported that all metrics used in HammerCloud are now available for Panda jobs. Changes are made into the dev pilot, the new pilot will be released after finishing the new pilot testing framework within a week.
    • HammerCloud is running in the US now, http://gangarobot.cern.ch/st/, with pie charts showing number of successes/failures. A code was provided by Alden for Panda metrics to be put in to the HammerCloud code. Dan has implemented this and now doing some tests. I encourage site admins start looking at their site's performance. Two new mailing lists are setup to coordinate the stress testing efforts and for site admins to share their experience in understanding how their clusters perform under load and if any change in local configurations needed, etc. They will be announced after ADC Oper meeting tomorrow.
    • Stress testing: analy.png
      • First round of stress testing is almost done with SUSYValidation job. Only AGLT2 needs be tested one more time (to have a success rate of 95%). See your site's certification status from AnalysisSiteCertification.
      • The missing files in SUSY datasets at NET2, MWT2 and AGLT2 were recovered.
      • The lfc errors (lfc_creatg failed, lfc_setfsizeg failed, lfc_creatg failed, lfc_statg failed ) at BNL were understood. Wensheng reported that this was due to an expired voms proxy extension.
      • Truncated files on disk were found at AGLT2. Shawn deleted them, there seems to be a problem still with some input files (pilot: Get error: Copy command returned error code 256). Being investigated.
      • I setup a second job, D3PD? making with TopPhysTools, instructions are available at AnalysisQueueJobTests. I ran a stress test on all sites yesterday (4/21), results can be seen at: http://panda.cern.ch:25880/server/pandamon/query?dash=analysis&processingType=stresstest&reload=yes (broken currently). These are long jobs, 12 to 16 hours, filling up the queues nicely.
      • Results to be looked at today. Will run on the contanier dataset once the missing tid datasets are replicated at MWT2 and NET2 (mc08.105200.T1_McAtNlo_Jimmy.recon.AOD.e357_s462_r579/).
      • I asked Mark Slater to put SUSYValidation and D3PD? making job into HammerCloud? .
      • Will continue to setting up other jobs listed at AnalysisQueueJobTests.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Two problems at BNL last week - bad drivers in the Thumper, resulted in some corrupted files; problem fixed, but cleanup is needed also at that Tier 2s. Converting PNFS IDs to PFNs now - may take a week or more. Also cleaning specific datasets - will be done today. Could be ~ 0.5 M files affected.
    • A network interruption.
    • New version of DQ2 - need a Tier 2 to volunteer - AGLT2.
    • Will configure DQ2 to reject bad transfers based on Adler 32 checksum.
    • Pilot bug whereby a file is registered in DQ2 but failed to be transferred to the SE causing DQ2 to fail the transfer back to BNL. There is an email thread.
    • Adding pnfsID to LFC; standard client API using setname field. Adding a field in ToA for fs. Can use at AGLT2 and MWT2, to reduce PNFS load.
  • this meeting:
    • AGLT2 - DQ2 has been upgraded. There are some caveats. Went relatively easy. Suggest using Campfire chat.
    • BNL_DQ2 - now checking bad transfers. Coming in a few a day (1/45K). These don't get registered in the LFC. Panda job will be failed.
    • Saul: 1/10K corruption.

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
 
  • last week:
    • Perfsonar working well; getting deployment simplified for easy maintenance. Michael suggests having a dedicated session to review how the probe information is being presented. Perhaps a tutorial at some point, when we're in a robust situation.
    • Getting closer to doing throughput benchmark tests.
    • Next week: NET2, AGLT2
  • this week:
    • Hiro changed TCP buffer size as 8M. (FTS client setting). This crashed the machine. Rich suggests using autotuning on the host.
    • Internet2 meeting next week in DC. Chip will speak Monday. Tuesday - a 10 minute slot for Michael.
    • Meeting for CIOs to understand LHC usage.

Site news and issues (all sites)

  • T1:
    • last week: expecting 10 more tape drives on Friday - should speed up staging process. Will require a 3 hour downtime, early next week. Additional PB of disk now available, BNLPANDA migration being done by Pedro and Armen. There was a failure of one of the core switches - took 5 hours to get back online. It was fixed within 2 hours - but there problems w/ connectivity to the OPN network part (to CERN and other Tier 1's). Not sure how. Line modules in the Cisco 6509 were being switched off automatically.
    • this week: Lots of storage management issues over the past week due to job profile. 120K requests in tape queue through dcache, clogged up; had to clean-up, then ran fine over the weekend. Ordered 10 more tape drives - arrived, early next week in production. 2x data rate to/from.

  • AGLT2:
    • last week: Low level SRM failures going on for a long time. "File exists" problem. Changed some parameters in the postgres parameters - no errors since then. Still watching. ntpd failed, clock drifted, caused auth problems. Implemented MONIT program to watch running processes. Wenjing has developed a pool balancer. Bob working on benchmarking. Harpertown benchmark running, takes about ~2 hours.
    • this week: running well right now. Working on getting rid of dark data on MCDISK.

Aside:

  • There was a ticket from the GGUS->GOC that didn't make it into the RT, went neglected for a while. #12323. Jason is maintaining the RT system. There is a manual process - needs to be automated.
  • Jason will discuss w/ Dantong, and will follow-up w/ the GOC. Keeping Fred in the loop.

  • NET2:
    • last week(s): Running smoothly at BU and HU. ANALY queue full from users. Two perfSonar machines up and reporting since yesterday. Direct circuit to BNL working. Still receiving corrupt files at rate of 1 per 2000. Mostly from Pandamover, but some from DDM. Not sure if this is due to network card, but seems intermittment. HU running smoothly. Running into situation where if a transfer takes too long, perhaps leaving a pilot process around -- resulting in a file exists error.
    • this week: Saul: Inventory of corrupted files is being replaced. RSV problems being looked into. John: in replacing data, wrote a script that does lcg-cp to replace. Sometimes hangs - data probably on tape. Probably should delete bad files, and re-subscribe. Jobs running at Harvard.

  • MWT2:
    • last week(s): Last Friday lost two dCache pools while adding two new MD1000 shelves. Lost 47K files across 4000 datasets. About 2000 subscriptions for MC_DISK. Aggravated by "write pools available" errors - in discussion w/ OSG storage and dcache.org. Downtime planned for tomorrow.
    • this week: last week we reported on loss of a couple of pools; working on new network settings for wan access. MCDISK cleanup. Also adjustments for dCache for direct writes from WAN to servers on public nodes.

  • SWT2 (UTA):
    • last week:
      1. ) Heavy local and central clean-up of the storage starting mid-last week.
      2. ) Over the weekend (Sunday) the LFC daemon died a couple of times. Re-started, stayed up for most of Monday, but then died again several times yesterday.
      3. ) Last night we noticed that certain SQL queries against the database were failing. Further checks indicated database corruption.
      4. ) Was this due to the activity in 1)? Not sure, but the coincidence seems too big.
      5. ) We setup a new SLC 4.7 32-bit host (to mimic our current LFC server), and installed the LFC software.
      6. ) Plan is to load a recent back-up of the db onto this test host, and attempt to repair the corruption there. If this is successful we'll have more confidence about modifying the production version of the db.
    • this week:
      • covered above
      • FTS proxy delegation issue - happened twice. Hiro is planning a patch to FTS tomorrow.

  • SWT2 (OU):
    • last week: All is well. 100 TB storage ordered.
    • this week: All seems to be running fine, but slow jobs? perfsonar tests are going much better now - there was a fix on the OU network side, but not sure.

  • WT2:
    • last week: SRM problem persisting - possibly due to a bad client. Had to setup firewall to block traffic, killed client. Worked fine afterwards. Central deletion, but not very fast 2-10 Hz. Doesn't run all the time. Power outage tomorrow.
    • this week:

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • AGLT2 - two servers setup, working, talking to a front-end at BNL
  • this week:
    • Presentation at ATLAS Computing Workshop 4/15: Slides

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • The new system is in production.
    • Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
    • Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
    • Testing release installation publication against the development portal. Will move to the production portal next week.
    • Future: define a job that compares whats at a site with what is in the portal.
    • Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
    • Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
    • Xin will develop a plan and present in 3 weeks.
  • this meeting:

Tier 3 coordination plans (Doug, Jim C)

  • last report:
    • Workshop at Argonne, May 18-19 - for sites to setup Tier 3's ready for analysis stress testing
    • Rate tests between Tier 2s and Tier 3s. dq2-get. Reasonably sized dataset available within the Tier 2 cloud.
    • Tier 3 support group to be formed. Need some input from Tier 2's.
    • BAF xrootd-bestman setup at BNL. Question as to whether space tokens should be implemented? Need an ANALY_BAF queue?
  • this report:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Tier3 networking (Rich)

  • last week
    • Reminder to advise campus infrastructure: Internet2 member meeting, April 27-29, in DC
    • http://events.internet2.edu/2009/spring-mm/index.html
    • Engage with the CIOs and program managers
    • Session 2:30-3:30 on Monday, 27-29 to focus on Tier 3 issues
    • Another session added for Wednesday, 2-4 pm.
  • this week
    • Upcoming meeting an ANL
    • Sent survey to computing-contacts
    • There is an Tier3 support list being setup.
    • Need an RT queue for Tier3

Local Site Mover

AOB

  • dCache service interruption tomorrow. postgres vacuum seems to flush the write ahead logs to disk frequently. Will increase logging buffer (w/ check point segments) for 1-2GB, as well as write-ahead logging buffers. To descrease the load while vacuuming. May need to do another one at some point. Will publish settings.
  • OSG 1.0.1 to be shortly released.


-- RobertGardner - 21 Apr 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


txt day-summary-2.php.png.txt (39.2K) | RobertGardner, 22 Apr 2009 - 08:27 | Stress tests 4/21-4/22
png analy.png (39.2K) | RobertGardner, 22 Apr 2009 - 08:30 | Stress testing
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback