r3 - 11 Mar 2009 - 15:18:51 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar11

MinutesMar11

Introduction

Minutes of the Facilities Integration Program meeting, March 11, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Sarah, Pedro, Saul, Michael, Rich, Shawn, Patrick, Mark, Fred, Douglas, Karthik, John, Wensheng
  • Apologies: Nurcan, Kaushik (late)
  • Guests:

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • End of month - reprocessing, and DDM stress test; schedule unknown - still beginning of March? Graeme - 'son of 10M transfer'
    • Torre's development/migration update: Schema and bulk data migrated to Oracle. Martin Novak at CERN working on this. Everything will be effected, autopilots, schedulers, etc. Also migrating to SVN. For monitoring, both Oracle and Mysql to be supporting. CERN-IT panda instance to be replaced with a CERN Oracle instance - will start adding clouds after that. Expect completion after re-processing. BNL instances will be deprecated, for ATLAS purposes. There still may be some OSG usage.
    • Saul - reports lots jobs getting killed at the moment.
    • Wall time limits - need a survey. Recommendation is 48 hours min, preferred 72 hours. AGLT2: 72; NE: no limit: MW: 120; OU: 48; SW: 75; SLAC: 18 hours (need to raise).
  • this week:
    • Reprocessing validation jobs on-going, limited by what can be pulled from tape.
    • Large number of MC jobs to backfill.
    • Will there be competing requests for re-processing, MC production?

Shifters report (Mark)

  • Reference
  • last meeting:
    • There was a problem with input jobs being on tape - cleared up early in the week.
    • UTD - still working getting them back into production;
    • AGLT2 - working on issue with GUMS servers
    • OSCER integration? Horst working w/ Paul;
    • SIGKILL - affecting 5 or 6 sites - perhaps a problem between submit host and gatekeeper
    • See Yuri's summary
  • this meeting:
    • Marco is on shift
    • Prior issues mostly resolved.
    • Time limit in batch system increased at SLAC.
    • Few pilot version updates; 34a is a major update; 34b is a minor patch. See Paul's email.
    • DDM dashboard improved
    • UTD-HEP job failures similar to failures at PIC. Lots of discussion, mail threads; seems to be site-specific, release specific. No good answer yet.
    • Potential difference in Adler32 checksum, based on 32bit versus 64bit python. Paul thinks this is being handled correctly in the Pilot. Give feedback to Paul. Shawn checked pilots at AGLT2 - looked okay.
    • NET2: Saul reports 50-90% of files in proddisk have wrong Adler32 for files > 2 GB. All DAC files. Last night high failure rates. These are input files from Panda mover. John will be checking this. Kaushik believes this is a different problem than the AGLT2 issue.
    • BNL_PANDA was down - back up now, backlog being reduced.

Analysis queues (Nurcan)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Another 10M transfer jobs planned - mid-Feb. During this phase there will be real throughput tests combined with the stress tests. And planning to include the Tier 2's.
    • UC - large number of errors in the DQ2 logs. Hiro will send dq2ping datasets. Failed submit FTS transfer error.
    • AGLT2 - fixed problems last night
    • WISC - there was a firewall issue
    • Deletion program completed - will start monitoring this week.
    • DDM SAM test - one dataset per hour to every site, see:
  • this meeting:
    • BNL_PANDA, and other areas upgraded to latest DQ2 release, 1.2.4. Suggests other sites wait until it gets cleaned up. Its a major upgrade: can split service by share. Important for BNL and AGLT2. Includes blacklist/white list subscription DNs. * Expect backlog to disappear in an hour. * See monitoring tool above -new summaries added. * Dq2 ping datasets got removed from central catalog - its working for the Tier 2's. * Dataset deletion service - how to monitor?

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
  • this week:
    • MinutesDataManageMar10
    • US decision for ATLASSCRATCHDISK needed (ref here)
    • Shawn has list of files produced on a faulty node. Checking checksums - 130 wrong, 6 correct (checking this). This list will need to be removed from DQ2. Will need to send along to Stephan. Believes this isolated.
    • Sites should check that only Adler32 is being used.
    • Proposal was that the pilot should check the checksum both locally, and after being placed in the storage before registration.
    • For xrootd sites - there is a program that does Adler 32. This is an additional program not part of xrootd.
    • GPFS/IBRIX - there is a posix program.

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Horst - all okay w/ Bestman-Gateway (NFS and Ibrix backends); still working
    • Wei - no updates; may need some spacetoken udpates
    • Patrick - will be installing bm-gw from VDT on production cluster
    • Doug - bm install.
    • Sarah - been going through install process; getting issues sorted out.
  • this week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.

Tier3 networking (Rich)

  • last week
    • Reminder to advise campus infrastructure: Internet2 member meeting, April 27-29, in DC
  • this week

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
  • last week:
    • No meeting
    • Timeline
      • Existing perfSONAR sites should reconfigure their “ Communities of Interest” this week (LHC USATLAS)
      • AGLT2_UM will document steps needed to setup regular USATLAS peer tests for perfSONAR by next throughput meeting.
      • Hiro (& Jay for graphics?) will create a prototype standardized dataset copy test to measure end-to-end throughput by the next meeting.
      • Missing perfSONAR sites need to update the spreadsheet to provide their timelines ASAP
    • Rich: Also discussing Nagios and syslog-ng extensions for monitoring these boxes
    • Patrick: LHC USATLAS are two communities of interest
    • Michael: Last week's reviewers felt perfsonar is a good tool for monitoring our infrastructure.
  • this week:
    • Getting ready to get back to disk-to-disk throughput tests.
    • Collecting data from Brookhaven from 6 sites. Working on getting data out of the database - troubleshooting. See email from Rich today about what to check on machines.
    • Shawn is creating instructions on how to bring up monitoring hosts.
    • Going to BNL site there are pictures showing traffic.

Release installation validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • The new system is in production.
  • Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
  • Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
  • Testing release installation publication against the development portal. Will move to the production portal next week.
  • Future: define a job that compares whats at a site with what is in the portal.
  • Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
  • Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
  • Xin will develop a plan and present in 3 weeks.

Site news and issues (all sites)

  • T1:
    • last week: probs w/ analy jobs over weekend due to large backlog of stageout requests, Xin debugged on Monday; moved to a more powerful machine so the queue is better served now. dcache upgrade yesterday 1.9.09. Backend oracle work for FTS and LFC. We need a new version of DQ2 site services. Becomes a pressing issue because of the registration backlog. Deployment of 30 Thors in progress, 10G NIC driver matches; 5 units to be installed by week's end. FTS migration to version 2.1 this week. Network - making progress w/ dedicated circuits; first UC-BNL in place; now working on BNL-BU, discussion on Friday (need another meeting next week). Next would be AGLT2.
      • Dantong: VOMS certificate updated at BNL; FTS 2.1 upgrade went well - moved to LHC OPN network.
      • Hiro will provide a script for sites to update voms host certs at Tier 2 sites.
    • this week:
      • no report

  • AGLT2:
    • last week: * sshd incident discussion
    • this week: * everything fine except the one node. 6 files were okay, checked with BNL. SRM auth difficulties yesterday.

  • NET2:
    • last week(s): BU (Saul): 224 new harpertown cores have arrived, to be installed. New storage not yet online - HW problems w/ DS3000s (IBM working on it). HU (John): gatekeeper load problems of last week related to polling old jobs. Fixed by stopping server, removed old state files. Also looking into Frontier. Frontier evaluation meeting on Friday's at 1pm EST run by Dantong (new mailing list BNL). Fred notes BDII needs to be configured at HU.
    • this week:
      • communicating with Hiro via email for Adler issue. HU - there was a bug in NFS from RH5 kernel causing lots of problems (gatekeeper loads, slow pacman installs). Kernel 2.6.18-53. Replaced with 2.6.18-92. Order of magnitude more more lookups, especially when lots of modules in ld library path. Expect to bring back online soon.

  • MWT2:
    • last week: * 21 new compute new compute servers (PE1950), 52 TB of storage to be added.
    • this week: * Looking into latency issues with xrootd. Getting some strange behavior with xrootFS. (Had a new data server on the bestman server.) In communication to Wei.

  • SWT2 (UTA):
    • last week:
      • UTA_SWT2 failed saturday, still diagnosing problems. Otherwise nothing to report.
    • this week:
      • Had a problem with CE gatekeeper on SWT2_UTA - needed to reboot. Probably something in Ibrix. Running into issue with xrootd latency on Dell servers - causing timeout, jobs told file doesn't exist. Is there a spin-down option enabled on the Perc5e controller to MD1000. Happening mostly on older machines - disks have chance to be idle. Have seen this with production as well, during ramp-up. Also finding adler32 mismatch on some files users are requesting, for data delivered in October. Possibly corrupt at BNL as well. 3 files hit so far. TAG selection jobs running correctly on OSG sites with 64 bit wn-client installed; discussing with Tadashi. At SLAC they install 32 bit client and don't see the problem. Marco is working with VDT to find a solution. Possibly remove XPATH from VDT.

  • SWT2 (OU):
    • last week: OSCER - still working w/ Paul (up to 500 cores). Need to talk about this in detail. 14.5.0 release needed. OSCER: still uses uberftp client, but new version (2.0 version) in wn-client doesn't accept same syntax for md5sum. Needs new dq2-put.
    • this week:
      • All is well

  • WT2:
    • last week: checksum issue resolved. Migrating lfc database to a dedicated machine run by database group, to improve reliability. Still working on network monitoring machines, but will be postponed until April. Had a reboot over the weekend, otherwise okay.
    • this week:
      • A data server failed; otherwise all is well.

Tier 3 coordination plans (Doug, Jim C)

Carryover issues (any updates?)

HTTP interface to LFC (Charles)

Pathena & Tier-3 (Doug B)

  • Last week(s):
    • relies on http-LFC interface
  • this week

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Will use the new system to install 14.5.0 on all the sites; pacballs have been subscribed.
  • this week:

Squids and Frontier (Douglas S)

  • last meeting(s):
    • Harvard examining use of Squid for muon calibrations (John B)
    • There is a twiki page, SquidTier2 to organize work a the Tier-2 level
    • Douglas requesting help with real applications for testing Squid/Frontier
    • Some related discussions this morning at the database deployment meeting here.
    • Fred in touch w/ John Stefano.
    • AGLT2 -tests - 130 simultaneous (short) jobs. Looks like x6 speed up. Doing tests without squid.
    • Wei - what is the squid cache refreshing policy?
    • John - BNL, BU conference
    • Will use the new system to install 14.5.0 on all the sites; pacballs have been subscribed.
  • this week:

Local Site Mover

AOB

  • None.


-- RobertGardner - 05 Mar 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback