r4 - 01 Apr 2009 - 15:15:05 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApril1

MinutesApril1

Introduction

Minutes of the Facilities Integration Program meeting, April 1, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Shawn, Kaushik, Mark, Michael, Sarah, Saul, Wei, Bob, Patrick, Xin, Mark
  • Apologies: Charles, Fred
  • Guests: none

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • End of month - reprocessing, and DDM stress test; schedule unknown - still beginning of March? Graeme - 'son of 10M transfer'
    • Reprocessing validation jobs on-going, limited by what can be pulled from tape.
    • Large number of MC jobs to backfill.
    • Will there be competing requests for re-processing, MC production?
  • this week:
    • Probable now is 90% of the jobs are pileup-up jobs. I/O intensive (20 input files, 1 our jobs). Hits files are often on tape, as they were generated months ago. Hitting HPSS hard. Thats why we're not running any jobs at the moment.
    • Some simulation channels, but priority is for pileup.
    • Have asked for regional simulation tasks for backfill - but thats hung-up in physics coordination and ATLAS-wide.
    • We have no control for regional production tasks.
    • Next task will be large-scale reprocessing, using release 14.5.2.4/5. Files must come tape, as an exercise. Expect this to begin tomorrow. Expect only Tier 1 resources to be sufficient, though will add SLAC as a Tier 2 to augment.

Shifters report (Mark)

  • Reference
  • last meeting:
    • Prior issues mostly resolved.
    • Time limit in batch system increased at SLAC.
    • Few pilot version updates; 34a is a major update; 34b is a minor patch. See Paul's email.
    • DDM dashboard improved
    • UTD-HEP job failures similar to failures at PIC. Lots of discussion, mail threads; seems to be site-specific, release specific. No good answer yet.
    • Potential difference in Adler32 checksum, based on 32bit versus 64bit python. Paul thinks this is being handled correctly in the Pilot. Give feedback to Paul. Shawn checked pilots at AGLT2 - looked okay.
    • NET2: Saul reports 50-90% of files in proddisk have wrong Adler32 for files > 2 GB. All DAC files. Last night high failure rates. These are input files from Panda mover. John will be checking this. Kaushik believes this is a different problem than the AGLT2 issue.
    • BNL_PANDA was down - back up now, backlog being reduced.
  • this meeting:
    • Several pilot upgrades from Paul recently.
    • High failure rate for last week's task; aborted.
    • Working on getting Harvard site into production.
    • Two UTA sites were offline - storage space constraints. Back online now.
    • Updated procedure for setting sites off/on line. See https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Controlling_Panda_Queues
    • Lack of pilots during a stretch at AGLT2 - seemed transient. Sites going off/online irregularly?
    • See Yuri's summary.
    • Panda monitor problems. There are still transition errors. Oracle backend problems - still optimizing.

Analysis queues (Nurcan)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • BNL_PANDA, and other areas upgraded to latest DQ2 release, 1.2.4. Suggests other sites wait until it gets cleaned up. Its a major upgrade: can split service by share. Important for BNL and AGLT2. Includes blacklist/white list subscription DNs.
    • Expect backlog to disappear in an hour.
    • See monitoring tool above -new summaries added.
    • Dq2 ping datasets got removed from central catalog - its working for the Tier 2's.
    • Dataset deletion service - how to monitor?
  • this meeting:
    • Problem at UC fixed.
    • Preparing for reprocessing exercise - cleaned input files off disk.
    • lcg-voms.cern.ch.2009-03-03.pem needs to be updated on all sites. Need to get John Hover involved.
    • Hiro sending dq2 site services monitor at bnl to cern now. Requires upgrade of DQ2.

from Hiro:

Jim has requested the status of the BNL DQ2 site services to be available to CERN SLS monitor. SLS home page is at https://sls.cern.ch/sls/index.php

And, for DQ2 site service should be at https://sls.cern.ch/sls/service.php?id=ATLAS_DDM_VOBOXES

And, you can click the each box by clicking the name. However, I don't see BNL any more. I have reported to them. Anyway, BNL should be there.

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Last data management meeting(s): MinutesDataManageMar10, MinutesDataManageMar24, MinutesDataManageMar31
    • US decision for ATLASSCRATCHDISK needed (ref here)
    • Shawn has list of files produced on a faulty node. Checking checksums - 130 wrong, 6 correct (checking this). This list will need to be removed from DQ2. Will need to send along to Stephan. Believes this isolated.
    • Sites should check that only Adler32 is being used.
    • Proposal was that the pilot should check the checksum both locally, and after being placed in the storage before registration.
    • For xrootd sites - there is a program that does Adler 32. This is an additional program not part of xrootd.
    • GPFS/IBRIX - there is a posix program.
  • this week:
    • See minutes from past meetings.
    • Space clean-up at Tier 2. What about MCDISK and DATADISK. AGLT2, SWT2 have already run into the problem. Big mess!
    • Hiro has installed Adler 32 plugin for dq2 site services at BNL. Checks value dcache and dq2. Catches errors for corruption during transfer. Running in passive mode - no corrupted files in a week. Active mode will fail the transfer if there's a mismatch.
    • Another big issue is when BNL migrates to storage tokens.
    • Pedro, Hiro working on services to reduce load on pnfs servers. Using pnfs IDs rather than filenames. Call backs from dcache when a file is staged, rather than polling.
    • Alexei's group has developed a nice way to categorize file usage at each site. There's a webpage prototype.
    • ATLASSCRATCH deadline?

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
  • last week:
    • No meeting
    • Getting ready to get back to disk-to-disk throughput tests.
    • Collecting data from Brookhaven from 6 sites. Working on getting data out of the database - troubleshooting. See email from Rich today about what to check on machines.
    • Shawn is creating instructions on how to bring up monitoring hosts.
    • Going to BNL site there are pictures showing traffic.
  • this week:
    • There will be a meeting next week. We're a little behind on throughput milestones - some probs with the perfsonar boxes observed.
    • Esnet working w/ BNL to resolve issues; also Ultralight.
    • Hiro will send a small number of large files to each site. Will plot throughput. Regular.
    • See also Jay's page for perfonar monitoring.

Site news and issues (all sites)

  • T1:
    • last week:
    • this week: Storage capacity - petabyte of Sun storage. Needed right 10G cards - only the Myrinet cards worked properly in the Thors. All together now. 25 units given to Pedro's group to get dCache configured there. Lots of activity in storage and data management. Working on priority of staging requests being implemented. Networking - next round of uslhcnet upgrades under way - for transatlantic capacity. 20 Gbps by October, contingent on final budget. Improvement T1-T2 connectivity. Now progress w/ BU and BNL circuit. Next would be AGLT2 connected.

  • AGLT2:
    • last week:
    • this week: MCDISK getting full. Putting together scripts and consistency checking. Wenjing has a hot-replica script service for dCache. Some nodes at MSU off for AC work.

  • NET2:
    • last week(s): BU (Saul): 224 new harpertown cores have arrived, to be installed. New storage not yet online - HW problems w/ DS3000s (IBM working on it). HU (John): gatekeeper load problems of last week related to polling old jobs. Fixed by stopping server, removed old state files. Also looking into Frontier. Frontier evaluation meeting on Friday's at 1pm EST run by Dantong (new mailing list BNL). Fred notes BDII needs to be configured at HU. Communicating with Hiro via email for Adler issue. HU - there was a bug in NFS from RH5 kernel causing lots of problems (gatekeeper loads, slow pacman installs). Kernel 2.6.18-53. Replaced with 2.6.18-92. Order of magnitude more more lookups, especially when lots of modules in ld library path. Expect to bring back online soon.
    • this week: data corruption issue turned out to be a hardware problem. Doing a complete inventory of data. ~few K files corrupted. New BM installed. Perfsonar boxes up - one working already. New rack of GPFS storage. 224 cores to be added. Progress in networking; NOX decided for direct connection between it and Esnet. HU (John): no pilots to the site - working with shifters.

  • MWT2:
    • last week(s): * 21 new compute new compute servers (PE1950), 52 TB of storage to be added. Looking into latency issues with xrootd. Getting some strange behavior with xrootFS. (Had a new data server on the bestman server.) In communication to Wei.
    • this week: Problems with network cards in the new Dells - dropped packets. Need to contact Myricom. Analysis queue stress test working. At IU - large number of jobs in transferring - a problem in the pilot. There were some releases missing.

  • SWT2 (UTA):
    • last week:
      • UTA_SWT2 failed saturday, still diagnosing problems. Otherwise nothing to report.
      • Had a problem with CE gatekeeper on SWT2_UTA - needed to reboot. Probably something in Ibrix. Running into issue with xrootd latency on Dell servers - causing timeout, jobs told file doesn't exist. Is there a spin-down option enabled on the Perc5e controller to MD1000. Happening mostly on older machines - disks have chance to be idle. Have seen this with production as well, during ramp-up. Also finding adler32 mismatch on some files users are requesting, for data delivered in October. Possibly corrupt at BNL as well. 3 files hit so far. TAG selection jobs running correctly on OSG sites with 64 bit wn-client installed; discussing with Tadashi. At SLAC they install 32 bit client and don't see the problem. Marco is working with VDT to find a solution. Possibly remove XPATH from VDT.
    • this week: Space problems on CPB cluster. Ran proddisk-cleanse. Working on cleaning up old data - what can be deleted. Some files on disk are unknown to LFC. SWT2_UTA was not getting pilots from BNL; cleaned up grid monitor debris, back online.

  • SWT2 (OU):
    • last week: OSCER - still working w/ Paul (up to 500 cores). Need to talk about this in detail. 14.5.0 release needed. OSCER: still uses uberftp client, but new version (2.0 version) in wn-client doesn't accept same syntax for md5sum. Needs new dq2-put.
    • this week: all is well. 100 TB storage ordered.

  • WT2:
    • last week:
    • this week: all is well. Replaced bad harddrive in a thumper. Running clean-up script in proddisk. When will the central operations team begin regular cleanup. Agreement is they will delete test data. User data we manage ourselves (US decision).

Carryover issues (any updates?)

Release installation validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • The new system is in production.
    • Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
    • Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
    • Testing release installation publication against the development portal. Will move to the production portal next week.
    • Future: define a job that compares whats at a site with what is in the portal.
    • Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
    • Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
    • Xin will develop a plan and present in 3 weeks.
  • this meeting:

Tier 3 coordination plans (Doug, Jim C)

  • Doug would like to report bi-weekly.
  • Would like to consider Tier 2 - Tier 3 affinities - especially with regard to distributing datasets.
  • Writing up a twiki for Tier 3 configuration expectations
  • Will be polling Tier 3's for their expertise.
  • Tier 3 meeting at Argonne, mid-May, for Tier 3 site admins.
  • Should Tier 3's have perfsonar boxes. Question is timeframe for deployment. To be discussed at the throughput call.

HTTP interface to LFC (Charles)

Pathena & Tier-3 (Doug B)

  • Last week(s):
    • relies on http-LFC interface
  • this week

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Will use the new system to install 14.5.0 on all the sites; pacballs have been subscribed.
  • this week:

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • Harvard examining use of Squid for muon calibrations (John B)
    • There is a twiki page, SquidTier2 to organize work a the Tier-2 level
    • Douglas requesting help with real applications for testing Squid/Frontier
    • Some related discussions this morning at the database deployment meeting here.
    • Fred in touch w/ John Stefano.
    • AGLT2 -tests - 130 simultaneous (short) jobs. Looks like x6 speed up. Doing tests without squid.
    • Wei - what is the squid cache refreshing policy?
    • John - BNL, BU conference
    • Will use the new system to install 14.5.0 on all the sites; pacballs have been subscribed.
  • this week:
    • Dantong reporting. There is a weekly Frontier meeting chaired by John Stefano.
    • Friday afernoon 1pm Eastern.
1) Get the documentation on TWiki in a week. 1) We will finalize BNL Frontier Infrastructure (two-three weeks from now).  (Time line April/15/2009, tax day)
 Two instances of Frontier services behind F5 switch.   Attached please see Dave's suggestion on testing Frontier functionality.

2) Tier 2 centers will identify their servers for local Squid, and study the documentation we provided.
During the operation meeting, let us discuss about what hardware and software requirements for Tier 2 configuration.

3) Tier 2 will set up their infrastructure one to two weeks after we finalize ours (April/30).
    • John - recommendations for Tier 2s. AGLT2 connection to BNL via Squids working.
    • Documentation setup.
    • Sites need to start identifying hardware. Singled threaded - only 1 CPU required. 2G Ram, at least 400 GB disk.
    • John will send regular announcements for the Friday meeting.

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Tier3 networking (Rich)

  • last week
  • this week
    • Another session added for Wednesday, 2-4 pm.

Local Site Mover

AOB

  • None.


-- RobertGardner - 24 Mar 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback