r5 - 25 Feb 2009 - 14:41:39 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb25



Minutes of the Facilities Integration Program meeting, Feb 25, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: John B, Torre, Rich, Sarah, Pedro, Rob, Mark, Kaushik, Patrick, Shawn, Nurcan, Justin, Tom, Saul, Armen, Horst, Karthik, Neng, Charles
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

  • http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/summary
  • last meeting(s):
    • Working on job submission to HU. Problems at BU - perhaps missing files. John will work the issues w/ Mark offline. done.
    • Pilot queue data misloaded when scheddb server not reachable; gass_cache abused. Mark will follow-up with Paul. (carryover) DONE
    • Retries for transferring files & job recovery - pilot option. Kaushik will follow-up with Paul.
    • Pilot problems introducing Adler32 changes at SWT2, checksums stored in LFC are wrong. Resolved? Yes. DONE
    • Backlog of transfers to BNL - across several Tier 2 sites - not understood. Hiro? There are still some backlogs, not as bad.
    • End of month - reprocessing, and DDM stress test; schedule unknown - still beginning of March? Graeme - 'son of 10M transfer'
    • Slowness in US ramp-up traced to a Panamover queue problem - solved.
    • There was a (demonstration) Panda security incident last week requiring a large number of changes (caused some pilot problems); now using secure curl and https in pilot. This incident was done in rush, some notifications didn't go out. Plugging other holds to prevent malicious text insertions into the monitoring database.
  • this week:
    • There were some overall glitches due to the migration of panda to CERN - resulting in lack of pilots. There were changes for oracle affecting submit hosts.
    • Saw some slowness in filling last week.
    • Torre's development/migration update: Schema and bulk data migrated to Oracle. Martin Novak at CERN working on this. Everything will be effected, autopilots, schedulers, etc. Also migrating to SVN. For monitoring, both Oracle and Mysql to be supporting. CERN-IT panda instance to be replaced with a CERN Oracle instance - will start adding clouds after that. Expect completion after re-processing. BNL instances will be deprecated, for ATLAS purposes. There still may be some OSG usage.
    • Saul - reports lots jobs getting killed at the moment.
    • Use atlas-project-adc-operations-shifts@cern.ch to report problems.
    • Wall time limits - need a survey. Recommendation is 48 hours min, preferred 72 hours. AGLT2: 72; NE: no limit: MW: 120; OU: 48; SW: 75; SLAC: 18 hours (need to raise).

Shifters report (Mark)

  • Distributed Computing Operations Meetings
  • BNL cloud in the Dasbhboard
  • last meeting:
    • Problems at SWT2 seeing some I/O load problems with some tasks
    • UTD problems - looks like LFC permissions issues
    • Pilot code updates for security
    • MWT2 dcache upgrade issues
    • Task 4380 - large number of failed jobs at SLAC; missing transformation not installed, but then the pilot seems to have installed successfully on the fly. But not in all cases.
    • FTS upgrade tomorrow at BNL
  • this meeting:
    • There was a problem with input jobs being on tape - cleared up early in the week.
    • UTD - still working getting them back into production;
    • AGLT2 - working on issue with GUMS servers
    • OSCER integration? Horst working w/ Paul;
    • SIGKILL - affecting 5 or 6 sites - perhaps a problem between submit host and gatekeeper
    • See Yuri's summary


Analysis queues, FDR analysis (Nurcan)

  • http://panda.cern.ch:25880/server/pandamon/query?dash=analysis
  • Analysis shifters meeting on 1/26/09
  • last meeting:
    • Nurcan would like to run a stress test in advance of March software week. Reprocessed DPDs should be available.
  • this meeting:
    • Facility working group on analysis queue performance: FacilityWGAP; First meeting: FacilityWGAPMinutesFeb24
    • Meetings will be bi-weekly.
    • Main problem at the moment are running TAG jobs on 64 bit OS. (Working on EGEE sites - perhaps a problem with the wn-client)
    • Re-processed datasets - data08_cos, data08_cosmic, data08_1b, data08_? good for next operational tests
    • Nurcan and Marco will be defining jobs to run over these datasets.

Operations: DDM (Hiro)

  • last meeting(s):
    • New DDM monitor up and running (dq2ping); testing with a few sites. Can clean up test files with srmrm. Plan to monitor all the disk areas, except proddisk.
    • Another 10M transfer jobs planned - mid-Feb. During this phase there will be real throughput tests combined with the stress tests. And planning to include the Tier 2's.
    • Proxy delegation problem w/ FTS - the patch has been developed and in the process of being released. Requires FTS 2.1. Did back-port. Though only operational SL4 machines. We would need to carefully plan migrating to this.
    • BNL_MCDISK has a problem - files are not being registered. New DQ2 version coming up the end of the week which will hopefully fix this.
    • BNL_PANDA - many datasets are still open. Is this an operations issues?
    • Pedro: there may be precision problems querying the DQ2 catalog. Will check creation date of the file.
  • this meeting:
    • UC - large number of errors in the DQ2 logs. Hiro will send dq2ping datasets. Failed submit FTS transfer error.
    • AGLT2 - fixed problems last night
    • WISC - there was a firewall issue
    • Deletion program completed - will start monitoring this week.
    • DDM SAM test - one dataset per hour to every site, see: http://www.usatlas.bnl.gov/dq2/monitor

Data Management & Storage Validation (Kaushik)

  • See new task StorageValidation
  • last week(s):
    • AGLT2 - what about MCDISK (now at 60 TB, 66 TB allocated)? These subscriptions are central subscriptions - should be AODs. Does the estimate need revision? Kaushik will follow-up.
    • Need a tool for examining token capacities and allocations. Hiro working on this.
    • Armen - a tool will be supplied to list obsolete datasets. Have been analyzing BNL - Hiro has a monitoring tool under development. Will delete obsolete datasets from Tier 2's too.
    • proddisk-cleanse questions - may need a dedicated phone meeting to discuss space management; more tools becoming available.
    • Discussing data deletion procedures for users. - Armen.
    • Deletions at BNL incorrectly "deleted" in the DQ2 central.
    • Timeframe for a weekly meeting: Tuesday 3pm Central (see scope above)
  • this week:

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Horst - sending feedback now to osg-storage.
    • Wei providing updates to documentation
    • Doug having lots of problems at Duke
  • this week
    • Horst - all okay w/ Bestman-Gateway (NFS and Ibrix backends); still working
    • Wei - no updates; may need some spacetoken udpates
    • Patrick - will be installing bm-gw from VDT on production cluster
    • Doug - bm install.
    • Sarah - been going through install process; getting issues sorted out.

Tier3 networking (Rich)

  • Reminds we need advise campus infrastructure: Internet2 member meeting, April, in DC

Throughput Initiative (Shawn)

  • Notes from meeting this week:

  • last week:
    • Presentation on USATLAS perfSONAR Status
    • Timeline
      • Existing perfSONAR sites should reconfigure their “Communities of Interest” this week (LHC USATLAS)
      • AGLT2_UM will document steps needed to setup regular USATLAS peer tests for perfSONAR by next throughput meeting.
      • Hiro (& Jay for graphics?) will create a prototype standardized dataset copy test to measure end-to-end throughput by the next meeting.
      • Missing perfSONAR sites need to update the spreadsheet to provide their timelines ASAP
    • Rich: Also discussing Nagios and syslog-ng extensions for monitoring these boxes
    • Patrick: LHC USATLAS are two communities of interest
    • Michael: Last week's reviewers felt perfsonar is a good tool for monitoring our infrastructure.
  • this week:
    • No meeting this week, next meeting in two weeks.
    • There has been perfsonar work over the past week. next week - will have a webpage about how to setup T2-T2, T2-T1 testing. BW measurements currently.


  • Should do some checking of releases installed at sites?

Site news and issues (all sites)

  • T1:
    • last week: probs w/ analy jobs over weekend due to large backlog of stageout requests, Xin debugged on Monday; moved to a more powerful machine so the queue is better served now. dcache upgrade yesterday 1.9.09. Backend oracle work for FTS and LFC. We need a new version of DQ2 site services. Becomes a pressing issue because of the registration backlog. Deployment of 30 Thors in progress, 10G NIC driver matches; 5 units to be installed by week's end. FTS migration to version 2.1 this week. Network - making progress w/ dedicated circuits; first UC-BNL in place; now working on BNL-BU, discussion on Friday (need another meeting next week). Next would be AGLT2.
    • this week:
      • Dantong: VOMS certificate updated at BNL; FTS 2.1 upgrade went well - moved to LHC OPN network.
      • Hiro will provide a script for sites to update voms host certs at Tier 2 sites.

  • AGLT2:
    • last week: * Poor WAN performance issue tracked down w/ help from Mike OConner. 10% packet loss, rate independent. US LHCnet and Esnet jumper at starlight - bad fiber - fixed. Now getting packet loss down to 0. Dzero transfers to FNAL down from hours to seconds. Throughput test back to BNL - needs work. dCache maintenance - removed pools on compute nodes, and using Berkeley database for metadata. Dell switch stack at MSU again causing problems. Upgraded rocks install for nodes. Frontier tests - 700 running processes to reach saturation (caused server process crash at BNL).
    • this week: * sshd incident discussion

  • NET2:
    • last week: BU (Saul): 224 new harpertown cores have arrived, to be installed. New storage not yet online - HW problems w/ DS3000s (IBM working on it). HU (John): gatekeeper load problems of last week related to polling old jobs. Fixed by stopping server, removed old state files. Also looking into Frontier. Frontier evaluation meeting on Friday's at 1pm EST run by Dantong (new mailing list BNL). Fred notes BDII needs to be configured at HU.

    • this week:
      • BU: nothing to report this year; HU: offline this week.

  • MWT2:
    • last week: Upgraded dCache at both sites. Processes on old pools didn't shut down properly; these errors got mopped up. Pilot changes and problems with curl as distributed in workernode-client.
    • this week:
      • 21 new compute new compute servers (PE1950), 52 TB of storage to be added.

  • SWT2 (UTA):
    • last week: xrootd site mover change last week orphaned 5000 jobs - corrected w/ Hiro's help. Now running smoothly. Updated perfsonar boxes to include communities of interest.
    • this week:
      • UTA_SWT2 failed saturday, still diagnosing problems. Otherwise nothing to report.

  • SWT2 (OU):
    • last week: OSCER - still working w/ Paul (up to 500 cores). Need to talk about this in detail.
    • this week: 14.5.0 release needed. OSCER: still uses uberftp client, but new version (2.0 version) in wn-client doesn't accept same syntax for md5sum. Needs new dq2-put.

  • WT2:
    • last week: checksum issue resolved. Migrating lfc database to a dedicated machine run by database group, to improve reliability. Still working on network monitoring machines, but will be postponed until April.
    • this week:
      • Reboot over the weekend, otherwise okay.

Carryover issues (any updates?)

Pathena & Tier-3 (Doug B)

  • Last week(s):
    • Meeting this week to discuss options for a lightweight panda at tier 3 - Doug, Torre, Marco, Rob
    • Local pilot submission, no external data transfers
    • Needs http interface for LFC
    • Common output space at the site
    • Run locally - from pilots to panda server. Tier 3 would need to be in Tiers of Atlas (needs to be understood)
    • No OSG CE required
    • Need a working group of the Tier 3's to discuss these issues in detail.
    • http-LFC interface: Charles had developed a proof-of-concept setup. Pedro has indicated willingness to help - pass knowledge of apache configuration and implement oracle features.
  • this week
    • http-LFC interface work

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Can run in production mode now, but there are two things to finish. Path of logfiles of install jobs to permanent location; publication of EGEE portal.
    • Installation pilots are hanging at OU.
  • this week:
    • Hung pilots disappeared
    • Will use the new system to install 14.5.0 on all the sites; pacballs have been subscribed.

Squids and Frontier (Douglas S)

  • last meeting(s):
    • Harvard examining use of Squid for muon calibrations (John B)
    • There is a twiki page, SquidTier2 to organize work a the Tier-2 level
    • Douglas requesting help with real applications for testing Squid/Frontier
    • Some related discussions this morning at the database deployment meeting here.
    • Fred in touch w/ John Stefano.
    • AGLT2 -tests - 130 simultaneous (short) jobs. Looks like x6 speed up. Doing tests without squid.
    • Wei - what is the squid cache refreshing policy?
    • John - BNL, BU conference
  • this week:
    • Dantong will report on weekly Friday meeting

Local Site Mover


  • Wei: questions about release 15 coming up - which platforms (release sl 4, sl 5 ) and gcc 4.3. Kaushik will develop a validation and migration plan for the production system and facility. - will follow up.
    • Kaushiik: spoke w/ Quarrie - no change in routine. Will try to make 64 bit releases.
    • They will test various gcc versions. If choose a non-standard gcc, will ATLAS will package along w/ release.
  • Next regular meeting in two weeks.

-- RobertGardner - 23 Feb 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback