r5 - 12 Aug 2009 - 14:35:43 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug12



Minutes of the Facilities Integration Program meeting, Aug 12, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Shawn, Charles, Rob, Michael, Sarah, Brian, Booker Bense, Pedro, Wen, Wei, Bob, Rich, Saul, Horst, Kaushik, Mark, Rupam
  • Apologies: Nurcan, Fred, John De Stefano
  • Guests: Brian Bockelman, Rob Quick

Integration program update (Rob, Michael)

Guest topic: Gratia transfer probes @ Tier 2 sites (Brian Bockelman)

Guest topic: Getting OIM registrations correct for WLCG installed pledged capacity (Brian, Rob Quick, Karthik)

  • WLCGReporting.pdf: WLCGReporting.pdf
  • Note - if you have more than one CE, the availability will take the "OR".
  • Make sure installed capacity is no greater than the pledge.
  • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
  • Have not seen yet a draft report.
  • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.

Hot topic: SL5 migration

  • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
  • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
  • Consensus: start mid-September for both SL5 and OSG 1.2
  • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
  • Milestone: my mid-October all sites should be migrated.

From: "Ernst, Michael" 
Date: August 12, 2009 6:48:56 AM CDT
Cc: "Rob Gardner" 
Subject: SL5 migration
Dear Colleagues, Now that computing management of all four LHC experiments have signed off on the migration to SL(C)5 the WLCG management asks all Tier-1 and Tier-2 sites to carry out the transition as soon as possible. Following the migration of the U.S. ATLAS Tier-1 center about two weeks ago numerous validation tasks of various kinds were run by ADC central operations and have proven that the ATLAS software is in a state allowing ATLAS computing management to confirm readiness for an SL(C)5 based processing infrastructure. At our weekly computing meeting later today we will address this point in detail.

Regards, Michael

From: Ian Bird [mailto:Ian.Bird@cern.ch] 
Sent: Wednesday, August 12, 2009 4:11 AM
To: worldwide-lcg-management-board (LCG Management Board)
Subject: MB Summary: agreement SL5 migration

Dear colleagues,

As agreed in the Management Board on August 4, now is the time to push to complete the SL5 migration at all sites, including the Tier 2s. It was understood in the meeting that the experiments are all ready and able to use SL5 resources. A web page is available to provide pointers to the relevant information to support the migration, including links to the necessary packages. (https://twiki.cern.ch/twiki/bin/view/LCG/SL4toSL5wnMigration)

It is now expected that Tier 1s and Tier 2s should plan this migration as rapidly as possible, so that the majority of resources are available under SL5 as soon as possible.



Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • MC production: we have failure rates this morning (27K jobs) due to task problem. Event numbers became too large, hit an Athena limit (in evgen). Ran into a problem at 250M events (2B unfiltered). Request to Borut/Pavel to abort task. Ticket submitted. Not sure how many days of processing we have left, since this large sample is dead.
    • Sites continuing to do well. Good efficiency.
    • Reprocessing exercise forthcoming: ESD-based repro to start mid-August. Won't take a long time - but data could be on tape. Armen looking into which datasets will be required, working with Hiro to get them staged. SL5 upgrade has exposed CORAL for releases 15.1-3. All Tier 2's should participate. There are concerns there may be hacks from the fast reprocessing done previously, want continue to have this capability at the Tier 2's so we should tests.
    • Update: Borut has created a new tag and series for the dataset, so we can continue generating new events. Therefore should no disruption. Kaushik also able to get access to proddb.
  • this week:
    • All doing okay - enough jobs for the next two weeks.
    • Notes nothing in Panda except what's in for the US cloud - forward requests
    • All T2's were running reprocessing validation jobs. There are a couple of issues that may require a new cache. (quotation marks in XML generated by the trf; two db release files are not getting modification times not updated - they're getting deleted by scratch cleaners, needs a fix for trf) Want to start on August 24.
    • There was a problem with pilot jobs mixing data and MC which caused Oracle access (task 78356); reprocessing has minimal Oracle access.

Shifters report (Mark)

  • Reference
  • last meeting:
    • Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=65709
      1)  HU_ATLAS_Tier2 set back to on-line after issue with credentials (i.e., TRUSTED_CA) was resolved.  RT # 13657.  7/29 p.m.
      2)  Evgen task 76944 aborted due to very high failure rate (tens of thousands).  eLog # 4881.
      3)  7/30: Bob at AGLT2 noticed some issues with pilots at their site (too many queued relative to the value of 'nqueue' etc.)  Turned out to be a problem with the scheduler -- from Torre:
      It was happening at BNL, an intermittent failure of a curl to CERN. Would have affected any scheduler; I think only you noticed because 1) your nqueue is high 2) the excess pilots were sitting for a long time waiting for slots so they piled up and 3) you're attentive! Not that others aren't of course. The bad effect on queueing should be fixed now (I've updated your scheduler and the rest will get updated code when the crons refresh in a couple hours).
      4)  7/31:  SWT2_CPB was set back om-line following recovery from a major power failure and a subsequent hardware issue with the cluster gk.
      5)  Discussions about procedures for requesting / making changes to panda site configurations -- came up in context of modifications to site ANALY_MWT2.  Details in mail thread(s). 7/31.
      6)  Pilot updates from Paul:
      v37r ==>
      * Job recovery is now grabbing output file guids from file metadata-.xml (produced by the pilot) instead of PoolFileCatalog.xml
      * Pilot is only generating output file guids when it can not find them in either PoolFileCatalog.xml or metadata.xml and only for non-recovery jobs
      v37s ==>
      * The job recovery algorithm has been updated to handle guids of log files in a more robust way. E.g. the stored value in the jobState file is now verified against that of the pilot metadata file (the jobState value is replaced with the metadata value in case of problems). Also, it is now not allowed to generate a new log guid in recovery mode. Previously this could happen if the value was missing in the jobState file (not normal). 
      7)  8/3: Issues with the jobmanager at MWT2_UC resolved -- production and analysis sites set back to on-line.
      8)  8/4: HPSS tape library maintenance at BNL completed --
      SL8500 Tape Library code upgrade has been completed and access to library resident HPSS data restored. The maintenance window had to be extended because two handbots were found defective and had to be replaced by SUN as a preventive measure. 
      9)  8/4:  Two issues affecting MWT2_IU -- (i) Heavy rains led to some water coming into the server room; (ii) Hardware fault in the PNFS server.  All issues resolved -- site set back on-line.
      10)  Some occasional issues have been seen running production at SL(C)5 sites -- for example Savannah # 53885, 54048.  Under investigation.
      11)  Maintenance downtime at SLAC -- from Wei:
      SLAC will go offline from Wednesday 8/5 until Monday 8/10 for storage maintenance. We will still need to take an outage on the week of 8/17 but will only need one day, and we will announce the date later.
      12)  Follow-ups from earlier reports: 
      (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
    • Fred would like to discuss the handling of this rt ticket: https://rt-racf.bnl.gov/rt/Ticket/Display.html?id=13449.
      • GGUS filed a ticket for data transfer problem between NET2-SWT2, filed to SWT2 queue. Wrong queue - problem was with NET2.
      • Several attempts from SWT2 to indicate problem not at SWT2. Sat for 3 weeks, no action.
      • GOC complained to Fred. Who was responsible for switching the ticket?
      • Who is monitoring the queue, who is responsible? Could SW have the ability to move to another queue? (Answer: yes)
      • Michael - will discuss with Tier 1 staff.
      • Note: in future, contact John De Stefano for any RT concerns (unofficially)
  • this meeting: Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting: http://indico.cern.ch/materialDisplay.py?contribId=1&materialId=0&confId=66077
    [ Production generally running very smoothly this past week -- most tasks have low error rates.  Majority of failed jobs the past few days were from site validation tasks. ]
    [ ESD reprocessing postponed until ~August 24. ]
    1)  8/6 a.m.: ATLAS Conditions oracle database maintenance at BNL completed successfully.
    2)  8/6 a.m.: BNL - brief outage on several WAN circuits due to a problem with the provider (Level3).  Resolved.
    3)  8/6 a.m.: BNL dq2 site services moved to a new host.  An issue with the monitoring was reported post-moved -- presumably resolved?
    4)  8/7: Problematic WN at MWT2_IU (iut2-c036) created stage-in failures -- issue resolved.
    5)  8/8 a.m.: Issue with BNL_OSG2_MCDISK resolved -- from Pedro:
    We had a problem with some pools in one of our storage servers.  The problem has been fixed and all data is again online.
    6)  This past weekend: attempt to migrate site name BU_ATLAS_Tier2o to BU_ATLAS_Tier2 resulted in some failed jobs, as they were assigned to the new name during the brief time it was implemented -- rolled back for now.
    7)  8/10-8/11: BNL -- Jobs failed at BNL with "Get error: dccp get was timed out after 18000 seconds" -- from Pedro:
    There were some problems with some pools.  We've restarted them and tested some of the files that failed.  The problem seems to be fixed now.  We will continue to check for other files according to the pnfsids on the Panda monitor.  RT 13775.
    8)  8/11: Storage maintenance completed at SLAC -- test jobs finished successfully -- SLACXRD & ANALY_SLAC set back 'on-line'.  (Possible power outage on 8/25?)
    9)  8/11: Tier3 UTD-HEP set back to 'on-line' following: (i) disk clean-up in their SE areas; (ii) fixed some issues related to RSV; (iii) successful test jobs.
    10)  8/11: Some jobs from task 78328 were failing at SWT2_CPB with the error "Auxiliary file sqlite200/*.db not found in DATAPATH" - eventually tracked down to the fact that the file ALLP200.db has a modification time of July 8th, after it is unpacked from a DB tar/zipped container, and hence was getting removed by an automated script that cleans old debris from the WN scratch areas.  A patch to the DB file is planned. eLog 5088.
    11) Today: It was announced that all sites should make plans for migrating to SL(C)5 -- from Ian Bird:
    As agreed in the Management Board on August 4, now is the time to push to complete the SL5 migration at all sites, including the Tier 2s.  It was understood in the meeting that the experiments are all ready and able to use SL5 resources.   A web page is available to provide pointers to the relevant information to support the migration, including links to the necessary packages.  (https://twiki.cern.ch/twiki/bin/view/LCG/SL4toSL5wnMigration)
    It is now expected that Tier 1s and Tier 2s should plan this migration as rapidly as possible, so that the majority of resources are available under SL5 as soon as possible.
    12)  Follow-ups from earlier reports: 
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues. 
    • Production running smoothly w/ low error rates
    • Most errors from validation tasks - experts notified.
    • Sites offline have been brought back into production: SLAC, UTD_HEP
    • Attempt to modify the site name for BU failed; will need to re-try later. Saul: turns out its not so easy - will hold off until going to SL5 and OSG 1.2
    • Time-stamp on dbrelease files

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Any news from NET2 on the missing files from the medcut sample (mc08.105807.JF35_pythia_jet_filter.merge.AOD.e418_a84_t53_tid070499 in particular) and on the registration problems (SFN not set in LFC for guid ...) for the dataset step09.00000011.jetStream_lowcut.recon.AOD.a84/ ?
      • Saul: these have been replaced, should be good to go. Nurcan: worried registration problem has not been addressed. Saul will track this down.
    • How ANALY_SWT2_CPB was being set to offline: Understood, Aaron reported: 'client=crawl-66-249-71-XXX.googlebot.com' has been setting 'ANALY_SWT2_CPB-pbs' offline without using proxy. https and proxy requirement was put in place for the curl command.
    • SLAC requested a HammerCloud test last Friday: http://gangarobot.cern.ch/hc/all/test/, tests 533-536, all completed. Any news on the results?
    • ANALY_MWT2_SHORT - decommissioned. Charles working with Alden to get internal name straightened out.
  • this meeting:
    • Status of DB access job at SWT2 and SLAC: Sebastien Binet made one last tag (PyUtils? -00-03-56-03) to modify PyUtils? /AthFile.py module for xrootd sites. Need to test it. If this does not work with the job, we need to switch to a release 15 job.
    • Status of step09 containers at BU: Saul reported that missing files as well as missing SFN errors were caused by the bug in cleanse which can delete files outside of the proddisk area. A cleanup is being done. Charles has current version with fix. In meantime repairing containers.

DDM Operations (Hiro)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
  • this week
    • Main issue is confirmation of environment settings at AGLT2
    • Notes at: https://lists.bnl.gov/pipermail/racf-frontier-l/2009-August/000380.html
    • ATLAS_FRONTIER_CONF to get defined if squid+frontier is used; Xin installed a new release.
    • How will users use this transparently - to be configured site-wide? Would need to change releases. Also in setup.sh in /etc/profile.d/ and in OSG_LOCAL setup for grid jobs. Site specific solution at AGLT2 right now. Goal is to do this in one place. Main issue is whether the environment is getting setup correctly for each job. Worried about this A

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • See meeting MinutesDataManageJul28
    • All storage at bnl is now thumper/thor-based (wn storage retired)
    • New release of prodiskcleanse.py and ccc.py this week from Charles.
  • this week:
    • No meeting this week
    • Monitoring:
    • ccc updates (Charles), http://repo.mwt2.org/viewvc/admin-scripts/lfc/ccc.py?view=markup. Has web output.
       From: Charles Waldman 
      Date: August 5, 2009 8:36:46 PM CDT
      To: Kaushik De 
      Cc: Horst Severini , Wei Yang , Sarah Williams , mwt2-core-l@LISTSERV.INDIANA.EDU, David Lesny , John Brunelle , Shawn McKee 
      Subject: New version of CCC consistency checker, I'm seeking collaborators
      There's a new version (currently 1.32) of my "ccc" consistency
      checker available:
      Code is here:
      Sample output here:
      (Kaushik, you might be interested to take a look at this - we have
      a large number of files in dCache which are not associated with any
      dq2 datasets, and they don't seem to be PandaMover files).
      This version has nice html output, fixes a few bugs in previous
      versions, and also compiles a list of dq2 orphans - files in
      mass storage that are not associated with any dq2 dataset.  This
      skips anything with _dis[0-9]+/ or _sub[0-9]+/ in the path, since
      these are PandaMover datasets not registered in dq2.  (It would
      be really nice if there were only 1 data distribution system,
      instead of dq2+PandaMover - the 2 different systems make bookkeeping
      more difficult.  Perhaps there's a better way to identify pandamover
      datasets, but this seems to work).
      However, this version only works for sites using dCache + PNFS,
      so that means only UC, IU and IllinoisHEP (T3).  I'd like to make
      this usable for everyone, which means a version that works with
      dCache+Chimera, a version for xrootd, and versions for "plain posix"
      (Ibrix, etc), plus whatever y'all are doing at Northeast (GPFS?)
      In order to do this I need a little help.  This is what I'm
      looking for:
      A) For your storage system, is there a way to dump the namespace,
      like pnfsDump?  This would generate at the very least a list of
      filenames that are supposed to be present, but more info would
      be good: filesizes and times would be useful.  Should this be
      done by issuing some command from inside ccc, or would it be
      better to dump to a file, then have ccc read that file?
      B) For your storage system, is there a way to verify what is
      actually on disk, as opposed to what's in the namespace?  If
      you're using "straight posix", then there's no difference between
      the namespace and storage, but if you have Chimera or xrootd,
      there should be some way to check what's in the namespace against
      what is on disk.  So, in that case, is there a way to check what's
      on the pools?
      C) Does the namespace report file sizes correctly, or is there
      some extra step required?  (For PNFS, files >2GB are reported 
      incorrectly due to limitations in NFSv2, but there is a 'backdoor'
      to get the actual file sizes).
      Once I get these questions answered I can add command-line flags
      to ccc.py to make it work with different storage systems.  Thanks
      for any comments/input.
      	   - Charles
    • Need to get this going for xrootd sites.
    • Tier0 share to be removed (from Hiro, done)
    • Large numbers of dq2 orphans (3 TB at UC!)
    • There was a file transferred from AGLT2 with 0 size and no checksum. Hiro notified.... how did it make it through the system?
    • Charles: checking code in dq2 passes these. Needs to be tightened; will follow-up.
    • Doug notes still waiting for T3 data mover.

Throughput Initiative (Shawn)

  • last week(s):
    • perfsonar status: rc1 of next version is available. Karthik will help test this version. When released, want to deploy by August 10 on all sites. There are new one-way timing measurement delays.
    • Throughput tests of last week. Not quite able to get GB/s. The virtual circuits are not performing as well as they had. Mike O Conner studying flows, seeing lost packet ramps when circuits in place. Packet lost not caused by circuit. Will use UC as a test case to study this.
    • Hiro: will regular testing 20 files, 600 MB/s - at least once per day to each site.
  • this week:
    • See minutes to list
    • Little behind on perfsonar milestone. Working on RC2 bugs at a few sites.
    • Will have details about meshtests to be setup. All T2's testing against each other, and the T1.
    • Feed lessons to T3's

OSG 1.2 deployment (Rob, Xin)

WLCG reliability and availability

From: Lcg Office 
Date: August 12, 2009 5:00:09 AM CDT
To: "project-wlcg-cb (Members of the WLCG CB)" 
Cc: "project-lcg-gdb (LCG - Grid Deployment Board)" 
Subject: July 2009 - Tier-2 Reliability and Availability Report

Dear Collaboration Board Members,

Find below the draft WLCG Reliability and Availability Report for the Tier-2 sites for last month.


Please send your comments and corrections to lcg.office@cern.ch before Friday 21 August 2009. The report will then be published on the WLCG web site and reported to the WLCG Overview Board.

Best regards Alberto Aimar, LCG Office

LCG Office IT Dept - CERN CH-1211 Genève, Switzerland http://www.cern.ch/lcg

WLCG Accounting

From: Lcg Office 
Date: August 11, 2009 4:25:16 AM CDT
To: Lcg Office , "worldwide-lcg-management-board (LCG Management Board)" , "project-wlcg-tier2 (List of LCG Tier 2 sites)" 
Subject: Tier 2 accounting - July 2009

Dear WLCG Tier 2 collaborator and associated members,

Please find attached the draft Tier 2 Accounting Report for July 2009, which will be published on the WLCG Accounting page http://lcg.web.cern.ch/LCG/accounts.htm next week.


Please therefore signal any anomalies that you may find to the LCG Office (lcg.office@cern.ch) by Friday 21 August (to take into account holiday absences across the Collaboration).

Kind regards, Cath

Site news and issues (all sites)

  • T1:
    • last week(s): * All the storage deployed (1.5 PB useable) now in production. It was a smooth transition. New hardware is working well, staging performance greatly improved. Nehalem bids coming in next few days. OS SL5 upgrade. Decoupled file systems for interactive and grid production queues. Lots of disk I/O - considering moving SSD system. Upgraded AFS. Change in the BNL site name, now properly.
    • this week:

  • AGLT2:
    • last week: all running smoothly. Did have a Dell switch crash; recovered. Test pilot failures having Athena failures; strange. All identified as seg faults. File not found, possibly? Did not seem to be the case. Do we have old test jobs?
    • this week: all okay. Will upgrade a blade chasis, have implemented a ROCKS5 headnode setup. Near term - next Tuesday to convert a chasis. OSG 1.2 on the second gatekeeper. (Non-ATLAS gatekeeper already upgraded).

  • NET2:
    • last week(s): Just back from vacation - John working on HU probs. Myricom cards installed; should be ready for throughput testing. There are a few data management issues still not quite understood, though all have been resolved; could be related to file corruption from faulty NICs. At HU, jobs failing for lfc-mkdir problem. Site not reporting to ReSS.
    • this week: also planning to convert a blade to SL5. Running smoothly at both site.

  • MWT2:
    • last week(s): ANALY_MWT2 set at 400; lsm switchover at IU on saturday. Uncovered problem with PNFS latency (files taking a long time to appear). Turned out to be a mount option from compute nodes. Monday lost a shelve of data during OS install. Memory requirements discussion - added lots more swap. 2G/2G RAM/swap. Fred following up with how the formal requirement is determined. Accounting descrepancy for MWT2_IU corrected.
    • this week: good week - 0 errors last few days!

  • SWT2 (UTA):
    • last week: Power outage event yesterday. Building generator control system failed; fixed. Lost power to cluster - spent most of yesterday bringing systems back online. Problem with gatekeeper node - should be back up today.
    • this week: no issues.

  • SWT2 (OU):
    • last week:
    • this week: no issues.

  • WT2:
    • last week(s): preload library problem fixed. Working on a procurement - ~10 thor units. ZFS tuning. Looking for a window to upgrade the Thumpers. Latest fix for xrootd client is in the latest ROOT release. downtime for power outage - planned during the week of Tier 2 meeting. running smoothly. 60%->45% cpu/wall time reduction, trying to understand. Need to update OS and ZFS asap.
    • this week: completed storage upgrade. Requesting another stress run to check things out. Will schedule another outage for August 25 - power, difficult to change date.

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover


  • last week
  • this week
    • Next week - workshop at UC

-- RobertGardner - 11 Aug 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pdf GratiaTransferATLAS.pdf (139.0K) | RobertGardner, 12 Aug 2009 - 12:29 |
pdf WLCGReporting.pdf (87.7K) | RobertGardner, 12 Aug 2009 - 12:29 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback