r3 - 26 Aug 2009 - 14:11:08 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug26

MinutesAug26

Introduction

Minutes of the Facilities Integration Program meeting, Aug 26, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Mark, Kaushik, Michael, Rob, Sarah, Charles, Patrick, John, Saul, Wei, Bob, Nurcan, Hiro
  • Apologies: Fred

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04 NEW
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • We also need to discuss the OSG 1.2 upgrade - should we do this upgrade together w/ SL5
      • Reprocessing exercise - postponed until August 24 - room for sites to do maintenance. Otherwise will be need to be postponed.
    • this week:
      • Site certification table has a new column for the lcg-utils update, as well as curl. Will update as needed.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • All doing okay - enough jobs for the next two weeks.
    • Notes nothing in Panda except what's in for the US cloud - forward requests
    • All T2's were running reprocessing validation jobs. There are a couple of issues that may require a new cache. (quotation marks in XML generated by the trf; two db release files are not getting modification times not updated - they're getting deleted by scratch cleaners, needs a fix for trf) Want to start on August 24.
    • There was a problem with pilot jobs mixing data and MC which caused Oracle access (task 78356); reprocessing has minimal Oracle access.
  • this week:
    • Reprocessing - will run a set of validation tasks for the next five days - shifters should not file bug reports. Depending on results, will be another week before a decision is made.
    • Otherwise tasks are running fine. Have a month's worth of tasks defined.
    • May get more requests from Jet-ET miss groups (Eric Feng)
    • Central production - will get 10 TeV with release 15 rather than 7 TeV. Why? Detailed schedule is available.
    • Need to plan to replication of conditions data & squid access at Tier 2's.

Shifters report (Mark)

  • Reference
  • last meeting: Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting: http://indico.cern.ch/materialDisplay.py?contribId=1&materialId=0&confId=66077
    [ Production generally running very smoothly this past week -- most tasks have low error rates.  Majority of failed jobs the past few days were from site validation tasks. ]
    [ ESD reprocessing postponed until ~August 24. ]
    1)  8/6 a.m.: ATLAS Conditions oracle database maintenance at BNL completed successfully.
    2)  8/6 a.m.: BNL - brief outage on several WAN circuits due to a problem with the provider (Level3).  Resolved.
    3)  8/6 a.m.: BNL dq2 site services moved to a new host.  An issue with the monitoring was reported post-moved -- presumably resolved?
    4)  8/7: Problematic WN at MWT2_IU (iut2-c036) created stage-in failures -- issue resolved.
    5)  8/8 a.m.: Issue with BNL_OSG2_MCDISK resolved -- from Pedro:
    We had a problem with some pools in one of our storage servers.  The problem has been fixed and all data is again online.
    6)  This past weekend: attempt to migrate site name BU_ATLAS_Tier2o to BU_ATLAS_Tier2 resulted in some failed jobs, as they were assigned to the new name during the brief time it was implemented -- rolled back for now.
    7)  8/10-8/11: BNL -- Jobs failed at BNL with "Get error: dccp get was timed out after 18000 seconds" -- from Pedro:
    There were some problems with some pools.  We've restarted them and tested some of the files that failed.  The problem seems to be fixed now.  We will continue to check for other files according to the pnfsids on the Panda monitor.  RT 13775.
    8)  8/11: Storage maintenance completed at SLAC -- test jobs finished successfully -- SLACXRD & ANALY_SLAC set back 'on-line'.  (Possible power outage on 8/25?)
    9)  8/11: Tier3 UTD-HEP set back to 'on-line' following: (i) disk clean-up in their SE areas; (ii) fixed some issues related to RSV; (iii) successful test jobs.
    10)  8/11: Some jobs from task 78328 were failing at SWT2_CPB with the error "Auxiliary file sqlite200/*.db not found in DATAPATH" - eventually tracked down to the fact that the file ALLP200.db has a modification time of July 8th, after it is unpacked from a DB tar/zipped container, and hence was getting removed by an automated script that cleans old debris from the WN scratch areas.  A patch to the DB file is planned. eLog 5088.
    11) Today: It was announced that all sites should make plans for migrating to SL(C)5 -- from Ian Bird:
    As agreed in the Management Board on August 4, now is the time to push to complete the SL5 migration at all sites, including the Tier 2s.  It was understood in the meeting that the experiments are all ready and able to use SL5 resources.   A web page is available to provide pointers to the relevant information to support the migration, including links to the necessary packages.  (https://twiki.cern.ch/twiki/bin/view/LCG/SL4toSL5wnMigration)
    It is now expected that Tier 1s and Tier 2s should plan this migration as rapidly as possible, so that the majority of resources are available under SL5 as soon as possible.
    12)  Follow-ups from earlier reports: 
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues. 
    • Production running smoothly w/ low error rates
    • Most errors from validation tasks - experts notified.
    • Sites offline have been brought back into production: SLAC, UTD_HEP
    • Attempt to modify the site name for BU failed; will need to re-try later. Saul: turns out its not so easy - will hold off until going to SL5 and OSG 1.2
    • Time-stamp on dbrelease files
  • this meeting:
    • Notes:
      ====================================================== 
      
      Yuri's weekly summary presented at the Tuesday morning ADCoS meeting: 
      See attached file.
      
      [ ESD reprocessing postponed until ~August 31(?) ]
      [Production generally running very smoothly this past week -- most tasks have low error rates.  Exception -- see 4) below.]
      
      1)  8/19: Job failures at HU_ATLAS_Tier2 with the error "Bad credentials" -- resolved -- from John:
      We had a problem with the disk that's hosting the TRUSTED_CA for the wn-client installation, causing jobs to fail with `Bad credentials' errors.  This should be fixed now. 
      2)  8/19: file transfer errors due to problem at WISC_MCDISK -- issue resolved.  RT 13841.
      3)  8/20: file transfer errors at AGLT2_MCDISK -- GGUS 51024 -- resolved, from Shawn:
      The two pools with the bulk of the MCDISK free space are being recovered. Once they are back online we will respond with further details. The AGLT2_MCDISK are is very full in general.
      4)  8/20:  High failure rate at most U.S. sites for jobs like valid1 csc_physVal_Mon*, 15.3.1.1 merge tasks 79222-79524.  Tasks aborted.
      5)  Weekend of 8/22-23: Software upgrades (OSG, dCache) completed at AGLT2.  An issue with AFS access from the WN's affecting pilots was resolved -- test jobs succeeded -- site set back to 'online'. 
      6)  8/24: ~250 jobs failed at HU_ATLAS_Tier2 with the error "No space left on device."  GGUS 51087.  Apparently resolved -- I must have missed a follow-up?
      7)  8/25-26: Power outage over at SLAC -- test jobs completed successfully -- site is set back to 'online'.
      8)  Follow-ups from earlier reports: 
      (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
      (ii)  SLC5 upgrades will most likely happen sometime during the month of September at most sites.
    • Quiet week
    • A few sites have been in downtime for s/w upgrades (aglt2), power outage (slac). Both back online.
    • Had some aborted tasks -

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Status of DB access job at SWT2 and SLAC: Sebastien Binet made one last tag (PyUtils? -00-03-56-03) to modify PyUtils? /AthFile.py module for xrootd sites. Need to test it. If this does not work with the 14.5.1.4 job, we need to switch to a release 15 job.
    • Status of step09 containers at BU: Saul reported that missing files as well as missing SFN errors were caused by the bug in cleanse which can delete files outside of the proddisk area. A cleanup is being done. Charles has current version with fix. In meantime repairing containers.
  • this meeting:
    • Working on to get analysis jobs using TAGs and database access into HammerCloud. There will be a discussion on jobs database access next week during S&C workshop. Fred and David Front invited. Sasha suggests to use release 15 with two options:
      • Adapt for HammerCloud the lightweight database access testing scripts from Xavi. These Athena jobs do not require any input data files per se. The tests are most useful, when a large set of jobs is submitted with varied input parameters for each job.
      • Also recommends to use the job from Fred Luehring for the HammerCloud tests. Efforts to adapt this job for Ganga will be invested well, as this particular job can be later developed into a Frontier testing job as well as into the Conditions POOL files access job.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • While looking at the transfers problem from BU to FZK (which is not part of ATLAS computing model for file transfer), I noticed that some T2s (NET2, SWT2, WT2) are not publishing SRM information to even OSG bdii. Also, some sites in OSG BDII (2170 port) are not in OSG BDII (2180port). This is really confusing (for debugging). Are there any plans to publish all US T2s SRM to OSG BDII (both 2170 and 2180 ports). Although it is not part of ATLAS computing model, this type of transfers from US T2s scratch space to other cloulds will happen as more users become active with real data. Unless, ATLAS T2s publish this info via BDII, DDM will fail because FTSs at other cloulds depends on this information. So, shouldn't we push to do this?
  • this meeting:
    • Publishing storage into BDII - SE's only, as certain transfers are failing (foreign Tier 1's to US Tier 2's)
    • Need to understand technical implications. Why are we breaking the heirarchical model? Need a discussion in a different forum.
    • Lets discuss this next week at CERN.
    • Throughput test bnl->aglt2 revealed a dcache glitch, shawn investigating

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • last week(s):
    • Hiro: will regular testing 20 files, 600 MB/s - at least once per day to each site.
    • Little behind on perfsonar milestone. Working on RC2 bugs at a few sites.
    • Will have details about meshtests to be setup. All T2's testing against each other, and the T1.
  • this week:
    • Yesterday had a briefing from DOE project Office of Science - there was a discussion of T3 throughput
    • We need to get T3 data transfers working, and documented. Want this consolodated in one place, for documentation purposes.
    • Doug: we need I2 tools in place, and we need to get the panadmover xfers to T3

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • Nothing for this week - schedule remains to have this, and the SL5 upgrade, complete by October.
    • Probably should focus on SL5 upgrade first.
    • See discussion above
  • this week:
    • Update to OSG 1.2.1 release this past week for pyopenssl patch

Site news and issues (all sites)

  • T1:
    • last week(s): * All the storage deployed (1.5 PB useable) now in production. It was a smooth transition. New hardware is working well, staging performance greatly improved. Nehalem bids coming in next few days. OS SL5 upgrade. Decoupled file systems for interactive and grid production queues. Lots of disk I/O - considering moving SSD system. Upgraded AFS. Change in the BNL site name, now properly.
    • this week: have completed site name consolodation issue; now have a proper WLCG site name. This was nontrivial to implement this.

  • AGLT2:
    • last week: all okay. Will upgrade a blade chasis, have implemented a ROCKS5 headnode setup. Near term - next Tuesday to convert a chasis. OSG 1.2 on the second gatekeeper. (Non-ATLAS gatekeeper already upgraded).
    • this week: Upgrade OSG 1.2 gatekeeper - went well; but there are dCache upgrade problems, throughput is very low. Adding space to mcdisk.

  • NET2:
    • last week(s): also planning to convert a blade to SL5. Running smoothly at both sites.
    • this week: all okay; HU running below capacity.

  • MWT2:
    • last week(s): no issues
    • this week: looking into data corruption - some jobs have failed as a result. Doing a full scan of dCache and calculate checksums and compare to LFC. Found over 300 files with mismatched checksums. 2M files. Otherwise running w/ low error.

  • SWT2 (UTA):
    • last week: no issues
    • this week: nothing to report

  • SWT2 (OU):
    • last week: no issues
    • this week: all is well.

  • WT2:
    • last week(s): completed storage upgrade. Requesting another stress run to check things out. Will schedule another outage for August 25 - power, difficult to change date.
    • this week: completed power outage, site coming back online. Low efficiency of xrootd client - have an update from developer to test.

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
  • this week

Getting OIM registrations correct for WLCG installed pledged capacity

  • last week
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
  • this week

AOB

  • last week
  • this week
    • Procurement. Disappointed in pricing at Tier 1. Rebids came back more favorable. 120 nodes.


-- RobertGardner - 25 Aug 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback