r4 - 27 Jul 2011 - 16:24:36 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesJuly27

MinutesJuly27

Introduction

Minutes of the Facilities Integration Program meeting, July 27, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Integration program from this quarter, FY11Q3
      • SiteCertificationP17 - quarterly report deadline approaching
      • Need to discuss issue of LFC-bindings and workernode-client, wlcg-client updates.
      • Michael: there is a WLCG activity in progress for a new FTS version, FTS 3. Will not rely on statically configured channels - will be driven to figure paths dynamically; will use historical information about performance. Was discussed at a Tier 1 services coordination meeting. Would like to offer a testbed for the WLCG team, eg. ITB. Shawn: have they incorporated network awareness, perfsonar data, LHCONE? Hooks in place? Michael - yes - this would be the central component.
      • Discussion on workshop on federated data storage for the LHC using xrootd - there is a workshop Sep 13-15 in Lyon. Federated access, monitoring, security and support. Andy is sending out a letter of invitation shortly.
    • this week

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Problems fetching data w/ dq2-get at SWT2. But it worked from lxplus, sent feedback to user.
    • ADC meeting on Monday - low level reports from users having problems fetching data. Larger than US issue.
    • Doug: Can tracer logs give transfer failures?
    • Michael: heard claim of 10% loss due to "dq2-get problems".
    • July 3 - several hours with communication probs getting to Panda server. Jobs ran twice - being addressed Tadashi
  • this week:

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • MinutesDataManageJul5
    • Main issues: LFC errors assume to be recurring - ACL issues? Slowness of deletion rate - dq2 database bottleneck? Deletion volume has increased 5x, is the claim.
    • We saw deletion rate fall in May, 10x, in one day. Last two months its been much lower. Issue tracked down by migration of dq2 db to new cluster, not sure why; indexing rebuilt, optimized, still not solved. (Was supposed to be an improvement)
    • Discussing moving deletion service locally - to the sites. This is a high priority to solve for DDM team.
    • Deletion agent is part of site services - Hiro attempting to do this at BNL. Not sure about prospects for Tier 2 sites. Testing with Vincent. Hiro does not believe its a locality issue.
  • this week:

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-7_18_2011.html
    
    1)  7/14: BNL - brief intervention affecting storage services completed as of ~11:40 a.m. EST.  (Added additional storage space to the SRM db 
    server - required a reboot of the machine.)  eLog 27263.
    2)  7/15: WISC - file transfer errors ("the server responded with an error500 500-Command failed. : globus_gridftp_server_posix.c:globus
    _l_gfs_posix_recv:797:500-open() fail500 End.").  Wen reported the problem had been fixed.  ggus 72664 closed, eLog 27276.
    3)  7/15: SWT2_CPB - file transfer errors from UPENN_LOCALGROUPDISK to SWT2_CPB_USERDISK (three files - checksum errors).  
    These are the same files associated with the issue of Panda re-running the same jobs twice, thus resulting in inconsistent checksum values in 
    the LFC vs. on disk (see item #12 from 7/6 shift summary).  The files were declared bad to DQ2m but there is an issue with long vs. short SURL's.  
    https://savannah.cern.ch/bugs/?84428, eLog 27295.
    4)  7/16: Initial express stream reprocessing (ES1) for the release 17 campaign started. The streams being run are the express stream and the 
    CosmicCalo stream. The reconstruction r-tag which is being used is r2541.  This reprocessing campaign (running at Tier-1's) uses release 17.0.2.3 
    and DBrelease 16.2.1.1.  Currently merging tasks are defined with p-tags p628 and p629.  More details here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DataPreparationReprocessing
    5)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on 
    lxplus, so presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    6)  7/19 a.m., from Shawn at AGLT2: In an attempt to fix our dCache GridFTP door hangs, AGLT2 will be deploying a patched version of dCache 
    (1.9.12-6 -> 1.9.12-7rc) which fixes a race condition.  We will be doing a rolling upgrade of the doors/pools and so expect intermittent outages from 
    8 AM to noon today.  Later in the day the site experienced a network outage 
    (see details here: http://www-hep.uta.edu/~sosebee/ADCoS/network-problem-AGLT2-7_19_2011.html).  Issues resolved, site was set back on-line as 
    of 7/20 a.m.  See: eLog 27468/60/39, https://savannah.cern.ch/support/index.php?122331.  ggus 72787 was opened during this period, now closed.
    7)  7/20: new pilot release from Paul (v48a).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS//pilot-version_SULU_48a.html
    
    Follow-ups from earlier reports:
    (i)  7/12: UTD-HEP - site admin requested that the site be set off-line for a maintenance outage.  https://savannah.cern.ch/support/?122180, 
    eLog 27209.
    Update 7/16: additionally site blacklisted in DDM due to file transfer errors.  ggus 72698 opened, eLog 27306/10.
    (ii)  7/12: NERSC - three DDM endpoints set off-line in advance of downtime 7/13. 
    https://savannah.cern.ch/support/index.php?122179.
    Update 7/18: site was set back on-line in DDM on 7/15, but had to be set back off due to file transfer errors.  Under investigation.  eLog 27294.
    Later that day: site admin reported that the SRM service had been running correctly for several days, so site was again un-blacklisted.  Latest 
    attempts at file transfers have succeeded, so https://savannah.cern.ch/support/index.php?122179 closed.  eLog 27390.
     
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=148585 
    
    1)  7/26: Job failures and DDM transfer errors at two sites (IllinoisHEP & Duke).  Issue was an expired cert on the tier-3 LFC host, now updated.  
    ggus tickets 72962 & 72974 closed, eLog 27688/707.
    2)  7/26: NERSC_SCRATCHDISK file transfer errors ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").  
    ggus 72961 in-progress, eLog 27681, https://savannah.cern.ch/bugs/index.php?84879 (Savannah DDM ticket).
    
    Follow-ups from earlier reports:
    
    (i)  7/12: UTD-HEP - site admin requested that the site be set off-line for a maintenance outage. https://savannah.cern.ch/support/?122180, eLog 27209.
    Update 7/16: additionally site blacklisted in DDM due to file transfer errors.  ggus 72698 opened, eLog 27306/10.
    Update 7/19: downtime was declared, so now possible to close ggus 72698 & Savannah 122180.  eLog 27706.
    Update 7/26: A ToA update is needed, so the site was again blacklisted in DDM.  http://savannah.cern.ch/support/?122471 (Savannah site 
    exclusion ticket).
    (ii)  7/15: SWT2_CPB - file transfer errors from UPENN_LOCALGROUPDISK to SWT2_CPB_USERDISK (three files - checksum errors).  These are 
    the same files associated with the issue of Panda re-running the same jobs twice, thus resulting in inconsistent checksum values in the LFC vs. on 
    disk (see item #12 from 7/6 shift summary).  The files were declared bad to DQ2m but there is an issue with long vs. short SURL's. 
    https://savannah.cern.ch/bugs/?84428, eLog 27295.
    (iii)  7/16: Initial express stream reprocessing (ES1) for the release 17 campaign started. The streams being run are the express stream and the 
    CosmicCalo stream. The reconstruction r-tag which is being used is r2541.  This reprocessing campaign (running at Tier-1's) uses release 17.0.2.3 
    and DBrelease 16.2.1.1.  Currently merging tasks are defined with p-tags p628 and p629.  More details here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DataPreparationReprocessing
    (iv)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on 
    lxplus, so presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to 
    come up with a solution. 
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Meeting this week - notes posted
    • For next quarter focus for inter-Cloud problems. Italian, Canadian clouds will deploy perfsonar-PS, as well as LHCOPN.
    • How best to leverage the deployed infrastructure. Can't do full mesh - select representative sites.
    • Related - enhance infrastructure to allow tests on demand, eg., T2-T2 transfers, schedule on-demand test. Whats the network path - isolate site from net problems.
    • Using Tomasz infrastructure
    • Distribute tests over resources
  • this week:

Federated Xrootd deployment in the US

last week(s) this week:

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • Will need to track down "unknown" category
  • Booker reports progress with SLAC security - no roadblocks - should get HCC jobs online soon.
this week
  • Ramping up Engage - Mats' latest email
    • BNL: Has been working for a long time now.
    • AGLT2: Works.
    • MWT2: enabling - today. We are not seeing this site in the ReSS? feed, so probably not enabled.
    • SWT2-UTA : enabled. Works.
    • SWT2-OU: We are not seeing this site in the ReSS? feed, so probably not enabled.
    • WT2: We are not seeing this site in the ReSS? feed, so probably not enabled.
    • NET2: forgot to ask

CVMFS

See TestingCVMFS

last week:

  • Dave - updated docs on CERN twiki. There is a nagging problem related to job failures; thought it might be a squid issue. But sees jobs failing with python mis-match problems. Investigated python logs. Only happens on certain jobs. Happening at both at MWT2 and IllinoisHEP.
  • Patrick - have asked to get a new resource approved for OSG for testing; awaiting WLCG BDII.
  • John: which sites are using the new namespace?
this week:

Site news and issues (all sites)

  • T1:
    • last week:
      • As mentioned earlier, empty directories issue. Otherwise very smooth.
      • Uptick in analysis jobs - a the Tier1 and overall generally.
      • Chimera upgrade is taking shape.
    • this week:

  • AGLT2:
    • last week(s):
      • Major changes last week. New AS number, own routing entity for AGLT2 now. Added virtual routing config at UM site. Updated firmware on switches.
      • Current golden - dCache 1.9.12-6, there are some issues
      • Updated condor.
    • this week:

  • NET2:
    • last week(s):
      • Smooth running
    • this week:

  • MWT2:
    • last week:
      • Downtime tomorrow - DHCP for Cobbler, firmware on s-nodes, omreport on s-nodes, osg-gk (add two slots), srm (add two slots), UPS work
    • this week:
      • Some new HC tests of dcap and xrootd direct access
2011-07-19 ANALY_MWT2, ANALY_MWT2_X result Both are direct access xrootd; analy only runs at UC, analy_x at IU; mc10_7TeV*NTUP_SUSY*p428*
2011-07-18 ANALY_MWT2, ANALY_MWT2_X result Both are direct access xrootd; analy only runs at UC, analy_x at IU; group10.phys-susy.SUSYD3PD.mc09*00614*
2011-07-18 ANALY_MWT2, ANALY_MWT2_X result Both are direct access xrootd; analy_x only runs at UC; group10.phys-susy.SUSYD3PD.mc09*00614*
2011-07-17 ANALY_MWT2, ANALY_MWT2_X result Both are direct access dcap; analy_x only runs at UC; mc10_7TeV*NTUP_SUSY*p428*
2011-07-17 ANALY_MWT2, ANALY_MWT2_X result Both are direct access dcap; analy_x only runs at UC; group10.phys-susy.SUSYD3PD.mc09*00614*

  • SWT2 (UTA):
    • last week:
      • Quiet - except for doubly-run jobs. Working on clean-up - follow Hiro's suggestion to declare files as bad.
      • CVMFS work
      • Tested fetch for job options from atlas web server
    • this week:

  • SWT2 (OU):
    • last week:
      • on vacation
    • this week:
      • Ran into a problem with load on OSCER_ATLAS cluster.

  • WT2:
    • last week(s):
      • Problem with LFC hardware yesterday, replaced.
      • DDM transfer failures from Germany and France - all logfiles. ROOT file are working fine. Is FTS failing to get these? Email sent to Hiro. NET2 also finding performance problems.
      • Hiro - notes many of these are never started, they're in the queue too long.
      • Suspects these are group production channels.
      • T2D? channels.
      • FZK to SLAC seems to be failing all the time. Official Tier 1 service contact for FZK?
    • this week:

Carryover issues (any updates?)

Python + LFC bindings, clients

last week(s):
  • New package from VDT delivered, but still has some missing dependencies (testing provided by Marco)
this week:

AOB

last week
  • None.
this week


-- RobertGardner - 20 Jul 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback