r6 - 18 Aug 2011 - 10:10:53 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug10

MinutesAug10

Introduction

Minutes of the Facilities Integration Program meeting, Aug 10, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Shawn, Torre, AK, Dave, Nate, Rob, Fred, Michael, Patrick, Armen, Kaushik, Mark, John D, Tom, Bob, Sarah, Saul, John B, Wei, Xin
  • Apologies: none
  • Guest: Dan Bradley

Integration program update (Rob, Michael)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

this week
  • Dan Bradley joined the meeting
  • SupportingGLOW
  • Science being supported - economists, animal science (statistics), physics; varies month-to-month
  • Space: no space requirements on the site, other than worker node
  • Few requirements - just must run under Condor
  • Some jobs use http-squid to get files; others use condor transfer mechanisms with TCP.
  • Using the site squid proxy, as advertised in the environment; (may need to check this)
  • Typical run time? Aim for 2 hours, some are running longer; they're running under glideins - so sites will see only the pilot. Glidein lifespan? Day or so.
  • Preemption is okay, expected.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
  • this week:
    • On Friday Borut aborted a large number of mc11 s128-n tags. Said he would resubmit.
    • On Monday started to drain. s127 tag jobs available. mc10 is finished, but some groups still requesting this.
    • Completely out! No email from Borut.
    • User analysis has been mostly constant
    • Reprocessing campaign (Jonas)? Has not started - scheduled to start today or tomorrow.

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
  • this week:
    • Overall things are okay; maintenance activities in various places
    • Good news - progress with understanding deletions, factor 2 gain in rate. Finding better performance; this is mainly at BNL. Higher than 7 Hz. Sometimes 10-12 Hz, reducing backlog. Expect to finish in the next week. Had been struggling with this for several months.
    • MCDISK cleanup proceeding. BNLTAPE - finished, BNLDISK - nearly finished. Hope to complete the legacy space token.
    • LFC errors still with us - Hiro will talk with Shawn and Saul (user directories need to be fixed - ACL problems).
    • Space needed for BNLGROUPDISK
    • Next USERDISK clean-up - two or three weeks. Will need to send email by end of the week.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=148585 
    
    1)  7/26: Job failures and DDM transfer errors at two sites (IllinoisHEP & Duke).  Issue was an expired cert on the tier-3 LFC host, now updated.  
    ggus tickets 72962 & 72974 closed, eLog 27688/707.
    2)  7/26: NERSC_SCRATCHDISK file transfer errors ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").  
    ggus 72961 in-progress, eLog 27681, https://savannah.cern.ch/bugs/index.php?84879 (Savannah DDM ticket).
    
    Follow-ups from earlier reports:
    
    (i)  7/12: UTD-HEP - site admin requested that the site be set off-line for a maintenance outage. https://savannah.cern.ch/support/?122180, eLog 27209.
    Update 7/16: additionally site blacklisted in DDM due to file transfer errors.  ggus 72698 opened, eLog 27306/10.
    Update 7/19: downtime was declared, so now possible to close ggus 72698 & Savannah 122180.  eLog 27706.
    Update 7/26: A ToA update is needed, so the site was again blacklisted in DDM.  http://savannah.cern.ch/support/?122471 (Savannah site 
    exclusion ticket).
    (ii)  7/15: SWT2_CPB - file transfer errors from UPENN_LOCALGROUPDISK to SWT2_CPB_USERDISK (three files - checksum errors).  These are 
    the same files associated with the issue of Panda re-running the same jobs twice, thus resulting in inconsistent checksum values in the LFC vs. on 
    disk (see item #12 from 7/6 shift summary).  The files were declared bad to DQ2m but there is an issue with long vs. short SURL's. 
    https://savannah.cern.ch/bugs/?84428, eLog 27295.
    (iii)  7/16: Initial express stream reprocessing (ES1) for the release 17 campaign started. The streams being run are the express stream and the 
    CosmicCalo stream. The reconstruction r-tag which is being used is r2541.  This reprocessing campaign (running at Tier-1's) uses release 17.0.2.3 
    and DBrelease 16.2.1.1.  Currently merging tasks are defined with p-tags p628 and p629.  More details here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DataPreparationReprocessing
    (iv)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on 
    lxplus, so presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to 
    come up with a solution. 
     
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=150159
    
    1)  8/3: Description of the db downtime at CERN, from Rod Walker:
    Due to an ADCR  Db intervention tomorrow August 3rd, from 16:00 till 17:00, there will be an interruption in ADC services including:
    DDM - nothing will work
    Panda - monitor, job submission will not work. Jobs finishing during downtime will be lost.
    Datri - transfer requests will not work
    You should expect disruption for an hour either side, where services are shutdown and brought back, so 15-18:00.
    2)  8/4: All clouds set off-line for approximately four hours due to central database / DDM slowness.  Backlog eventually reduced, and clouds were 
    re-enabled.  See: https://ggus.eu/ws/ticket_info.php?ticket=73192, eLog 28008.
    3)  8/5: ANALY_AGLT2 inadvertently set off-line due to a bug in a private downtime list used for testing a new automated queue management tool.  
    Thanks to Bob for reporting - queue set back on-line.
    4)  8/6-8/8: Transfer of some datasets to ALGT2_CALIBDISK were reported to be progressing slowly.  Not a site issue, but rather some modifications 
    to the FTS settings are needed to better optimize the data flows.  Experts working on this issue.  See eLog 28085 / 28167.
    5)  8/8: BNL - job failures with errors such as "TRF_UNKNOWN | 'poolToObject: caught error: Could not connect to the file 
    ( POOL : "PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )' | "createObj PoolToDataObject() failed."  From Hiro: There 
    seems to had a problem accessing this file for about 1.5 hour in the morning. It seems to be related to the overloaded storage pool. There is no 
    longer such a issue.  ggus 73311 closed, eLog 28222.
    6)  8/9: AGLT2 network problem.  Shawn requested that the site be set off-line.  https://savannah.cern.ch/support/index.php?122744 
    (site exclusion ticket), eLog 28227.
    7)  8/10: New pilot release from Paul (v48b).  Among others addresses an issue affecting BNL (incorrect handling of an exit code).  
    See: http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_SULU_48b.html.  eLog 28255 (BNL issue).
    
    Follow-ups from earlier reports:
    
    (i)  7/12: UTD-HEP - site admin requested that the site be set off-line for a maintenance outage. https://savannah.cern.ch/support/?122180, 
    eLog 27209.
    Update 7/16: additionally site blacklisted in DDM due to file transfer errors.  ggus 72698 opened, eLog 27306/10.
    Update 7/19: downtime was declared, so now possible to close ggus 72698 & Savannah 122180.  eLog 27706.
    Update 7/26: A ToA update is needed, so the site was again blacklisted in DDM.  http://savannah.cern.ch/support/?122471 (Savannah site 
    exclusion ticket).
    Updates:
    7/28: Requested changes to ATLAS ToA & panda schedconfigdb now done.
    8/1: Site admin reported maintenance period was over, and thus ready for testing.  Jobs submitted, but as yet not running due to missing 
    release(s).  Alessandro and Xin notified.  Also, DDM failures were observed after the site was unblacklisted, most likely due to an incorrect 
    SRM port value in the ToA.  Also, site admin is investigating a hardware problem with Dell.  eLog 27913/32,  
    https://ggus.eu/ws/ticket_info.php?ticket=73116 in-progress.
    Update 8/9: Still see a problem with pilots at the site.  May be related to WN client s/w.  Savannah 122471 updated.
    (ii)  7/16: Initial express stream reprocessing (ES1) for the release 17 campaign started. The streams being run are the express stream and 
    the CosmicCalo stream. The reconstruction r-tag which is being used is r2541.  This reprocessing campaign (running at Tier-1's) uses 
    release 17.0.2.3 and DBrelease 16.2.1.1.  Currently merging tasks are defined with p-tags p628 and p629.  More details here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DataPreparationReprocessing
    (iii)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully 
    on lxplus, so presumably not a site problem.  Requested additional debugging info from the user, investigating further.  
    ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan 
    developers to come up with a solution.
    (iv)  7/31: SMU_LOCALGROUPDISK - DDM transfer failures with the error "[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic 
    Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]."  From Justin: Certificate has expired. New cert request 
    was put in a few days ago.  https://ggus.eu/ws/ticket_info.php?ticket=73070 in-progress, eLog 27876, site blacklisted: 
    https://savannah.cern.ch/support/index.php?122540
    
    • Database intervention last Thursday - and lots of DDM backlogs in all clouds.
    • AGLT2 analysis queue set offline by mistake - glitch from development of auto-scheduling of test jobs
    • Pilot update from Paul
    • Tier 3 at UTD - Marco helped the admin get ToA updated.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
  • this week:
    • LHCOPN has nearly completed deployment of perfsonar and using Tom's monitoring system
    • Italian cloud expanding, now eager to extend to cross-cloud; Canadian cloud coming online.
    • Demonstrators for DYNES - phase A deployment complete; beginning phase B. Demo could be available shortly - to demonstrate circuits between sites.

Federated Xrootd deployment in the US

last week(s) this week:

CVMFS

See TestingCVMFS

last week:

  • Dave - updated docs on CERN twiki. There is a nagging problem related to job failures; thought it might be a squid issue. But sees jobs failing with python mis-match problems. Investigated python logs. Only happens on certain jobs. Happening at both at MWT2 and IllinoisHEP.
  • Patrick - have asked to get a new resource approved for OSG for testing; awaiting WLCG BDII.
  • John: which sites are using the new namespace?
this week:
  • Dave: On Monday stratum 1 servers switched over to the new, final URL; working just fine at Illinois.
  • New rpm to become available by end of week.
  • Generally - we should not convert a large site for scalability.
  • MWT2, AGLT2 would be ready to do a large scale test.

Site news and issues (all sites)

  • T1:
    • last week:
      • As mentioned earlier, empty directories issue. Otherwise very smooth.
      • Uptick in analysis jobs - a the Tier1 and overall generally.
      • Chimera upgrade is taking shape.
    • this week:

  • AGLT2:
    • last week(s):
      • Major changes last week. New AS number, own routing entity for AGLT2 now. Added virtual routing config at UM site. Updated firmware on switches.
      • Current golden - dCache 1.9.12-6, there are some issues
      • Updated condor.
    • this week:
      • Major problem - lost local networking at UM; switch had flow-control on while others didn't; fixed, recovering services.

  • NET2:
    • last week(s):
      • Smooth running
    • this week:
CONFLICT original 3:
CONFLICT version 4:
CONFLICT version new:
      • Storage arrived for new rack with 3 TB drives
CONFLICT end
      • Worker node purchase in the plans
      • Next week ATLAS physics workshop in Boston
      • Still busy with IO program
      • Wide area networking to other sites - will discuss in next throughput meeting.

  • MWT2:
    • last week:
    • this week:
      • Retiring IU endpoints; both physical sites representing by one endpoints
      • Working on storage procurement

  • SWT2 (UTA):
    • last week:
    • this week:
      • Purchase cycle

  • SWT2 (OU):
    • last week:
    • this week:

  • WT2:
    • last week(s):
    • this week:
      • Upgraded CE, added a new CE - encountered problems; GIP and BDII - got help from Burt H.

Carryover issues (any updates?)

Python + LFC bindings, clients

last week(s):
  • New package from VDT delivered, but still has some missing dependencies (testing provided by Marco)
this week:
  • New OSG release coming with has updated LFC, python client interfaces, etc., supporting new worker-node client and wlcg-client

AOB

last week
  • None.
this week


-- RobertGardner - 09 Aug 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback