r2 - 24 Aug 2011 - 12:54:57 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesAug24

MinutesAug24

Introduction

Minutes of the Facilities Integration Program meeting, Aug 24, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

Compute node discussion

CVMFS deployment discussion

See TestingCVMFS

previously:

  • Dave: On Monday stratum 1 servers switched over to the new, final URL; working just fine at Illinois.
  • New rpm to become available by end of week.
  • Generally - we should not convert a large site for scalability.
  • MWT2, AGLT2 would be ready to do a large scale test.
this week:

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • On Friday Borut aborted a large number of mc11 s128-n tags. Said he would resubmit.
    • On Monday started to drain. s127 tag jobs available. mc10 is finished, but some groups still requesting this.
    • Completely out! No email from Borut.
    • User analysis has been mostly constant
    • Reprocessing campaign (Jonas)? Has not started - scheduled to start today or tomorrow.
  • this week:

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • Overall things are okay; maintenance activities in various places
    • Good news - progress with understanding deletions, factor 2 gain in rate. Finding better performance; this is mainly at BNL. Higher than 7 Hz. Sometimes 10-12 Hz, reducing backlog. Expect to finish in the next week. Had been struggling with this for several months.
    • MCDISK cleanup proceeding. BNLTAPE - finished, BNLDISK - nearly finished. Hope to complete the legacy space token.
    • LFC errors still with us - Hiro will talk with Shawn and Saul (user directories need to be fixed - ACL problems).
    • Space needed for BNLGROUPDISK
    • Next USERDISK clean-up - two or three weeks. Will need to send email by end of the week.
  • this week:

Shifters report (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (this week provided by Torsten Harenberg):
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=150160
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-8_16_2011.pdf
    
    1)  8/12: BNL - job failures due to stage-out errors ("lsm-put failed ([201] Copy command failed! Error ( POLLIN POLLERR POLLHUP) (with
    data) on control line [7] Failed to create a control line").  Hiro reported that a dCache misconfiguration caused several storage pools to go offline.  
    T0 exports also affected.  Resolved.  http://savannah.cern.ch/bugs/?85478, ggus 73446 closed, eLog 28347/55.
    2)  8/13: SWT2_CPB - DDM transfer errors ("[TRANSFER_TIMEOUT] globus_ftp_client_size: Connection timed out]").  Issue resolved - from Patrick: 
    One of our gridftp door servers hung up. The machine has been restarted.  ggus 73450/RT 20726 closed, eLog 28473.
    3)  8/13: ggus 73467 was opened for job failures at BNL apparently due to stage-out errors.  Instead this turned out to be a task definition problem, 
    tracked in https://savannah.cern.ch/bugs/index.php?85597.  ggus ticket closed, eLog 28460.
    4)  8/14:  Job failures at BNL with the error "poolToObject: caught error: FID "7C7CDC6C-B85A-E011-9642-003048F0E782" is not existing in the 
    catalog."  Again not a site issue, but rather the jobs were defined with an out-of-date atlas release.  ggus 73475/RT 20729 closed.
    5)  8/15: Shifter reported a problem with the Frontier service at BNL (via this monitoring: http://sls.cern.ch/sls/service.php?id=ATLAS-BNL-Frontier).  
    John DeStefano at BNL reported that a spike in network traffic coincided with the monitor alert, and during this period the Frontier servers handled 
    approximately a factor of two greater number of requests compared to average activity.  More details in eLog 28463.
    6)  8/16: SWT2_CPB - data transfer errors to SWT2_CPB_DATADISK and SWT2_CPB_PHYS-TOP (" has trouble with canonical path. cannot access it").  
    From Patrick: The XrootdFS layer stopped on the SRM host. Transfers are succeeding after a restart.  ggus 73563/RT 20738 closed, eLog 28550.
    
    Follow-ups from earlier reports:
    
    (i)  7/12: UTD-HEP - site admin requested that the site be set off-line for a maintenance outage. https://savannah.cern.ch/support/?122180, eLog 27209.
    Update 7/16: additionally site blacklisted in DDM due to file transfer errors.  ggus 72698 opened, eLog 27306/10.
    Update 7/19: downtime was declared, so now possible to close ggus 72698 & Savannah 122180.  eLog 27706.
    Update 7/26: A ToA update is needed, so the site was again blacklisted in DDM.  http://savannah.cern.ch/support/?122471 (Savannah site exclusion ticket).
    Updates:
    7/28: Requested changes to ATLAS ToA & panda schedconfigdb now done.
    8/1: Site admin reported maintenance period was over, and thus ready for testing.  Jobs submitted, but as yet not running due to missing release(s).  
    Alessandro and Xin notified.  Also, DDM failures were observed after the site was unblacklisted, most likely due to an incorrect SRM port value in the ToA.  
    Also, site admin is investigating a hardware problem with Dell.  eLog 27913/32,  https://ggus.eu/ws/ticket_info.php?ticket=73116 in-progress.
    Update 8/9: Still see a problem with pilots at the site.  May be related to WN client s/w.  Savannah 122471 updated.
    Update 8/12: All issues appear to be resolved.  Test jobs successful - queue set 'on-line'.  Savannah 122471 closed.
    (ii)  7/16: Initial express stream reprocessing (ES1) for the release 17 campaign started. The streams being run are the express stream and the CosmicCalo 
    stream. The reconstruction r-tag which is being used is r2541.  This reprocessing campaign (running at Tier-1's) uses release 17.0.2.3 and DBrelease 16.2.1.1.  
    Currently merging tasks are defined with p-tags p628 and p629.  More details here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DataPreparationReprocessing
    (iii)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on lxplus, so 
    presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to come up 
    with a solution.
    (iv)  7/31: SMU_LOCALGROUPDISK - DDM transfer failures with the error "[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=
    Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]."  From Justin: Certificate has expired. New cert request was put in a few days ago.  
    https://ggus.eu/ws/ticket_info.php?ticket=73070 in-progress, eLog 27876, site blacklisted: https://savannah.cern.ch/support/index.php?122540
    (v)  8/9: AGLT2 network problem.  Shawn requested that the site be set off-line.  https://savannah.cern.ch/support/index.php?122744 (site exclusion ticket), eLog 28227.
    Update 8/11: Issue resolved, test jobs successful, queues set back on-line.  eLog 28273.
     

  • this meeting: Operations summary:
     
     Yuri's summary from the weekly ADCoS meeting (this week provided by Hiroshi Sakamoto):
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=1&confId=150162
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-8_23_2011.txt
    
    1)  8/18: WISC_DATADISK - DDM checksum errors on some files used for functional tests.  Problem fixed by Wen - ggus 73598 closed, eLog 28576.  
    ggus 73595 was also opened for WISC around this time due to DDM gridftp errors (and the space tokens were blacklisted).  Problem resolved, ggus ticket 
    closed.  https://savannah.cern.ch/support/index.php?122928 (Savannah site exclusion), eLog 28635.
    2)  8/19: Large numbers of failed jobs across many sites with the pilot error "No reply to sent job."  Issue traced to out of date lcg-vomscerts package on 
    panda & bamboo servers,  now updated.  See: https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/28613.
    3)  8/19: OU_OCHEP_SWT2 - large number of jobs were stuck in the 'transferring' state for > one day.  Issue was somehow related to a networking problem, 
    resolved by rebooting the cluster headnode.  Backlog of transferring jobs cleared up, analysis queue set back on-line after successful HammerCloud tests.
    4)  8/19: Shifter reported BNL Frontier service at 50% availability (via SLS monitoring).  From John at BNL: A minor intervention was needed on one of BNL's 
    two Frontier servers, causing a service degradation but the service remained available throughout and has been restored back to normal.  
    https://savannah.cern.ch/bugs/index.php?85776, eLog 28592.
    5)  8/19: ALGT2 - too many issues after converting to CVMFS access to atlas s/w releases, so had to revert back to standard NFS-based access.
    Post-rollback test jobs were successful (both production and analysis via HammerCloud), so site back on-line.  eLog 28648.
    6)  8/22: AGLT2 - job failures due to checksum error on input file FSRelease-0.7.1.2.tar.gz.  ggus 73694 in-progress, eLog 28679.
    7)  8/22: NERSC_LOCALGROUPDISK - DDM errors such as "destination file failed on the SRM with error [SRM_ABORTED]."  ggus 73717 in-progress, 
    eLog 28700.
    8)  8/22 early a.m.: AGLT2 - network problem in the UM server room.  Issue resolved as of ~2:00 p.m. EST.  Production and analysis queues back on-line.  
    eLog 28715.
    9)  8/22: OUHEP_OSG - DDM transfer errors such as "Error:/bin/mkdir: cannot create directory `/raid04/data/atlashotdisk': Permission denied."  Issue understood 
    and resolved - from Horst: the atlashotdisk directory hadn't been created yet.  I fixed it, and transfers are flowing now.  ggus 73729/RT 20764 closed, eLog 28722.
    10)  8/22-23: OUHEP_OSG_HOTDISK - transfer failures due to checksum errors.  All but one of the transfers eventually succeeded, and Horst/Hiro are working 
    on this last one.  ggus 73741/RT 20765 closed, eLog 28730.  (ggus 73742/RT 20766 also opened around this time - closed as duplicates.)
    
    Follow-ups from earlier reports:
    
    (i)  7/16: Initial express stream reprocessing (ES1) for the release 17 campaign started. The streams being run are the express stream and the CosmicCalo 
    stream. The reconstruction r-tag which is being used is r2541.  This reprocessing campaign (running at Tier-1's) uses release 17.0.2.3 and DBrelease 16.2.1.1.  
    Currently merging tasks are defined with p-tags p628 and p629.  More details here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DataPreparationReprocessing
    (ii)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on lxplus, so 
    presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to come up 
    with a solution.
    Update 8/23: Patrick requested that the user run some new tests to check whether a BeStMan patch he installed addresses this issue.
    (iii)  7/31: SMU_LOCALGROUPDISK - DDM transfer failures with the error "[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=
    Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]."  From Justin: Certificate has expired. New cert request was put in a few days ago.  
    https://ggus.eu/ws/ticket_info.php?ticket=73070 in-progress, eLog 27876, site blacklisted: https://savannah.cern.ch/support/index.php?122540
    
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • LHCOPN has nearly completed deployment of perfsonar and using Tom's monitoring system
    • Italian cloud expanding, now eager to extend to cross-cloud; Canadian cloud coming online.
    • Demonstrators for DYNES - phase A deployment complete; beginning phase B. Demo could be available shortly - to demonstrate circuits between sites.
  • this week:

Federated Xrootd deployment in the US

last week(s) this week:

Site news and issues (all sites)

  • T1:
    • last week(s):
      • As mentioned earlier, empty directories issue. Otherwise very smooth.
      • Uptick in analysis jobs - a the Tier1 and overall generally.
      • Chimera upgrade is taking shape.
    • this week:

  • AGLT2:
    • last week(s):
      • Major problem - lost local networking at UM; switch had flow-control on while others didn't; fixed, recovering services.
    • this week:

  • NET2:
    • last week(s):
      • ATLAS Workshop at Boston University next week: http://physics.bu.edu/sites/atlas-workshop-2011/agenda/
      • Storage arrived for new rack with 3 TB drives
      • Worker node purchase in the plans
      • Next week ATLAS physics workshop in Boston
      • Still busy with IO program
      • Wide area networking to other sites - will discuss in next throughput meeting.
      • Smooth running
    • this week:

  • MWT2:
    • last week:
      • Retiring IU endpoints; both physical sites representing by one endpoints
      • Working on storage procurement
    • this week:

  • SWT2 (UTA):
    • last week:
      • Purchase cycle
    • this week:

  • SWT2 (OU):
    • last week:
    • this week:

  • WT2:
    • last week(s):
      • Upgraded CE, added a new CE - encountered problems; GIP and BDII - got help from Burt H.
    • this week:

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • Dan Bradley joined the meeting
  • SupportingGLOW
  • Science being supported - economists, animal science (statistics), physics; varies month-to-month
  • Space: no space requirements on the site, other than worker node
  • Few requirements - just must run under Condor
  • Some jobs use http-squid to get files; others use condor transfer mechanisms with TCP.
  • Using the site squid proxy, as advertised in the environment; (may need to check this)
  • Typical run time? Aim for 2 hours, some are running longer; they're running under glideins - so sites will see only the pilot. Glidein lifespan? Day or so.
  • Preemption is okay, expected.
this week

Python + LFC bindings, clients

last week(s):
  • New OSG release coming with has updated LFC, python client interfaces, etc., supporting new worker-node client and wlcg-client
this week:
  • OSG 1.2.21 released, has updates.

AOB

last week
  • None.
this week


-- RobertGardner - 18 Aug 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback