r3 - 09 Nov 2011 - 12:48:06 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesNov9

MinutesNov9

Introduction

Minutes of the Facilities Integration Program meeting, November 9, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Friday (1pm CDT, bi-weekly - convened by Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New integration phase - WIP NEW
      • Networking issues:
        • slow transfers from our Tier 2's and Tier 1s (Atlantic and Canada). We need to get into the habit of asking network experts for help, as well as relying on our own tools. Esnet can help.
        • We had problems getting CERN's attention for Calibration data to AGLT2.
        • Perfsonar was key to resolving this issue.
        • Tier 1 service coordination meeting is a forum to raise issues.
      • LHCONE
        • No where near production quality
        • Testbed at the moment
        • See talk of last week
        • Agreement with ADC operations - step back. Ask a small number of sites to get onto the infrastructure in a well-managed way; present this list at an upcoming ADC meeting.
        • Will not accept an approach with sites attaching by themselves
      • Participation in technical working groups news
        • WLCG storage management technical working group -- 2015
        • Call for volunteers for participation.
      • Cloud
        • Cui Lin's discussion at SMU meeting - continued last week at SW week (John's presentation) - well received and aligned with ATLAS plans. More concrete activities in the future.
      • Federated Xrootd
        • Wei gave an excellent presentation on where we are
        • Dan van Der Steer tried it out, found shortcomings, made good progress
        • Potential is still great - for Tier 3's, and Tier 2s
        • We should continue to pursue this
    • this week

Focus topic: Illinois Campus Cluster Pilot Project (Dave Lesny)

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Spoke with Borut today - making sure all clouds are full; will have enough G4 to keep us going for many months. We should never see drains. If we see them, raise an alarm, its a technical issue.
    • Regional production may be coming to RAC soon.
    • All is well otherwise.
    • Doug: Output of group-level activities into the US cloud? Not sure, CREM meeting tomorrow. Urges Doug to make subscriptions. PD2P? only looks at DATADISK - Kaushik suggests this as an alternative.
  • this week:

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (presented this week by Pavol Strizenec):
    
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_1_2011.html
    
    
    1)  10/26: AGLT2_CALIBDISK transfer errors ("[TRANSFER_TIMEOUT] gridftp_copy_wait: Connection timed out").  From Bob: We are using a 
    "site aware" dCache configuration, and we _think_ that the number of cached files, not showing in tokens, increased to a point where all the 
    pools were full or nearly full. We have since freed up 1TB minimum on all pools, and the srmwatch tool here is now showing mostly green.  
    ggus 75739 / RT 21022 closed, eLog 30812.  (ggus 75765 also opened during this period for job stage-out errors - since closed - eLog 30813.)
    2)  10/27: New pilot release from Paul (SULU 49a).  See details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_SULU_49a.html
    3)  10/28: SWT2_CPB - DDM errors ("...has trouble with canonical path. cannot access it.").  Problem fixed by restarting the xrootdfs service on 
    the SRM host.  ggus 75811 / RT 21032 closed, eLog 31039.
    4)  10/31: WISC_* spacetokens - file transfer errors ("[GRIDFTP_ERROR] globus_ftp_client: the server responded with an error500").  Issue 
    fixed by the site - srm-copy now works.  ggus 75855 closed, eLog 30999.
    5)  10/31-11/1: Very large backlog of 'holding' jobs in the panda system.  Clouds were set off-line for several hours to allow the backlog to clear out.  
    Extensive discussion about the casue / possible solutions.  See for example: eLog 31029, e-mail threads in ADCoS experts/shifters lists.
    6)  11/1: SWT2_CBP - destination file transfer errors from various sites - for example "Timelimit of 604800 seconds exceeded in 
    TAIWAN-LCG2_DATADISK->SWT2_CPB_DATADISK queue."  Only indication of a problem was a large number of waiting jobs in one of the 
    FTS channels.  The concurrency limit was increased from 10 => 20.  ggus 75885 / RT 21143 in-progress, eLog 31041.
    7)  11/1: UTD-HEP set off-line due to a power outage at the site.  https://savannah.cern.ch/support/index.php?124408 (Savannah site exclusion), 
    eLog 31044.
    8)  11/1: NET2 - Saul reported the site experienced a power failure affecting the LFC.  Panda queues were kept off-line overnight in preparation 
    for a GPFS upgrade the next morning.  Upgrade completed - however, ggus 75915 was opened during this time - since closed.  Savannah site 
    exclusion: https://savannah.cern.ch/support/?124427.
    
    Follow-ups from earlier reports:
    
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  This is the 
    same type of issue that has been observed at several US tier-2d's when attempting to copy job output files to other clouds.  Working to understand 
    this problem and decide how best to handle these situations.  Discussion in eLog 30170.
    (iii)  10/14: AGLT2 file transfer errors ("[NO_PROGRESS] No markers indicating progress received for more than 180 seconds").  Probably not an 
    issue on the AGLT2 side, but rather slowness on the remote end (in this case TRIUMF).  There was a parallel issue causing jobs failures (NFS 
    server hung up), which has since been resolved.  ggus 75302 closed, eLog 30415.
    Update 10/15: ggus 75348 was opened for the site, again initially related to job failures due to slow output file transfer timeouts.  What was 
    probably a real site issue (jobs failing with "ERROR 201 Copy command exited with status 256 | Log put error: lsm-put failed (201)") began 
    appearing on 10/17.  Most likely related to a dCache service restart.  Ticket in-progress, eLog 30443, https://savannah.cern.ch/support/?124073 
    (Savannah site exclusion).
    (iv)  10/18: NET2 - file transfer errors ("file has trouble with canonical path. cannot access it").  Saul reported that the issue was possibly a transient 
    \GPFS problem causing several nodes to briefly lose their mounts.  ggus 75430 in-progress, eLog 30502.
    Update 10/19-10/20: File transfer errors at NET2.  Issue was due to checksum problem after some new storage was brought on-line.  See details in 
    https://ggus.eu/ws/ticket_info.php?ticket=75430 (in-progress), https://savannah.cern.ch/bugs/?87984, eLog 30586.
    Update 11/2: issues resolved (including GPFS clients upgrade) - ggus 75430 / RT 20993 closed, eLog 30819.
    (v)  10/22: OUHEP_OSG - file transfer errors ("failed to contact on remote SRM [httpg://ouhep2.nhn.ou.edu:8443/srm/v2/server]").  Horst reported 
    there is a hardware problem with the SRM host which require some time to get it repaired.  Site set off-line in DDM (https://savannah.cern.ch/support/?124250), 
    ggus 75608 / RT 21007 in-progress, eLog 30744.
    Update 11/1: hardware problem with the SRM host being worked on - tickets updated.
    (vi)  10/25: MWT2_UC - job failures with the error "Transformation not installed in CE" (16.6.7.13).  Alessandro reported that the release installation 
    system indicates the software has been installed.  Under investigation - ggus 75692, eLog 30762.
    Update 10/27: release is at the site (missing link?) - jobs using the release are finishing successfully.  ggus 75692 closed.
    
    
    • UTD HEP finally got back online (thanks to Marco helping with local-site-mover)
    • Notes SSB daily resumes have been useful. Not sure if "no activity" means a problem.
    • HIro: BNL has had problems getting excluded in the past week.
      • Alessandro's script runs two times a day; 6000 lines of bash and python
      • Sometimes removes a local setup script, temporarily. Results in a mixup of 32 bit and 64 bit libraries being available
      • Has the job profile changed recently, requiring more conditions data to be read

  • this meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=162147
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_7_2011.html
    
    
    1)  11/2: From Bob at AGLT2: We took two power hits this morning at AGLT2/UM campus, one of which lasted long enough that it took down several 
    hundred job slots.  We are back up now, and hoping for no further outages (one campus building will be off for 8 hours).  HC Jobs are once again 
    succeeding here.
    2)  11/2: various U.S. cloud sites - thousands of job failures with LFC errors (mostly "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, 
    Could not secure the connection)").  Some sites needed to update the lcg-vomcerts package - link provided by John Hover (thanks).  Errors stopped 
    as sites performed the update.  ggus 75931 was opened during this period at ALGT2 - errors looked like the same LFC one shown above?  AGLT2 
    was also set off-line during this time.  Once issue seemed to be resolved test jobs submitted - completed successfully, back on-line.  ggus 75931 closed, 
    eLog 31101.
    3)  11/3: LFC problem at OU_OCHEP - issue was an incorrect location for the ".lsc" files - now fixed by Horst.
    4)  11/4: NET2 - file transfer failures ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]").  Update from Saul: We had a chain 
    of events: file system problem with small files => adler checksum spreader gettting stuck => high loads on gatekeeper => SRM unresponsive.  It should 
    be all fixed now.  ggus 76015 closed, eLog 31120.
    5)  11/7: ALGT2 - file transfers were failing with the error "Unable to connect to msufs01.aglt2.org:2811globus_xio: Operation was canceled globus_xio: 
    Operation timed out."  Due to an issue with the with the network link between MSU & Chicago.  Being worked on - details in
     https://ggus.eu/ws/ticket_info.php?ticket=76072, eLog 31195.
    6)  11/7-11/9: BNL maintenance outage.  More details: eLog 31163, http://savannah.cern.ch/support/?124485 (Savannah site exclusion),
    http://www-hep.uta.edu/~sosebee/ADCoS/BNL-Intervention-Nov7-Nov9.html
    7)  11/8: SWT2_CPB: DDM errors (SRM down) - the site had scheduled a maintenance outage in OIM, which correctly propagated forward and caused 
    a DDM blacklisting.  It became necessary to extend the downtime for several hours, but the extension wasn't processed in time to prevent the site from 
    being whitelisted, hence the transfer errors.  Maintenance completed, SRM now back up.  ggus 76162 / RT 21177 closed, eLog 31229.
    
    Follow-ups from earlier reports:
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  This is the same 
    type of issue that has been observed at several US tier-2d's when attempting to copy job output files to other clouds.  Working to understand this 
    problem and decide how best to handle these situations.  Discussion in eLog 30170.
    (iii)  10/14: AGLT2 file transfer errors ("[NO_PROGRESS] No markers indicating progress received for more than 180 seconds").  Probably not an issue 
    on the AGLT2 side, but rather slowness on the remote end (in this case TRIUMF).  There was a parallel issue causing jobs failures (NFS server hung up), 
    which has since been resolved.  ggus 75302 closed, eLog 30415.
    Update 10/15: ggus 75348 was opened for the site, again initially related to job failures due to slow output file transfer timeouts.  What was probably a 
    real site issue (jobs failing with "ERROR 201 Copy command exited with status 256 | Log put error: lsm-put failed (201)") began appearing on 10/17.  
    Most likely related to a dCache service restart.  Ticket in-progress, eLog 30443, https://savannah.cern.ch/support/?124073 (Savannah site exclusion).
    Update 11/3: https://savannah.cern.ch/support/?124073 closed, but ggus 75348 is still 'in-progress'.
    (iv)  10/22: OUHEP_OSG - file transfer errors ("failed to contact on remote SRM [httpg://ouhep2.nhn.ou.edu:8443/srm/v2/server]").  Horst reported there 
    is a hardware problem with the SRM host which require some time to get it repaired.  Site set off-line in DDM (https://savannah.cern.ch/support/?124250), 
    ggus 75608 / RT 21007 in-progress, eLog 30744.
    Update 11/1: hardware problem with the SRM host being worked on - tickets updated.
    (v)  11/1: SWT2_CBP - destination file transfer errors from various sites - for example "Timelimit of 604800 seconds exceeded in 
    TAIWAN-LCG2_DATADISK->SWT2_CPB_DATADISK queue."  Only indication of a problem was a large number of waiting jobs in one of the FTS channels.  
    The concurrency limit was increased from 10 => 20.  ggus 75885 / RT 21143 in-progress, eLog 31041.
    Update 11/3: Since the ggus & RT tickets were still open they were updated with info regarding different (unrelated) transfer errors ("failed to contact on 
    remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]").  This latter problem was due to a hardware problem with a storage server.  It was 
    replaced that evening and the issue resolved.  ggus 75885 / RT 21143 closed, eLog 31227.
    (vi)  11/1: UTD-HEP set off-line due to a power outage at the site.  https://savannah.cern.ch/support/index.php?124408 (Savannah site exclusion), 
    eLog 31044.
    Update 11/3: power restored - test jobs completed successfully - back to 'on-line'.  eLog 31108.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US

last week(s) this week:

Tier 3 GS

last meeting:
  • Hari: UTD back online, all is well.

this meeting:

Site news and issues (all sites)

  • T1:
    • last week(s):
      • Have on order PB of disk on order (Nexan - 3 TB drives); need to have this in place for next reprocessing round. Will evaluate Hitachi.
    • this week:

  • AGLT2:
    • last week(s):
      • Went down on Thursday - UPS testing to test backup generator. OSG 1.2.23 - CE, wn-client, wlcg-client all upgraded. And Xin's new ATLAS-wn.
      • 4 am incident today - transfers failing; seemed to be related to site-aware configuration, needed to run sweeper to free up space. Caused SRM watch to go read. Consulting dCache.
      • Discussing transfers rates to Triumf with Ian Gable.
      • Tom: dcache pool node failed, caused network storm. Looking at rate limits on switches. You see lots of packets. 10M packets/sec, crashed node had highest input rate. 24 port 10G switch blade network.
    • this week:

  • NET2:
    • last week(s):
      • Major prob on Oct 20. New storage, GPFS config. Next 24 hours resulted in corrupted files. Cannot make the problem come back. Not sure if its a GPFS problem, may upgrade. Following Hiro's recipe for declaring bad files.
      • Working on conversion to CVMFS.
      • New gatekeeper and some new server machines
      • Doug: will you add more to the groupdisk area? Guideline is to provide 100 TB (per group, within the groupdisk area)
    • this week:

  • MWT2:
    • last week:
      • Took delivery of first part of 720 TB storage upgrade
      • Sarah and Fred busy working on rack re-arrangement at IU
      • https://integrationcloud.campfirenow.com/room/192194
      • Illinois - Dave is updating phase 1 of the pilot, tieing Illinois site to campus cluster has been successful - running Panda jobs. Document this for ultimate incorporation into MWT2 queues. Specifications for worker nodes for second instance of campus cluster. Otherwise things are find at Illinois.
    • this week:

  • SWT2 (UTA):
    • last week:
      • HEPSPEC survey - will update table
      • Cleaned up PRODDISK; discovering cycling TB day writing into storage
      • Looking at bestman/xrootd install from VDT repos; had a problem with ATLAS certificates; there is a missing use-case in the install instructions.
      • Deletion service configuration changed; saw a load issue when the rate was high. Not sure of cause, since local proddisk-cleanse does not cause this kind of spike.
    • this week:

  • SWT2 (OU):
    • last week:
      • Had a storage reboot yesterday. Load issue on Lustre servers - 9900 issues after reboot. Load gone now though.
      • Asked IBM about adding 3 TB drives - can't do this without voiding warranty.
    • this week:

  • WT2:
    • last week(s):
      • Working on installation of new R410
      • New storage online, 550 TB of storage. Some problems with the storage server from Dell, continuing.
      • Reconfiguring disk for performance, will lose 250TB as a result. Net gain is ~ 300 TB of storage (will be up to 2.1, 2.2 PB).
    • this week:

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
this week

CVMFS deployment discussion

See TestingCVMFS

previously:

  • Wei: instructions seem to ask us to use the squid at BNL. Hardware recommendation? Same as what you're already using.
  • Load has not been observed to be high at AGLT2. Squid is single-threaded, multi-core not an issue. Want to have a good amount of memory, so as to avoid hitting local disk.
  • At AGLT2 -recommend multiple squids, and compute nodes are configured to not hit remote proxy. Doug claims it will still fail over to stratum 1 regardless.
this week:

Python + LFC bindings, clients

last week(s):
  • NET2 - planning
  • AGLT2 - 3rd week of October
  • Now would be a good time to upgrade; future OSG releases will be rpm-based with new configuration methods.
this week:

AOB

last week
  • Doug: Autopyfactory and Tier 3-Panda work will require schedconfig work.
this week


-- RobertGardner - 08 Nov 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf T2_Pilot.ppt.pdf (52.6K) | RobertGardner, 09 Nov 2011 - 07:46 |
pdf T2_Pilot-1.pdf (110.8K) | RobertGardner, 09 Nov 2011 - 07:46 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback