r10 - 05 Mar 2013 - 11:34:57 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb202013

MinutesFeb202013

Introduction

Minutes of the Facilities Integration Program meeting, Feb 20, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”

Attending

  • Meeting attendees: Patrick, Michael, Sarah, Dave, Jason, Rob, Mark, Saul, James (MSU), John H, Horst, Doug, Wei, Ilija, Armen and Mark,
  • Apologies: Shawn
  • Guests:

Integration program update (Rob, Michael)

Transition from DOEGrids to DigiCerts

Storage deployment

last meeting:(s)
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • MWT2: 500 TB
  • NET2: 576 TB usable, to be part of the GPFS. Expect a week from whenever the SGE is migration is complete. Est. 2 weeks.
  • SWT2_UTA: Expect remainder of delivery this week or next week.
  • SWT2_OU: Storage is online - but waiting to encorporate into Lustre.
this meeting:
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • MWT2:
  • NET2: up and running, being tested
  • SWT2_UTA: still waiting for equipment
  • SWT2_OU: Storage is online - but waiting to encorporate into Lustre.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Saul notes continued problems with jobs getting stuck in the transferring state. Kaushik notes the brokering limit on transferring to running > 2 wil be raised. Also, why is the number of transferring jobs increased? There as well is an autopyfactory job submission issue. Will need to discuss with John Hover. Saul and John to discuss with Jose and John Hover.
    • Hiro notes transfers back to FZK might be slowing this.
  • this meeting:

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Main issue has been PRODDISK filling quickly. But there were not significant issues at our sites, generally okay.
    • Big chunks deployed at AGLT2 and BNL.
    • USERDISK cleanups submitted - problem at SLAC.
    • There will be a data management meeting on Tuesday.
  • this meeting:
    • A note sent yesterday. In contact with sites to adjust space tokens.
    • Hiro will send USERDISK cleanup, actual cleanup will be in two weeks.
    • Is DATADISK being used? Armen claims it is primary data. It is a question of popularity. We need to work with ADC to discuss policy for effective use by physicists.
    • Issue reported by Doug: Top group has SW and NET2. 25% of at NET2, which has the most space has issues - many hundreds of queued datasets, lots of deletions on the books. Can't direct output of D3PD? production for use by US physicists, also by FAX. Had datasets that have been stalled for two weeks. 178 TB. Michael: may need to find an interim solution.

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=235568
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-2_11_2013.html
    
    1)  2/8: AGLT2 -  transfer failures with the error "First non-zero marker not received within 600 seconds."  Issue understood and resolved - from Shawn: 
    We had two pool nodes in a bad state (due to load). They have been restarted. In addition we had 3 dcache door VMs also showing some problematic 
    error messages and they were also restarted. https://ggus.eu/ws/ticket_info.php?ticket=91286 closed - eLog 42778.
    2)  2/11: New pilot release from Paul (v.56c).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_56c.html
    3)  2/11: AGLT2 - file transfer failures with "[GRIDFTP_ERROR] globus_ftp_client: the operation was aborted]."  From Shawn (2/13): Continuing problems 
    with inter-site bandwidth congestion. We applied a minor dcache update on the doors and headnodes at AGLT2 about 1.5 hours ago. Things are looking better. 
    In addition we increased the allowed number of processes on our dCache doors to remove a limitation we where hitting. We will continue to watch this but we 
    expect the problems should be resolved.  https://ggus.eu/ws/ticket_info.php?ticket=91379 in-progress, eLog 42878.  (ggus 91477 was also opened on 2/13 
    for SRM transfer errors. eLog 42908.)
    4)  2/12: Very large number of job failures at several U.S. cloud sites with the error "TRF_UNKNOWN."  Not a site issue, but rather a problem with multiple tasks.  
    See the discussion in https://savannah.cern.ch/support/?135848 - eLog 42858.  (ggus 91441 was incorrectly assigned to OU_OCHEP_SWT2 for these errors - 
    again not a site issue.  Ticket closed - eLog 42884.
    
    Follow-ups from earlier reports:
    
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to 
    protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion 
    ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (iii)  1/18: ggus 90548 / RT 22875 were opened for job failures at OU_OCHEP_SWT2 with "replica not found" pilot errors.  Not a site issue - such failures were 
    seen at most U.S. sites.  Seems to be an issue with PanDA/LFC.  Is the problem understood?  Both tickets closed, eLog 42183.
    Update 1/30: https://savannah.cern.ch/bugs/index.php?100175 opened for this issue.
    (iv)  1/21: SWT2_CPB DDM deletion errors - probably not a site issue, as the some of the errors are related to datasets with malformed names, others are deletion 
    attempts for very old datasets.  Working with DDM experts to resolve the issue.  https://ggus.eu/ws/ticket_info.php?ticket=90644 in-progress.
    Update 1/23: Opened https://savannah.cern.ch/support/?135310 - awaiting a response from DDM experts.
    Update 2/13: Duplicate ggus ticket 91451 was opened/closed - eLog 42890.  Still awaiting feedback from the deletions team.
    
    
    • 4000 job failures from get replica errors. Panda brokering issue? Mark - noted that input files were never there a UTA. Mark will push on this.
    • Also notes deletion errors at UTA - clearly something wrong.
  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=236858
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-2_18_2013.html
    
    1)  2/15: BNL - file transfer errors ("[TRANSFER_MARKERS_TIMEOUT] No transfer markers received for more than 180 seconds").  Issue understood - several 
    dCache pools were heavily loaded, resulting in timeouts.  Write operations on these pools were disabled, and the LAN/WAN mover numbers on them were lowered.  
    Solved the problem.  https://ggus.eu/ws/ticket_info.php?ticket=91548 was closed on 2/16.  eLog 42940.
    2)  2/15: From Saul: we have just set the BU_ATLAS_Tier2o queue to brokeroff as we're nearing the completion of our migration from PBS to SGE.
    3)  2/16: UPENN - file transfer failures (SRM connection issue).  https://ggus.eu/ws/ticket_info.php?ticket=91122 was re-opened, as this issue has been seen at the 
    site a couple of times over the past few weeks.  Restarting BeStMan fixes the problem for a few days.  Site admin requested support to try and implement a more 
    permanent fix.  eLog 42963.
    4)  2/19:  Sarah reported a large number of analysis jobs (~3k) were stuck in the 'defined' state at ANALY_MWT2. Yuri noticed there were some brokerage 
    messages/warnings for the site in the panda event logs, but not clear whether this was the problem.  Issue cleared up as of late Tuesday evening, only ~230 'defined' 
    jobs by that time.  eLog 43009.
    5)  2/19: NET2 - file transfer failures.  Initially the errors were SRM connection ones.  These may have coincided with admins at the site working on the central 
    deletions issue.  Later there were new errors like "No markers indicating progress received for more than 180 seconds" & " source file doesn't exist."  Issue under 
    investigation.  https://ggus.eu/ws/ticket_info.php?ticket=91641, eLog  43018.
    
    Follow-ups from earlier reports:
    
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to 
    protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), 
    eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (iii)  1/18: ggus 90548 / RT 22875 were opened for job failures at OU_OCHEP_SWT2 with "replica not found" pilot errors.  Not a site issue - such failures were seen at 
    most U.S. sites.  Seems to be an issue with PanDA/LFC.  Is the problem understood?  Both tickets closed, eLog 42183.
    Update 1/30: https://savannah.cern.ch/bugs/index.php?100175 opened for this issue.
    (iv)  1/21: SWT2_CPB DDM deletion errors - probably not a site issue, as the some of the errors are related to datasets with malformed names, others are deletion 
    attempts for very old datasets.  Working with DDM experts to resolve the issue.  https://ggus.eu/ws/ticket_info.php?ticket=90644 in-progress.
    Update 1/23: Opened https://savannah.cern.ch/support/?135310 - awaiting a response from DDM experts.
    Update 2/13: Duplicate ggus ticket 91451 was opened/closed - eLog 42890.  Still awaiting feedback from the deletions team.
    (v)  2/11: AGLT2 - file transfer failures with "[GRIDFTP_ERROR] globus_ftp_client: the operation was aborted]."  From Shawn (2/13): Continuing problems with inter-site 
    bandwidth congestion. We applied a minor dcache update on the doors and headnodes at AGLT2 about 1.5 hours ago. Things are looking better. In addition we increased 
    the allowed number of processes on our dCache doors to remove a limitation we where hitting. We will continue to watch this but we expect the problems should be resolved.  
    https://ggus.eu/ws/ticket_info.php?ticket=91379 in-progress, eLog 42878.  (ggus 91477 was also opened on 2/13 for SRM transfer errors. eLog 42908.)
    Update 2/14: No more transfer errors - closed ggus 91379.  On 2/15 ggus 91477 was also closed (the issue was a brief overload condition on the SRM host).  eLog 42924.
    
    • Replica not found error - was evidently resolved, but not sure of the fix.
    • TRF unknown errors - multi-cloud; this was eventually fixed.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • See notes in email yesterday
    • New version of perfsonar under test at AGLT2 and MWT2 (rc1).
    • Release by March. With 10G. Goal in facility to deploy across the facility by end March.
    • Amazon connectivity to AGLT2 being worked on (default uses commercial network, which is relatively slow). Hopefully soon.
  • this meeting:
    • NET2 - CERN connectivity - has it been improved?
    • LHCONE connectivity for NET2 and SWT2
    • Prepare with discussions at NET2, even if the setup will come in with the move to Holyoke; get organized. Move won't happen before the end of March. The WAN networking at Holyoke is still not well defined. Start a conversation about bringing LHCONE.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • link to FDR twiki https://twiki.cern.ch/twiki/bin/viewauth/Atlas/JanuaryFDR
  • Wei is testing voms module provided by Gerry. Still a few things to sort out, but
  • Added a few more sites. Working with the Spanish cloud; expecting new French sites.
  • New monitoring collector needed at CERN? The collector at SLAC is not very stable, working with Matevz.
  • All fax information is coming from AGIS now.
  • Found orphaned jobs from HC.
  • Progress on skim-slim service at UC
this week
  • Release 3.3 required for security module change is out.
  • Ilijia notes release 3.3 supports f-stream, and we should switch from detailed monitoring.
  • Global name space change and Rucio - may need to address this with DDM.
  • BNL overwrite options need to be set correctly (probably using an old xrdcp client)

Site news and issues (all sites)

  • T1:
    • last meeting(s): John is currently writing a cost-aware cloud scheduler. Adds policies that are cost driven, to expand to "pay as you go" resources. The current demand-driven event is helping drive better understanding with Amazon policies for provisioning and cost modeling. No indication of bottlenecks into/out of storage.
    • this meeting:

  • AGLT2:
    • last meeting(s): Things are running well. Lots of jobs in the analysis queue. Attempting to get a second peering into MiLAR, which will provide direct routes to Omnipop and MWT2. Heavy usage - more than 50k analysis jobs / day. Have had some crashes on the new storage. Addressing bottleneck between the two sites. 150 cores brought online.
    • this meeting: Still working with new storage servers - unstable.

  • NET2:
    • last meeting(s):
      1) Subscription backlog/FTS/gridftp: We moved griftp traffic to a new host as an attempt to help 
      with the pre-Moriond "Tier 2 starvation issue".  The performance of individual gridftp transfers 
      is good, but there are long pauses even when there is plenty of networking and I/O capacity.  It's 
      acting as if there isan FTS bottleneck, however chaning #files and #streams in FTS doesn't seem to 
      have any major effect.
      
      As a result of this, there is a subscription bottleneck (~2000 subscriptions are current).  We'd like
      to have a consultation from DDM to help with this if possible.
      
      2) We borrowed Tier 3 resources to boost production, re: pre-Moriond.
      
      3) Our new SGE PanDA queues (BU_ATLAS_Tier2 and ANALY_BU_ATLAS_Tier2) are both working in PanDA with
      HC and real analysis jobs.  The only remaining problem is getting condor feedback of running jobs
      to work for APF.  Thanks to Jose, John H. and OSG guys for helping with this.
      
      4) New storage is racked and stacked, electrical work is done.  Getting this up is high on our  to do list.  Currently there is plenty of free disk space.
      
      5) We reported a problem with slow transferring in PanDA.  This has drastically different effects on
      different sites, e.g. HU drained almost completely because of this while BU had plenty of 
      activated jobs.  
      
      6) We see a problem similar to what Horst sees at OU with quite a few jobs using a few minutes of 
      CPU time but >1 wall-day.
      
      7) Perfsonar 10Gpbs optics are installed on the new bandwidth node.
      
      8) Our usual end-to-end WAN study found a weak link from MANLAN to CERN causing, e.g. outgoing traffic 
      from NET2 to France to be 10x faster than NET2 to Switzerland.  
      
      9) Bestman deletion errors continue at about a 10% rate in spite of reducing the load on our SRM host.  
      We'll deal with this as soon as we get SGE squared away.
      
      10) We still have approximatly 1/2 of the 2012 hardware funds to spend.
      
      11) We saw a problem with monitoring jobs using ping at Harvard.  ping is blocked at Harvard and this 
      had the effect of gradually using up all the LSF job slots until someone notices and kills them.  We 
      think that this is resolved.
      
      12) Since the gridftp move on Jan. 17, we have a problem with both BU and HU being reported in "Unknown" 
      state to WLCG (thanks to Fred Luehring for pointing this out).  We still don't know what's going on here.
      
      13) Last week Ueda made some improvements to our AGIS entries, especially so that BU and HU belong to
      "US-NET2".  This is generally an improvement, but if you are sensitive to AGIS changes, you might notice 
      something.
      
      14) Preparations for Holyoke are actively underway.  Planning for networking, a 10Gbps link dedicated 
      for LHCone, late stages of negotiating with vendors to move the equipment.  The move will occur not before
      March 31 with ~1 month slippage fairly likely.
      
      15) We sometimes run into situations where it looks like we're not getting enough pilots but have no feedback
      as to why.  
      
      16) Paul Nilssons ErrorDiagnosis.py
      
    • this week: Running 100% analysis on the BU side. Michael: would like to include in Panglia.

  • MWT2:
    • last meeting(s): 500 TB added. Will be working on DQ2 + PY 2.6 client.
    • this meeting: Preparing for downtime during week of March 18: UC network reconfiguration, add 500 TB, investigate network issues at IU.

  • SWT2 (UTA):
    • last meeting(s): Deletion errors with Bestman for certain datasets. Have a ticket open. Network issue still being worked on by Jason; looks like it might be a busy link in Chicago. Continued work on storage. Finalizing purchase for additional compute nodes. Will circulate specifications to the tier 2 list.
    • this meeting: Still tracking an issue with the deletion service to clear up old deletions.

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): Unused job slots, lack of pilots.
    • this meeting: Working on getting FAX jobs to run - unintentionally brought the site down. Will need to experiment with Bestman and a new version of Java. Is OSG aware of this?

AOB

last meeting this meeting


-- RobertGardner - 19 Feb 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf rwg-monitoring.pptx.pdf (3442.6K) | RobertGardner, 20 Feb 2013 - 12:38 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback