r4 - 20 Mar 2013 - 15:14:24 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar202013

MinutesMar202013

Introduction

Minutes of the Facilities Integration Program meeting, March 20, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”

Attending

  • Meeting attendees: Rob, Dave, Shawn, Joel (for Horst), Saul, James, Bob, Armen, Patrick, John Brunelle, Sarah, Wei, Hiro, Mark
  • Apologies: Michael, Kaushik, Doug, Horst, Jason, Fred (joining late), Alden
  • Guests:

Integration program update (Rob, Michael)

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • MWT2: no change. Downtime delay for remaining 500 TB and required network upgrades, likely week of March 25
  • NET2: one or two days 576 TB will be added.
  • SWT2_UTA: waiting for delivery (est. tomorrow)
  • SWT2_OU: Need to request service from DDN team.
this meeting:
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • MWT2: no change. Downtime delay for remaining 500 TB and required network upgrades, likely week of March 25.
  • NET2: 576 TB is now in GPFS; still in UNALLOCATED at the moment. (Armen: there has been a rearrangement of GROUPDISK quota, NET2 was lowered 300 TB, ... and other changes.) Expects to bring online today.
  • SWT2_UTA: Equipment has been delivered. Also network equipment. Need to add head nodes, will take a big downtime. Need to rack and stack. (couple of days.) Expect delivery of new networking gear, needed for the new storage. Notes also very slow delivery from Dell. Shawn notes we can contact Gary Kriegal and Roger Goff.
  • SWT2_OU: Lustre expansion scheduled for April 8.

Supporting opportunistic usage from OSG VOs (Rob)

last week

this week

  • UC3 VO support: MWT2, AGLT2 (working on it, updating GUMS server), NET2 (working on it), SWT2_UTA, _OU (will upgrade GUMS)
  • CMS VO support: MWT2, WT2 (only for debugging, due to lack of outbound connectivity), AGLT2, SWT2_UTA, NET2 (will do)
  • OSG VO support: MWT2, AGLT2, SWT2_UTA, NET2 (will do),

Integration program issue

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

  • June 1 is the milestone date to get all sites on.
  • BNL, AGLT2, 2 sites from MWT2 (*MWT2_IU needs action.)
  • NET2 - Holyoke to MANLAN now connected. Tier 2 subnets will be need to routed. Holyoke move is postponed. Saul expects June 1 milestone will be met -- either at Holyoke, or in place at BU. What about Harvard? Will bring this up with HU networking.
  • SWT2 - both sites are not on LHCONE presently. OU goes through ONENET (has connect to AL2S? I2 system, ride layer 2 to MANLAN and peer there). For SWT2, LEARN - Kansas City may be the connection point. Dale Finkleson thought it could happen quickly. Patrick: have not discussed locally at UTA networking; concerned about how UTA traffic is handled. Shawn: I2 can bring up a peering in any of their PoPs, e.g. in Houston, if that is the best place to peer. Will talk with network managers
  • SLAC is already on LHCONE.

Supporting OASIS usability by OSG VOs

  • See presentations at OSG All Hands
  • Still about a month away

Deprecating PandaMover in the US ATLAS Computing Facility (Kasuhik)

Supporting opportunistic access from OSG by ATLAS

The transition to SL6

last meeting
  • WLCG working group
  • Shawn is in this group already; Horst volunteers

this meeting

  • Yesterday was the first meeting of the working group. All experiments have signed off, except ATLAS. There are compiling issues for CMS. Alessandro de Salvo gave the ATLAS summary - timeframe for site migration in June. (Lots of sites have already gone to SL6). June - October migration, so sites should think about this.

Evolving the ATLAS worker node environment

  • Meeting at OSG AHM with Brian, Tim (OSG); John, Rob, Dave, Jose
  • See notes

Virtualizing Tier 2 resources to backend ageing Tier 3's (Lincoln)

Transition from DOEGrids to DigiCerts

last week

this week

  • This is week is your last chance to renew DOE certs
  • See OSG AH presentations
  • Saul claims the service is very slow, nothing happening. User cert -- needs a RA to sign off. Michael has to sponsor the first user for a site.
  • John Hover suggested renewing anything near expiration.
  • AGLT2 has converted everything and has find no problems.
  • Lets avoid certificate surprises!

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Have plenty of production to do, sites should remain full. There will be period bursts for important work.
    • A good time for downtimes, starting next week.
    • Try to coordinate downtimes.
  • this meeting:
    • Kaushik notes production is low, good time to take downtimes.
    • Mark: things have been running smoothly. From last meeting:
      • WISC issue resolved
      • PENN still needs help; Mark will restart thread

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • A note sent yesterday. In contact with sites to adjust space tokens.
    • Hiro will send USERDISK cleanup, actual cleanup will be in two weeks.
    • Is DATADISK being used? Armen claims it is primary data. It is a question of popularity. We need to work with ADC to discuss policy for effective use by physicists.
    • NERSC scratchdisk deletion issue. Reporting not correct. Lots of consistency checking.
    • Daily summary reporting errors - complaint given to SSB team but not response.
    • NET2 central deletion errors continuing. Focus has been on low transfer rates. Have tried everything except upgrading Bestman2. (Error rate of 50%. Service sees "dropped connection". 400 errors/hour) Have updated Java, increased allowed threads; have not updated Java heapsize (Wei will send pointers).
    • USERDISK cleanup at the end of week
  • this meeting:
    • USERDISK cleanup was done at most places. Move 25 TB from USER to GROUP at AGLT2.
    • Hiro - believes there is something wrong with the deletion service - deleted 1/3 of USERDISK datasets, but only 10% reduction in storage or # files. Saw this at other sites as well.
    • Armen believes dark data in USERDISK is accumulating (70 TB). Is a user doing logical deletion. What is the strategy to handling dark data in USERDISK.
    • Ilija has an example of a source of dark data.
    • Should all sites bring back the quotas for all sites for GROUPDISK to 650 TB. (It must be group-wide)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-3_11_2013.html
    
    1)  3/9: BU_ATLAS_Tier2 squid service was down.  Restarted a few hours later - issued resolved.  https://ggus.eu/ws/ticket_info.php?ticket=92361 closed, 
    eLog 43268.
    2)  3/11: https://ggus.eu/ws/ticket_info.php?ticket=92434 ("source file doesn't exist") was opened and incorrectly assigned to BNL.  Problem was instead on 
    the NET2 side.  From Saul:  these are real missing files on our side from PRODDISK. I believe it's from us cleansing files from PRODDISK before they had a 
    chance to be transferred out when we were having our DDM slowness a week or so ago. We've declared some of the files missing, but still have to do an 
    inventory.  ggus ticket was closed - eLog 43300.
    3)  3/11: OU_OCHEP_SWT2 - jobs failing with "lost heartbeat" errors (~50% failure rate).  Horst investigated but not find any site-related issues.  Errors went 
    away after a few hours.  https://gus.fzk.de/ws/ticket_info.php?ticket=92477 closed - eLog 43321.
    
    Follow-ups from earlier reports:
    
    (i)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (ii)  2/16: UPENN - file transfer failures (SRM connection issue).  https://ggus.eu/ws/ticket_info.php?ticket=91122 was re-opened, as this issue has been seen 
    at the site a couple of times over the past few weeks.  Restarting BeStMan fixes the problem for a few days.  Site admin requested support to try and implement 
    a more permanent fix.  eLog 42963.
    Update 3/2: still an ongoing issue.  BesStMan restarts are required every few days (4-5?).  eLog 43194.
    (iii)  2/27: AGLT2 file transfer errors (locality is UNAVAILABLE").  One or more storage servers experiencing heavy loads.  https://ggus.eu/ws/ticket_info.php?ticket=91835 
    in-progress, eLog 43143.  https://savannah.cern.ch/support/index.php?136180 (Savannah site exclusion).  https://ggus.eu/ws/ticket_info.php?ticket=91896 was 
    also opened on 3/3 for file transfer problems.  Update 3/5: SRM errors this day a separate issue (ownership of host certs on some dCache servers).  Issue resolved.  
    Also, deployed a new kernel of the storage nodes to rectify recent problems. 
    Update 3/8: all issues resolved - ggus 91385, 91896 closed - eLog 43298.
    (iv)  3/5: WISC-ATLAS LOCALGROUPDISK: functional test transfer failures (efficiency 0%) with DESTINATION errors (" has trouble with canonical path").  
    https://ggus.eu/ws/ticket_info.php?ticket=92164, eLog 43227.
    Update 3/11: still seeing these file transfer errors - added a comment to ggus 92164.
    
  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=235568
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-3_18_2013.htm
    
    1)  3/14: SWT2_CPB - file transfer failures ("failed to contact on remote SRM").  Two issues around this time: (i) several storage servers were heavily loaded; (ii) the 
    SRM host was intermittently unresponsive.  Issue cleared up after a few hours -  https://ggus.eu/ws/ticket_info.php?ticket=92519 closed early a.m. 3/15.  eLog 43356.
    2)  3/15: DUKE_ATLASGCE - site had requested to be set to 'brokeroff' on 3/7.  On 3/15 jobs were heavily failing at the site with "lost heartbeat" errors.  Issue was 
    due to an overload of the batch queue head node.
    3)  3/16: HU_ATLAS_Tier2 - two WN's with high job failure rates.  Site confirmed the problem, and the nodes were remove from production.  
    https://ggus.eu/ws/ticket_info.php?ticket=92551 closed, eLog 43360.
    4)  3/16: OU_OCHEP_SWT2 - one WN had a high failure rate.  Horst removed the node from production, as it had a bad hard drive.  
    https://ggus.eu/ws/ticket_info.php?ticket=92554 closed, eLog 43364.
    5)  3/18: SLACXRD file transfer failures ("failed to contact on remote SRM").  Wei reported the problem was fixed - transfers began to succeed.  Closed 
    https://ggus.eu/ws/ticket_info.php?ticket=92604 - eLog 43395.
    
    Follow-ups from earlier reports:
    
    (i)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    Update 3/17: deletion service involve to provide some support to address this problem.  See: https://savannah.cern.ch/bugs/index.php?100884.  eLog 43373.
    (ii)  2/16: UPENN - file transfer failures (SRM connection issue).  https://ggus.eu/ws/ticket_info.php?ticket=91122 was re-opened, as this issue has been seen 
    at the site a couple of times over the past few weeks.  Restarting BeStMan fixes the problem for a few days.  Site admin requested support to try and implement 
    a more permanent fix.  eLog 42963.
    Update 3/2: still an ongoing issue.  BesStMan restarts are required every few days (4-5?).  eLog 43194.
    (iii)  3/5: WISC-ATLAS LOCALGROUPDISK: functional test transfer failures (efficiency 0%) with DESTINATION errors (" has trouble with canonical path").  
    https://ggus.eu/ws/ticket_info.php?ticket=92164, eLog 43227.
    Update 3/11: still seeing these file transfer errors - added a comment to ggus 92164. amin
    Update 3/19: Site admin resolved a problem with xrootdfs - transfers now succeeding.  Closed ggus 92164 - eLog 43410.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Release by March. With 10G. Goal in facility to deploy across the facility by end March.
    • NET2 - CERN connectivity - has it been improved?
    • LHCONE connectivity for NET2 and SWT2 - timeline?
    • Prepare with discussions at NET2, even if the setup will come in with the move to Holyoke; get organized. Move won't happen before the end of March. The WAN networking at Holyoke is still not well defined. Start a conversation about bringing LHCONE.
  • this meeting:
    • rc2 for perfsonar; rc3 next week; sites should prepare to upgrade.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s) this week

  • Wei: had a meeting at CERN with ADC regarding global namespace with Rucio. Everyone on board as to implementation. It is a compromise - space token. C++, (Andy/Wei) Java (Ilija), DPM (?).
  • rpm for VOMS security module. Xrootd 3.3.1, plus rpm. Gerry will provide this within a week.
  • Ilija: New sites: FZK, RAL-PP
  • New student is investigating all monitoring sent to CERN, studying IO rates, etc., will be useful.
  • Slim Skim Service - recommended as a central service.
  • xrootd-door on dCache: Wei has a work-around for a few problems. Discussing dCache support.

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting:

  • AGLT2:
    • last meeting(s):
    • this meeting: ordered upgrade to blade system for 10g systems. 416 job slots to be added.

  • NET2:
    • last meeting(s):
    • this week: Central deletion group is troubleshooting problems.

  • MWT2:
    • last meeting(s):
    • this meeting: Waiting on Dell hardware network reconfig. IU - brocade networking problem. UIUC - campus cluster IP address space converted; GPFS problem found, need update. New Dell hardware has arrived.

  • SWT2 (UTA):
    • last meeting(s):
    • this meeting:

  • SWT2 (OU):
    • last meeting(s):
    • this meeting:

  • WT2:
    • last meeting(s):
    • this meeting: Power issue on March 13, broke a dell switch. Downtime tomorrow to replace the switch. User process limit on RHEL6 had a problem with SDD arrays crashing, 90-nproc problem (fixed, stable).

AOB

last meeting this meeting


-- RobertGardner - 20 Mar 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback