r3 - 03 Apr 2013 - 14:42:41 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr032013



Minutes of the Facilities Integration Program meeting, April 3, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”


  • Meeting attendees: Dave, Rob, Fred, Saul, Bob, Ilija, Wei, Patrick, Mark, Armen, Alden, Sarah, Horst, Hiro,
  • Apologies: Mark, Michael, Jason
  • Guests:

Integration program update (Rob, Michael)

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: no change. Downtime delay for remaining 500 TB and required network upgrades, likely week of March 25
  • NET2: 576 TB is now in GPFS; still in UNALLOCATED at the moment. (Armen: there has been a rearrangement of GROUPDISK quota, NET2 was lowered 300 TB, ... and other changes.) Expects to bring online today.
  • SWT2_UTA: Equipment has been delivered. Also network equipment. Need to add head nodes, will take a big downtime. Need to rack and stack. (couple of days.) Expect delivery of new networking gear, needed for the new storage. Notes also very slow delivery from Dell. Shawn notes we can contact Gary Kriegal and Roger Goff.
  • SWT2_OU: Lustre expansion scheduled for April 8.
this meeting:
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: Network updates in place. Bringing remaining 500 TB online this week and next.
  • NET2: DONE
  • SWT2_UTA: Equipment received, storage racked. Planning networking changes; coming along. Will take a downtime to make network changes.
  • SWT2_OU: Still on for April 8. Tuesday network configuration. Will add 120 TB.

Supporting opportunistic usage from OSG VOs (Rob)

last week
  • AccessOSG
  • In particular, I would like to request all US ATLAS sites support the UC3 VO (see SupportingUC3).
  • UC3 VO support: MWT2, AGLT2 (working on it, updating GUMS server), NET2 (working on it), SWT2_UTA, _OU (will upgrade GUMS)
  • CMS VO support: MWT2, WT2 (only for debugging, due to lack of outbound connectivity), AGLT2, SWT2_UTA, NET2 (will do)
  • OSG VO support: MWT2, AGLT2, SWT2_UTA, NET2 (will do),
this week
  • UC3 VOMS issue seen at AGLT2 - getting access error. Elisabeth from GOC attempting to reproduce. Fred can also help.
  • CMS - NET2 attempted, but had an auth issue.
  • OSG - NET2 will try again

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting
  • June 1 is the milestone date to get all sites on.
  • BNL, AGLT2, 2 sites from MWT2 (*MWT2_IU needs action.)
  • NET2 - Holyoke to MANLAN now connected. Tier 2 subnets will be need to routed. Holyoke move is postponed. Saul expects June 1 milestone will be met -- either at Holyoke, or in place at BU. What about Harvard? Will bring this up with HU networking.
  • SWT2 - both sites are not on LHCONE presently. OU goes through ONENET (has connect to AL2S? I2 system, ride layer 2 to MANLAN and peer there). For SWT2, LEARN - Kansas City may be the connection point. Dale Finkleson thought it could happen quickly. Patrick: have not discussed locally at UTA networking; concerned about how UTA traffic is handled. Shawn: I2 can bring up a peering in any of their PoPs, e.g. in Houston, if that is the best place to peer. Will talk with network managers
  • SLAC is already on LHCONE.

this meeting

  • Updates?
  • Fred - still struggling with Brocade at IU. Problem has been reproduced, testing.
  • SWT2_OU: still waiting to hear from I2 and ESnet. Zane Gray at OU is leading this.

Supporting OASIS usability by OSG VOs

last meeting
  • See presentations at OSG All Hands
  • Still about a month away
  • Note this is CVMFS for OSG VOs

this meeting

  • Fred will invite Scott to report in two weeks.

Deprecating PandaMover in the US ATLAS Computing Facility (Kasuhik)

  • No update.

The transition to SL6

last meeting
  • WLCG working group
  • Shawn is in this group already; Horst volunteers
  • Yesterday was the first meeting of the working group. All experiments have signed off, except ATLAS. There are compiling issues for CMS. Alessandro de Salvo gave the ATLAS summary - timeframe for site migration in June. (Lots of sites have already gone to SL6). June - October migration, so sites should think about this.

this meeting

  • Discussed above.

Evolving the ATLAS worker node environment

last meeting
  • Meeting at OSG AHM with Brian, Tim (OSG); John, Rob, Dave, Jose

this meeting

  • Nothing to report this meeting.

Virtualizing Tier 2 resources to backend ageing Tier 3's (Lincoln)

  • No update

Transition from DOEGrids to DigiCerts

last week
  • This is week is your last chance to renew DOE certs
  • See OSG AH presentations
  • Saul claims the service is very slow, nothing happening. User cert -- needs a RA to sign off. Michael has to sponsor the first user for a site.
  • John Hover suggested renewing anything near expiration.
  • AGLT2 has converted everything and has find no problems.
  • Lets avoid certificate surprises!

this week

  • As of March 26 there is a list of sponsors.

Operations overview: Production and Analysis (Kaushik absent, Mark reporting)

  • Production reference:
  • last meeting(s):
    • Kaushik notes production is low, good time to take downtimes.
    • Mark: things have been running smoothly. From last meeting:
      • WISC issue resolved
      • PENN still needs help; Mark will restart thread
  • this meeting:
    • Running smooth
    • Pilot update from Paul
    • Large number of waiting jobs from the US - premature assignment?
    • Following-up with PENN: email got not response. Paul Keener reported an auto-update to his storage; reverted back to previous version (March 28). Transfers are now stable at the site, the ticket has been closed.
    • Discussion about Site Storage blacklisting. Its essentially an automatic blacklisting. Discussed using atlas-support-cloud-us@cern.ch. The problem is what to do with the Tier 3. Doug will make sure the tier3 sites have correct email addresses in AGIS.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • USERDISK cleanup was done at most places. Move 25 TB from USER to GROUP at AGLT2.
    • Hiro - believes there is something wrong with the deletion service - deleted 1/3 of USERDISK datasets, but only 10% reduction in storage or # files. Saw this at other sites as well.
    • Armen believes dark data in USERDISK is accumulating (70 TB). Is a user doing logical deletion. What is the strategy to handling dark data in USERDISK.
    • Ilija has an example of a source of dark data.
    • Should all sites bring back the quotas for all sites for GROUPDISK to 650 TB. (It must be group-wide)
  • this meeting:
    • Will need another USERDISK cleanup. Hiro will send an email.
    • PRODDISK needs to be cleaned at NET2 and MWT2.
    • The SRM values reported incorrectly at SLAC and SWT2.
    • The SRM value at SWT2 space tokens dropped, then came back, except at GROUPDISK. Notes this is transient behavior.
    • GROUPDISK loss at MWT2 - site report below.
    • NET2 deletion issue - it is still slow. There is also a reporting issue here as well. This is reported everyday in Savannah. There is an active discussion. Saul believes it is functional by lowering the chunk size to 10 (other sites have 80-120). lcg-del command used by Tomas sometimes drops connections. Dropped connections 1 out of 10 from BU. Next step? Try duplicating problem on a new machine. Dropped packets?

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  3/21: New pilot release from Paul (v56d).  Details here:
    2)  3/21: DDM dashboard update released:
    3)  3/21: NET2 - file transfer failures with SRM errors.  https://ggus.eu/ws/ticket_info.php?ticket=92762 in-progress, eLog 43449.  Blacklisted: 
    Update 3/27: issue seems to be resolved, no errors over the previous 12 hours.  Closed ggus 92762.  eLog 43543.
    4)  3/21: WISC - https://ggus.eu/ws/ticket_info.php?ticket=92164 was re-opened (previously closed on 3/19), due to all transfers failing with 
    "service timeout during [srm2__srmPrepareToPut]."  No site update, but transfers succeeding as of 3/25.  Again closed ggus 92164.  eLog 43505.
    5)  3/22: NERSC - file transfer failures ("[GRIDFTP_ERROR] an end-of-file was reached globus_xio: An end of file occurred (possibly the destination 
    disk is full)"). https://ggus.eu/ws/ticket_info.php?ticket=92768 was closed on 3/25 after the errors went away.  eLog 43506.
    6)  3/24: SWT2_CPB - https://ggus.eu/ws/ticket_info.php?ticket=92519 was re-opened after some of the SRM transfer errors came back.  Several 
    storage servers are intermittently heavily loaded - issue under investigation. eLog 43476.
    7)  3/24: AGLT2 - file transfer errors ("[GENERAL_FAILURE] ... ThePinCallbacks Timeout").  A storage server was hung up and had to to be restarted.  
    Also free space increased on the system since it was fuller than dCache settings should have allowed.  https://ggus.eu/ws/ticket_info.php?ticket=92802 
    closed, eLog 43507.
    8)  3/25: SLAC - file transfer errors (SRM).  Issue was a temporary one related to a networking issue at the site (problematic switch caused most of 
    the storage to be inaccessible).  eLog 43493/94.
    Follow-ups from earlier reports:
    (i)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    Update 3/17: deletion service involve to provide some support to address this problem.  See:
    https://savannah.cern.ch/bugs/index.php?100884.  eLog 43373.
    Update 3/25: working with the deletion service team this issue was resolved.  Still working to increase the rate.  ggus 89339 closed.
    (ii)  2/16: UPENN - file transfer failures (SRM connection issue).  https://ggus.eu/ws/ticket_info.php?ticket=91122 was re-opened, as this issue has been 
    seen at the site a couple of times over the past few weeks.  Restarting BeStMan fixes the problem for a few days.  Site admin requested support to try 
    and implement a more permanent fix.  eLog 42963.
    Update 3/2: still an ongoing issue.  BesStMan restarts are required every few days (4-5?).  eLog 43194.

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  3/28: MWT2 - file transfer errors (SRM). Issues following a major networking upgrade at the site forced the scheduled maintenance downtime to be 
    extended, hence these transfer errors.   Issue resolved, https://ggus.eu/ws/ticket_info.php?ticket=92948 closed as of 3/29.  eLog 43568.  
    https://savannah.cern.ch/support/index.php?136789 (Savannah site exclusion - site was blacklisted during this incident).
    2)  3/29:  NERSC - https://ggus.eu/ws/ticket_info.php?ticket=92768 was re-opened after the file transfer errors reappeared.  eLog 43629.
    3)  4/1: NET2 - file transfer errors (SRM).  From Saul: We had a GPFS problem which lead to our SRM hanging. SRM is restarted. We think that the problem 
    is resolved.  No additional errors over the next ~12 hours, so closed https://ggus.eu/ws/ticket_info.php?ticket=93004.  eLog 43666.
    Follow-ups from earlier reports:
    (i)  2/16: UPENN - file transfer failures (SRM connection issue).  https://ggus.eu/ws/ticket_info.php?ticket=91122 was re-opened, as this issue has been 
    seen at the site a couple of times over the past few weeks.  Restarting BeStMan fixes the problem for a few days.  Site admin requested support to try and 
    implement a more permanent fix.  eLog 42963.
    Update 3/2: still an ongoing issue.  BesStMan restarts are required every few days (4-5?).  eLog 43194.
    Update 3/28: recent errors were traced to an unannounced update to xrootd that broke the installed system. Reverted to the earlier version, and so far this 
    situation is improved.  ggus 91122 was closed, eLog 43573.
    (ii)  3/24: SWT2_CPB - https://ggus.eu/ws/ticket_info.php?ticket=92519 was re-opened after some of the SRM transfer errors came back.  Several storage 
    servers are intermittently heavily loaded - issue under investigation. eLog 43476.
    Update 3/28: It was discovered that a disk in a RAID array failed in such a way that the hot spare did not swap in correctly.  Replacing the disk cleared up the 
    issue.  After a restart of xrootdfs transfers are again succeeding.  Closed ggus 92519, eLog 43586.

DDM Operations (Hiro)

  • this meeting:

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Release by March. With 10G. Goal in facility to deploy across the facility by end March.
    • NET2 - CERN connectivity - has it been improved?
    • LHCONE connectivity for NET2 and SWT2 - timeline?
    • Prepare with discussions at NET2, even if the setup will come in with the move to Holyoke; get organized. Move won't happen before the end of March. The WAN networking at Holyoke is still not well defined. Start a conversation about bringing LHCONE.
    • rc2 for perfsonar; rc3 next week; sites should prepare to upgrade.
  • this meeting:
    • Network problems seen generally for westbound traffic.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei: had a meeting at CERN with ADC regarding global namespace with Rucio. Everyone on board as to implementation. It is a compromise - space token. C++, (Andy/Wei) Java (Ilija), DPM (?).
  • rpm for VOMS security module. Xrootd 3.3.1, plus rpm. Gerry will provide this within a week.
  • Ilija: New sites: FZK, RAL-PP
  • New student is investigating all monitoring sent to CERN, studying IO rates, etc., will be useful.
  • Slim Skim Service - recommended as a central service.
  • xrootd-door on dCache: Wei has a work-around for a few problems. Discussing dCache support.
this week
  • Wei - Gerri packed the voms security module into rpm, works at SLAC and with DPM sites. Once Gerri has a permanent location for a permanent location, will create a set of detailed instruction.
  • Ilija has requested a central git repo at CERN for FAX. Can WLCG task force provide a point of coordination.
  • Ilija - doing FDR testing. Also testing using the Slim Skim Service. Seeing nice results for file transfers to UC3. Now seeing 4GB/s. How is the collector holding up.

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting:

  • AGLT2:
    • last meeting(s): ordered upgrade to blade system for 10g systems. 416 job slots to be added.
    • this meeting: Working on dcache pool servers to sl6.3; most converted. Moving VMs. Shawn working w/ Patrick and Sarah on proddisk cleanup. Then to get CCC mechanism working again.

  • NET2:
    • last meeting(s): Central deletion group is troubleshooting problems.
    • this week: Had a spike in lost heartbeat jobs due to a bug in the gatekeeper SEG; OSG is working on a patch. Still have the slow deletion issue.

  • MWT2:
    • last meeting(s): Waiting on Dell hardware network reconfig. IU - brocade networking problem. UIUC - campus cluster IP address space converted; GPFS problem found, need update. New Dell hardware has arrived.
    • this meeting: Upgrades to UC network - reconfigured with bonded 40G ports between Cisco and Dell stacks. 2x10G bonded for some s-nodes. IU reconfiguration for jumbo frames. New compute nodes at UIUC - 14 R420s, getting built with SL5. Also adding more disk for DDN, but there are continued GPFS issues, working closely with campus cluster admins. GROUPDISK data loss - CCC was reporting a large amount of dark data incorrectly. Recovering what we can from other sites, and notifying users, and modifying procedures so it doesn't happen again.

  • SWT2 (UTA):
    • last meeting(s):
    • this meeting: A couple of storage issues possibly creating the dips in the accounting plots. New version of panda mover in the git repo that uses python 2.6. Use with caution. Will be busy getting new storage online.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting: Downtime next week. 10g perfsonar node as well.

  • WT2:
    • last meeting(s): Power issue on March 13, broke a dell switch. Downtime tomorrow to replace the switch. User process limit on RHEL6 had a problem with SDD arrays crashing, 90-nproc problem (fixed, stable).
    • this meeting: Dell switch issue - minor issue. Two 10g uplinks channel bonded sometimes has trouble, possible problem with the environment.


last meeting this meeting
  • HPC activities at Argonne - Doug to report next meeting.

-- RobertGardner - 03 Apr 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback