r5 - 17 Oct 2013 - 04:23:54 - HorstSeveriniYou are here: TWiki >  Admins Web > MinutesOct162013



Minutes of the Facilities Integration Program meeting, October 16, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843


  • Meeting attendees: Chris, Dave, Saul, Michael, Joel, Sarah, Mayuko, Bob, John Brunelle, Hiro, Wei, Mark, Alden, Armen, Patrick, Fred, Doug
  • Apologies: Horst, Jason, Shawn, Mark
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Procurement status
      • Rucio support, SL6 - all set, final checks
      • Tier 3 flocking, Campus group formation: collection of services to utilize and leverage resources outside the institution. Forumulate a plan to materialize the capabilities and functionality; need to work closely with Analysis Support.
      • SHA-2 transition
      • High level comments: the SL6 transition took way too long, to support the latest dq2 clients; we ended up 3-months late. We must prepare and hold better to a schedule next time. Special thanks to David Lesny for spear-heading the technical issues. Equipment procurement - also taking too long. When can we expect final numbers.
      • Update for network program for next week.
      • Perfsonar registration in OIM. There are ambiguities to resolve.
    • this week
      • Review of progress on network upgrades
      • Procurement updates
      • Rucio re-naming
      • December US ATLAS Computing Facilities meeting. December 11-12, 2013, University of Arizona (Tucson). Agenda, theme in progress.
      • OSG All Hands Meeting announced:
        We are pleased to give you first information for the 2014 OSG All Hands Meeting. This will be April 7-11th 2014 at the SLAC National Accelerator Lab in California. http://www.slac.stanford.edu/.
        The schedule will follow the successful format from previous years:
          * US ATLAS and US CMS distributed facility – Tier-2 and Tier-3 – and the next Campus Infrastructure Community (CIC)  meetings on the Monday and Tuesday.
          * Plenary talks from scientists, researchers and OSG leaders on the Wednesday.
          * "Ask the Experts" and other workshops on Thursday.
          * And the OSG Council face-to-face – open to Consortium members – at the end of the week. 
        Information about hotels and other logistics will be posted in about a month. Don't hesitate to contact us for more information, as well as if you are interested to contribute and participate in the program planning and program itself.
        Program Committee: osg-ahm-program@opensciencegrid.org
        Organizer: Amber Boehnlein, AHM2014 Host
      • Michael: possible impacts of a possible shutdown. Will try to do everything to keep the Tier 2s running. Possible alternatives to discuss storage alternatives. We might consider making this a tracked program of work.

Reports on program-funded network upgrade activities


  • Ordered Juniper EX9208 (100 Gbps on a channel) for both UM and MSU. Getting them installed now.
  • Will be retargeting some of the tier2 funds to complete the circuits between sites.
  • LR optics being purchased ($1200 per transceiver at the Junipers).
  • Need to get a 40g line card for the MX router on campus.
  • Probably a month away before 40g or 80g connectivity to CC NIE.
  • UM-MSU routing will be only 40g.
  • Likely end of November.



  • Replacing 6248 backbone to Z9000 as central switch, plus additional satellite switches connected to the central switch, likely dell 8132s.
  • Might even put compute nodes into 8132Fs (5 - 6) at 10g. Has a QSFP module for uplinks.
  • Waiting for quotes from Dell
  • Michael: should look at per-port cost when considering compute nodes
  • Early December timeframe
  • 100g from campus - still no definite plans

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting

this meeting:

  • SLAC: PO was sent to Dell, but now pulled back.
  • AGLT2: Intel came through - have new quotes, getting them re-worked.
  • NET2:
  • SWT2:
  • MWT2: R620 with Ivybridge

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE


  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

previous meeting (8/14/13)

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

previous meeting (8/21/13)

  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.

previous meeting (9/4/13)

  • Updates?
  • UTA: meeting with new network director schedule this Friday or next week. Back on the page.

this meeting (9/18/13)

  • Updates?
  • UTA - no update; getting on the new director's manager. Before the next meeting.
  • BU & HU - made some headway with Chuck and Mike O'Conner. NOX at Holyoke to be at 100g in 6 months. (Michael: from LHCONE operations call, NOX will extend to MANLAN, initially 10g link on short notice; sounded promising.)
  • OU - OU network folks think we can be on LHCONE by Oct 1

this meeting (10/16/13)

  • Updates?
  • UTA - had meeting with new director of campus network computing, and LEARN representative. Possible separate routing instance. Will meet with them tomorrow morning.
  • OU - new switch being purchased, that also sets a separate routing instance, so as to separate traffic.
  • BU - no news. HU will not join LHCONE? Michael: raises question of NET2 architecture. Saul: HU is connected by 2x10g links; discussing it with James.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Two drainages over the last week; saturday drain caused by new certs installed on Friday.
    • A bad task go into the system; 10's K failures. Missing scout jobs?
  • this meeting:
    • Had a user running multi-threads in the ANALY queues. Should we set one up?
    • In production, they tend to be validation tasks, but require only around 100 slots.
    • Bring this up at next weeks software week.

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting:
    1)  10/5: MWT2 - file transfer failures ("[CONNECTION_ERROR] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]").  From Sarah: 
    I found that the PoolManager service (running inside dCacheDomain on uct2-dc4) had failed with: OpenJDK 64-Bit Server VM warning: Attempt to allocate stack guard
    pages failed. java.lang.OutOfMemoryError: unable to create new native thread.
    I restarted dCacheDomain, and transfers have started to succeed. https://ggus.eu/ws/ticket_info.php?ticket=97789 was closed, eLog 46293.
    2)  10/7 early a.m.: US sites were auto-excluded by HC testing due to a problem with the VOMS service at CERN (proxy was getting created, but the verification would fail).  
    Issue appeared to be resolved after several hours (during this period sites were bouncing between off-line/on-line several times).  
    Related ggus tickets: https://ggus.eu/ws/ticket_info.php?ticket=97808, https://ggus.eu/ws/ticket_info.php?ticket=97813 were closed (not a BNL site issue). eLog 46321, 46327
    3)  10/8: ADC Weekly meeting:
    4)  10/8: SLACXRD - file transfers failing ("[GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 500 500-Command failed: 
    globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914: 500-open() fail 500 End]").   https://ggus.eu/ws/ticket_info.php?ticket=97880 in-progress, eLog 46350.
    Follow-ups from earlier reports:

  • this week: Operations summary:
    Summary from the weekly ADCoS meeting:
    No meeting this week
    1)  10/11: AGLT2 - production jobs failing with errors like "Error details: ddm: Could not add files to DDM: dq2.repository.DQRepositoryException.DQFrozenDatasetException : 
    [USER][OTHER] The dataset ... is already frozen!"  From Shawn: Our postgresql yum-auto-updated on our dCache headnode around 04:43 AM Eastern time today. This causes 
    the existing dCache processes to lose access to the DB. I am restarting dCache services which should resolve the problem. https://ggus.eu/ws/ticket_info.php?ticket=97956 closed 
    later the same day, eLog 46461.
    2)  10/12: MWT2 - file transfer failures with "failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]." From Dave: I found our SRM service down 
    this morning, I suspect due to a full disk partition. Some cleanup has been done to open space, and SRM has been restarted. Issue resolved, 
    https://ggus.eu/ws/ticket_info.php?ticket=97981 was closed, eLog 46405.  (https://savannah.cern.ch/support/index.php?140158 - site was blacklisted for several hours by 
    SAAB during this period.)
    3) 10/12 p.m. - Bob at AGLT2 reported a major networking outage at the site.  Downtime declared in OIM.
    Update 10/14 p.m. - networking issues resolved.  By the next day HC tests were succeeding, so the sites were set back on-line.
    4)  10/14: MWT2 - production jobs were failing with errors like "pilot: getProperSiterootAndCmtconfig: Missing installation: No such file or directory: 
    /cvmfs/atlas.cern.ch/repo/sw/software/17.2.1/cmtsite/setup.sh."  Dave reported that several nodes had corrupted CVMFS caches (related to a known bug in CVMFS 2.1.14).
    The nodes were taken off-line until the caches can be cleaned. Issue resolved, https://ggus.eu/ws/ticket_info.php?ticket=97994 was closed, eLog 46462.
    5)  10/15: ADC Weekly meeting:
    (Meeting not held this week, but the AMOD/ADCoS report is available on the meeting Indico page.)
    Follow-ups from earlier reports:
    (i)  10/8: SLACXRD - file transfers failing ("[GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 500 500-Command failed: 
    globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914: 500-open() fail 500 End]").   https://ggus.eu/ws/ticket_info.php?ticket=97880 in-progress, eLog 46350.
    Update 10/15: No recent updates to the ticket, but transfers have been succeeding over the past 48 hours, so ggus 97880 was closed.  eLog 46460.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • MinutesSep17DataManage2013
    • Estimating 5% of all data in US facility is dark, mostly in USERDISK
    • About 100 TB for primary to secondary
    • Kaushik's action items: dark data on DATADISK should not happen (subscription services): Saul and others will look into it. USERDISK: much is quite old - caused by glitching deletion services? Far more complex to figure: set a date, delete all prior to this.
    • Hiro: the Rucio File Catalog will be replicated to BNL
  • this meeting:
    • Various proddisk, userdisk, and localgroupdisk cleanup campaigns. Removed more than 300 TB.
    • MWT2 reporting problem from a misconfig after the upgrade. Problem being corrected.
    • Doug: what are the long term plans for LOCALGROUPDISK? Quotas? When will Rucio provide this? Kaushik: policy will be provided by the RAC, but we need tools which have been in short supply. But we need a policy.
    • Kaushik will bring this up with the RAC. Myuko (UTA, but stationed at BNL) will be working on distributed operations, and taking shifts.

DDM Operations (Hiro)

  • this meeting:
    • Rucio re-naming progress. AGLT2 and SLAC are now renaming. MWT2 will start tomorrow. 10 day estimate to completion. SLAC: two weeks.
    • Rename FDR datasets - Hiro will send a script for sites to run
    • Working on BNL - there is an issue. Jobs are still writing non-rucio files. Has 50M files to rename.
    • Doug: User issues should send email to DAST
    • In case of a BNL shutdown, we may need to move FTS and LFC out of BNL. Michael: according to NYT a deal might have been reached. We need to have a contingency plan in place. Cloud solution contingency.
    • Cleanup issues - after the rename is complete, dark data should be simple to delete.

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Not much progress. There is a concern about rpm-upgrade of xrootd. Horst provided a solution.
  • BU completed all the steps, config seems to be correct. For some reason the service stops, quits right away. Wei is debugging.
  • Question about FAX monitoring information - needs to be reported to the Rucio popularity database.

this week

  • N2N change needed for DPM has been checked in; also Lincoln has created rpms.
  • There is a bugfix needed in the US - AGIS lookup broken.
  • Wei will send out an email describing a needed N2N module fix for Rucio hash calculation. This will be for both Java and C++ versions.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Implementing Science DMZ with Juniper router; a prerequisite for the new 100g Arista router. Will be used to interface to WAN circuit, for transatlantic demonstration. Connection through a recognized network architecture. Moving to next gen OpenStack service Grizzly - reorganized way communication happens with the underlying network infrastructure; new virtual layer for control. Client SDN controller - to dynamically create VLANS. Will be doing this in the context of cloud activities. (Lab contributed the equipment.) Relation to DYNES? That is really a hardware setup - not the app layer.
    • this meeting: MapR evaluation with 5 servers with 60 drives beyond each, 8.8 Gbps from 200 farm nodes. Uses the MapR API rather than NFS. Compression turned off, checksums enabled, 100% cache miss, 4 TB in 7.5 minutes. This is a hadoop copy. 2.5 Gbps over the network (saturating the 10g NICs). Will provide a writeup. We should have a dedicated session on this at some point.

  • AGLT2:
    • last meeting(s): Lot of CVMFS issue - manefests with HC clouds - missing installation. Working with Dave Dykstra and Jacob Blummer - wn access. Too many files get pinned within the cache. Has a script which runs once a day randomly to do the reconfig. Updating wn-client and OSG releases; probably will do this well before. Will send CVMFS to the t2 list. 2.1.15 timescale? Unknown.
    • this meeting: Problem over the weekend with UPS units at UM. Networking problems resulted, now fixed as of Monday evening. SHA-2 compliant almost everywhere, except dcache-xrootd doors. Hepix 2013 in two weeks.

Need to track SHA-2 updates at sites.

  • NET2:
    • last meeting(s): Appears that network performance to BNL goes way down, resulting in DDM errors. Perfsonar needs investigation.
    • this week: Planning to do CVMFS 2.1.15. No work on SHA-2.

  • MWT2:
    • last meeting(s): Saw CVMFS probs as well. Seems to be related to OASIS - the catalog is so large (in 2.0.19 the caches were separate). Currently working on 2.1.14 - made sure the cache is large. No problem. SHA-2 compliant except for dCache. Sarah working on the plan. Registration will be fixed in OIM (a separate service group for each instance). Working with OSG folks on Condor - Gratia bug - 7.8.8 - occasionally a job record claims 2B seconds used. Update in the Gratia Condor probe that has a workaround.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): Have moved to new dq2 client; implemented new lsm to satisfy new naming conventions. Dark data deletion, LFC orphans - working on a version of clean-pm to work on the output of CCC; will circulate. FAX upgraded at CPB, implemented new Rucio N2N. Seems to be much better than before. Upgraded perfsonar to latest version. Still need to do OIM registration. Possible flaky hardware throughput node. Checked versions of software for SHA-2 - will need to update wn-client.
    • this meeting:

  • SWT2 (OU):
    • last meeting(s): Migration done and validation going well and almost complete, but currently working on hardware issues with gatekeeper and IPMI network issues with lustre servers. Also, still working on xrootd crashing our server because it seems to be triggering a lustre bug.
    • this meeting: iDrac Enterprises on order to fix IPMI issues. Will require another downtime, probably week of Oct 28. OU_OSCER_ATLAS releases validated, currently awaiting HC jobs to turn on; apparently there is a problem with HC jobs to this site, experts are investigating.

  • WT2:
    • last meeting(s): Will do the correct resource registration for perfsonar in OIM. lsm for SLAC - running for prod and analy, seems fine. Will convert the rest of the SLAC site. Procurement order went out. Stabilized GT5 and lsf (work around, would hope OSG ).
    • this meeting:


last meeting
  • Will have a meeting next week, to get back in phase.
  • Fall US ATLAS Facilities workshop. Tentative dates: December 11, 12. Location: University of Arizona (Tucson).
this meeting

-- RobertGardner - 16 Oct 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pdf MWT2-network-upgrade-status-v1.pdf (3644.1K) | RobertGardner, 16 Oct 2013 - 12:43 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback