r5 - 30 Oct 2013 - 14:37:14 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct302013



Minutes of the Facilities Integration Program meeting, October 30, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843


  • Meeting attendees: Michael, Sarah, Rob, Bob, Ilija, Saul, Patrick, John Brunelle, Horst, Wei, Mayuko, Kaushik,
  • Apologies: Mark Neubauer, Jason, Shawn, Dave
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Review of progress on network upgrades
      • Procurement updates
      • Rucio re-naming
      • December US ATLAS Computing Facilities meeting. December 11-12, 2013, University of Arizona (Tucson).
      • OSG All Hands Meeting announced:
        We are pleased to give you first information for the 2014 OSG All Hands Meeting. This will be April 7-11th 2014 at the SLAC National Accelerator Lab in California. http://www.slac.stanford.edu/.
        The schedule will follow the successful format from previous years:
          * US ATLAS and US CMS distributed facility – Tier-2 and Tier-3 – and the next Campus Infrastructure Community (CIC)  meetings on the Monday and Tuesday.
          * Plenary talks from scientists, researchers and OSG leaders on the Wednesday.
          * "Ask the Experts" and other workshops on Thursday.
          * And the OSG Council face-to-face – open to Consortium members – at the end of the week. 
        Information about hotels and other logistics will be posted in about a month. Don't hesitate to contact us for more information, as well as if you are interested to contribute and participate in the program planning and program itself.
        Program Committee: osg-ahm-program@opensciencegrid.org
        Organizer: Amber Boehnlein, AHM2014 Host
      • Michael: possible impacts of a possible shutdown. Will try to do everything to keep the Tier 2s running. Possible alternatives to discuss storage alternatives. We might consider making this a tracked program of work.
    • this week
      • Sites are requested to register updates to SiteCertificationP27 for facility integration tracking purposes
      • New Google docs for reported capacities, see CapacitySummary.
        • Review for: accuracy, retirements, and consistency with OIM
        • Update with upgrades as deployed this quarter.
      • Registration now open for US ATLAS Workshop on Distributed Computing, December 11-12, 2013: https://indico.cern.ch/conferenceDisplay.py?confId=281153
      • Interesting S&C workshop last week, https://indico.cern.ch/conferenceDisplay.py?confId=210658. Would like to get Facilities more involved with HPC resources, which appear as cluster resources. Eg. TACC might be an initial environment to allow integration work.
      • We should begin looking for opportunistic resources on OSG, and integrate them into our workflow. Need to work through the integration issues - e.g. there were issues with Parrot on the TACC OS.
      • SWT2 collaboration meeting within the next two weeks.

Reports on program-funded network upgrade activities


last meeting
  • Ordered Juniper EX9208 (100 Gbps on a channel) for both UM and MSU. Getting them installed now.
  • Will be retargeting some of the tier2 funds to complete the circuits between sites.
  • LR optics being purchased ($1200 per transceiver at the Junipers).
  • Need to get a 40g line card for the MX router on campus.
  • Probably a month away before 40g or 80g connectivity to CC NIE.
  • UM-MSU routing will be only 40g.
  • Likely end of November.
this meeting
  • LR optics from ColorChip have been shipped. (for UM)
  • Still waiting on info to connect to the CC NIE router
  • Also, final budget info
  • Hope to get this by Friday.


last meeting

this meeting

  • No update


last meeting
  • Replacing 6248 backbone to Z9000 as central switch, plus additional satellite switches connected to the central switch, likely dell 8132s.
  • Might even put compute nodes into 8132Fs (5 - 6) at 10g. Has a QSFP module for uplinks.
  • Waiting for quotes from Dell
  • Michael: should look at per-port cost when considering compute nodes
  • Early December timeframe
  • 100g from campus - still no definite plans

this meeting

  • Waiting for another set of quotes from Dell.
  • No news on 100g from campus; likely will be 10g to and from campus, though LEARN route will change.
  • Not sure what the prognosis is going to be for 100g. Kaushik has had discussions with OIT and networking management. There are 2x10g links at the moment.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting
  • AdHocComputeServerWG
  • SLAC: PO was sent to Dell, but now pulled back.
  • AGLT2: Intel came through - have new quotes, getting them re-worked.
  • NET2:
  • SWT2:
  • MWT2: R620 with Ivybridge

this meeting:

  • AGLT2:
  • NET2: have a request for quote to Dell for 38 nodes. Option for C6200s.
  • SWT2: no updates
  • MWT2: 48 R620 with Ivybridge - POs have gone out to Dell. 17 compute nodes.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE


  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

previous meeting (8/14/13)

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

previous meeting (8/21/13)

  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.

previous meeting (9/4/13)

  • Updates?
  • UTA: meeting with new network director schedule this Friday or next week. Back on the page.

this meeting (9/18/13)

  • Updates?
  • UTA - no update; getting on the new director's manager. Before the next meeting.
  • BU & HU - made some headway with Chuck and Mike O'Conner. NOX at Holyoke to be at 100g in 6 months. (Michael: from LHCONE operations call, NOX will extend to MANLAN, initially 10g link on short notice; sounded promising.)
  • OU - OU network folks think we can be on LHCONE by Oct 1

previous meeting (10/16/13)

  • Updates?
  • UTA - had meeting with new director of campus network computing, and LEARN representative. Possible separate routing instance. Will meet with them tomorrow morning.
  • OU - new switch being purchased, that also sets a separate routing instance, so as to separate traffic.
  • BU - no news. HU will not join LHCONE? Michael: raises question of NET2 architecture. Saul: HU is connected by 2x10g links; discussing it with James.

this meeting (10/30/13)

  • Updates?
  • UTA (Mark): There is second 2x 10g link into campus, a UT research network. Has the link on campus. Trying to decide where the traffic should route.
  • OU (Horst):
  • BU (Saul): News from Chuck was it would be very expensive (but hearing things second hand).

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Had a user running multi-threads in the ANALY queues. Should we set one up?
    • In production, they tend to be validation tasks, but require only around 100 slots.
    • Bring this up at next weeks software week.
  • this meeting:
    • See presentation from Torre's overview

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting:
    No meeting this week due to ATLAS computing & s/w week.  AMOD/ADCoS report:
    1)  10/16: MWT2 - https://ggus.eu/ws/ticket_info.php?ticket=98067 was opened for SRM transfer errors, but the site was in a scheduled downtime for dCache maintenance.  
    Ticket closed, eLog 46481. (Probably a situation where FTS continues attempting to process transfers that were scheduled prior to the downtime.)
    2) 10/17: 2.1 upgrade for the ATLAS DDM Dashboard - see:
    3) 10/17: AGLT2 - file transfers failing with "[GRIDFTP_ERROR] globus_gass_copy_register_url_to_url transfer timed out]." From Bob: The dCacheDomain process began 
    to throw errors at 9:30pm EDT, according to the log file. It was unresponsive, and has been restarted.  Issue resolved, https://ggus.eu/ws/ticket_info.php?ticket=98143 was 
    closed, eLog 46501.
    4)  10/21: NET2 - Saul reported a DDM problem at the site, so the storage and panda services were set off-line while the issue was being investigated.  Hardware/GPFS 
    problem was fixed, and services restored after ~five hours.
    5)  10/23:AGLT2_0 Frontier service was unavailable. From Shawn:  Both cache.aglt2.org and cache3.aglt2.org had their Frontier-squid services hung. Those both have been 
    restarted. Issue resolved, https://ggus.eu/ws/ticket_info.php?ticket=98332 was closed.  eLog 46565.
    6)  10/23: SWT2_CPB file transfer failures.  A storage server was taken off-line when the NIC in the machine shutdown due to a failed cooling fan.  Problem fixed, earlier 
    failed transfers eventually succeeded.  eLog 46585.
    7)  10/23: ADC Weekly meeting - link to the ADC Operations session during computing & s/w week:
    https://indico.cern.ch/conferenceDisplay.py?confId=210658 (search for "ADC Operations")
    Follow-ups from earlier reports:

  • this week: Operations summary:
    Summary from the weekly ADCoS meeting:
    1)  10/24: SLAC - file transfers failing with "[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS 
    Data Management."  Issue quickly resolved by the site - https://ggus.eu/ws/ticket_info.php?ticket=98364 was closed. eLog 46581.
    2)  10/24: ANLASC - file transfers failing with "checksum mismatch" errors (i.e., "[INTERNAL_ERROR] Checksum mismatch]"). Issue resolved - from Doug: fixed - use the 
    xrootd-dsi interface for globus.  https://ggus.eu/ws/ticket_info.php?ticket=98366 was closed, eLog 46612.
    3)  10/24: WISC - file transfers failing with "Unable to connect to c091.chtc.wisc.edu:2811 globus_xio: System error in connect: Connection timed out globus_xio: A system 
    call failed: Connection timed out." On 10/28 the site admin reported that the systems had been upgraded to slc6 and osg3.1, but there were some lingering issues with the 
    mapping of grid users.  Transfer errors returned on 10/30. https://ggus.eu/ws/ticket_info.php?ticket=98365 in-progress, eLog 46655.
    4)  10/29:  HammerCloud testing was suspended while a database intervention was occurring. (This meant that auto-exclusion of sites was not available.)  System back up 
    as of 10/30 a.m. eLog 46637, 46648.
    5)  10/29: ADC Weekly meeting:
    Follow-ups from earlier reports:

  • Things have been quiet, good, no open or old tickets.
  • HC testing was offline - backend database intervention

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Various proddisk, userdisk, and localgroupdisk cleanup campaigns. Removed more than 300 TB.
    • MWT2 reporting problem from a misconfig after the upgrade. Problem being corrected.
    • Doug: what are the long term plans for LOCALGROUPDISK? Quotas? When will Rucio provide this? Kaushik: policy will be provided by the RAC, but we need tools which have been in short supply. But we need a policy.
    • Kaushik will bring this up with the RAC. Myuko (UTA, but stationed at BNL) will be working on distributed operations, and taking shifts.
  • this meeting:
    • DATADISK discussion about primary data. Primary level should around 50% rather than 80%.
    • Victor is not reporting correctly. Discrepancies with local availability and the report - following up.
    • Kaushik: need to stay on top of ADC - keep reminding them.
    • MWT2 DATADISK - probably can now allocate more free space now that about 200 TB has been deleted.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Rucio re-naming progress. AGLT2 and SLAC are now renaming. MWT2 will start tomorrow. 10 day estimate to completion. SLAC: two weeks.
    • Rename FDR datasets - Hiro will send a script for sites to run
    • Working on BNL - there is an issue. Jobs are still writing non-rucio files. Has 50M files to rename.
    • Doug: User issues should send email to DAST
    • In case of a BNL shutdown, we may need to move FTS and LFC out of BNL. Michael: according to NYT a deal might have been reached. We need to have a contingency plan in place. Cloud solution contingency.
    • Cleanup issues - after the rename is complete, dark data should be simple to delete.
  • this meeting:
    • Rucio re-naming: there is a problem with the script, and Hiro.
    • Ilija reporting on re-naming efforts; 5 Hz. Expect to complete re-naming in two days. Its about 2M files.
    • Saul: running now; expect to be finished in a few days.
    • We need to synchronize with Hiro. There was a problem at AGLT2 - problems no longer being found in the inventory. How do we validate?
    • UTA: finished UTA_SWT2 without problems; restarted at CPB. Finding about 1/3 of the errors renaming.
    • OU: paused, waiting.
    • Wei believes there is a problem with the dump itself.

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • N2N change needed for DPM has been checked in; also Lincoln has created rpms.
  • There is a bugfix needed in the US - AGIS lookup broken.
  • Wei will send out an email describing a needed N2N module fix for Rucio hash calculation. This will be for both Java and C++ versions.

this week

  • Wei - made a tag for the rpm to get into the WLCG repo. Just need a confirmation from Lincoln.
  • Ilija - N2N: tested at MWT2, all worked correctly. Tested at ECDF (a DPM SE), but caused the door to crash.
  • Deployment: SWT2 service is down. BNL - has an old version, cannot do Rucio N2N.
  • SWT2_CPB: Patrick - not sure what what is going on. Won't stay running. Tried to put the latest N2N last night. Running as a proxy server. Can we Andy help with this.
  • BNL - running 3.3.1. Need to uprade to 3.3.3 and new Rucio N2N; would also like glrd updated.
  • The rest of the US facility is okay.
  • Working with Valeri on changing database schema to hold more historical data. Prognosis? Next few days. No big issues.
  • Episode at HU - 4600 failed jobs failed over. Mostly recovered normally. Saul believes the local site mover was disabled for a period.
  • Two polish sites joined - but which cloud? (Wahid coordinated)
  • Horst: notices downstream redirection is failing - Ilija is investigating.

Site news and issues (all sites)

  • T1:
    • last meeting(s): MapR evaluation with 5 servers with 60 drives beyond each, 8.8 Gbps from 200 farm nodes. Uses the MapR API rather than NFS. Compression turned off, checksums enabled, 100% cache miss, 4 TB in 7.5 minutes. This is a hadoop copy. 2.5 Gbps over the network (saturating the 10g NICs). Will provide a writeup. We should have a dedicated session on this at some point.
    • this meeting: Still evaluating alternative storage solutions: object storage technology, as well as evaluating GPFS. Have completed first stage of moving to 100g infrastructure; will demonstrate between CERN and BNL. November 12-14. Full production.

  • AGLT2:
    • last meeting(s): Problem over the weekend with UPS units at UM. Networking problems resulted, now fixed as of Monday evening. SHA-2 compliant almost everywhere, except dcache-xrootd doors. Hepix 2013 in two weeks.
    • this meeting:

  • NET2:
    • last meeting(s): Planning to do CVMFS 2.1.15. No work on SHA-2.
    • this week: Converting to new CVMFS.

  • MWT2:
    • last meeting(s): Saw CVMFS probs as well. Seems to be related to OASIS - the catalog is so large (in 2.0.19 the caches were separate). Currently working on 2.1.14 - made sure the cache is large. No problem. SHA-2 compliant except for dCache. Sarah working on the plan. Registration will be fixed in OIM (a separate service group for each instance). Working with OSG folks on Condor - Gratia bug - 7.8.8 - occasionally a job record claims 2B seconds used. Update in the Gratia Condor probe that has a workaround.
    • this meeting: Rucio conversion, network upgrades. IU will be doing some reconfiguring when UC does (preparing for this now).

  • SWT2 (UTA):
    • last meeting(s): Have moved to new dq2 client; implemented new lsm to satisfy new naming conventions. Dark data deletion, LFC orphans - working on a version of clean-pm to work on the output of CCC; will circulate. FAX upgraded at CPB, implemented new Rucio N2N. Seems to be much better than before. Upgraded perfsonar to latest version. Still need to do OIM registration. Possible flaky hardware throughput node. Checked versions of software for SHA-2 - will need to update wn-client.
    • this meeting: Tracking down an issue with the new Panda tracer functionality - job is throwing an exception. Rucio re-naming. SHA-2: only items is latest wn-client package. USERDISK is getting tight. Would like to get an LFC dump.

  • SWT2 (OU):
    • last meeting(s): iDrac Enterprises on order to fix IPMI issues. Will require another downtime, probably week of Oct 28. OU_OSCER_ATLAS releases validated, currently awaiting HC jobs to turn on; apparently there is a problem with HC jobs to this site, experts are investigating.
    • this meeting: SHA-2 compliant except for Xrootd.

  • WT2:
    • last meeting(s): Will do the correct resource registration for perfsonar in OIM. lsm for SLAC - running for prod and analy, seems fine. Will convert the rest of the SLAC site. Procurement order went out. Stabilized GT5 and lsf (work around, would hope OSG ).
    • this meeting: Upgraded rpm-based OSG services, now SHA-2 compliant. One machine is doing Gratia reporting. Still have GUMS and MyProxy services to do. Working astrophysics groups to get ATLAS jobs running on their HPC cluster (so they need to co-exist).


last meeting
  • Will have a meeting next week, to get back in phase.
  • Fall US ATLAS Facilities workshop. Tentative dates: December 11, 12. Location: University of Arizona (Tucson).
this meeting

-- RobertGardner - 30 Oct 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback