r7 - 13 Nov 2013 - 15:04:53 - HorstSeveriniYou are here: TWiki >  Admins Web > MinutesNov132013

MinutesNov132013

Introduction

Minutes of the Facilities Integration Program meeting, November 13, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843

Attending

  • Meeting attendees: Torre, Bob, Ilija, Rob, Saul, Michael, Joel, Shawn, Patrick, John Brunelle, Kaushik, Mark, Dave, Armen, Wei, Hiro, Mayuko
  • Apologies: Jason
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Sites are requested to register updates to SiteCertificationP27 for facility integration tracking purposes
      • New Google docs for reported capacities, see CapacitySummary.
        • Review for: accuracy, retirements, and consistency with OIM
        • Update with upgrades as deployed this quarter.
      • Registration now open for US ATLAS Workshop on Distributed Computing, December 11-12, 2013: https://indico.cern.ch/conferenceDisplay.py?confId=281153
      • Interesting S&C workshop last week, https://indico.cern.ch/conferenceDisplay.py?confId=210658. Would like to get Facilities more involved with HPC resources, which appear as cluster resources. Eg. TACC might be an initial environment to allow integration work.
      • We should begin looking for opportunistic resources on OSG, and integrate them into our workflow. Need to work through the integration issues - e.g. there were issues with Parrot on the TACC OS.
      • SWT2 collaboration meeting within the next two weeks.
    • this week
      • Managing LOCALGROUPDISK - in two weeks Kaushik will report.
      • Rucio renaming task underway; need to accelerate pace of renaming.
      • CPU upgrades, network upgrades
      • Kaushik: had a SWT2 meeting last week at OU. Lots of issues discussed. All aspects of planning and operations. Discussed alternative sources of funding for SWT2. Langston got an MRI. Now have 1300 additional CPUs for SWT2.

New production site: LUCILLE (Horst, Joel)

  • This is a new 34 node Sandybridge cluster -- dual-oct Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz CPUs with 128 GB of RAM
per node and 1 TB hard drive, with internal and external 10-gig network, and a 110 TB XFS storage and a fully functioning Bestman2 SE. It's SHA-2 compliant.
  • As this is a non-pledged non-dedicated opportunistic production-only resource, (no analysis), there are currently no plans for deploying FAX. We have to talk to network folks about LHCONE, and find funding for perfsonar boxes.
  • 26 compute nodes
  • 10g connectivity to OU
  • Latency between LU and other sites. 40 ms UTA to OU.
  • Have some nodes contributed by Bioinformatics - 5 nodes - will be integrated.
  • 100% ATLAS right now, will go down.
  • Condor. Using Rocks.
  • OSG opportunistic access.
  • Perfsonar support - willing to support it.

Reports on program-funded network upgrade activities

AGLT2

last meeting
  • Ordered Juniper EX9208 (100 Gbps on a channel) for both UM and MSU. Getting them installed now.
  • Will be retargeting some of the tier2 funds to complete the circuits between sites.
  • LR optics being purchased ($1200 per transceiver at the Junipers).
  • Need to get a 40g line card for the MX router on campus.
  • Probably a month away before 40g or 80g connectivity to CC NIE.
  • UM-MSU routing will be only 40g.
  • Likely end of November.
previous meeting
  • LR optics from ColorChip have been shipped. (for UM)
  • Still waiting on info to connect to the CC NIE router
  • Also, final budget info
  • Hope to get this by Friday.
this meeting
  • Juniper connect at 2x40g to cluster in place; 100g in place to Chicago
  • New wavelength for Mylar
  • MSU router to be procured.

MWT2

last meeting

this meeting

  • Juniper in place in at UC, connected to SciDMZ
  • IU - still at 2x10g
  • UIUC- network configuration change next wed, move campus cluster consolidation switch to 100g.

SWT2-UTA

last meeting(s)
  • Replacing 6248 backbone to Z9000 as central switch, plus additional satellite switches connected to the central switch, likely dell 8132s.
  • Might even put compute nodes into 8132Fs (5 - 6) at 10g. Has a QSFP module for uplinks.
  • Waiting for quotes from Dell
  • Michael: should look at per-port cost when considering compute nodes
  • Early December timeframe
  • 100g from campus - still no definite plans

last meeting (10/30/13)

  • Waiting for another set of quotes from Dell.
  • No news on 100g from campus; likely will be 10g to and from campus, though LEARN route will change.
  • Not sure what the prognosis is going to be for 100g. Kaushik has had discussions with OIT and networking management. There are 2x10g links at the moment.

this meeting (11/13/13)

  • Will get Dell quotes into purchasing this week; this is for the internal networking, close to storage.
  • Kaushik: we still have to meet with the new network manager at UTA.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting
  • AdHocComputeServerWG
  • SLAC: PO was sent to Dell, but now pulled back.
  • AGLT2:
  • NET2: have a request for quote to Dell for 38 nodes. Option for C6200s.
  • SWT2: no updates
  • MWT2: 48 R620 with Ivybridge - POs have gone out to Dell. 17 compute nodes.

this meeting:

  • AGLT2: have two quotes for R620s with differing memory. Some equipment money will go into networking; probably purchase 11-14 nodes.
  • NET2: quotes just arrived from Dell. Will likely go for the C6000s. Will submit immediately.
  • SWT2: putting together a package with Dell. Timing: have funds at OU; but not at UTA.
  • MWT2: 48 nodes

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

previous meeting (8/14/13)

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

previous meeting (8/21/13)

  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.

previous meeting (9/4/13)

  • Updates?
  • UTA: meeting with new network director schedule this Friday or next week. Back on the page.

this meeting (9/18/13)

  • Updates?
  • UTA - no update; getting on the new director's manager. Before the next meeting.
  • BU & HU - made some headway with Chuck and Mike O'Conner. NOX at Holyoke to be at 100g in 6 months. (Michael: from LHCONE operations call, NOX will extend to MANLAN, initially 10g link on short notice; sounded promising.)
  • OU - OU network folks think we can be on LHCONE by Oct 1

previous meeting (10/16/13)

  • Updates?
  • UTA - had meeting with new director of campus network computing, and LEARN representative. Possible separate routing instance. Will meet with them tomorrow morning.
  • OU - new switch being purchased, that also sets a separate routing instance, so as to separate traffic.
  • BU - no news. HU will not join LHCONE? Michael: raises question of NET2 architecture. Saul: HU is connected by 2x10g links; discussing it with James.

previous meeting (10/30/13)

  • Updates?
  • UTA (Mark): There is second 2x 10g link into campus, a UT research network. Has the link on campus. Trying to decide where the traffic should route.
  • OU (Horst):
  • BU (Saul): News from Chuck was it would be very expensive (but hearing things second hand.

this meeting (11/13/13)

  • Updates?
  • UTA (Patrick). Kaushik, previous attempt to peer to LHCONE failed, had to back out of it. Have had conversations with UTA and LEARN - now have options, there are additional paths. Estimate - next couple of weeks.
  • OU (Horst):
    From Matt Runion:
    The fiber terminations are done.  We are still awaiting approval for a couple of connections within the 4PP datacenter.
    I've also begun coordination with the OSCER folks as to a date for installation and cutover for the new switch.  Unfortunately, with SC2013, cutover is unlikely until after Thanksgiving. 
    We're tentatively shooting for Wed the 4th or Wed the 11th for installation and cutover. (Wednesdays there is a built-in maintenance window for OSCER).
    Following that, some configuration/coordination with !OneNet, and finally vlan provisioning and and router configuration.
    Realistically, factoring in holiday leave, end of semester, etc, etc, I'm guessing it will be sometime in January before we have packets flowing in and out of !LHCONE.
  • LU (Horst): Have to talk to OneNet and LU Networking folks.
  • BU (Saul): Nothing definitive, but met with people at Holyoke who manage it. Spoke with Leo Donnelly. Not yet ready to work technically. Michael - is the BU and BNL dedicated circuit still used? Perhaps use it to connect NET2 to MANLAN, and hook into the VRF.
  • HU (John): Same data center as BU. Just getting started with it.

SHOULD we ask for a dedicated meeting with experts?

  • Yes, Shawn will convene a meeting between phone/video meeting for network experts.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Had a user running multi-threads in the ANALY queues. Should we set one up?
    • In production, they tend to be validation tasks, but require only around 100 slots.
    • Bring this up at next weeks software week.
  • this meeting:
    • There might be lulls in production. There is also beyond-pledge production to do.
    • Michael: there was a request that a request could be filled by HPC resources. Massively parallel, long jobs. Could go to Argonne, Oak Ridge, and LBL.

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting:
    Not available this week
    
    1)  10/31: New pilot version from Paul (v58f).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_58f.html
    2)  11/2: BNL - SRM outage.  From Michael: The problem is fixed. A hardware failure in the SSD based RAIDset caused the outage. The server is back and the service is
    fully restored. eLog 46709. On 11/4 Hiro reported there were still some lingering issues associated with this incident. Problems fixed as of 11/5 - see: 
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/46743, https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/46751.
    3)  11/5: ADC Weekly meeting:
    http://indico.cern.ch/conferenceDisplay.py?confId=277466
    
    Follow-ups from earlier reports:
    
    (i)  10/24: WISC - file transfers failing with "Unable to connect to c091.chtc.wisc.edu:2811 globus_xio: System error in connect: Connection timed out globus_xio: 
    A system call failed: Connection timed out." On 10/28 the site admin reported that the systems had been upgraded to slc6 and osg3.1, but there were some lingering 
    issues with the mapping of grid users.  Transfer errors returned on 10/30. https://ggus.eu/ws/ticket_info.php?ticket=98365 in-progress, eLog 46655.
    Update 11/4: site reported the problem was fixed.  ggus 98365 was closed, eLog 46733.
    

  • this week: Operations summary:
    
    Summary from the weekly ADCoS meeting:
    Not available this week
    
    1)  11/6 p.m.: MWT2 - file transfers failing with "[SECURITY_ERROR] globus_ftp_client: the server responded with an error 550 Permission denied]." Problem fixed - from 
    Sarah: Auth config changes yesterday caused the DDM admin user to be mapped to usatlas3 instead of usatlas1. That is corrected as of 11/7 a.m.  
    https://ggus.eu/ws/ticket_info.php?ticket=98694 was closed on 11/8, eLog 46787, 46813.  https://savannah.cern.ch/support/index.php?140605 (site was blacklisted during this period).
    2)  11/7: From John at NET2: We had one of the hosts handling our LSM i/o get in a bad state, creating a batch of both stage-in and stage-out errors at HU_ATLAS_Tier2 
    and ANALY_HU_ATLAS_Tier2.  The issue is fixed now.
    3)  11/8: SWT2_CPB - file transfers were failing with SRM errors.  Issue was actually due to a loss of DNS in the cluster.  Restarting 'named' on the admin node fixed the problem.
    4)  11/9: WISC - https://ggus.eu/ws/ticket_info.php?ticket=98365 was re-opened when file transfer failures reappeared ("Unable to connect to c091.chtc.wisc.edu:2811 globus_xio: 
    System error in connect: Connection timed out"). eLog 46835. Still see the errors on 11/13.
    5)  11/10 p.m.: UTA_SWT2 - lost the main circuit which provides WAN connectivity to the cluster.  Networking experts restored the link overnight.  OIM downtime canceled 
    and all services back on-line 11/11 early a.m.
    6)  11/12 p.m.: SWT2_CPB - xrootd process died on a storage server, resulting in failed transfers over a several hour period.  No ticket, since the issue had been resolved by 
    the time the shifter noticed the problem, and some of the earlier failed transfers had started to succeed.
    7)  11/12: ADC Weekly meeting:
    http://indico.cern.ch/conferenceDisplay.py?confId=281271
    
    Follow-ups from earlier reports:
    
    None
    

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • DATADISK discussion about primary data. Primary level should around 50% rather than 80%.
    • Victor is not reporting correctly. Discrepancies with local availability and the report - following up.
    • Kaushik: need to stay on top of ADC - keep reminding them.
    • MWT2 DATADISK - probably can now allocate more free space now that about 200 TB has been deleted.
  • this meeting:
    • Not much to report. There was a deletion problem with Lucille - understood now.
    • Request to DDM operations about reducing primary data at Tier 2s. There was some cleanup, but then filled again.
    • 500 TB at BNL that was related to a RAC request, "Extra" category. Armen will make a proposal.
    • Another 600 TB at BNL in "default" - status unknown, a difficult category to figure out.
    • USERDISK cleanup is scheduled for the end of next week.
    • Zero secondary datasets at BNL - meaning PD2P? is shutdown at BNL.
    • Is there any hope of using DATADISK more effectively, such that we could reduce usable capacity but replicate data by a factor of two. Kaushik and Michael will get in touch with Borut.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Rucio re-naming progress. AGLT2 and SLAC are now renaming. MWT2 will start tomorrow. 10 day estimate to completion. SLAC: two weeks.
    • Rename FDR datasets - Hiro will send a script for sites to run
    • Working on BNL - there is an issue. Jobs are still writing non-rucio files. Has 50M files to rename.
    • Doug: User issues should send email to DAST
    • In case of a BNL shutdown, we may need to move FTS and LFC out of BNL. Michael: according to NYT a deal might have been reached. We need to have a contingency plan in place. Cloud solution contingency.
    • Cleanup issues - after the rename is complete, dark data should be simple to delete.
  • previous meeting (10/30/13):
    • Rucio re-naming: there is a problem with the script, and Hiro.
    • Ilija reporting on re-naming efforts; 5 Hz. Expect to complete re-naming in two days. Its about 2M files.
    • Saul: running now; expect to be finished in a few days.
    • We need to synchronize with Hiro. There was a problem at AGLT2 - problems no longer being found in the inventory. How do we validate?
    • UTA: finished UTA_SWT2 without problems; restarted at CPB. Finding about 1/3 of the errors renaming.
    • OU: paused, waiting.
    • Wei believes there is a problem with the dump itself.
  • this meeting:
    • BNL FTS and DDM site services - problem - the subscription fails. Happened three times in the past three weeks.
    • Rucio re-naming; has an okay from Tomas to change the AGIS entry in all the sites in the US to end with /rucio as the path. Will do today. Then, all new files will be created in the Rucio convention. All Tier 2 datadisks are set that way.
    • The re-naming script had two bugs. There was a bug in the dump file. Applies all the sites.
    • Ilija and Sarah re-wrote the code, running at 10 Hz.
    • Can Hiro create a new dump file with the special paths needed for re-naming.

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • N2N change needed for DPM has been checked in; also Lincoln has created rpms.
  • There is a bugfix needed in the US - AGIS lookup broken.
  • Wei will send out an email describing a needed N2N module fix for Rucio hash calculation. This will be for both Java and C++ versions.

previous week (10/30/13)

  • Wei - made a tag for the rpm to get into the WLCG repo. Just need a confirmation from Lincoln.
  • Ilija - N2N: tested at MWT2, all worked correctly. Tested at ECDF (a DPM SE), but caused the door to crash.
  • Deployment: SWT2 service is down. BNL - has an old version, cannot do Rucio N2N.
  • SWT2_CPB: Patrick - not sure what what is going on. Won't stay running. Tried to put the latest N2N last night. Running as a proxy server. Can we Andy help with this.
  • BNL - running 3.3.1. Need to uprade to 3.3.3 and new Rucio N2N; would also like glrd updated.
  • The rest of the US facility is okay.
  • Working with Valeri on changing database schema to hold more historical data. Prognosis? Next few days. No big issues.
  • Episode at HU - 4600 failed jobs failed over. Mostly recovered normally. Saul believes the local site mover was disabled for a period.
  • Two polish sites joined - but which cloud? (Wahid coordinated)
  • Horst: notices downstream redirection is failing - Ilija is investigating.

this week (11/13/13)

  • Starting to push N2N at DE sites, and are finding a few minor bugs, will produce a new rpm.
  • New deployment document to review
  • Hiro has updated BNL - latest Xrootd 3.3.3 and N2N
  • Mail from Gerd - stat issue with dCache will be solved.
  • Ilija - noted there were a large number of FAX failovers that were pilot config related. Will be fixed.
  • Ilija - working with Valeri Fine to get better monitoring for FAX failovers.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Still evaluating alternative storage solutions: object storage technology, as well as evaluating GPFS. Have completed first stage of moving to 100g infrastructure; will demonstrate between CERN and BNL. November 12-14. Full production.
    • this meeting:

  • AGLT2:
    • last meeting(s): Problem over the weekend with UPS units at UM. Networking problems resulted, now fixed as of Monday evening. SHA-2 compliant almost everywhere, except dcache-xrootd doors. Hepix 2013 in two weeks.
    • this meeting: Hepix meeting a great success. Networking for SC2013. Openflow does not work on stacked f10 switched. Week after t-giving dcache upgrade. Upgrade of condor.

  • NET2:
    • last meeting(s): Planning to do CVMFS 2.1.15. No work on SHA-2.
    • this week: Will get going on LHCONE using existing BNL-BU link. SAM was broken for a while - having trouble getting both SAM and availability/reliability working correctly. There's a problem with OIM configuration? Updated to CVMFS 2.1.15. Still working on SHA-2 compatibility.

  • MWT2:
    • last meeting(s): Rucio conversion, network upgrades. IU will be doing some reconfiguring when UC does (preparing for this now).
    • this meeting: Major task was the dCache 2.6.15 upgrade. Upgraded CVMFS to 2.1.15. Went to Condor 8.0.4. Completely SHA-2 compliant. (Hiro: can you test.)

  • SWT2 (UTA):
    • last meeting(s): Tracking down an issue with the new Panda tracer functionality - job is throwing an exception. Rucio re-naming. SHA-2: only items is latest wn-client package. USERDISK is getting tight. Would like to get an LFC dump.
    • this meeting: How to test SHA-2? Michael: John Hover to run SHA-2 compatible certificate at all the sites. Both CE and storage. Contacted Von Welch.

  • SWT2 (OU, OSCER):
    • last meeting(s): SHA-2 compliant except for Xrootd.
    • this meeting: All updated and green.

  • SWT2 (LU):
    • last meeting(s):
    • this meeting: Fully functional and operational and active.

  • WT2:
    • last meeting(s): Upgraded rpm-based OSG services, now SHA-2 compliant. One machine is doing Gratia reporting. Still have GUMS and MyProxy services to do. Working astrophysics groups to get ATLAS jobs running on their HPC cluster (so they need to co-exist).
    • this meeting:

AOB

last meeting this meeting


-- RobertGardner - 12 Nov 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback