r5 - 22 Jan 2014 - 22:06:48 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan222014

MinutesJan222014

Introduction

Minutes of the Facilities IntegrationProgram meeting, January 22, 2014

Attending

  • Meeting attendees:
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New Google docs for reported capacities, see CapacitySummary.
        • Review for: accuracy, retirements, and consistency with OIM
        • Update with upgrades as deployed this quarter.
      • Arizona meeting. 13 registrants.
      • New information in SiteCertificationP27 - Rucio renaming column information.
      • Multicore resources. Two Tier2 sites have not yet provided MCORE resources. Will provide this.
      • LHCONE - proposal was made to increase transatlantic capacity using the new 100g link, integrate with existing VRFs. Will allow to run pilot applications which require the bandwidth. Use FAX as demonstrator. Count on a four week period.
      • Queue consolidation - how to handle Panda queues in the future. Discussed at SW at CERN - want to follow-up. Artificial separation between PROD and ANALY queues. Requires lots of coding changes in Panda. Wei: How would requirements be communicated to the job scheduler? Kaushik: will require changes in autopyfactory, but its possible. Michael: goals are to reduce fragmentation at the site level, while allowing priority adjustments at the Panda level. Kaushik: not yet on the Panda to-do list. (Most work is now on DEFT and JEDI). Kaushik: will get to this this year, will provide a proposal. BNL can be used as an integration test site.
      • Management of LOCALGROUPDISK; Kaushik and Armen are discussion.
      • Observed issue with single streams > 10Gbps, as reported by Brian Tierney at the Tucson meeting. Would like to work with the ESnet outreach team to address the issue. Ho
    • this week
      • Final updates to SiteCertificationP27 and v29 of google docs spread sheet in CapacitySummary due for quarterly report.
        • screenshot_01.jpg:
          screenshot_01.jpg
      • Some activities for next quarter:
        • MCORE availability
        • DC14 readiness tests
          • Want to make it easy to deploy at other sites, taking out
          • Invite John to explain openstack environment at BNL
        • Facility-wide flexible infrastructure activity
      • Transatlantic FAX demonstrator. Goal is demonstrate large scale use of 100 Gbps transatlantic test circuit using direct read or copies of data via FAX from multiple sites in the U.S.
        • Writing this up. To be used by the network providers on both sides. Timeframe in early Feb. Adding 100g circuit later in the month. What performance can you get with direction.
      • ATLAS Connect
        • Starting some beta testing now. http://connect.usatlas.org
        • Currently, bosco-flocking with small quotas are in place to MWT2, AGLT2, and CSU-Fresno
        • http://connect.usatlas.org/accounting-summary/
        • Working on connecting to TACC-Stampede (first steps); later, add autopyfactory to support Panda jobs to these targets.
        • Full description is available in Tier3 implementation committee google docs
      • Last week Condo-of-Condos meeting at NCSA, http://www.ncsa.illinois.edu/Conferences/ARCC/. Met with some our institutional campus computing partners (in particular Holyoke, OU directors) who expressed interest in ATLAS Connect and more generally http://ci-connect.net.

Managing LOCALGROUPDISK at all sites (Kaushik)

previously
  • LOCALGROUPDISK - first draft from Armen, Kaushik reviewing.
  • Beyond pledge production storage
  • Tools will be needed for policy enforcement.
  • Rucio features for quota management not available yet.
  • Hard limits versus soft limits. Enforcement.
  • Will present the plan in Arizona.

this meeting

  • See Indico for slides.
  • Q: (Doug) why 3 TB? 30TB total. A(Kaushik): 3x300 ~ 1PB; which would be un-pledged resources. At what scale do we go to RAC? Of course, if everyone used 30TB we'd be out of space. We're assuming a factor of 10 overcommitment.
  • Q: (Doug): week interval? Kaushik: three warnings, then go to RAC before deleting.
  • Q: (Doug): sample lifetime? Ask the user to set a lifetime.
  • Q: (Rob): whats the granularity? A: (Kaushik): will put it in at the dataset level. Allow wildcards. Will need a lot of software, monitoring, and web-based front-ends. We have a rough idea of what it will be like. Will be ready before the next run. Hiro, Mayuko and Armen will be doing the development we think.
  • Q: (Saul): is LGD for US users only? A: (Kaushik): yes, only U.S.
  • Q: (Doug): where does the policy go for approval? A:(Kaushik): reviewing within the facilities; then to analysis support (Jason & Erich); then to RAC. Once approved by the RAC, then give to US ATLAS management. Throughput the process, this will be reviewed. Doug: IB? A: good suggestion.
  • Q: (Doug): what about archival?
  • Michael: there is a natural opportunity to present at an IB meeting - there is a standard time slot, suggest we use it.
  • Michael: we reserve 20% for US physicists. We should be consistent. This is more like 2 PB. We should discuss where these go - presumably majority to LGD, but there may be others.
  • Kaushik - notes most users tend to stay at a single site. Entire quota can go at a site.
  • Rob: notes

Reports on program-funded network upgrade activities

AGLT2

last meeting
  • Ordered Juniper EX9208 (100 Gbps on a channel) for both UM and MSU. Getting them installed now.
  • Will be retargeting some of the tier2 funds to complete the circuits between sites.
  • LR optics being purchased ($1200 per transceiver at the Junipers).
  • Need to get a 40g line card for the MX router on campus.
  • Probably a month away before 40g or 80g connectivity to CC NIE.
  • UM-MSU routing will be only 40g.
  • Likely end of November.
previous meeting
  • LR optics from ColorChip have been shipped. (for UM)
  • Still waiting on info to connect to the CC NIE router
  • Also, final budget info
  • Hope to get this by Friday.
prior meeting, 11/13
  • Juniper connect at 2x40g to cluster in place; 100g in place to Chicago
  • New wavelength for Mylar
  • MSU router to be procured.
prior meeting, (1/8/14)
  • MSU funding issues getting resolved; will soon order parts for the Juniper.
  • At UM, all parts have been ordered 40g line card for CC-NIE router.
this meeting, (1/22/14)
  • 40g line card go hung up, warranty problem.
  • Transponders ordered, given to Merit.
  • Will get a new esi

MWT2

last meeting(s)
  • Timeframe - end of November
  • Juniper in place in at UC, connected to SciDMZ
  • IU - still at 2x10g
  • UIUC- network configuration change next wed, move campus cluster consolidation switch to 100g.
last meeting (11/27/13)
  • IU: Network: All our 10Gb hosts, including storage servers, are attached to one of two 4810 switches, each with a 4x10Gb uplink. The 1Gb hosts are on the 6248 switch stack, which is connected to our 100Gb switch via a 2x10Gb uplink. The two pieces we are missing are the VLT connections between the 4810 switches, and moving the 6248 switch stack to uplink to the 4810s. We attempted to move the 6248 to the 4810 when we moved the 10Gb hosts, but found the combination of the trunk to the 6248 and the VLT caused routing issues. We also found that the VLT was causing routing asymmetries for the directly-connected 10Gb hosts. We have the VLT disabled while we investigate that issue. We plan to role out a new test config on Mon Dec 2, and to iterate on that through the week until we are in the final configuration.
  • Illinois: testing of 40 Gbps next week. There have been some checksum errors that are being investigated. 100Gb wave to 710S LSD: Fiber cleaned but not enough testing at load to know if it fixed the low level checksum issue. Working with I2 to try and bring up as a 40Gb link for testing. Currently we have a 10Gb link. Plans for a go-nogo on 100Gb are a week from this friday Second wave via west route (Peoria and up I55) to 600W did not get funding via cc-nie grant. Other funded sources being looked into. On campus: Campus Cluster consolidation switch is now directly connected to CARNE router (100Gb Juniper) Current connection is a 2x10Gb LAG. The equipment for an 8x10Gb LAG is in place, however there are not enough fibers between ACB and node-1 (where CARNE lives) for 8 connections. Spare fibers not passing tests. Could pull more fibers but the conduits are full. Options being looking into. We can use working fibers and add to LAG without any downtime So I believe right now we are limited to 10Gb to 710S LSD (uplink to Chicago), but the limit soon will be the 2x10Gb LAG (CCS to CARNE - 40Gb to Chicago) which will be raised as the LAG is increased. In two weeks we might have 100Gb.
  • UC: 40 Gbps to server room. Will start transitioning hosts next week to new VLANs.

prior meeting (1/8/14)

this meeting, (1/22/14)

  • IU: already done at 8x10g
  • UC: hardware and fiber in had for an additional 6x10g
  • UIUC: working on fiber to ACB building where campus cluster resides. Word from John Towns: Imminent. Dave: 8x10g is now ready, but some infrastructure at the ACB is still needed.
  • Midwest connectivity issue for LHCONE. How do we get to whichever VRF we use. We need to arrange a meeting with I2, ESnet, MWT2 and AGLT2, and OmniPoP? . There is an issue since the LHCONE infrastructure is primarily 10g.

SWT2-UTA

last meeting(s)
  • Replacing 6248 backbone to Z9000 as central switch, plus additional satellite switches connected to the central switch, likely dell 8132s.
  • Might even put compute nodes into 8132Fs (5 - 6) at 10g. Has a QSFP module for uplinks.
  • Waiting for quotes from Dell
  • Michael: should look at per-port cost when considering compute nodes
  • Early December timeframe
  • 100g from campus - still no definite plans
last meeting (10/30/13)
  • Waiting for another set of quotes from Dell.
  • No news on 100g from campus; likely will be 10g to and from campus, though LEARN route will change.
  • Not sure what the prognosis is going to be for 100g. Kaushik has had discussions with OIT and networking management. There are 2x10g links at the moment.
last meeting (11/13/13)
  • Will get Dell quotes into purchasing this week; this is for the internal networking, close to storage.
  • Kaushik: we still have to meet with the new network manager at UTA.
previous meeting (11/27/13)
  • Had a long series of meetings last week with new director of networking. Much better understanding of the UTA networking roadmap. LEARN and UT system research networks. Problem is now coordinating among the different groups.
  • Right now there are multiple 10g links; two 100g links are coming soon. CIO is about to sign for this.
  • Provisioning has started for the campus. Will need to make sure we're plugged into it. Need to make sure SWT2 is as close to edge router as possible. #1 priority. Will create DMZ. Now problem current exceeding 8Gbps.
  • Logical diagram of WAN and LAN networks?
  • Michael: interested in the 2x100g beyond campus (e.g. to Internet2). How is LEARN connected?
  • OU: 20-40g coming. Will produce a diagram.
prior meeting (1/8/14)
  • Orders have been placed. Replacing stacks of 6248s. Waiting on delivery date.
  • Waiting on second round of funds.
this meeting, (1/22/14)
  • No change.
  • Michael: money should be available.
  • Kaushik: estimate 1 month.
  • 1st NS6000 plus some top of rack switches. Full buy includes two NS6000 which will be LAG'd.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting
  • AdHocComputeServerWG
  • SLAC: PO was sent to Dell, but now pulled back.
  • AGLT2:
  • NET2: have a request for quote to Dell for 38 nodes. Option for C6200s.
  • SWT2: no updates
  • MWT2: 48 R620 with Ivybridge - POs have gone out to Dell. 17 compute nodes.
Previous meeting: (11/13/13)
  • AGLT2: have two quotes for R620s with differing memory. Some equipment money will go into networking; probably purchase 11-14 nodes.
  • NET2: quotes just arrived from Dell. Will likely go for the C6000s. Will submit immediately.
  • SWT2: putting together a package with Dell. Timing: have funds at OU; but not at UTA.
  • MWT2: 48 nodes
previous meeting: (11/27/13)
  • AGLT2:
  • MWT2:16 new servers for the UIUC ICC. Four servers are already running. 12 more coming early December.
  • NET2: Placed an order for 42 nodes. Not sure about delivery. Expect after Jan 1. Have not decided whether these will be BU or HU.
  • SWT2: Still waiting for next round of funding. Expect January or Feb.
prior meeting: (1/8/14)
  • AGLT2: three R620s arrived, another 16 ordered. Will be replacing PE1950s at UM.
  • MWT2: See FabricUpgradeP27#MWT2. 48 nodes at UC are racked, half are powered. Need to benchmark nodes, network cable to top-of-rack switches, complete power cabling. Expect to have online next week. UIUC: 12 of 16 nodes arrived and are online.
  • NET2: 42 C6220s nodes have arrived. Will install within the next few days. Install all on BU side.
  • SWT2: Still waiting for next round of funding. Expect January or Feb.
this meeting, (1/22/14)
  • AGLT2: 19 R620s in hand. One racked, setup as a dynamic slot machine. Decommissioning PE1950s. 17 running within the next week. Working on dynamic slots - but finding the m-core jobs bursty. Plea to get better brokering of multi-core jobs when they are available. Kaushik: will have better luck with Jedi. Bob noticed there are different brokering issues going on here. Problem getting jobs moved into activated. Saul: release validation warning; seen at many sites. Doug: can analysis be used as backfill? Kaushik: we might hold off optimizing until there is more demand. (at the moment, only 1200 jobs running.)
  • MWT2: First 14 servers brought online Friday; benchmarking, testing opportunistic jobs. Work continues.
  • NET2: 42 nodes are installed. Expect to have online in a day or two. Bob: hep spec numbers?
  • SWT2: NA
  • WT2: purchased 60 M620s. waiting. HS06 benchmark will needed.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.
previous meeting
  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.
previous meeting (8/14/13)
  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.
previous meeting (8/21/13)
  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.
previous meeting (9/4/13)
  • Updates?
  • UTA: meeting with new network director schedule this Friday or next week. Back on the page.
this meeting (9/18/13)
  • Updates?
  • UTA - no update; getting on the new director's manager. Before the next meeting.
  • BU & HU - made some headway with Chuck and Mike O'Conner. NOX at Holyoke to be at 100g in 6 months. (Michael: from LHCONE operations call, NOX will extend to MANLAN, initially 10g link on short notice; sounded promising.)
  • OU - OU network folks think we can be on LHCONE by Oct 1
previous meeting (10/16/13)
  • Updates?
  • UTA - had meeting with new director of campus network computing, and LEARN representative. Possible separate routing instance. Will meet with them tomorrow morning.
  • OU - new switch being purchased, that also sets a separate routing instance, so as to separate traffic.
  • BU - no news. HU will not join LHCONE? Michael: raises question of NET2 architecture. Saul: HU is connected by 2x10g links; discussing it with James.
previous meeting (10/30/13)
  • Updates?
  • UTA (Mark): There is second 2x 10g link into campus, a UT research network. Has the link on campus. Trying to decide where the traffic should route.
  • OU (Horst):
  • BU (Saul): News from Chuck was it would be very expensive (but hearing things second hand.
previous meeting (11/13/13)
  • Updates?
  • UTA (Patrick). Kaushik, previous attempt to peer to LHCONE failed, had to back out of it. Have had conversations with UTA and LEARN - now have options, there are additional paths. Estimate - next couple of weeks.
  • OU (Horst):
    From Matt Runion: The fiber terminations are done. We are still awaiting approval for a couple of connections within the 4PP datacenter. I've also begun coordination with the OSCER folks as to a date for installation and cutover for the new switch. Unfortunately, with SC2013, cutover is unlikely until after Thanksgiving. We're tentatively shooting for Wed the 4th or Wed the 11th for installation and cutover. (Wednesdays there is a built-in maintenance window for OSCER). Following that, some configuration/coordination with !OneNet, and finally vlan provisioning and and router configuration. Realistically, factoring in holiday leave, end of semester, etc, etc, I'm guessing it will be sometime in January before we have packets flowing in and out of !LHCONE.
  • LU (Horst): Have to talk to OneNet and LU Networking folks.
  • BU (Saul): Nothing definitive, but met with people at Holyoke who manage it. Spoke with Leo Donnelly. Not yet ready to work technically. Michael - is the BU and BNL dedicated circuit still used? Perhaps use it to connect NET2 to MANLAN, and hook into the VRF.
  • HU (John): Same data center as BU. Just getting started with it.
SHOULD we ask for a dedicated meeting with experts?
  • Yes, Shawn will convene a meeting between phone/video meeting for network experts.
prior meeting (11/27/13)
  • UTA: campus has ordered Cisco switches, two weeks ago. 4500x switches. Expect to complete LHCONE peering before the holidays. Will this include the two Z9000's? No. Dell 4810.
  • OU: nothing new. Got info from Mat Ryun, for Shawn's document. Don't expect until after the new year. Expect, right after the beginning of the year, definitely. LU: will discuss following the new year.
  • BU: nothing new. Will have meeting on December 5 - will meet with Holyoke networking people. Next step for LHCONE? Expect nothing will happen until January.
  • Shall we convene a general Tier2-LHCONE meeting? Yes.
prior meeting (1/8/14)
  • UTA: need to meet with campus infrastructure people. Will schedule a meeting.
  • OU: no updates. Expect to hear something soon.
  • BU: no update. December 5 meeting: all agreed to use dedicated circuit to BNL, but no schedule. Then discussion about how 100g will get to the room (via Albany? Move NOX there?). Some time in 2014 will have 100g. Michael: can we make it work now?
  • Michael: At LHCONE-LHCOPN Pasadena meeting it was agree it was timely to discuss networking requirements of the LHC at 10-11 Feb at CERN. Idea was to bring network providers together to see how the infrastructure develops in the future. Perhaps merge the infrastructure? A refined usage of the LHCOPN allowing Tier2s? Comprehensive discussion planned.
this meeting, (1/22/14)
  • UTA: Mark - have setup a machine
  • OU: some changes to 100g infrastructure.
  • BU: no change.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Multi-core decision not yet made, but may ask sites to split 50/50.
    • mc14 will be multi-core only. 1-2 weeks.
  • this meeting:
    • Completely full of production jobs. Plenty of analysis.
    • Getting requests for US regional production to speed-up requests. 1000 tasks waiting to be assigned. Wolfgang - when we're full, we're really full.

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting:
    Not available this week
    
    1)  1/9: https://savannah.cern.ch/support/index.php?141526 was opened for job failures at BNL_ATLAS_RCF with "Job killed by signal 15: Signal handler has set job result to 
    FAILED, ec = 1201." However, RCF is a known opportunistic resource, so not a site issue. Ticket was closed, eLog 47609.
    2)  1/9: BNL - file transfer failures ("Marking Space as Being Used failed"). Issue with the SRM quickly resolved. eLog 47613.
    3)  1/13: SLACXRD - power outage at the site, resulting in failed file transfers with SRM errors. Power was restored by the next day, but then another power outage occurred. 
    Site recovering from the outages. https://ggus.eu/ws/ticket_info.php?ticket=100265 in-progress, eLog 47662. Blacklisted.
    4)  1/14: BNL - Michael reported on an SE problem at the site.  Issue was traced to auto-vacuum kicking in to clean the SRM postgres database
    from obsolete records, which turns out to be I/O intensive. Configuration was adjusted to reduce the effect of the cleaning process. eLog 47664.
    5) 1/14: File transfers from ES/NCG-INGRID-PT => BNL were failing with " [INTERNAL_ERROR] Invalid SRM version." Issue was due to a misconfiguration at the FTS level at BNL, 
    and was fixed. https://ggus.eu/ws/ticket_info.php?ticket=100266 was closed on 1/15. eLog 47678.
    6)  1/14: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=294394
    
    Follow-ups from earlier reports:
    
    (i)  12/12: WISC DDM deletion errors ("atlas07.cs.wisc.edu    [SE][srmRm][] httpg://atlas07.cs.wisc.edu:8443/srm/v2/server: CGSI-gSOAP running on voatlas311.cern.ch reports 
    Error reading token data header: Connection reset by peer"). https://ggus.eu/ws/ticket_info.php?ticket=99731, eLog 47329.
    Update 12/30: deletion errors continue - no response to the ticket from the site. eLog 47514.
    (II)  1/7: MWT2 - frontier squid at the site was shown as down in the monitor (had been in this state for several weeks) - update from Dave: Several weeks ago, the assigned IP 
    space for MWT2 nodes at UChicago was changed. It appears that the DNS registration of the newly assigned IP for uct2-grid1.uchicago.edu was overlooked. The node has been 
    up and operational for internal use, but DNS to sites outside UChicago does not reflect its new address. https://ggus.eu/ws/ticket_info.php?ticket=100091 in-progress, eLog 47574.
    Update 1/9: DNS and AGIS changes were made to reflect a new hostname. ggus 100091 was closed.
    1/13: The newly registered squid instance was not configured in the monitoring. https://ggus.eu/ws/ticket_info.php?ticket=100228 in-progress,
    

  • this week: Operations summary:
    Summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=294116 (ADCoS)
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=295993 (ADC Weekly)
    
    1)  1/17: BNL - source/destination file transfer failures with SRM errors. The SRM database had again become unresponsive for a short period because of postgres auto-vacuum 
    operations to remove obsolete db entries. See message from Hiro on 1/21 to the usatlas-ddm-l@lists.bnl.gov list with details of changes made at BNL to address this problem. 
    https://ggus.eu/ws/ticket_info.php?ticket=100350 was closed, eLog 47731, 47763.
    2)  1/21: WISC - file transfer failures with the error "Unable to connect to c091.chtc.wisc.edu:2811." https://ggus.eu/ws/ticket_info.php?ticket=100406 was closed on 1/22 after the 
    site admin reported the problem was fixed (along with the DDM deletion problem - see (i) in the follow-up section below). eLog 47798.
    3)  1/21: NET2: transfers were failing with "certificate has expired." Updated certificate installed, issue resolved.
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/47783
    4)  1/21: MWT2- transfers were failing with errors like "Cannot determine address of local host," etc. Sarah reported a problem with the internal DNS. Issue was resolved. 
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/47787
    5)  1/21: UTA_SWT2: transfers were failing with "certificate has expired." Updated certificate installed, issue resolved.
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/47789
    6)  1/14: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=295993
    
    Follow-ups from earlier reports:
    
    (i)  12/12: WISC DDM deletion errors ("atlas07.cs.wisc.edu    [SE][srmRm][] httpg://atlas07.cs.wisc.edu:8443/srm/v2/server: CGSI-gSOAP running on voatlas311.cern.ch reports 
    Error reading token data header: Connection reset by peer"). https://ggus.eu/ws/ticket_info.php?ticket=99731, eLog 47329.
    Update 12/30: deletion errors continue - no response to the ticket from the site. eLog 47514.
    Update 1/22: site admin reported the problem was fixed. No recent DDM deletion errors. ggus 99731 was closed, eLog 47797.
    (ii)  1/7: MWT2 - frontier squid at the site was shown as down in the monitor (had been in this state for several weeks) - update from Dave: Several weeks ago, the assigned IP 
    space for MWT2 nodes at UChicago was changed. It appears that the DNS registration of the newly assigned IP for uct2-grid1.uchicago.edu was overlooked. The node has been 
    up and operational for internal use, but DNS to sites outside UChicago does not reflect its new address. https://ggus.eu/ws/ticket_info.php?ticket=100091 in-progress, eLog 47574.
    Update 1/9: DNS and AGIS changes were made to reflect a new hostname. ggus 100091 was closed.
    1/13: The newly registered squid instance was not configured in the monitoring. https://ggus.eu/ws/ticket_info.php?ticket=100228 in-progress.
    1/15: The squid service now being monitored correctly. ggus 100228 was closed, eLog 47692.
    (iii)  1/13: SLACXRD - power outage at the site, resulting in failed file transfers with SRM errors. Power was restored by the next day, but then another power outage occurred. 
    Site recovering from the outages. https://ggus.eu/ws/ticket_info.php?ticket=100265 in-progress, eLog 47662. Blacklisted.
    Update 1/16: All issues related to the power outages resolved. ggus 100265 was closed, eLog 47721.
    

  • Finally got a response from UWISC on a very old tickets.
  • SMU missing data problem is being dealt with. Doug: is support at SMU slipping (as Justin left)? Mark: seems to be.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Armen: DATADISK issue at BNL. Default type of data must be defined by ADC and wasn't. Then "Extra" needs to be defined. (Archive tags: primary, secondary, default, extra). BNL is 98% non-deleted. All new tasks stopped being assigned to the US cloud. Kaushik: raised it with ADC, which evidently has no plan.
    • Hiro: can move some files to tape.
    • Kaushik: ADC is responsible for managing DATADISK.
    • Michael: We need to act now, bring this up with computing management and ADC. There was no reaction from ADC at yesterday's meeting.
    • Armen: the main problem is primary dataset.
    • Kaushik will send a draft email to send to Borut and Richard.
    • Hiro: what is the future of Pandamover? Kaushik: plan was to abandon it with Rucio.
    • Armen: cleaned up 250 TB of LOCALGROUPDISK at BNL. 30 TB at MWT2.
    • USERDISK cleanup ...
  • this meeting:

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last meeting (11/27/13)

  • Wei: still working with German sites, deploying Rucio N2N, a few minor issues to resolve.
  • Deployment document updated.
  • Ilija: stability issue - dcache-xrootd door stopped responding. Still trying to understand the cause. Working with Asoka to get a user script for optimal redirector location. Working with Valeri to get fax failover monitoring improved, requiring. Few week at earliest.
  • UTA stability issues. Wei gave Patrick some suggestions. A week of stability. Memory allocation on the environment variable, since RHEL6. Configuration change in xrootd configuration. Stress test?
  • Wei: prefers a small stress test on Rucio-converted sites.
  • Ilija - will be stress-testing MWT2. Also, there will be a change in notification for fax endpoint problems. A new test dataset has been distributed to all sites.

prior meeting (1/8/14)

  • BNL has enabled Rucio for its N2N, restarted the service.
  • All other US sites have updated and are stably running.
  • Real time mailing proved to be rather infrequent.
  • Still a lot of sites mainly FR and DE cloud do not deploy Rucio enabled N2N? .
  • localSetupFAX deployed - significantly simplifies end-user's life.
  • Still debugging cmsd instabilities at few sites.

* this meeting (1/22/14)*

  • Ilija - doing some stress testing. There were some test datasets missing - one remains to be restored at BNL.
  • LFC-free N2N, added stability for FAX
  • Please use new Rucio gLFNs
  • Wei - discovered a load condition with new N2N, awaiting Andy to troubleshoot, could be an Xrootd problem itself.
  • Rucio renaming status: UTA - should be done, but Patrick wants to check a few things.

Site news and issues (all sites)

  • T1:
    • last meeting(s): New Arista switch is providing many internal 100g links. New dcache version running.
    • this meeting: Buying > 2PB of storage, replacing old, progressing. Replacing F10 equipment. Moving towards a high capacity interlink fabric, 3 Arista 7500s (100g trunks between cores). Space manager discovered problems with 2.6. (What settings did Hiro use to fix the problem?) Happens with high file transfer completion rate. Avoids growth.

  • AGLT2:
    • last meeting(s): Upgraded dcache to 2.6.19. Working on MCORE configuration.
    • this meeting: Looking into OMD - Open Monitoring Distribution

  • NET2:
    • last meeting(s): installing new nodes, bringing up new MCORE queues. Introduce new SGE queues.
    • this week: working WAN,

  • MWT2:
    • last meeting(s): Static MCORE queues now; will try out partitionable slots. dCache server testing on GPFS as a locality cache.
    • this meeting: David and Lincoln able to get the 6248 connected to the Juniper - getting the new R620s online. Confirmation from UIUC additional fiber inside ACB.

  • SWT2 (UTA):
    • last meeting(s): MCORE, LHCONE. Resurrecting some old UTA_SWT2 servers. CPB issues. Rucio.
    • this meeting:

  • SWT2 (OU, OSCER):
    • last meeting(s):
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): several Thors are having issues, no redundancy. Setting them in read-only mode. CPU purchases: 60 blades in receiving.
    • this meeting: converted 600 old cores to rhel6 for ATLAS jobs. Additional 1600 still running rhel5, unlikely we'll use these. Site-wide power issue, some ATLAS nodes were affected.

AOB

last meeting
  • Need to check v29 spreadsheet for SL6.
this meeting
  • Wei: new WLCG availability matrix is being calculated, see email sent to usatlas-t2-l. Awaiting clarification from Alessandro about including grid3-locations.txt in the availability test.


-- RobertGardner - 22 Jan 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


jpg screenshot_01.jpg (45.4K) | RobertGardner, 22 Jan 2014 - 12:29 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback