r4 - 27 Nov 2013 - 14:50:15 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov272013

MinutesNov272013

Introduction

Minutes of the Facilities Integration Program meeting, November 27, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843

Attending

  • Meeting attendees: Rob, Michael, Patrick, Saul, Dave, Myoko, Wei, Ilija, Mark, Kaushik, Horst
  • Apologies: Bob, Alden, Jason, Sarah, Fred
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Sites are requested to register updates to SiteCertificationP27 for facility integration tracking purposes
      • New Google docs for reported capacities, see CapacitySummary.
        • Review for: accuracy, retirements, and consistency with OIM
        • Update with upgrades as deployed this quarter.
      • Registration now open for US ATLAS Workshop on Distributed Computing, December 11-12, 2013: https://indico.cern.ch/conferenceDisplay.py?confId=281153
      • Interesting S&C workshop last week, https://indico.cern.ch/conferenceDisplay.py?confId=210658. Would like to get Facilities more involved with HPC resources, which appear as cluster resources. Eg. TACC might be an initial environment to allow integration work.
      • Managing LOCALGROUPDISK - in two weeks Kaushik will report.
      • Rucio renaming task underway; need to accelerate pace of renaming.
      • CPU upgrades, network upgrades
      • Kaushik: had a SWT2 meeting last week at OU. Lots of issues discussed. All aspects of planning and operations. Discussed alternative sources of funding for SWT2. Langston got an MRI. Now have 1300 additional CPUs for SWT2.
    • this week
      • Arizona meeting. 13 registrants.
      • New information in SiteCertificationP27 - Rucio renaming column information.

Managing LOCALGROUPDISK at all sites (Kaushik)

  • LOCALGROUPDISK - first draft from Armen, Kaushik reviewing.
  • Beyond pledge production storage
  • Tools will be needed for policy enforcement.
  • Rucio features for quota management not available yet.
  • Hard limits versus soft limits. Enforcement.
  • Will present the plan in Arizona.

Reports on program-funded network upgrade activities

AGLT2

last meeting
  • Ordered Juniper EX9208 (100 Gbps on a channel) for both UM and MSU. Getting them installed now.
  • Will be retargeting some of the tier2 funds to complete the circuits between sites.
  • LR optics being purchased ($1200 per transceiver at the Junipers).
  • Need to get a 40g line card for the MX router on campus.
  • Probably a month away before 40g or 80g connectivity to CC NIE.
  • UM-MSU routing will be only 40g.
  • Likely end of November.
previous meeting
  • LR optics from ColorChip have been shipped. (for UM)
  • Still waiting on info to connect to the CC NIE router
  • Also, final budget info
  • Hope to get this by Friday.
previous meeting, 11/13
  • Juniper connect at 2x40g to cluster in place; 100g in place to Chicago
  • New wavelength for Mylar
  • MSU router to be procured.
this meeting, (11/27/13)

MWT2

last meeting(s)
  • See report in SiteCertificationP27
  • Timeframe - end of November
  • Juniper in place in at UC, connected to SciDMZ
  • IU - still at 2x10g
  • UIUC- network configuration change next wed, move campus cluster consolidation switch to 100g.
this meeting, (11/27/13)
  • IU:
    Network:
    All our 10Gb hosts, including storage servers, are attached to one of
    two  4810 switches, each with a 4x10Gb uplink.  The 1Gb hosts are on the
    6248 switch stack,
    which is connected to our 100Gb switch via a 2x10Gb uplink. The two
    pieces we are missing are the VLT connections between the 4810 switches,
    and moving the 6248 switch stack to uplink to the 4810s. We attempted to
    move the 6248 to the 4810 when we moved the 10Gb hosts, but found the
    combination of the trunk to the 6248 and the VLT caused routing issues.
    We also found that the VLT was causing routing asymmetries for the
    directly-connected 10Gb hosts. We have the VLT disabled while we
    investigate that issue.  We plan to role out a  new test config on Mon
    Dec 2, and to iterate on that through the week until we are in the final
    configuration.
  • Illinois: testing of 40 Gbps next week. There have been some checksum errors that are being in.
    100Gb wave to 710S LSD:
    
        Fiber cleaned but not enough testing at load to know if it fixed the low level checksum issue.
        Working with I2 to try and bring up as a 40Gb link for testing. Currently we have a 10Gb link.
        Plans for a go-nogo on 100Gb are a week from this friday
    
    Second wave via west route (Peoria and up I55) to 600W did not get funding via cc-nie grant.
    Other funded sources being looked into.
    
    On campus:
        Campus Cluster consolidation switch is now directly connected to CARNE router (100Gb Juniper)
        Current connection is a 2x10Gb LAG. The equipement for an 8x10Gb LAG is in place, however
        there are not enough fibers between ACB and node-1 (where CARNE lives) for 8 connections.
        Spare fibers not passing tests. Could pull more fibers but the conduits are full. Options being looking into.
        We can use working fibers and add to LAG without any downtime So I believe right now we are limited 
        to 10Gb to 710S LSD (uplink to Chicago),  but the limit soon will be the 2x10Gb LAG (CCS to CARNE - 
        40Gb to Chicago) which will be raised as the LAG is increased. In two weeks we might have 100Gb.
  • UC: 40 Gbps to server room. Will start transitioning hosts next week to new VLANs.

SWT2-UTA

last meeting(s)
  • Replacing 6248 backbone to Z9000 as central switch, plus additional satellite switches connected to the central switch, likely dell 8132s.
  • Might even put compute nodes into 8132Fs (5 - 6) at 10g. Has a QSFP module for uplinks.
  • Waiting for quotes from Dell
  • Michael: should look at per-port cost when considering compute nodes
  • Early December timeframe
  • 100g from campus - still no definite plans

last meeting (10/30/13)

  • Waiting for another set of quotes from Dell.
  • No news on 100g from campus; likely will be 10g to and from campus, though LEARN route will change.
  • Not sure what the prognosis is going to be for 100g. Kaushik has had discussions with OIT and networking management. There are 2x10g links at the moment.

last meeting (11/13/13)

  • Will get Dell quotes into purchasing this week; this is for the internal networking, close to storage.
  • Kaushik: we still have to meet with the new network manager at UTA.

this meeting (11/27/13)

  • Had a long series of meetings last week with new director of networking. Much better understanding of the UTA networking roadmap. LEARN and UT system research networks. Problem is now coordinating among the different groups.
  • Right now there are multiple 10g links; two 100g links are coming soon. CIO is about to sign for this.
  • Provisioning has started for the campus. Will need to make sure we're plugged into it. Need to make sure SWT2 is as close to edge router as possible. #1 priority. Will create DMZ. Now problem current exceeding 8Gbps.
  • Logical diagram of WAN and LAN networks?
  • Michael: interested in the 2x100g beyond campus (e.g. to Internet2). How is LEARN connected?
  • OU: 20-40g coming. Will produce a diagram.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting
  • AdHocComputeServerWG
  • SLAC: PO was sent to Dell, but now pulled back.
  • AGLT2:
  • NET2: have a request for quote to Dell for 38 nodes. Option for C6200s.
  • SWT2: no updates
  • MWT2: 48 R620 with Ivybridge - POs have gone out to Dell. 17 compute nodes.

Previous meeting: (11/13/13)

  • AGLT2: have two quotes for R620s with differing memory. Some equipment money will go into networking; probably purchase 11-14 nodes.
  • NET2: quotes just arrived from Dell. Will likely go for the C6000s. Will submit immediately.
  • SWT2: putting together a package with Dell. Timing: have funds at OU; but not at UTA.
  • MWT2: 48 nodes

this meeting:

  • AGLT2:
  • MWT2:16 new servers for the UIUC IIC. Four servers are already running. 12 more coming early December.
  • NET2: Placed an order for 42 nodes. Not sure about delivery. Expect after Jan 1. Have not decided whether these will be BU or HU.
  • SWT2: Still waiting for next round of funding. Expect January or Feb.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

previous meeting (8/14/13)

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

previous meeting (8/21/13)

  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.

previous meeting (9/4/13)

  • Updates?
  • UTA: meeting with new network director schedule this Friday or next week. Back on the page.

this meeting (9/18/13)

  • Updates?
  • UTA - no update; getting on the new director's manager. Before the next meeting.
  • BU & HU - made some headway with Chuck and Mike O'Conner. NOX at Holyoke to be at 100g in 6 months. (Michael: from LHCONE operations call, NOX will extend to MANLAN, initially 10g link on short notice; sounded promising.)
  • OU - OU network folks think we can be on LHCONE by Oct 1

previous meeting (10/16/13)

  • Updates?
  • UTA - had meeting with new director of campus network computing, and LEARN representative. Possible separate routing instance. Will meet with them tomorrow morning.
  • OU - new switch being purchased, that also sets a separate routing instance, so as to separate traffic.
  • BU - no news. HU will not join LHCONE? Michael: raises question of NET2 architecture. Saul: HU is connected by 2x10g links; discussing it with James.

previous meeting (10/30/13)

  • Updates?
  • UTA (Mark): There is second 2x 10g link into campus, a UT research network. Has the link on campus. Trying to decide where the traffic should route.
  • OU (Horst):
  • BU (Saul): News from Chuck was it would be very expensive (but hearing things second hand.

previous meeting (11/13/13)

  • Updates?
  • UTA (Patrick). Kaushik, previous attempt to peer to LHCONE failed, had to back out of it. Have had conversations with UTA and LEARN - now have options, there are additional paths. Estimate - next couple of weeks.
  • OU (Horst):
    From Matt Runion:
    The fiber terminations are done.  We are still awaiting approval for a couple of connections within the 4PP datacenter.
    I've also begun coordination with the OSCER folks as to a date for installation and cutover for the new switch.  Unfortunately, with SC2013, cutover is unlikely until after Thanksgiving. 
    We're tentatively shooting for Wed the 4th or Wed the 11th for installation and cutover. (Wednesdays there is a built-in maintenance window for OSCER).
    Following that, some configuration/coordination with !OneNet, and finally vlan provisioning and and router configuration.
    Realistically, factoring in holiday leave, end of semester, etc, etc, I'm guessing it will be sometime in January before we have packets flowing in and out of !LHCONE.
  • LU (Horst): Have to talk to OneNet and LU Networking folks.
  • BU (Saul): Nothing definitive, but met with people at Holyoke who manage it. Spoke with Leo Donnelly. Not yet ready to work technically. Michael - is the BU and BNL dedicated circuit still used? Perhaps use it to connect NET2 to MANLAN, and hook into the VRF.
  • HU (John): Same data center as BU. Just getting started with it.

SHOULD we ask for a dedicated meeting with experts?

  • Yes, Shawn will convene a meeting between phone/video meeting for network experts.

this meeting (11/27/13)

  • UTA: campus has ordered Cisco switches, two weeks ago. 4500x switches. Expect to complete LHCONE peering before the holidays. Will this include the two Z9000's? No. Dell 4810.
  • OU: nothing new. Got info from Mat Ryun, for Shawn's document. Don't expect until after the new year. Expect, right after the beginning of the year, definitely. LU: will discuss following the new year.
  • BU: nothing new. Will have meeting on December 5 - will meet with Holyoke networking people. Next step for LHCONE? Expect nothing will happen until January.
  • Shall we convene a general Tier2-LHCONE meeting? Yes.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • There might be lulls in production. There is also beyond-pledge production to do.
    • Michael: there was a request that a request could be filled by HPC resources. Massively parallel, long jobs. Could go to Argonne, Oak Ridge, and LBL.
  • this meeting:
    • Production sporadic. How long will it continue? The next big sample is mc13, "relatively early in 2014". A good time for downtimes.
    • Multi-core queues: how long will this happen?
      • SWT2 and NET2 will add MCORE queues next week. Good check of the system.
      • BNL - will be completely dynamic in the future using Condor.
      • Still have a large number of jobs waiting in analysis queues.

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting (Armen):
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=283149
    
    1)  11/13: HU_ATLAS_Tier2 - jobs failing with "Failed to setup DBRelease properly." Problem was due to CVMFS issues on some WN's.  As of 11/18 HU upgraded to 
    CVMFS v2.1.15, which seems to have fixed the problem.  https://ggus.eu/ws/ticket_info.php?ticket=98860 closed, eLog 46971.
    2)  11/19 a.m.: AGLT2 - squid service at the site shown as down ('red') in the monitoring.  Bob reported that a disk filled on all three of the UM squids.  Space was cleaned and 
    the services restarted.  https://ggus.eu/ws/ticket_info.php?ticket=99008 was closed, but re-opened the next day when the squid service again was shown as down.  From Shawn: 
    The cache2.aglt2.org squid server's log filled the /var/log/squid partition - cleaned and squid restarted. ggus 99008 closed, eLog 47020.
    3)  11/19: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=283853
    
    Follow-ups from earlier reports:
    
    (i)  11/9: WISC - https://ggus.eu/ws/ticket_info.php?ticket=98365 was re-opened when file transfer failures reappeared ("Unable to connect to c091.chtc.wisc.edu:2811 globus_xio: 
    System error in connect: Connection timed out"). eLog 46835. Still see the errors on 11/13.
    Update 11/14: site reported the problem was fixed and closed ggus 98365.  However, errors still occurring, so ticket was again re-opened. eLog 46916.
    Update 11/19: site was requested to respond to the ticket.
    

  • this week: Operations summary:
    Summary from the weekly ADCoS meeting (Alexey Sedov):
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=283150
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/131126_AMODreport.pdf
    
    1)  11/21: OU_OSCER_ATLAS - jobs were failing with "staging input file failed" errors.  Not a site issue, but instead related to the maintenance downtime at OU_OCHEP_SWT2 
    around this time (OU_OSCER uses OCHEP for file storage). https://ggus.eu/ws/ticket_info.php?ticket=99051 was closed, eLog 47050.
    2)  11/22: MWT2 - lost files at the site. Issue tracked in https://savannah.cern.ch/support/index.php?140877.
    3)  11/26: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=285143
    
    Follow-ups from earlier reports:
    
    (i)  11/9: WISC - https://ggus.eu/ws/ticket_info.php?ticket=98365 was re-opened when file transfer failures reappeared ("Unable to connect to c091.chtc.wisc.edu:2811 globus_xio: 
    System error in connect: Connection timed out"). eLog 46835. Still see the errors on 11/13.
    Update 11/14: site reported the problem was fixed and closed ggus 98365.  However, errors still occurring, so ticket was again re-opened. eLog 46916.
    Update 11/19: site was requested to respond to the ticket.
    Update 11/25: site reported the problem was fixed, ggus 98365 was closed.
    

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • DATADISK discussion about primary data. Primary level should around 50% rather than 80%.
    • Victor is not reporting correctly. Discrepancies with local availability and the report - following up.
    • Kaushik: need to stay on top of ADC - keep reminding them.
    • MWT2 DATADISK - probably can now allocate more free space now that about 200 TB has been deleted.
  • previously:
    • Not much to report. There was a deletion problem with Lucille - understood now.
    • Request to DDM operations about reducing primary data at Tier 2s. There was some cleanup, but then filled again.
    • 500 TB at BNL that was related to a RAC request, "Extra" category. Armen will make a proposal.
    • Another 600 TB at BNL in "default" - status unknown, a difficult category to figure out.
    • USERDISK cleanup is scheduled for the end of next week.
    • Zero secondary datasets at BNL - meaning PD2P? is shutdown at BNL.
    • Is there any hope of using DATADISK more effectively, such that we could reduce usable capacity but replicate data by a factor of two. Kaushik and Michael will get in touch with Borut.
  • this meeting:
    • USERDISK cleanup is in progress.
    • There is a well-known issue associated with two users that submitted the same jobs twice. DESY sites affected, as well as US sites. There is some data which should be declared obsolete. Sarah provided a list of data files to be declared lost. DQ2-ops owns the issue; meanwhile Hiro's error checker continues to send notifications every hour.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Rucio re-naming progress. AGLT2 and SLAC are now renaming. MWT2 will start tomorrow. 10 day estimate to completion. SLAC: two weeks.
    • Rename FDR datasets - Hiro will send a script for sites to run
    • Working on BNL - there is an issue. Jobs are still writing non-rucio files. Has 50M files to rename.
    • Doug: User issues should send email to DAST
    • In case of a BNL shutdown, we may need to move FTS and LFC out of BNL. Michael: according to NYT a deal might have been reached. We need to have a contingency plan in place. Cloud solution contingency.
    • Cleanup issues - after the rename is complete, dark data should be simple to delete.
  • previous meeting (10/30/13):
    • Rucio re-naming: there is a problem with the script, and Hiro.
    • Ilija reporting on re-naming efforts; 5 Hz. Expect to complete re-naming in two days. Its about 2M files.
    • Saul: running now; expect to be finished in a few days.
    • We need to synchronize with Hiro. There was a problem at AGLT2 - problems no longer being found in the inventory. How do we validate?
    • UTA: finished UTA_SWT2 without problems; restarted at CPB. Finding about 1/3 of the errors renaming.
    • OU: paused, waiting.
    • Wei believes there is a problem with the dump itself.
  • previous meeting (11/13/13):
    • BNL FTS and DDM site services - problem - the subscription fails. Happened three times in the past three weeks.
    • Rucio re-naming; has an okay from Tomas to change the AGIS entry in all the sites in the US to end with /rucio as the path. Will do today. Then, all new files will be created in the Rucio convention. All Tier 2 datadisks are set that way.
    • The re-naming script had two bugs. There was a bug in the dump file. Applies all the sites.
    • Ilija and Sarah re-wrote the code, running at 10 Hz.
    • Can Hiro create a new dump file with the special paths needed for re-naming.

  • this meeting:

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • N2N change needed for DPM has been checked in; also Lincoln has created rpms.
  • There is a bugfix needed in the US - AGIS lookup broken.
  • Wei will send out an email describing a needed N2N module fix for Rucio hash calculation. This will be for both Java and C++ versions.

previous week (10/30/13)

  • Wei - made a tag for the rpm to get into the WLCG repo. Just need a confirmation from Lincoln.
  • Ilija - N2N: tested at MWT2, all worked correctly. Tested at ECDF (a DPM SE), but caused the door to crash.
  • Deployment: SWT2 service is down. BNL - has an old version, cannot do Rucio N2N.
  • SWT2_CPB: Patrick - not sure what what is going on. Won't stay running. Tried to put the latest N2N last night. Running as a proxy server. Can we Andy help with this.
  • BNL - running 3.3.1. Need to uprade to 3.3.3 and new Rucio N2N; would also like glrd updated.
  • The rest of the US facility is okay.
  • Working with Valeri on changing database schema to hold more historical data. Prognosis? Next few days. No big issues.
  • Episode at HU - 4600 failed jobs failed over. Mostly recovered normally. Saul believes the local site mover was disabled for a period.
  • Two polish sites joined - but which cloud? (Wahid coordinated)
  • Horst: notices downstream redirection is failing - Ilija is investigating.

previous meeting (11/13/13)

  • Starting to push N2N at DE sites, and are finding a few minor bugs, will produce a new rpm.
  • New deployment document to review
  • Hiro has updated BNL - latest Xrootd 3.3.3 and N2N
  • Mail from Gerd - stat issue with dCache will be solved.
  • Ilija - noted there were a large number of FAX failovers that were pilot config related. Will be fixed.
  • Ilija - working with Valeri Fine to get better monitoring for FAX failovers.

this week (11/27/13)

  • Wei: still working with German sites, deploying Rucio N2N, a few minor issues to resolve.
  • Deployment document updated.
  • Ilija: stability issue - dcache-xrootd door stopped responding. Still trying to understand the cause. Working with Asoka to get a user script for optimal redirector location. Working with Valeri to get fax failover monitoring improved, requiring. Few week at earliest.
  • UTA stability issues. Wei gave Patrick some suggestions. A week of stability. Memory allocation on the environment variable, since RHEL6. Configuration change in xrootd configuration. Stress test?
  • Wei: prefers a small stress test on Rucio-converted sites.
  • Ilija - will be stress-testing MWT2. Also, there will be a change in notification for fax endpoint problems. A new test dataset has been distributed to all sites.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Still evaluating alternative storage solutions: object storage technology, as well as evaluating GPFS. Have completed first stage of moving to 100g infrastructure; will demonstrate between CERN and BNL. November 12-14. Full production.
    • this meeting: 6400 max analysis slots at BNL. Working on next gen storage deployment -- to replace 5 year old equipment. 2.5PB to replace. Needs to be done quickly (3-6 months). Tier 1 will go down Dec 16,17. All services will be affected. New Arista switch will be fully integrated. New version of spanning-tree routing in the LAN, a new algorithm. Day long intervention. Will update dCache to version 2.6, primarily for SHA-2. And new version of Condor.

  • AGLT2:
    • last meeting(s): Hepix meeting a great success. Networking for SC2013. Openflow does not work on stacked f10 switched. Week after t-giving dcache upgrade. Upgrade of condor.
    • this meeting:

  • NET2:
    • last meeting(s): Planning to do CVMFS 2.1.15. No work on SHA-2. Will get going on LHCONE using existing BNL-BU link. SAM was broken for a while - having trouble getting both SAM and availability/reliability working correctly. There's a problem with OIM configuration? Updated to CVMFS 2.1.15. Still working on SHA-2 compatibility.
    • this week: Retired two racks of 1TB drives, went down 100 TB. SGE configuration. Downtime on Dec 8 for power upgrade. Next week will do all SHA-2 upgrades.

  • MWT2:
    • last meeting(s): Major task was the dCache 2.6.15 upgrade. Upgraded CVMFS to 2.1.15. Went to Condor 8.0.4. Completely SHA-2 compliant. (Hiro: can you test.)
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): Tracking down an issue with the new Panda tracer functionality - job is throwing an exception. Rucio re-naming. SHA-2: only items is latest wn-client package. USERDISK is getting tight. Would like to get an LFC dump. How to test SHA-2? Michael: John Hover to run SHA-2 compatible certificate tests at all the sites. Both CE and storage. Contacted Von Welch.
    • this meeting: Machine room issue at CPB - AC issue today. Had a machine that was problematic - resolved now. SHA-2 compliance done at both sites. Would like to see John's tests.

  • SWT2 (OU, OSCER):
    • last meeting(s): SHA-2 compliant except for Xrootd.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting: Perfsonar timeline? New perfsonar host installed at LU. Running both BW and latency as tests.

  • WT2:
    • last meeting(s): Upgraded rpm-based OSG services, now SHA-2 compliant. One machine is doing Gratia reporting. Still have GUMS and MyProxy services to do. Working astrophysics groups to get ATLAS jobs running on their HPC cluster (so they need to co-exist).
    • this meeting: SHA-2 compliance. Some issue with the new gridftp server - seeing some performance issues, but without error messages? Will out a full day outage on Dec 2 to remove Thumpers.

AOB

last meeting this meeting


-- RobertGardner - 25 Nov 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback