r3 - 04 Sep 2013 - 14:00:21 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep42013

MinutesSep42013

Introduction

Minutes of the Facilities Integration Program meeting, Sep 4, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843

Attending

  • Meeting attendees: Bob, Rob, Wei, Fred, Armen, Dave, John, Mark, Sarah, Hiro, Saul
  • Apologies: Jason, Michael, Horst, Shawn
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • We'll need to cover the topics below
        • SL6 upgrade status
        • FY13 Procurements
        • Updates for Rucio support (Dave's page)
        • Distributed Tier 3 (T2 flocking), user scratch, Facilities working subgroup
      • Conversion of the existing storage inventory to the Rucio naming convention (Hiro).
      • How much inventory is in active use? Can we purge before renaming? Hiro has been studying this - and there is a meeting being setup to discuss this. Now a hot topic.
      • This is much more complex than a thermodynamic. n.b. there already is a shortage of T1 space.
    • this week
      • Enabling DQ2 2.4+ clients, SL6 conversion
        • Global versus dq2 specific
        • SLAC has a problem - PRODDISK files are written the old convention.
      • SHA2 compliance deadline, October 1, 2013: https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/SHA2Compliance
        • SLAC - Bestman upgrade didn't work without gums. Will install a gums server. Wei is skeptical.
        • BU - needs may need to upgrade Bestman
        • UTA - ?

FY13 Procurements - Compute Server Subcommittee (Bob)

last time
  • AdHocComputeServerWG
  • Shawn: discussion summary about possible discounts. You know about this. Waiting for some sensitive discussions to converge.
  • Attempting to get rational set of quotes for M620 and R620 quotes. Higher than either BNL and SLAC. Also awaiting info from Andy L and Dell, etc.
  • Shawn - expect news, maybe even today.

this time:

  • Still waiting
  • Wei - starting purchasing process, pooling with other groups, to purchase a large group of machines. Will not run HT, 4GB/core, and IB. Dell. M620 blade. Happy with pricing.

Integration program issues

Rucio support (Dave)

Previous meeting

this meeting

  • OU: has not done this, but will during the upgrade
  • NET2_HU: all nodes are SL6
  • NET2_BU: unclear
  • MWT2 DONE
  • AGLT2 DONE
  • BNL DONE
  • WT2 DONE

this meeting:

  • OU currently updating
  • SWT2:

The transition to SL6

MAIN REFERENCE

CURRENTLY REPORTED

last meeting(s)

  • All sites - deploy by end of May, June
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Main thing to consider is whether you upgrade all at once, or rolling.
  • BNL will be migrated by the COB today! Will be back online tonight. BNL did the rolling update.
  • Look at AGIS - changing panda queues much easier
  • Are the new queue names handled reporting? If they are members of same Resource Group.
  • What about $APP? Needs a separate grid3-locations file. But the new system doesn't use it any longer.
  • Schedule:
    • BNL DONE
    • June 10 - AGLT2 - will do rolling
    • MWT2 - still a problem with validations; could start next week
    • SLAC - week of June 10
    • NET2 - all at once. Week of June 17
    • UTA - all at once. June 24. Lots of dependencies - new hardware, network. A multi-day outage is probably okay.
    • OU - all at once. Rocks versus Puppet decision. After July 5.
  • Goal: Majority of sites supporting the new client by end of June. May need to negotiate continued support

  • BNL DONE
  • MWT2 DONE
  • AGLT2: 1/3 of worker nodes were converted; ran into a CVMFS cache size config issue, but otherwise things are going well. The OSG app is owned by usatlas2, but validation jobs are now production jobs. Doing rolling upgrade. They are using the newest cvmfs release. n.b. change in cache location. Expect to be finished next week.
  • NET2: HU first, then BU. At HU - did big bang upgrade; ready for Alessandro to do validation. Ran into problem with host cert. 2.1.11 is production. One machine at BU. Hope to have this done in two weeks. BU team working on HPC center at Holyoke.
  • SWT2 (UTA)
  • SWT2 (OU)
  • WT2: Failed jobs on test nodes - troubleshooting with Alessandro. Expect to be complete by end of next week.

previous meeting update:

  • BU: after August 11 now since Alessandro on vacation. Will commit to doing it August 12.
  • OU: working on final ROCKS configuration. Still validating the OSCER site - problems not understood.
  • CPB is done DONE. SWT2_UTA - also delaying because of Alessandro's absence.

previous meeting (8/14/2013)

  • Updates?
  • BU: getting validated right now with a new set of Panda queues; they are getting validated right now. Did local testing, there were problems with Adler checksumming. Within the next week.
  • OU: wants to wait until OSCER is validated until starting OCHEP. Saul will help.
  • UTA_SWT2 - still needs to convert. Were waiting on getting some equipment moved over to the FW data center.

previous meeting (8/24/2013)

  • Updates?
  • BU: about 2/3 of the releases have been validated. Completion data is not known. Hope is for a week. Its been going rather slowly. Ran into a problem with CVMFS.
  • OU: currently down due to storage problems. But still have not fixed current problems.
  • Both OU and BU have had problems getting attention from Alessandro.
  • Need a formal statement about a setup file in $APP.

this meeting

  • OU doing upgrade

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

previous meeting (8/14/13)

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

this meeting (8/21/13)

  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.

this meeting (9/4/13)

  • Updates
  • UTA: meeting with new network director schedule this Friday or next week. Back on the page.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • There has been a lack of production jobs lately
    • From ADC meeting - new software install system; Alastaire's talk on production / squid.
    • SMU issues - there has been progress
    • ADC operations will be following up on any installation issues.
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=2&confId=269291
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/adcos-summary-8_26_13.html
    
    1)  8/23: SLACXRD - file transfer failures with SRM errors. https://ggus.eu/ws/ticket_info.php?ticket=96831, eLog 45532.  As of 8/27 no recent errors like the ones in
    the ticket, so ggus 96831 was closed. eLog 45560.
    2)  8/27: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=268262
    3)  8/28: SLACXRD - file transfer failures and jobs stage-out errors.  Wei reported that some disk failures were being worked on.  eLog 45592.
    
    Follow-ups from earlier reports:
    
    (i)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in the 
    network path between OU & BNL.  As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, 
    since everything was then re-routed to the 'old' paths.  Problem under investigation.
    (ii)  7/7: SLACXRD - file transfer failures ("[USER_ERROR] source file doesn't exist:).  https://ggus.eu/ws/ticket_info.php?ticket=95491 in-progress, eLog 44910.
    Update 7/16 - still see these errors.  https://ggus.eu/ws/ticket_info.php?ticket=95763 was opened, marked as a slave to master ticket ggus 95491.
    Update 7/24: ggus 95491 was closed, but then re-opened when the errors reappeared. 
    Update 8/21: Files were declared lost, and recent DDM efficiency is good.  ggus 95491 was closed, eLog 45509.
    (iii)  8/6: WISC HOTDISK transfer errors ("Unexpected Gatekeeper or Service Name globus_gsi_gssapi: Authorization denied") - https://ggus.eu/ws/ticket_info.php?ticket=96406 - 
    eLog 45289, 45461 .  Also affects LOCALGROUPDISK, site blacklisted in DDM.
    (iv)  8/13: AGLT2 - file transfers with SRM errors, still related to the dCache "out of memory" issue. Also transfers failing with the error "[SECURITY_ERROR] SRM Authentication 
    failed." From Shawn: We are working with the dCache developers to track down the root cause. During the last OOM we updated our dCache to an "instrumented" version which 
    we are now running. The dCache services are back up now but will take a while to run smoothly again. https://ggus.eu/ws/ticket_info.php?ticket=96614 in-progress, eLog 45403.
    Update 8/15: Ticket was closed/reopened when the errors came back.  Closed on 8/19, but again reopened on 8/21.  Ongoing issue.  eLog 45494.
    (v)  8/20: OU_OCHEP_SWT2: storage problem at the site.  Issue being worked on. 
    Update 8/23: hardware problem fixed, file transfers resumed.  Later that day https://ggus.eu/ws/ticket_info.php?ticket=96830 was opened due to file transfers failing (SRM errors).  
    BeStMan restarted fixed this problem - ggus 96830 was closed, eLog 45533.
    
  • this week: Operations summary:
    Summary from the weekly ADCoS meeting (Pavol Strizenec):
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=270957
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/adcos-summary-9_2_13.html
    
    1)  8/29: Information about the migration of ATLAS Computing TWiki pages:
    http://www-hep.uta.edu/~sosebee/ADCoS/Migration-Atlas-Computing-Twiki.html
    2)  8/30 p.m.: AGLT2 - file transfer failures ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  From Shawn: Our postgresql instance was 
    restarted but not the dCache instances using it. I am restarting our SRM now and will restart any other affected services.  Issue resolved - https://ggus.eu/ws/ticket_info.php?ticket=96956 
    was closed, eLog 45640.
    3)  9/1: SWT2_CPB - file transfers failing with SRM errors.  Underlying problem was traced to problematic DNS in the cluster.  Issue resolved, DDM and panda queues back up as 
    of early afternoon 9/2. eLog 45675, http://savannah.cern.ch/support/?139562 (Savannah site exclusion).
    4)  9/3: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=269432
    
    Follow-ups from earlier reports:
    
    (i)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in the network 
    path between OU & BNL.  As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, since everything 
    was then re-routed to the 'old' paths.  Problem under investigation.
    (ii)  8/6: WISC HOTDISK transfer errors ("Unexpected Gatekeeper or Service Name globus_gsi_gssapi: Authorization denied") - https://ggus.eu/ws/ticket_info.php?ticket=96406 - 
    eLog 45289, 45461 .  Also affects LOCALGROUPDISK, site blacklisted in DDM.
    Update from the site admin, 9/3: It's fixed. The hostcert and hostkey was override by some other management program. ggus 96406 was closed, eLog 45690.
    (iii)  8/13: AGLT2 - file transfers with SRM errors, still related to the dCache "out of memory" issue. Also transfers failing with the error "[SECURITY_ERROR] SRM Authentication failed." 
    From Shawn: We are working with the dCache developers to track down the root cause. During the last OOM we updated our dCache to an "instrumented" version which we are 
    now running. The dCache services are back up now but will take a while to run smoothly again. https://ggus.eu/ws/ticket_info.php?ticket=96614 in-progress, eLog 45403.
    Update 8/15: Ticket was closed/reopened when the errors came back.  Closed on 8/19, but again reopened on 8/21.  Ongoing issue.  eLog 45494.
    Update 8/29 from Shawn: We have temporarily turned off sending data to our dCache billing DB. OOM errors have stopped. ggus 96614 was closed, eLog 45689.
    (iv)  8/28: SLACXRD - file transfer failures and jobs stage-out errors.  Wei reported that some disk failures were being worked on.  eLog 45592.
    Update 8/30: issue resolved - recent jobs and transfers are succeeding.
    

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Dark data discussions are on-going. Will provide a comparison for sites; preliminary report next week.
    • Localgroupdisk. Some deletions at UC, about 150 TB. Also contacting top users at SLAC.
    • Userdisk cleanup finished everywhere except BNL.
    • No report - Armen on vacation
  • this meeting:
    • Has been deleting data on behalf a user - 220 TB, at MWT2. Need to send another note users. 80% are not being used? What is the popularity.
    • Deleting data from DATADISK limited by primary copies.
    • Dark data - need more feedback. USERDISK deleted before May 1. These files are eligible for CCC cleanup.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei - sent a note to update Xrootd w/ notes. Reminds sites to
  • Ilija - working with Matez on monitoring - there was some mix up of cms and atlas data. New release of collector coming.
  • Panda failover monitoring.
  • All sites agreed to update Xroot https://twiki.cern.ch/twiki/bin/view/Atlas/Xrootd333Upgrade.
this week
  • Waiting for several sites to install Rucio-based N2N? . Only SLAC and MW completed. Wei will send an official.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Hiro has started to do direct access tests. Planning to move ANALY short queue to direct access. Anticipate moving more in the future. Working on new storage hardware - issued order for 1.5 PB of storage. $120/TB. Will run with ZFS on top - start deploying as soon as possible. Hiro and others have made good progress on evaluating the Ceph block device part. Looks attractive to us. Using old retired storage. Features are nice. Slice and dice the storage. 100 Gbps-capable Arista switch arrived; readying for trans atlantic test in mid-september. IB eval, 56 Gbps interfaces into these machines.
    • this meeting:

  • AGLT2:
    • last meeting(s): OOM crashes on dCache headnodes. Ticket open. 2.2.12 version. Troubleshooting. Wants to move to 2.6; will need gplazma 2 natively. Still have issue with dCache OOM w/ the billing cell db. Isolated it to this. Instrumenting the monitoring. Getting ready to purchase networking switch, and compute nodes.
    • this meeting: More recently seeing larger numbers of CVMFS problems. HC offlined last night - most likely CVMFS issue. CVMFS 2.1.14 most recent version is causing problems. MW backed out to 2.0.19. Dave notes certain version of the kernel have problems with the latest 2.1.14.

  • NET2:
    • last meeting(s): Still working on SL6. New gatekeeper being setup at HU. Will do the Xrootd update a BU. Working on X509.
    • this week: LHCONE progress; also expect 100g soon. HU has an issue with GK on SL6; lsf-gratia. (Wei has suggestion for disabling certain records). Has a ticket open. SL6 upgrade completed. Rucio upgrade.

  • MWT2:
    • last meeting(s): DNS appliance failed at UC that broke many things. Corrected, but took a while to propagate to all servers. GPFS issues at Illinois - lack of file descriptors.
    • this meeting: Updated SHA-2 compliant gums. dCache upgrade later in the month. There is a networking issue at IU. IU has received its upgrade equipment.

  • SWT2 (UTA):
    • last meeting(s): Working on SL6 upgrade. Problem with a storage server - but have work-around. Hope to be able to request software validation.
    • this meeting: Patrick up at OU

  • SWT2 (OU):
    • last meeting(s): Lustre problems - but IBM is working on the problem.
    • this meeting: Upgrading to SL6.

  • WT2:
    • last meeting(s):
    • this meeting: Now have outbound TCP/IP connections. There is a problem with the GRAM service, needs a restart every few days. Need to open up a ticket with OSG. Release validation - two sites have been working fine. Finding some validations fail with some of the new sites. (Notes two upgrades of both GT5 and LSF).

AOB

last meeting
  • Will have a meeting next week, to get back in phase.
  • Fall US ATLAS Facilities workshop. Tentative dates: December 11, 12. Location: University of Arizona (Tucson).
this meeting


-- RobertGardner - 04 Sep 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback