r3 - 12 Aug 2013 - 19:35:06 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesJul312013

MinutesJul312013

Introduction

Minutes of the Facilities Integration Program meeting, July 31, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843

Attending

  • Meeting attendees: Alden, Rob, Dave, Saul, Patrick, Horst, Armen, Hiro
  • Apologies: Shawn, Bob, Jason, Mark, Michael, Wei
  • Guests:

Integration program update (Rob, Michael)

FY13 Procurements - Compute Server Subcommittee (Bob)

last time

this time:

  • Only thing of note that I would have said is in regard to the HS06 measurements under SL6, and the reference quotes we are seeking for M620/R620 configurations.

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • NET2: DONE
  • SWT2_OU: 120 TB installed. DONE
  • MWT2: 3.7 PB now online DONE
  • SWT2_UTA: One of the systems is built and deployable; good shape, a model for moving forward. Will need to take a downtime - but will have to consult with Kaushik. Should be ready for a downtime in two weeks. If SL6 is a harder requirement, will have to adjust priority.
this meeting:
  • SWT2_UTA DONE 800 DEPLOYED, 200 RETIRED, 2.5 PB, ANOTHER 400 UNPOWERED (NOTE < 50%)
  • *

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.
this meeting
  • Updates?
  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

The transition to SL6

MAIN REFERENCE

CURRENTLY REPORTED

last meeting(s)

  • All sites - deploy by end of May, June
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Main thing to consider is whether you upgrade all at once, or rolling.
  • BNL will be migrated by the COB today! Will be back online tonight. BNL did the rolling update.
  • Look at AGIS - changing panda queues much easier
  • Are the new queue names handled reporting? If they are members of same Resource Group.
  • What about $APP? Needs a separate grid3-locations file. But the new system doesn't use it any longer.
  • Schedule:
    • BNL DONE
    • June 10 - AGLT2 - will do rolling
    • MWT2 - still a problem with validations; could start next week
    • SLAC - week of June 10
    • NET2 - all at once. Week of June 17
    • UTA - all at once. June 24. Lots of dependencies - new hardware, network. A multi-day outage is probably okay.
    • OU - all at once. Rocks versus Puppet decision. After July 5.
  • Goal: Majority of sites supporting the new client by end of June. May need to negotiate continued support

  • BNL DONE
  • MWT2 DONE
  • AGLT2: 1/3 of worker nodes were converted; ran into a CVMFS cache size config issue, but otherwise things are going well. The OSG app is owned by usatlas2, but validation jobs are now production jobs. Doing rolling upgrade. They are using the newest cvmfs release. n.b. change in cache location. Expect to be finished next week.
  • NET2: HU first, then BU. At HU - did big bang upgrade; ready for Alessandro to do validation. Ran into problem with host cert. 2.1.11 is production. One machine at BU. Hope to have this done in two weeks. BU team working on HPC center at Holyoke.
  • SWT2 (UTA)
  • SWT2 (OU)
  • WT2: Failed jobs on test nodes - troubleshooting with Alessandro. Expect to be complete by end of next week.

this meeting

  • Updates?
  • BU: after August 11 now since Alessandro on vacation. Will commit to doing it August 12.
  • OU: working on final ROCKS configuration. Still validating the OSCER site - problems not understood.
  • CPB is done DONE. SWT2_UTA - also delaying because of Alessandro's absence.

Tier 3 - Tier 2 Flocking (Rob)

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • There has been a lack of production jobs lately
    • From ADC meeting - new software install system; Alastaire's talk on production / squid.
    • SMU issues - there has been progress
    • ADC operations will be following up on any installation issues.
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting (Michal Svatos ):
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=262732
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-global-summary-7_22_13.html
    
    1)  7/17: AGLT2 - file transfer failures with the error "failed to contact on remote SRM." From Shawn: Another OutOfMemory error on our dCache instance. 
    We have a ticket up with dCache support but no resolution yet.  Additional errors on 7/18 ("[SECURITY_ERROR] SRM Authentication failed]").  Issue resolved 
    as of 7/19, and https://ggus.eu/ws/ticket_info.php?ticket=95711 was closed.  eLog 45048.
    2)  7/18 early a.m.: WISC file transfer failures ("First non-zero marker not received within 180 seconds"). https://ggus.eu/ws/ticket_info.php?ticket=95851 - eLog 45027.
    3)  7/19: SLACXRD: machine room cooling outage - eLog 45054.  Services mostly restored by early afternoon.
    4)  7/22: New pilot release from Paul (v58b) - details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_58b.html
    5) 7/23 early a.m.: MWT2_SL6 - jobs failing with stage-in errors - https://ggus.eu/ws/ticket_info.php?ticket=96035.  Errors went away, so the ggus ticket was closed 
    after ~24 hours.  eLog  45125.
    6)  7/23 early a.m.: UPENN_LOCALGROUPDISK: file transfer fail with "[INVALID_PATH]...No such file or directory" - https://ggus.eu/ws/ticket_info.php?ticket=96037. 
    Site admin reported that the storage was almost full, and some space was freed up. Closed ggus 96037 - eLog 45118.
    7) 7/24: Frontier service was down at CERN (underlying issue was an ATLR database problem, now fixed) - https://ggus.eu/ws/ticket_info.php?ticket=96113 was closed.  
    Generated a large number of "TRF_UNKNOWN" prod job failures. eLog 45139.
    
    Follow-ups from earlier reports:
    
    (i)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    Update 6/6: Some progress getting the SRM configured correctly.  However, DDM transfers began failing with permission errors, so the site was blacklisted: 
    http://savannah.cern.ch/support/?138002.  eLog 44480.
    Update 7/24: ggus 93748 was closed.
    (ii)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in 
    the network path between OU & BNL.  As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, 
    since everything was then re-routed to the 'old' paths.  Problem under investigation.
    (iii)  7/7: SLACXRD - file transfer failures ("[USER_ERROR] source file doesn't exist:).  https://ggus.eu/ws/ticket_info.php?ticket=95491 in-progress, eLog 44910.
    Update 7/16 - still see these errors.  https://ggus.eu/ws/ticket_info.php?ticket=95763 was opened, marked as a slave to master ticket ggus 95491.
    (iv)  7/12: SLAC - BDII errors from the site to multiple UK sites - https://ggus.eu/ws/ticket_info.php?ticket=95675 in-progress, eLog 44958.  ( https://ggus.eu/ws/ticket_info.php?ticket=95706 
    was opened the same day for a similar issue at OU_OCHEP_SWT2, but that ticket was closed with a request to reassign the issue to FTS3 developers.)
    Update 7/24: Issue apparently resolved (fixed by FTS developers) - no more errors.  ggus 95675 was closed - eLog 45113.
    

  • this week: Operations summary:
    Summary from the weekly ADCoS meeting (Michal Svatos ):
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=265765
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/adcos-summary-7_29_13.html
    
    1)  7/24: NERSC_LOCALGROUPDISK: transfer failure [Checksum mismatch] - https://ggus.eu/ws/ticket_info.php?ticket=96116 - eLog 45176. On 7/25 a maintenance outage 
    was declared to work on a filesystem problem. https://savannah.cern.ch/support/index.php?138943 (site blacklisting).
    2)  7/27: MWT2_SL6 : ~2.2k job failures with "lost heartbeat" and batch system killed errors. The Condor pool dumped a large number of jobs, possibly during a "minor" reconfiguration 
    of the Condor configuration. No more errors - https://ggus.eu/ws/ticket_info.php?ticket=96193 was closed, eLog 45181.
    3)  7/29: SLACXRD - jobs were failing heavily on WN fell0197.  Machine was removed from production for further off-line debugging.  eLog 45206.
    4)  7/30: New pilot release from Paul (v58d) - details:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_58d.html
    5)  7/30: ADC weekly meeting:
    https://indico.cern.ch/event/265094
    
    Follow-ups from earlier reports:
    
    (i)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in the 
    network path between OU & BNL.  As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, 
    since everything was then re-routed to the 'old' paths.  Problem under investigation.
    (ii)  7/7: SLACXRD - file transfer failures ("[USER_ERROR] source file doesn't exist:).  https://ggus.eu/ws/ticket_info.php?ticket=95491 in-progress, eLog 44910.
    Update 7/16 - still see these errors.  https://ggus.eu/ws/ticket_info.php?ticket=95763 was opened, marked as a slave to master ticket ggus 95491.
    Update 7/24: ggus 95491 was closed, but then re-opened when the errors reappeared. 
    (iii)  7/18 early a.m.: WISC file transfer failures ("First non-zero marker not received within 180 seconds"). https://ggus.eu/ws/ticket_info.php?ticket=95851 - eLog 45027.
    Update 7/30: The xrootd redirector server was replaced with a new machine.  ggus 95851 was closed.
    

Data Management and Storage Validation (Armen)

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s) this week

  • See Johannes' email. Sites should follow-up.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Final stage of completing network upgrade. Then DB backend for LFC and FTS to Oracle 11g. And dCache upgraded 2.2.-10. Farm on SL6. DDM enabled. SL6 wn's running HC jobs - soon to open up for regular production and analysis. PandaMover being brought up - so as to resume production at the Tier 2s. Hopefully finished by 7 pm. Buying new compute nodes - delivery data of June 10, 90 worker nodes, to be in production a few days after. (25-30 kHS06). In collaboration with ESNet on 100g transatlantic demo; preparing link between BNL and MANLAN in NYC. On European end, following Tirena conference, extend Amsterdam to CERN link.
    • this meeting:

  • AGLT2:
    • last meeting(s): Working hard getting ready for SL6. Test jobs run production just fine. User analysis jobs are failing however, unclear. VMWare systems at MSU and UM - new door machine at MSU configured and running - then will update pool servers to SL6.
    • this meeting:

  • NET2:
    • last meeting(s): Release issue as mentioned - unresolved problem. Unintentionally update of CVMFS on HU nodes. Michael: since HU and BU in same data center, why not unify? Only reason was to minimize changes in the move, might do this in the future. HC stress test?
    • this week: John back from vacation and will make some networking changes and swap out HU gatekeeper hardware; & install new OSG version.

  • MWT2:
    • last meeting(s): Regional networking meeting next week, AGLT2+MWT2. Illinois CC GPFS issues last week caused by faulty Infiniband component. New compute nodes online with SL6. Puppet rules setup for IU and UC nodes. Networking issues at IU: Sunday changed network to backup 10g link, re-enabled virtual router in the Brocade router for LHCONE. However, checksum errors returned. 191/600K transfers. People are trying to understand the source.
    • this meeting: Tier 3 flocking project or OSG connect. Networking problem at IU seems to be resolved; opening up storage pools at IU.

  • SWT2 (UTA):
    • last meeting(s): Loading SL6 and ROCKS - has a solution for this, isolating ISOs. Will bring up head node, and start bringing configurations forward. Malformed URI's from deletion service.
    • this meeting: Big thing is upgrade is complete; new storage. New edge nodes. All gridftp servers at 10g. SL6. Update OSG components to the latest. All went well.

  • SWT2 (OU):
    • last meeting(s): Some problems with high memory jobs - result has crashed compute nodes. Condor configured to kill jobs over 3.8 GB. These are production jobs. No swap? Very little.
    • this meeting:

  • WT2:
    • last meeting(s): Preparing for SL6 migration. Have CVMFS running on an SL6 machine; running test CE, and its working with LSF 9.1.
    • this meeting: 1. SL6 migration is almost completed. Jobs are running OK just need to reinstall batch nodes. 2. SLAC security approved our security plan for opening outbound TCP from batch nodes. It is waiting for CIO's signature. We will need to re-ip batch nodes. We will likely combine 1 and 2 next week (if we get CIO's signature, which I think is just paperwork)

AOB

last meeting this meeting


-- RobertGardner - 31 Jul 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback