r4 - 26 Jun 2013 - 14:27:52 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJun262013

MinutesJun262013

Introduction

Minutes of the Facilities Integration Program meeting, June 26, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode:
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”

Attending

  • Meeting attendees: Michael, Dave, Rob, Bob, Armen, Torre, Ilija, Saul, James (SMU), Wei, Shawn, Host, Doug, Alden
  • Apologies: Kaushik, Mark, Patrick
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • CapacitySummary - please update v27 in google docs
      • Program this quarter: SL6 migration; FY13 procurement; perfsonar update; Xrootd update; OSG and wn-client updates; FAX
      • WAN performance problems, and general strategy
      • Introducing FAX into real production, supporting which use-cases.
      • v28 version of spreadsheet can communicated to John Weigand done
      • Phase 25 site certification matrix: SiteCertificationP25
      • SL6 validation and deployment (see below)
      • FY13 procurements - note we have been discussing with ATLAS computing management re: CPU/storage ratio. Guidance is to emphasize CPU in this round.
    • this week
      • Needed discussion on gLexec, see below.
      • OSG: There will be a workshop at Duke University, August 27-28 (c.f. http://campusgrids.org/), focusing on use of Bosco and OSG Connect to use campus and opportunistic resources. Will feature a module on using Parrot + CVMFS, and FAX.
        • In particular, set something up for Tier 3s.
      • Report from Bob on Adhoc Compute Server subcommittee activities (below)
      • LHCONE: NET2, SWT2 still to be connected
      • Clutter in the Panda monitor. Horst claims this happened with the switch from scheddb to AGIS. How to get rid of them? Alden: thought this was fixed, will fix.

gLexec

  • Background: _ At the last WLCG management board meeting a decision was taken regarding gLExec. This was also taken up at the WLCG Operations Coordination meeting last Thursday._ See https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes130620. The plan foresees making the gLExec SAM tests critical starting from October; experience shows that setting up gLExec and passing the tests is not difficult, it is well documented and it does not risk affecting any production workflow.
  • https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeploymentTracking
  • We'll just need to get it working with SAM tests, or do full deployment.
  • To what extent is it planned by ADC for use in Panda? Torre: integration work is being done by CERN IT, but there is no strong commitment from ADC.
  • CMS has done this - Action on Rob to get some instructions together.

FY13 Procurements - Compute Server Subcommittee (Bob)

last time
  • Guidance is for CPU
  • Share benchmarks
  • Lets have one expert assigned by each Tier 2 to collect relevant information on benchmarks. Michael notes significant performance differences depending on configuration.
  • Rob will send a note to each of the Tier 2 contacts to join a group.

this time:

  • AdHocComputeServerWG
  • Doug: considering virtualization?
  • Question as to going above 2 GB/logical core. Official guidance? ATLAS is trying hard to keep requirements to within 2 GB. Expect to see more 3-4 GB jobs.
  • What about MP jobs ?
  • Discussion of benchmarks - Doug has some IO examples.
  • What about the NICs? 10G?
  • Next meeting this Friday.

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • NET2: DONE
  • SWT2_OU: 120 TB installed. DONE
  • MWT2: 3.7 PB now online DONE
  • SWT2_UTA: One of the systems is built and deployable; good shape, a model for moving forward. Will need to take a downtime - but will have to consult with Kaushik. Should be ready for a downtime in two weeks. If SL6 is a harder requirement, will have to adjust priority.
this meeting:

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 2/3 DONE (*MWT2_IU needs action, see below.)
  • SLAC DONE

notes:

  • Updates?
  • OU - status unknown.
  • UTA - conversations with LEARN, UTA, I2 are happening. There has been a meeting. They are aware of the June 1 milestone.
  • NET2 - new 10g link is setup. 2 x 10 g to HU. Chuck is aware of the June 1 LHCONE milestone. Saul will follow-up shortly, expects no problem by June 1.
  • IU - plan is to decide friday whether whether we need to bypass the brocade, access Juniper directly to peer with LHCONE. Fred is working closely with the engineers.
this meeting
  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

The transition to SL6

MAIN REFERENCE

CURRENTLY REPORTED

last meeting(s)

  • All sites - deploy by end of May, June
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Main thing to consider is whether you upgrade all at once, or rolling.
  • BNL will be migrated by the COB today! Will be back online tonight. BNL did the rolling update.
  • Look at AGIS - changing panda queues much easier
  • Are the new queue names handled reporting? If they are members of same Resource Group.
  • What about $APP? Needs a separate grid3-locations file. But the new system doesn't use it any longer.
  • Schedule:
    • BNL DONE
    • June 10 - AGLT2 - will do rolling
    • MWT2 - still a problem with validations; could start next week
    • SLAC - week of June 10
    • NET2 - all at once. Week of June 17
    • UTA - all at once. June 24. Lots of dependencies - new hardware, network. A multi-day outage is probably okay.
    • OU - all at once. Rocks versus Puppet decision. After July 5.
  • Goal: Majority of sites supporting the new client by end of June. May need to negotiate continued support

this meeting

  • BNL DONE
  • MWT2 DONE
  • AGLT2: 1/3 of worker nodes were converted; ran into a CVMFS cache size config issue, but otherwise things are going well. The OSG app is owned by usatlas2, but validation jobs are now production jobs. Doing rolling upgrade. They are using the newest cvmfs release. n.b. change in cache location. Expect to be finished next week.
  • NET2: HU first, then BU. At HU - did big bang upgrade; ready for Alessandro to do validation. Ran into problem with host cert. 2.1.11 is production. One machine at BU. Hope to have this done in two weeks. BU team working on HPC center at Holyoke.
  • SWT2 (UTA)
  • SWT2 (OU)
  • WT2: Failed jobs on test nodes - troubleshooting with Alessandro. Expect to be complete by end of next week.

Updates from the Tier 3 taskforce?

last meeting
  • Report is due by July
  • Doing testing of Tier 3 scenarios using grid or cloud resources
  • Working with AGLT2 as a test queue.
  • Managed to get surveys from every Tier 3 site. Writing assignments will be setup for the final report.
  • Half the community does not have resources on their campus.
  • Solve the data handling problem to local resources; as fully supported DDM endpoint. gridftp-only endpoints were never fully supported.
  • Survey report will be available in two weeks

this meeting

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • There has been a lack of production jobs lately
    • From ADC meeting - new software install system; Alastaire's talk on production / squid.
    • SMU issues - there has been progress
    • ADC operations will be following up on any installation issues.
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    Not available this week.
    
    1)  6/19: SAAB tool ("Storage Area Automatic Blacklisting") put into production:
    It will act on ALL DDMEndpoints. The autoblacklisting will act only onto 'upload' and 'write' DDM permissions. Feedback/suggestions/issues are important for us; please 
    see the SAAB twiki: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/StorageAreaAutomaticBlacklisting
    2)  6/19: AGLT2 - file transfer failures with SRM errors.  Issue was due to a cooling problem in the server room which caused temperatures to spike.  Problem fixed as of 
    early afternoon.  https://ggus.eu/ws/ticket_info.php?ticket=94974 closed, eLog 44631.
    
    Follow-ups from earlier reports:
    
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, 
    eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue. 
    (ii)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    Update 6/6: Some progress getting the SRM configured correctly.  However, DDM transfers began failing with permission errors, so the site was blacklisted: 
    http://savannah.cern.ch/support/?138002.  eLog 44480.
    (iii)  6/7: NET2 - file transfers failing with "Communication error on send."  Problem similar to one from early May that affected only UK cloud sites 
    (see: https://ggus.eu/ws/ticket_info.php?ticket=93660).  An unrelated problem with GPFS was solved on 6/8.  https://ggus.eu/ws/ticket_info.php?ticket=94677 in-progress, 
    eLog 44493.  https://savannah.cern.ch/support/index.php?138030 (Savannah DDM, for blacklisting).
    (iv)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in the 
    network path between OU & BNL.  As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, since 
    everything was then re-routed to the 'old' paths.  Problem under investigation.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    Not available this week.
    
    1)  6/24: At BNL a disk pool became unavailable on Thursday afternoon. ADC and DDM operations management was informed about the incident shortly after, while experts 
    from Oracle investigated the issue to find out whether the data on disk is recoverable. On Friday night all experts (from Oracle and a company specialized on data recovery) 
    concluded that fixing the file system is impossible. A post incident analysis by Oracle specialists is still in progress. Once that is completed we will post a detailed incident report. 
    eLog 44700.
    
    Follow-ups from earlier reports:
    
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue. 
    (ii)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    Update 6/6: Some progress getting the SRM configured correctly.  However, DDM transfers began failing with permission errors, so the site was blacklisted: 
    http://savannah.cern.ch/support/?138002.  eLog 44480.
    (iii)  6/7: NET2 - file transfers failing with "Communication error on send."  Problem similar to one from early May that affected only UK cloud sites 
    (see: https://ggus.eu/ws/ticket_info.php?ticket=93660).  An unrelated problem with GPFS was solved on 6/8.  https://ggus.eu/ws/ticket_info.php?ticket=94677 in-progress, eLog 44493.  
    https://savannah.cern.ch/support/index.php?138030 (Savannah DDM, for blacklisting).
    Update 6/23: Issues resolved.  From Saul: These errors were caused by our atlas.bu.edu SRM incorrectly not being in the OSG BDII. FTS3 then had the right host (presumably from AGIS) 
    but did not have the port number, thus causing all transfers to fail with "gfal2" related errors
    and only for sonar tests. Thanks to FTS3/sonar, there is a workaround for this that has been in place for days. With this fix, the errors have stopped for several days.  GGUS and Savannah 
    tickets were closed, eLog 44683.
    (iv)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in the network path 
    between OU & BNL.  As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, since everything was then 
    re-routed to the 'old' paths.  Problem under investigation.
    

Data Management and Storage Validation (Armen)

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • See notes from yesterday
    • Final release of perfsonar 3.3 likely within a week. (this is rc4) Will request all sites to update. Option of preserving data or a clean install. Few things to do with OSG service registration.
    • Work continues on the dashboard.
    • Few things being developed for alerts.
    • Transfers to OU and UTA in-bound are doing better; but there are problems on some paths coming in.
    • LHCONE - SWT2 problems during the switch-over. Asymmetric routes with CERN created - had to back changes out.
    • Michael: Some complaints about the dashboard. Distinguishing real problems with false "reds". Shawn: most of the red issues are known problems with version 3.2.
    • Info on client tools will be provided. Doug wants something specific for Tier 3s. Logical set is US ATLAS Tier 2s and Tier 1.
  • this meeting:

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei: email sent to upgrade GSI security to certain sites. dCache sites already have it. Working with Spanish Tier 1.
  • Ilija: enabled allow-fax true at MWT2 and AGLT2. No problems seen. Will look at statistics.
  • Xrootd upgrade issues - have xrootd.org and OSG and EPEL. Wei will discuss with Lucasz.
this week

Site news and issues (all sites)

  • T1:
    • last meeting(s): Final stage of completing network upgrade. Then DB backend for LFC and FTS to Oracle 11g. And dCache upgraded 2.2.-10. Farm on SL6. DDM enabled. SL6 wn's running HC jobs - soon to open up for regular production and analysis. PandaMover being brought up - so as to resume production at the Tier 2s. Hopefully finished by 7 pm. Buying new compute nodes - delivery data of June 10, 90 worker nodes, to be in production a few days after. (25-30 kHS06). In collaboration with ESNet on 100g transatlantic demo; preparing link between BNL and MANLAN in NYC. On European end, following Tirena conference, extend Amsterdam to CERN link.
    • this meeting:

  • AGLT2:
    • last meeting(s): Working hard getting ready for SL6. Test jobs run production just fine. User analysis jobs are failing however, unclear. VMWare systems at MSU and UM - new door machine at MSU configured and running - then will update pool servers to SL6.
    • this meeting:

  • NET2:
    • last meeting(s): Release issue as mentioned - unresolved problem. Unintentionally update of CVMFS on HU nodes. Michael: since HU and BU in same data center, why not unify? Only reason was to minimize changes in the move, might do this in the future. HC stress test?
    • this week:

  • MWT2:
    • last meeting(s): Regional networking meeting next week, AGLT2+MWT2. Illinois CC GPFS issues last week caused by faulty Infiniband component. New compute nodes online with SL6. Puppet rules setup for IU and UC nodes. Networking issues at IU: Sunday changed network to backup 10g link, re-enabled virtual router in the Brocade router for LHCONE. However, checksum errors returned. 191/600K transfers. People are trying to understand the source.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): Loading SL6 and ROCKS - has a solution for this, isolating ISOs. Will bring up head node, and start bringing configurations forward. Malformed URI's from deletion service.
    • this meeting:

  • SWT2 (OU):
    • last meeting(s): Some problems with high memory jobs - result has crashed compute nodes. Condor configured to kill jobs over 3.8 GB. These are production jobs. No swap? Very little.
    • this meeting:

  • WT2:
    • last meeting(s): Preparing for SL6 migration. Have CVMFS running on an SL6 machine; running test CE, and its working with LSF 9.1.
    • this meeting:

AOB

last meeting this meeting


-- RobertGardner - 26 Jun 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback