r4 - 15 May 2013 - 14:20:58 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay152013

MinutesMay152013

Introduction

Minutes of the Facilities Integration Program meeting, May 15, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode:
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”

Attending

  • Meeting attendees: Rob, Bob, Fred, Dave, Patrick, Joel, Armen, Mark, Saul, John,
  • Apologies: Horst, Kaushik, Ilija, Shawn, and other Tokyo travelers
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • CapacitySummary - please update v27 in google docs
      • Program this quarter: SL6 migration; FY13 procurement; perfsonar update; Xrootd update; OSG and wn-client updates; FAX
      • WAN performance problems, and general strategy
      • Introducing FAX into real production, supporting which use-cases.
    • this week
      • v28 version of spreadsheet can communicated to John Weigand

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • MWT2: 3.3 PB now online. 180 TB remaining to complete the 1 PB upgrade
  • NET2: DONE
  • SWT2_OU: 120 TB installed. DONE
  • SWT2_UTA: Progressing; R620 mobo NICs needed to be replaced. Working on configuration on the storage modules. 3660i options under study. Then, how to bring the storage into the cluster, whether downtime is needed. There will be some additional network configuring to do (might be avoidable, if an existing rack is re-used).
    • Michael: a large aggregation switch needed, in order to bring this storage into a usable state.
    • Rob: internal bottleneck if used with existing equipment?

this meeting:

  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • NET2: DONE
  • SWT2_OU: 120 TB installed. DONE
  • MWT2: 3.7 PB now online DONE
  • SWT2_UTA: One of the systems is built and deployable; good shape, a model for moving forward. Will need to take a downtime - but will have to consult with Kaushik. Should be ready for a downtime in two weeks. If SL6 is a harder requirement, will have to adjust priority.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 2/3 DONE (*MWT2_IU needs action, see below.)
  • SLAC DONE

notes:

  • Updates?
  • OU: Horst believes it will be straightforward, will follow-up.
  • UTA: Patrick: waiting to hear back from UTA networking staff to see the status. Will check today.
  • NET2: waiting on Holyoke move.
  • IU: Brocade and IU engineers are still engaged in working on the problem.

this meeting

  • Updates?
  • OU - status unknown.
  • UTA - conversations with LEARN, UTA, I2 are happening. There has been a meeting. They are aware of the June 1 milestone.
  • NET2 - new 10g link is setup. 2 x 10 g to HU. Chuck is aware of the June 1 LHCONE milestone. Saul will follow-up shortly, expects no problem by June 1.
  • IU - plan is to decide friday whether whether we need to bypass the brocade, access Juniper directly to peer with LHCONE. Fred is working closely with the engineers.

The transition to SL6

last meeting(s)
  • All sites - deploy by end of May
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • At MWT2 will use UIUC campus cluster nodes; will start on this tomorrow.
  • Need a twiki setup to capture details.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Expect all sites to participate and convert to SL6 as old clients will be disabled in June.

  • This gets us most quickly to the latest dq2 client.
  • NET2: concerned about the timescale. Will have a problem meeting the deadline.
  • At BNL, the time required was about 4 hours.
  • WT2: have a new 3,000 node cluster coming up as RHEL6 - ATLAS may have access to. Timing issues for the shared resources. ATLAS dedicated resources can be moved to SL6, though.
  • AGLT2: concern is ROCKS, glued to SL5. Simultaneous transition to Forman and Puppet.
  • OU: transition from Platform to Puppet. OSCER cluster is RHEL6 (though job manager issues).
  • UTA: cannot do it before June 1. ROCKS, plus heavily involved with storage and networking, limiting the time.

this meeting

  • Bob - discussed last week, no way to be ready with Puppet and Foreman. Decided to go back to ROCKS SL6 server. (Will transition to puppet later this summer, more smoothly)
  • UTA - no time in the past two weeks
  • Issue is ROCKS doesn't recognize SL6
  • NET2 - will try.

Evolving the ATLAS worker node environment

last meeting

this meeting

  • Dave is working the issue
  • At MWT2, a number of SL6 nodes running (production)
  • OSG tarball as distributed in CVMFS. Minor fixes, reflected in today's 3.1.18
  • Production jobs are validated
  • User analysis jobs run into trouble - under certain circumstances. Payload is doing LFC operations, which leads to incompatible libstdC++ lib with asetup. MWT2 had the issue, but BNL did not. Difference has to do with copytool.
  • Direct IO invokes a different path.
  • SL5 it works since the libs are compatible.
  • Can John try direct access on his queue.
  • Work around is to modify LD_PRELOAD of correct libstdC++.
  • How are FAX accesses affected?
  • What about SL6 validation?
  • Need a set of procedures for the transition, that each site can follow.

Updates from the Tier 3 taskforce?

last meeting
  • Michael: a survey has been circulated about the resources deployed, usage, LOCALGROUPDISK, etc. Published today.
  • Result of this will be a set of recommendations to the IB.

this meeting

  • Fred - Tier 3 institutes have been surveyed, about 1/2 have responded. In general, people are happy with local resources.
  • Report is due by July
  • Doing testing of Tier 3 scenarios using grid or cloud resources
  • Working with AGLT2 as a test queue.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Running smooth
    • Pilot update from Paul
    • Large number of waiting jobs from the US - premature assignment?
    • Following-up with PENN: email got not response. Paul Keener reported an auto-update to his storage; reverted back to previous version (March 28). Transfers are now stable at the site, the ticket has been closed.
    • Discussion about Site Storage blacklisting. Its essentially an automatic blacklisting. Discussed using atlas-support-cloud-us@cern.ch. The problem is what to do with the Tier 3. Doug will make sure the tier3 sites have correct email addresses in AGIS.
  • this meeting:

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Will need another USERDISK cleanup. Hiro will send an email.
    • PRODDISK needs to be cleaned at NET2 and MWT2.
    • The SRM values reported incorrectly at SLAC and SWT2.
    • The SRM value at SWT2 space tokens dropped, then came back, except at GROUPDISK. Notes this is transient behavior.
    • GROUPDISK loss at MWT2 - site report below.
    • NET2 deletion issue - it is still slow. There is also a reporting issue here as well. This is reported everyday in Savannah. There is an active discussion. Saul believes it is functional by lowering the chunk size to 10 (other sites have 80-120). lcg-del command used by Tomas sometimes drops connections. Dropped connections 1 out of 10 from BU. Next step? Try duplicating problem on a new machine. Dropped packets?
  • this meeting:
    • No major issues
    • ALGT2 groupdisk needs space
    • USERDISK cleanup will happen this week
    • Wants to check with Saul about deletion rates, which is unchanged after the move. Will increase chunk size parameter. 60-80 #files is default; currently NET2 = 10. Armen believes there is a bottleneck.
    • Saul was able to reproduce the problem - dropped connections - which depends on location of client. Where does the deletion service client run? Armen: there are dedicated machines at CERN.
    • Dark data - mainly has been in USERDISK. Cleaned all datasets prior to 2013. Sites: please check sites for dark data.
    • Armen - will provide a dark data summary for next meeting.

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=251040 (presented by Pavol Strizenec)
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_6_2013.html
    
    1)  5/2: NET2 - https://ggus.eu/ws/ticket_info.php?ticket=93660 was re-opened for FT transfer errors between the site and several UK sites.  Issue was eventually traced to a 
    configuration issue in BDII (a change had been made the same day - see more details in the ticket).  ggus 93660 was again closed on 5/3.  eLog 44079.
    2)  5/6: NET2 - site down for a maintenance outage (moving to a new location).  Expected to last until 5/11. eLog 44109.
    3)  5/7: Presentation on the status of the new ATLAS s/w installation system (Alessandro De Salvo) at the ADC Weekly meeting:
    https://indico.cern.ch/getFile.py/access?contribId=5&resId=1&materialId=slides&confId=250155
    4)  5/7: SWT2_CPB - DDM errors due to a problematic storage server.  One of the virtual drives is off-line - attempting to recover the partition. eLog 44136.
    
    Follow-ups from earlier reports:
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue. 
    (ii)  4/25: NERSC_LOCALGROUPDISK transfer failures with checksum errors.  Update from Iwona on 5/1: Automatic yum update killed our "adler helpers" - daemons handling 
    remote checksum calculations. We improved monitoring to catch it faster and plan to patch ssh startup scripts. 
    Issue seems to be resolved - o.k. to close https://ggus.eu/ws/ticket_info.php?ticket=93666?  eLog 43997.
    Update 5/2: ggus 93666 was closed. eLog 44059.
    (iii)  4/30: File transfers to SLACXRD (from TAIWAN-LCG2) were failing due to an expired host cert on the SLAC side.  https://ggus.eu/ws/ticket_info.php?ticket=93747 in progress, eLog 44043.
    Update 5/1: no recent errors - ggus 93747 was closed. eLog 44054.
    (iv)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (presented by Michal Svatos):
    
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=252457
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_13_2013.html
    
    1)  5/10: SWT2_CPB - user reported a problem while attempting to transfer data from the site.  Likely related to the user's certificate / CA (similar problem has been seen in the past for a 
    few CA's).  Under investigation - https://ggus.eu/ws/ticket_info.php?ticket=93976.
    2)  5/11: MWT2 - Sarah reported two issues affecting DDM at the site: (i) a scheduled network outage this morning had greater scope that we
    anticipated, and took the UC site off the network for 30-45 minutes. (ii) a disk shelf on one of the dCache pools went offline, and the files
    on that pool were unavailable. We've brought that pool back online and those files are now available.
    3)  5/13: No ADC weekly, but some notes from Alastair Dewhurst regarding WLCG Squid Monitoring:
    https://indico.cern.ch/getFile.py/access?contribId=2&resId=minutes&materialId=minutes&confId=250155
    4)  5/14: NET2 - file transfer failures ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]" & "[INTERNAL_ERROR] Checksum mismatch").  Issue understood - from Saul:  
    We had a couple of disk failures, but the errors were actually caused by an unrelated configuration mistake. FTS and DDM restarted.  https://ggus.eu/ws/ticket_info.php?ticket=94050 closed, 
    eLog 44210.
    5)  5/14: BU_ATLAS_Tier2: frontier squid is down (see: http://dashb-atlas-ssb.cern.ch/dashboard/request.py/sitehistory?site=BU_ATLAS_Tier2#currentView=Frontier_Squid).  
    https://ggus.eu/ws/ticket_info.php?ticket=94054 in-progress, eLog 44197.
    
    Follow-ups from earlier reports:
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue. 
    (ii)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    (iii)  5/6: NET2 - site down for a maintenance outage (moving to a new location).  Expected to last until 5/11. eLog 44109.
    Update 5/12: outage completed - services being restarted.
    (iv)  5/7: SWT2_CPB - DDM errors due to a problematic storage server.  One of the virtual drives is off-line - attempting to recover the partition. eLog 44136.
    Update 5/9: RAID issue resolved.  A small number of files were suspect, these were reported to DDM ops to get them removed as replicas at SWT2_CPB.  eLog 44145.
    

  • Generally production has been running well
  • US sites drained a bit - due to lack of MC tasks. Only 400 assigned jobs for the entire cloud presently! <3000 activated!
  • Armen: there was a note from Wolfgang there was a lack of tasks.
  • From ADC weekly meeting a week ago - see link above.
  • SMU tickets open, a long thread.

DDM Operations (Hiro)

  • this meeting:

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Release by March. With 10G. Goal in facility to deploy across the facility by end March.
    • NET2 - CERN connectivity - has it been improved?
    • LHCONE connectivity for NET2 and SWT2 - timeline?
    • Prepare with discussions at NET2, even if the setup will come in with the move to Holyoke; get organized. Move won't happen before the end of March. The WAN networking at Holyoke is still not well defined. Start a conversation about bringing LHCONE.
    • rc2 for perfsonar; rc3 next week; sites should prepare to upgrade.
    • Network problems seen generally for westbound traffic.

    • UTA issue. Longstanding poor in-bound performance. ESnet ticket is open. Patrick and Sarah have been doing testing, observing checksum problems. Also at BNL - but they've cleared up (not correlated with a specific event). Hiro ran load test - 60 MB/s. The in-bound path seems to be taking the commodity path. (Patrick is following-up this afternoon.) Outbound direct has been better, but there are now checksum problems. Hiro is updating the ESnet ticket with measurements.
    • Michael: we need to get our providers involved as soon as possible, and through official channels. Note - we are the customers of the service. Need to use the ticketing system.
    • Notes:
       Meeting Notes for NA Throughput Meeting --- April 30, 2013
      ===========================================
      
      Attending: Shawn, Dave, Mark,Andy, Jason, Patrick, Philippe, Rob, Horst, Tom, John, Saul, Sarah
      Excused: Garhan
      
      1) Meeting agenda:  Changes, edits or additions?  None needed.
      
      2) Status of perfSONAR-PS Install in USATLAS
      
            i) Mesh configuration.  Sites not yet deployed with mesh configuration as of last call needing a status update (see notes from last call below):
                  a) NET2  -  Did Augustine complete this?  Status?  Saul isn't sure and will find out the status. 
                  b) WT2  - Please provide an update? Email Wei and Yee for status
                  c) BNL -  Was BNL updated to use the mesh?  (Using RC2 or RC3?)  Had to fall-back to previous version because of problems  This is the only 1G host at BNL so we will wait on retrying until UTA issue is resolved or v3.3 is ready.
      
           ii) 10GE Bandwidth instance deployed?  (If not, when?  See last notes below for status at the end of November)
                  a) SWT2:  Both OU and UTA needed downtimes.  OU was scheduled April 8. OU done?  UTA timeline?   SWT2_OU had network upgraded.  Installed RC3 on 10G host for testing.  Once v3.3 is out Horst will rename as production.  SWT2_UTA is waiting on replacement 10G NICs this week.  Hopefully this week.
                  b) WT2:  Have 3 10G perfSONAR-PS hosts but not allowed to get results  of tests until security approves.  Any update on approval status?  Email Wei and Yee for status
      
      3) perfSONAR-PS issues AND testing v3.3 RC3
      
          i) New issues noted?  None reported. 
         ii) Toolkit update status: RC3 info (Andy).   Andy reported that a bunch of new issues noted.  RC4 will likely happen soon.  Then 1-2 more weeks before a final version. Important we have this release as reliable and robust as possible.  Horst asked about his problems getting services to start.  Logs sent to Aaron to determine if this is specific to OU host or an RC3 issue. 
         iii) Modular dashboard news and plans / GitHub status  (Tom, Ian, ?)   GUI updated with CRUD capabilities.   Tom will go back to datastore work.   Mesh-configuration reading or traceroute matrix creation.  Action item: Shawn will email Tom info on the mesh-configuration from Aaron.  
         iv) NTP clock issue at MWT2_UC? (from last meeting notes).  Dave found no problem BUT maybe LHCONE related load may be causing issues.  
         v) Status of asymmetric BWCTL tests at Nebraska? (Garhan) Garhan will update via email or at next meeting.
        vi) Test of "Update" from v3.2.2/SL5 to v3.3 RC3/SL6?  (Garhan).  Will be trying this during the next week 
      
      4) Throughput 
      
          i)  New issues to track? None.  
         ii)   UTA: Discussion on current status of the inbound poor throughput problem.  Please see perfSONAR-PS results on the old dashboard at http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=USATLAS   Also we have some 1G testing from http://psum06.aglt2.org/toolkit  and http://psum05.aglt2.org/toolkit running since yesterday.    Discussion of PDF file  Shawn sent (attached).  Patrick mentioned checksum errors on "outbound" files to BNL (10%?).   No problems to AGLT2, OU, UC, NET2 (lcg-cp or srmcp was used; 200 tests in a row) but did see it on IU (gridftp destination); IU was very bad though.   Now testing with IU using wget/http swapping file back and forth.   Sarah reported results.   UC had 0 errors and IU had 1 in 1000 which had a byte-swap problem.  Sarah asked if the first two bytes of the ADLER32 checksum had a problem and the last two are OK.  Patrick: in general that was the case for the BNL  transfer.   IU->UTA is slower than UTA->IU.  Testing is continuing.  Action item: Patrick will contact LEARN/UTA networking to determine why inbound seems to be coming via commodity/GPN connection   Patrick will try to get inbound routing fixed and then repeat tests.   
         iii)  New developments?
         iv) Monitoring (Hiro?)    Hiro will report on this at the next meeting. (Ran out of time)
      
      5) Site Round-table and reports (any/all of USCMS, ATLAS Canada or US ATLAS)
         (Ran out of time)
      
      6 AOB and next meeting  
          Two weeks from today is the ATLAS TIM in Tokyo.  We have enough to discuss that we will change the schedule to meet NEXT week , May 7th.  Shawn will be at ANSE meeting in Caltech but will try to run the meeting from there. 
      
      Please send along any additions or corrections via email.  Thanks,
      
      Shawn
      

  • this meeting:
    • Sites should prepare for perfsonar rc3
    • 10g boxes up and running at UTA; will run at 3.2.2, which is stable.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei - Gerri packed the voms security module into rpm, works at SLAC and with DPM sites. Once Gerri has a permanent location for a permanent location, will create a set of detailed instruction.
  • Ilija has requested a central git repo at CERN for FAX. Can WLCG task force provide a point of coordination.
  • Ilija - doing FDR testing. Also testing using the Slim Skim Service. Seeing nice results for file transfers to UC3. Now seeing 4GB/s. How is the collector holding up.
this week

Site news and issues (all sites)

  • T1:
    • last meeting(s): looking at next gen network switches; equipment for the eval - Cisco, Xtreme, Arista; discovered issues with bw/stream (not exceeding 10g, e.g.). Will get a unit June/July timeframe. CPU procurement. Dell won the bid. (HP, Oracle, IBM). Two sandybridge, 2.3 GHz, 64 GB mem, 8x500 GB = 4 TB ($5k). Tested extensively.
    • this meeting:

  • AGLT2:
    • last meeting(s): 40G connection between T3 and T2 - working well. Using LR 40 Gbps optics from Color Chip. S4810.
    • this meeting: Running well. Working with ROCKS6, as noted above. Setting up T3 test queue this afternoon. 40 cores (5 PE1950s). There is a Twiki page.

  • NET2:
    • last meeting(s): Move begins Monday, May 6.
    • this week: Big move successful! DDM up, BU is up; only major problem was HW for HU gatekeeper - may need to switch. Few disks were damaged. T3 is working. HU will come back online very soon. Have a lot of funds for expansion; storage need replacement; need to replace old IBM blades.

  • MWT2:
    • last meeting(s): Chicago: all storage online except for one server, H810 controller on one replaced. IU networking. UIUC - 16 new Dell machines now up. C8220s with sleds. 512 additional job slots, EL 6, osg-wn-client 3.1. Fixed BIOS settings to be performance, HS benchmarks in-line with AGLT2.
    • this meeting: uc2-s16 online, now at 3.7 PB capacity. Network activities at IU. UIUC - campus cluster down today for GPFS upgrade. Building SL6 nodes, puppet rules in place, nodes deployed at UIUC and UC.

  • SWT2 (UTA):
    • last meeting(s): A couple of storage issues possibly creating the dips in the accounting plots. New version of panda mover in the git repo that uses python 2.6. Use with caution. Will be busy getting new storage online.
    • this meeting: Been busy. Perfsonar 10g up and running, performing well; still need to get monitoring straightened out. Identified issue in campus network related to gateway switch, dropping packets. Seems to be responsible for throughput issues to UTA. Hiro's tests go up to 300 MB/s download speeds. Evaluating that switch and others to improve performance. Looking at F10 S4810. Testing today and tonight. Diagnosing issues for pulling data from IU - 1% checksum errors; (not seen at UC); 1200 0 errors at UC, 40 errors out of 1400 from IU. 3660 deployment coming along - moved back to SL6.3 kernel (Dell probs with SL6.4).

  • SWT2 (OU):
    • last meeting(s): Downtime next week. 10g perfsonar node as well.
    • this meeting: All is fine. SL6 won't happen until Horst returns from Germany by July 1. Does expect to have SL6 jobs running on OSCER by June 1.

  • WT2:
    • last meeting(s): No burning issues; future batch system will be lsf 9.1. Testing.
    • this meeting:

AOB

last meeting this meeting
  • None.


-- RobertGardner - 14 May 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback