r5 - 14 Aug 2013 - 15:25:18 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug142013

MinutesAug142013

Introduction

Minutes of the Facilities Integration Program meeting, Aug 14, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843

Attending

  • Meeting attendees: Erich, Dave, Michael, Shawn, Fred, Saul, James, Mark, Horst, Wei, Patrick, Alden, Ilija, Doug
  • Apologies: Bob, Jason, Armen
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Out of band meeting this week. Briefly hear updates and collect site reports, any other operational issues.
    • this week
      • We'll need to cover the topics below
        • SL6 upgrade status
        • FY13 Procurements
        • Updates for Rucio support (Dave's page)
        • Distributed Tier 3 (T2 flocking), user scratch, Facilities working subgroup
      • Conversion of the existing storage inventory to the Rucio naming convention (Hiro).
      • How much inventory is in active use? Can we purge before renaming? Hiro has been studying this - and there is a meeting being setup to discuss this. Now a hot topic.
      • This is much more complex than a thermodynamic. n.b. there already is a shortage of T1 space.

FY13 Procurements - Compute Server Subcommittee (Bob)

last time
  • AdHocComputeServerWG
  • Only thing of note that I would have said is in regard to the HS06 measurements under SL6, and the reference quotes we are seeking for M620/R620 configurations.

this time:

  • Shawn: discussion summary about possible discounts. You know about this. Waiting for some sensitive discussions to converge.

Integration program issues

Rucio support (Dave)

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

this meeting

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

The transition to SL6

MAIN REFERENCE

CURRENTLY REPORTED

last meeting(s)

  • All sites - deploy by end of May, June
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Main thing to consider is whether you upgrade all at once, or rolling.
  • BNL will be migrated by the COB today! Will be back online tonight. BNL did the rolling update.
  • Look at AGIS - changing panda queues much easier
  • Are the new queue names handled reporting? If they are members of same Resource Group.
  • What about $APP? Needs a separate grid3-locations file. But the new system doesn't use it any longer.
  • Schedule:
    • BNL DONE
    • June 10 - AGLT2 - will do rolling
    • MWT2 - still a problem with validations; could start next week
    • SLAC - week of June 10
    • NET2 - all at once. Week of June 17
    • UTA - all at once. June 24. Lots of dependencies - new hardware, network. A multi-day outage is probably okay.
    • OU - all at once. Rocks versus Puppet decision. After July 5.
  • Goal: Majority of sites supporting the new client by end of June. May need to negotiate continued support

  • BNL DONE
  • MWT2 DONE
  • AGLT2: 1/3 of worker nodes were converted; ran into a CVMFS cache size config issue, but otherwise things are going well. The OSG app is owned by usatlas2, but validation jobs are now production jobs. Doing rolling upgrade. They are using the newest cvmfs release. n.b. change in cache location. Expect to be finished next week.
  • NET2: HU first, then BU. At HU - did big bang upgrade; ready for Alessandro to do validation. Ran into problem with host cert. 2.1.11 is production. One machine at BU. Hope to have this done in two weeks. BU team working on HPC center at Holyoke.
  • SWT2 (UTA)
  • SWT2 (OU)
  • WT2: Failed jobs on test nodes - troubleshooting with Alessandro. Expect to be complete by end of next week.

previous meeting update:

  • BU: after August 11 now since Alessandro on vacation. Will commit to doing it August 12.
  • OU: working on final ROCKS configuration. Still validating the OSCER site - problems not understood.
  • CPB is done DONE. SWT2_UTA - also delaying because of Alessandro's absence.

this meeting

  • Updates?
  • BU: getting validated right now with a new set of Panda queues; they are getting validated right now. Did local testing, there were problems with Adler checksumming. Within the next week.
  • OU: wants to wait until OSCER is validated until starting OCHEP. Saul will help.
  • UTA_SWT2 - still needs to convert. Were waiting on getting some equipment moved over to the FW data center.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • There has been a lack of production jobs lately
    • From ADC meeting - new software install system; Alastaire's talk on production / squid.
    • SMU issues - there has been progress
    • ADC operations will be following up on any installation issues.
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting:
    not available this week
    
    1)  8/1: From Tomas Kouba: The default srmtimout on all deletion service machines has been lowered from 1200 seconds to 300. eLog 45233.
    2)  8/1: IllinoisHEP - file transfer failures with SRM errors. Possibly a transient (network?) issue.  Errors went away, so https://ggus.eu/ws/ticket_info.php?ticket=96301 was closed, eLog 45245.
    3)  8/2: AGLT2 - file transfers heavily failing with " [SECURITY_ERROR] SRM Authentication failed" - from Bob at ALGT2:  Our httpdDomain and dCacheDomain both were throwing OutOfMemory 
    java exceptions. They were restarted.  Issue resolved - https://ggus.eu/ws/ticket_info.php?ticket=95711 was re-opened, then closed on 8/6.  eLog 45251.
    4)  8/4 p.m. - SWT2_CPB & UTA_SWT2 - Frontier squid for the two sites was down. From Patrick: The partition holding the log files got filled and crashed the squid process.  A configuration problem 
    with the frontier-squid software has been addressed and this should not occur again.  https://ggus.eu/ws/ticket_info.php?ticket=96349, https://ggus.eu/ws/ticket_info.php?ticket=96350 were closed. 
    eLog 45272, 45273.
    5)  8/6 early a.m.: UPENN LOCALGROUPDISK file transfer errors (SRM).  Issue quickly resolved by the site admin (no details).  https://ggus.eu/ws/ticket_info.php?ticket=96393 was closed, eLog 45283.
    6)  8/6: WISC HOTDISK transfer errors ("Unexpected Gatekeeper or Service Name globus_gsi_gssapi: Authorization denied") - https://ggus.eu/ws/ticket_info.php?ticket=96406 - eLog 45289.  Also 
    affects LOCALGROUPDISK, site blacklisted in DDM.
    7)  8/6: SLACXRD - problematic WN (bullet 0084); all jobs failing on the host.  https://ggus.eu/ws/ticket_info.php?ticket=96419 assigned, eLog 45293.
    
    Follow-ups from earlier reports:
    
    (i)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in the network path between OU & BNL.  
    As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, since everything was then re-routed to the 'old' paths.  Problem 
    under investigation.
    (ii)  7/7: SLACXRD - file transfer failures ("[USER_ERROR] source file doesn't exist:).  https://ggus.eu/ws/ticket_info.php?ticket=95491 in-progress, eLog 44910.
    Update 7/16 - still see these errors.  https://ggus.eu/ws/ticket_info.php?ticket=95763 was opened, marked as a slave to master ticket ggus 95491.
    Update 7/24: ggus 95491 was closed, but then re-opened when the errors reappeared. 
    (iii)  7/24: NERSC_LOCALGROUPDISK: transfer failure [Checksum mismatch] - https://ggus.eu/ws/ticket_info.php?ticket=96116 - eLog 45176. On 7/25 a maintenance outage was declared to work on a 
    filesystem problem. https://savannah.cern.ch/support/index.php?138943 (site blacklisting).
    

  • this week: Operations summary:
    Summary from the weekly ADCoS meeting:
    not available this week (no meeting)
    
    1)  8/8: AGLT2 - file transfer failures with the error "SRM Authentication failed."  From Shawn: The dCache headnode had another Out of Memory issue. dCache services were restarted at 7:10 AM Eastern. 
    This should restore appropriate access to the SE.  https://ggus.eu/ws/ticket_info.php?ticket=96477 was closed, eLog 45338.
    2)  8/10: AGLT2 - file transfer failures ("failed to contact on remote SRM").  From Shawn: We have been working with dCache support trying to debug our Out of Memory issues. We had to restart dCache 
    services and downgrade from a test version.
    dCache should be operational again within the next 10-15 minutes (once the old requests clear).  https://ggus.eu/ws/ticket_info.php?ticket=96562 was closed, eLog 45370.
    3)  8/12: The BNL_CLOUD queue has test for several days, with test jobs failing at a rate of ~50%.  From John Hover: The worker nodes were exhibiting "too many open files" errors. A new VM image was 
    made with altered limits and an unneeded service disabled. That image will be rolled out today. The underlying cause for the error is still unclear. https://ggus.eu/ws/ticket_info.php?ticket=96584 in-progress, 
    eLog 45379.
    4)  8/13: AGLT2 - file transfers with SRM errors, still related to the dCache "out of memory" issue. Also transfers failing with the error "[SECURITY_ERROR] SRM Authentication failed." From Shawn: We are 
    working with the dCache developers to track down the root cause. During the last OOM we updated our dCache to an "instrumented" version which we are now running. The dCache services are back up 
    now but will take a while to run smoothly again. https://ggus.eu/ws/ticket_info.php?ticket=96614 in-progress, eLog 45403.
    5) ADC Weekly meeting agenda:
    https://indico.cern.ch/conferenceDisplay.py?confId=266771
    
    Follow-ups from earlier reports:
    
    (i)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  AN issue developed in the network path between OU & BNL.  
    As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, since everything was then re-routed to the 'old' paths.  
    Problem under investigation.
    (ii)  7/7: SLACXRD - file transfer failures ("[USER_ERROR] source file doesn't exist:).  https://ggus.eu/ws/ticket_info.php?ticket=95491 in-progress, eLog 44910.
    Update 7/16 - still see these errors.  https://ggus.eu/ws/ticket_info.php?ticket=95763 was opened, marked as a slave to master ticket ggus 95491.
    Update 7/24: ggus 95491 was closed, but then re-opened when the errors reappeared. 
    (iii)  7/24: NERSC_LOCALGROUPDISK: transfer failure [Checksum mismatch] - https://ggus.eu/ws/ticket_info.php?ticket=96116 - eLog 45176. On 7/25 a maintenance outage was declared to work on a 
    filesystem problem. https://savannah.cern.ch/support/index.php?138943 (site blacklisting).
    (iv)  8/6: WISC HOTDISK transfer errors ("Unexpected Gatekeeper or Service Name globus_gsi_gssapi: Authorization denied") - https://ggus.eu/ws/ticket_info.php?ticket=96406 - eLog 45289.  Also affects 
    LOCALGROUPDISK, site blacklisted in DDM.
    (v)  8/6: SLACXRD - problematic WN (bullet 0084); all jobs failing on the host.  https://ggus.eu/ws/ticket_info.php?ticket=96419 assigned, eLog 45293.
    Update 8/9: No details from the site, but the problem went away.  ggus 96419 was closed - eLog 45344.
    

  • There are a couple of Tier 3 tickets that are not getting any attention. E.g. NERSC. No reply for two weeks. Also one at Wisconsin. Are these causing operational issues for ADC? These sites are evidently not well-maintained - should we have a mechanism to shut them off, if there is no response beyond an expiration date.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Dark data discussions are on-going. Will provide a comparison for sites; preliminary report next week.
    • Localgroupdisk. Some deletions at UC, about 150 TB. Also contacting top users at SLAC.
    • Userdisk cleanup finished everywhere except BNL.
  • this meeting:
    • No report - Armen on vacation

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • See Johannes' email. Sites should follow-up.
this week
  • Migration to Xrootd 3.3.3. Changes to configuration. There is a twiki page with instructions - will add a table to mark progress.
  • Rucio N2N? - test in the US.
  • FAX failover is working in the US and some UK sties. And we have monitoring for this. There is a plugin in Pandamon providing a page. A new version of the monitor will be able to filter according to owner.
  • Working with Valerie. Will have historical views as well.
  • We are now reporting on problems with sites - sent by SSB
  • f-stream now enabled. Rucio N2N? testing at MWT2. Developments for measuring response times.
  • New xrootd developed.
  • Discussion of UDP collector at SLAC moving to GOC.
  • Doug - working with Australians. Tier 3 IB - tier 3's may want to federation.

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting: Hiro has started to do direct access tests. Planning to move ANALY short queue to direct access. Anticipate moving more in the future. Working on new storage hardware - issued order for 1.5 PB of storage. $120/TB. Will run with ZFS on top - start deploying as soon as possible. Hiro and others have made good progress on evaluating the Ceph block device part. Looks attractive to us. Using old retired storage. Features are nice. Slice and dice the storage. 100 Gbps-capable Arista switch arrived; readying for trans atlantic test in mid-september. IB eval, 56 Gbps interfaces into these machines.

  • AGLT2:
    • last meeting(s): Working hard getting ready for SL6. Test jobs run production just fine. User analysis jobs are failing however, unclear. VMWare systems at MSU and UM - new door machine at MSU configured and running - then will update pool servers to SL6.
    • this meeting: OOM crashes on dCache headnodes. Ticket open. 2.2.12 version. Troubleshooting. Wants to move to 2.6; will need gplazma 2 natively.

  • NET2:
    • last meeting(s): John back from vacation and will make some networking changes and swap out HU gatekeeper hardware; & install new OSG version.
    • this week: HU will be down for the entire week as they convert to their new 30k core cluster. Will ATLAS be able to take advantage of these? These resources could appear behind an existing

  • MWT2:
    • last meeting(s): Tier 3 flocking project or OSG connect. Networking problem at IU seems to be resolved; opening up storage pools at IU.
    • this meeting: UIUC - found an additional 12 nodes. Will be up to 1600 job slots at Illinois.

  • SWT2 (UTA):
    • last meeting(s): Big thing is upgrade is complete; new storage. New edge nodes. All gridftp servers at 10g. SL6. Update OSG components to the latest. All went well.
    • this meeting: Writing a local site mover that is xrootd based, relying only on binary release of xrootd. There is some logic in the program. Using xrd to just stat the file. (Want to look at xrdfs.) Using an old xcp mover built into the pilot, does a preload of the libraries before an athena. Our xrootd FAX door does not seem stable. Its run in a proxy mode. Running newer version of torque seeing pilots hammer it. Next - get equip moved to other lab, do the update, add some new storage.

  • SWT2 (OU):
    • last meeting(s): Some problems with high memory jobs - result has crashed compute nodes. Condor configured to kill jobs over 3.8 GB. These are production jobs. No swap? Very little.
    • this meeting:

  • WT2:
    • last meeting(s): 1. SL6 migration is almost completed. Jobs are running OK just need to reinstall batch nodes. 2. SLAC security approved our security plan for opening outbound TCP from batch nodes. It is waiting for CIO's signature. We will need to re-ip batch nodes. We will likely combine 1 and 2 next week (if we get CIO's signature, which I think is just paperwork)
    • this meeting: Security team is now going to permit outbound IP. RHEL6 upgrade nearly complete. Will combine those two things together. Hopes to have this in the next two weeks. Two quotes from Dell for R620's and M620's. Similar pricing! Thinking about adding IB. Other groups may add this as well.

Distributed Tier 3 issues (Rob)

  • Prototyping work at MWT2: http://twiki.mwt2.org/bin/view/Main/FlockingProject
  • We need to setup Facilities sub-working group to discuss Tier 3 --> Tier 2 flocking issues in detail.
  • Writeable "scratch" storage space at each Tier 2 "FAXbox"
  • Work by Lincoln at MWT2:
    - Ceph clusters have 3 basic components :
    	1) Monitor service. (ceph-mon). Keeps track of cluster membership, configuration, and state. 
    		* We have 1 running, plan to deploy 3 total servers as per Ceph recommendation. (3 servers)
    	2) Object store daemons (ceph-osd). These are daemons that run on top of the disks.
    		* We have 60x750GB disks, evenly spread of across 10 servers (6 per server). 
    	2) Metadata service (ceph-mds). Coordinates access to the object store daemons. 
    		* We have 1 of these (shares a server with one of the monitors)
    
    - Ceph can export the filesystem in one of two ways:
    	1) CephFS. A kernel module that presents a filesystem to a node. Behaves essentially like a 'door' that we use in the canonical storage element setting. 
    	2) RADOS Block Device (RBD). Presents an arbitrarily sized block device to the OS (e.g., /dev/rbd0) that can be formatted with any filesystem. Seems to be used in OpenStack environments.
    
    - CephFS is fully POSIX, so overlaying xrootd, GO, etc is all easily possible. These things are equally possible with RBD (you could probbly even stick a dCache pool on top of RBD if you wanted). RBD is considered more production-ready than CephFS, but seems slower in my limited testing.
    
    - Ceph performance can be greatly improved by also sticking an SSD into each storage node for a journaling. Ceph behaves much better under parallel workloads. 
    
    - Ceph can set arbitrary replication levels, we have 2x replication currently. 
    
    -Our Ceph deployment has 40 TB raw, 20TB usable after replication.
    
    -The newest version (Ceph 0.67 'Dumpling') was released today.

  • A simple test
    [root@faxbox ~]# xrdcp -v root://fax.mwt2.org:1094//atlas/dq2/user/flegger/MWT2/user.flegger.MWT2.data12_8TeV.00212172.physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00/NTUP_SMWZ.01007411._000021.MWT2.root.1 /mnt/ceph/.
    [xrootd] Total 3703.14 MB       |====================| 100.00 % [528.9 MB/s]
     

AOB

last meeting this meeting
  • Will have a meeting next week, to get back in phase.
  • Fall US ATLAS Facilities workshop. Tentative dates: December 11, 12. Location: University of Arizona (Tucson).


-- RobertGardner - 14 Aug 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback