r4 - 21 Aug 2013 - 14:24:47 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug212013

MinutesAug212013

Introduction

Minutes of the Facilities Integration Program meeting, Aug 14, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843

Attending

  • Meeting attendees: Bob, Ilija, Dave, Torre, Sarah, Shawn, James, John Brunelle, Mark, Wei, Saul, Horst
  • Apologies: Jason, Mark
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • We'll need to cover the topics below
        • SL6 upgrade status
        • FY13 Procurements
        • Updates for Rucio support (Dave's page)
        • Distributed Tier 3 (T2 flocking), user scratch, Facilities working subgroup
      • Conversion of the existing storage inventory to the Rucio naming convention (Hiro).
      • How much inventory is in active use? Can we purge before renaming? Hiro has been studying this - and there is a meeting being setup to discuss this. Now a hot topic.
      • This is much more complex than a thermodynamic. n.b. there already is a shortage of T1 space.
    • this week

FY13 Procurements - Compute Server Subcommittee (Bob)

last time
  • AdHocComputeServerWG
  • Shawn: discussion summary about possible discounts. You know about this. Waiting for some sensitive discussions to converge.

this time:

  • Attempting to get rational set of quotes for M620 and R620 quotes. Higher than either BNL and SLAC. Also awaiting info from Andy L and Dell, etc.
  • Shawn - expect news, maybe even today.

Integration program issues

Rucio support (Dave)

last meeting

this meeting

  • OU: has not done this, but will during the upgrade
  • NET2_HU: all nodes are SL6
  • NET2_BU: unclear
  • MWT2 DONE
  • AGLT2 DONE
  • BNL DONE
  • WT2 DONE

The transition to SL6

MAIN REFERENCE

CURRENTLY REPORTED

last meeting(s)

  • All sites - deploy by end of May, June
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Main thing to consider is whether you upgrade all at once, or rolling.
  • BNL will be migrated by the COB today! Will be back online tonight. BNL did the rolling update.
  • Look at AGIS - changing panda queues much easier
  • Are the new queue names handled reporting? If they are members of same Resource Group.
  • What about $APP? Needs a separate grid3-locations file. But the new system doesn't use it any longer.
  • Schedule:
    • BNL DONE
    • June 10 - AGLT2 - will do rolling
    • MWT2 - still a problem with validations; could start next week
    • SLAC - week of June 10
    • NET2 - all at once. Week of June 17
    • UTA - all at once. June 24. Lots of dependencies - new hardware, network. A multi-day outage is probably okay.
    • OU - all at once. Rocks versus Puppet decision. After July 5.
  • Goal: Majority of sites supporting the new client by end of June. May need to negotiate continued support

  • BNL DONE
  • MWT2 DONE
  • AGLT2: 1/3 of worker nodes were converted; ran into a CVMFS cache size config issue, but otherwise things are going well. The OSG app is owned by usatlas2, but validation jobs are now production jobs. Doing rolling upgrade. They are using the newest cvmfs release. n.b. change in cache location. Expect to be finished next week.
  • NET2: HU first, then BU. At HU - did big bang upgrade; ready for Alessandro to do validation. Ran into problem with host cert. 2.1.11 is production. One machine at BU. Hope to have this done in two weeks. BU team working on HPC center at Holyoke.
  • SWT2 (UTA)
  • SWT2 (OU)
  • WT2: Failed jobs on test nodes - troubleshooting with Alessandro. Expect to be complete by end of next week.

previous meeting update:

  • BU: after August 11 now since Alessandro on vacation. Will commit to doing it August 12.
  • OU: working on final ROCKS configuration. Still validating the OSCER site - problems not understood.
  • CPB is done DONE. SWT2_UTA - also delaying because of Alessandro's absence.

previous meeting (8/14/2013)

  • Updates?
  • BU: getting validated right now with a new set of Panda queues; they are getting validated right now. Did local testing, there were problems with Adler checksumming. Within the next week.
  • OU: wants to wait until OSCER is validated until starting OCHEP. Saul will help.
  • UTA_SWT2 - still needs to convert. Were waiting on getting some equipment moved over to the FW data center.

this meeting

  • Updates?
  • BU: about 2/3 of the releases have been validated. Completion data is not known. Hope is for a week. Its been going rather slowly. Ran into a problem with CVMFS.
  • OU: currently down due to storage problems. But still have not fixed current problems.
  • Both OU and BU have had problems getting attention from Alessandro.
  • Need a formal statement about a setup file in $APP.

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

previous meeting (8/14/13)

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

this meeting (8/21/13)

  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • There has been a lack of production jobs lately
    • From ADC meeting - new software install system; Alastaire's talk on production / squid.
    • SMU issues - there has been progress
    • ADC operations will be following up on any installation issues.
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    
    

    • There are a couple of Tier 3 tickets that are not getting any attention. E.g. NERSC. No reply for two weeks. Also one at Wisconsin. Are these causing operational issues for ADC? These sites are evidently not well-maintained - should we have a mechanism to shut them off, if there is no response beyond an expiration date.
  • this week: Operations summary:
    Summary from the weekly ADCoS meeting (Hiroshi Sakamoto):
    https://indico.cern.ch/materialDisplay.py?contribId=1&materialId=0&confId=266038
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/adcos-summary-8_20_13.html
    
    1)  8/14: MWT2 - DNS issue at the site - from Dave: All reverse lookups for addresses that are part of the MWT2 at UChicago are failing. This in turn causes all X509, SSL, etc 
    connections to fail. Since the dCache head nodes, including the SRM server are located at this site, all DDM transfer are failing.
    2)  8/15: New pilot release from Paul (v58e).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_58e.html
    3)  8/19: US Tier2 HOTDISK has been decommissioned. https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/45475.
    4)  8/20: ADC weekly meeting:
    https://indico.cern.ch/event/266976
    5)  8/20: OU_OCHEP_SWT2: storage problem at the site.  Issue being worked on. 
    6)  8/20: From Saul at NET2 - The NET2/BU gatekeeper atlas-net2.bu.edu ran out of memory suddenly and had to be rebooted in the period roughly at 9:30 PM EST. I don't think 
    that we lost any jobs, but I expect a bundle of DDM errors.   All systems are back as of about 9:45 PM.
    
    Follow-ups from earlier reports:
    
    (i)  6/8 p.m.: OU_OCHEP_SWT2 file transfers failing with "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]."  An issue developed in the network 
    path between OU & BNL.  As of 6/10 early afternoon the direct AL2S path between OU and BNL was turned off, and that 'fixed' the network problems temporarily, since everything 
    was then re-routed to the 'old' paths.  Problem under investigation.
    (ii)  7/7: SLACXRD - file transfer failures ("[USER_ERROR] source file doesn't exist:).  https://ggus.eu/ws/ticket_info.php?ticket=95491 in-progress, eLog 44910.
    Update 7/16 - still see these errors.  https://ggus.eu/ws/ticket_info.php?ticket=95763 was opened, marked as a slave to master ticket ggus 95491.
    Update 7/24: ggus 95491 was closed, but then re-opened when the errors reappeared. 
    (iii)  7/24: NERSC_LOCALGROUPDISK: transfer failure [Checksum mismatch] - https://ggus.eu/ws/ticket_info.php?ticket=96116 - eLog 45176. On 7/25 a maintenance outage was 
    declared to work on a filesystem problem. https://savannah.cern.ch/support/index.php?138943 (site blacklisting).
    Update 8/15: site reported all issues had been resolved.  ggus 96116 was closed 8/16 after verifying recent DDM success. 
    (iv)  8/6: WISC HOTDISK transfer errors ("Unexpected Gatekeeper or Service Name globus_gsi_gssapi: Authorization denied") - https://ggus.eu/ws/ticket_info.php?ticket=96406 - 
    eLog 45289, 45461 .  Also affects LOCALGROUPDISK, site blacklisted in DDM.
    (v)  8/12: The BNL_CLOUD queue has been set to 'test' for several days, with test jobs failing at a rate of ~50%.  From John Hover: The worker nodes were exhibiting "too many 
    open files" errors. A new VM image was made with altered limits and an unneeded service disabled. That image will be rolled out today. The underlying cause for the error is still 
    unclear. https://ggus.eu/ws/ticket_info.php?ticket=96584 in-progress, eLog 45379.
    Update 8/18: From John Hover: The WN's were exhibiting "too many open files" errors. A new VM image was made with altered limits and an unneeded service disabled. That image 
    was rolled out. The problem is fixed.
    ggus 96584 was closed, eLog 45451.   
    (vi)  8/13: AGLT2 - file transfers with SRM errors, still related to the dCache "out of memory" issue. Also transfers failing with the error "[SECURITY_ERROR] SRM Authentication 
    failed." From Shawn: We are working with the dCache developers to track down the root cause. During the last OOM we updated our dCache to an "instrumented" version which 
    we are now running. The dCache services are back up now but will take a while to run smoothly again. https://ggus.eu/ws/ticket_info.php?ticket=96614 in-progress, eLog 45403.
    Update 8/15: Ticket was closed/reopened when the errors came back.  Closed on 8/19, but again reopened on 8/21.  Ongoing issue.  eLog 45494.
    

  • On-going effort to upgrade to FTS-3, on-going problem in the UK cloud.
  • Last week's - open ticket at NERSC was closed.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Dark data discussions are on-going. Will provide a comparison for sites; preliminary report next week.
    • Localgroupdisk. Some deletions at UC, about 150 TB. Also contacting top users at SLAC.
    • Userdisk cleanup finished everywhere except BNL.
    • No report - Armen on vacation
  • this meeting:
    • No report

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Close to having an up dated release of perfsonar installations. yum updates are possible - Dave has tested this at MWT2. Getting ready to make a big push.
    • Revisit next week.
  • this meeting:
    • Release 3.3.1 should be out today. Each site should update.
    • MWT2, AGLT2, OU done already.
    • Goal is to start to utilize this information in various ways.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Migration to Xrootd 3.3.3. Changes to configuration. There is a twiki page with instructions - will add a table to mark progress.
  • Rucio N2N? - test in the US.
  • FAX failover is working in the US and some UK sties. And we have monitoring for this. There is a plugin in Pandamon providing a page. A new version of the monitor will be able to filter according to owner.
  • Working with Valerie. Will have historical views as well.
  • We are now reporting on problems with sites - sent by SSB
  • f-stream now enabled. Rucio N2N? testing at MWT2. Developments for measuring response times.
  • New xrootd developed.
  • Discussion of UDP collector at SLAC moving to GOC.
  • Doug - working with Australians. Tier 3 IB - tier 3's may want to federation.
this week
  • Wei - sent a note to update Xrootd w/ notes. Reminds sites to
  • Ilija - working with Matez on monitoring - there was some mix up of cms and atlas data. New release of collector coming.
  • Panda failover monitoring.
  • All sites agreed to update Xroot https://twiki.cern.ch/twiki/bin/view/Atlas/Xrootd333Upgrade.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Hiro has started to do direct access tests. Planning to move ANALY short queue to direct access. Anticipate moving more in the future. Working on new storage hardware - issued order for 1.5 PB of storage. $120/TB. Will run with ZFS on top - start deploying as soon as possible. Hiro and others have made good progress on evaluating the Ceph block device part. Looks attractive to us. Using old retired storage. Features are nice. Slice and dice the storage. 100 Gbps-capable Arista switch arrived; readying for trans atlantic test in mid-september. IB eval, 56 Gbps interfaces into these machines.
    • this meeting:

  • AGLT2:
    • last meeting(s): OOM crashes on dCache headnodes. Ticket open. 2.2.12 version. Troubleshooting. Wants to move to 2.6; will need gplazma 2 natively.
    • this meeting: Still have issue with dCache OOM w/ the billing cell db. Isolated it to this. Instrumenting the monitoring. Getting ready to purchase networking switch, and compute nodes.

  • NET2:
    • last meeting(s): HU will be down for the entire week as they convert to their new 30k core cluster. Will ATLAS be able to take advantage of these? These resources could appear behind an existing
    • this week: Still working on SL6. New gatekeeper being setup at HU. Will do the Xrootd update a BU. Working on X509.

  • MWT2:
    • last meeting(s): UIUC - deployed an additional 12 nodes. Will be up to 1600 job slots at Illinois.
    • this meeting: DNS appliance failed at UC that broke many things. Corrected, but took a while to propagate to all servers. GPFS issues at Illinois - lack of file descriptors.

  • SWT2 (UTA):
    • last meeting(s): Writing a local site mover that is xrootd based, relying only on binary release of xrootd. There is some logic in the program. Using xrd to just stat the file. (Want to look at xrdfs.) Using an old xcp mover built into the pilot, does a preload of the libraries before an athena. Our xrootd FAX door does not seem stable. Its run in a proxy mode. Running newer version of torque seeing pilots hammer it. Next - get equip moved to other lab, do the update, add some new storage.
    • this meeting: Working on SL6 upgrade. Problem with a storage server - but have work-around. Hope to be able to request software validation.

  • SWT2 (OU):
    • last meeting(s): Some problems with high memory jobs - result has crashed compute nodes. Condor configured to kill jobs over 3.8 GB. These are production jobs. No swap? Very little.
    • this meeting: Lustre problems - but IBM is working on the problem.

  • WT2:
    • last meeting(s): Security team is now going to permit outbound IP. RHEL6 upgrade nearly complete. Will combine those two things together. Hopes to have this in the next two weeks. Two quotes from Dell for R620's and M620's. Similar pricing! Thinking about adding IB. Other groups may add this as well.
    • this meeting: Running RHEL5 and RHEL6 queues.

Distributed Tier 3 issues (Rob)

last meeting:
  • Prototyping work at MWT2: http://twiki.mwt2.org/bin/view/Main/FlockingProject
  • We need to setup Facilities sub-working group to discuss Tier 3 --> Tier 2 flocking issues in detail.
  • Writeable "scratch" storage space at each Tier 2 "FAXbox"
  • Work by Lincoln at MWT2:
    - Ceph clusters have 3 basic components :
    	1) Monitor service. (ceph-mon). Keeps track of cluster membership, configuration, and state. 
    		* We have 1 running, plan to deploy 3 total servers as per Ceph recommendation. (3 servers)
    	2) Object store daemons (ceph-osd). These are daemons that run on top of the disks.
    		* We have 60x750GB disks, evenly spread of across 10 servers (6 per server). 
    	2) Metadata service (ceph-mds). Coordinates access to the object store daemons. 
    		* We have 1 of these (shares a server with one of the monitors)
    - Ceph can export the filesystem in one of two ways:
    	1) CephFS. A kernel module that presents a filesystem to a node. Behaves essentially like a 'door' that we use in the canonical storage element setting. 
    	2) RADOS Block Device (RBD). Presents an arbitrarily sized block device to the OS (e.g., /dev/rbd0) that can be formatted with any filesystem. Seems to be used in OpenStack environments.
    - CephFS is fully POSIX, so overlaying xrootd, GO, etc is all easily possible. These things are equally possible with RBD (you could probbly even stick a dCache pool on top of RBD if you wanted). RBD is considered more production-ready than CephFS, but seems slower in my limited testing.
    - Ceph performance can be greatly improved by also sticking an SSD into each storage node for a journaling. Ceph behaves much better under parallel workloads. 
    - Ceph can set arbitrary replication levels, we have 2x replication currently. 
    -Our Ceph deployment has 40 TB raw, 20TB usable after replication.
    -The newest version (Ceph 0.67 'Dumpling') was released today.
  • A simple test
    [root@faxbox ~]# xrdcp -v root://fax.mwt2.org:1094//atlas/dq2/user/flegger/MWT2/user.flegger.MWT2.data12_8TeV.00212172.physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00/NTUP_SMWZ.01007411._000021.MWT2.root.1 /mnt/ceph/.
    [xrootd] Total 3703.14 MB       |====================| 100.00 % [528.9 MB/s]
     

this meeting

AOB

last meeting
  • Will have a meeting next week, to get back in phase.
  • Fall US ATLAS Facilities workshop. Tentative dates: December 11, 12. Location: University of Arizona (Tucson).
this meeting


-- RobertGardner - 20 Aug 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback