r4 - 18 Sep 2013 - 15:55:40 - HorstSeveriniYou are here: TWiki >  Admins Web > MinutesSep182013

MinutesSep182013

Introduction

Minutes of the Facilities Integration Program meeting, Sep 17, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
  • Your access code: 2913843

Attending

  • Meeting attendees: Rob, Fred, Dave, Saul, Michael, Patrick, Torre, Sarah, Armen, Wei, Alden, Kaushik, John Hover, Bob, Hiro, John Brunelle
  • Apologies: Shawn, Jason, Mark, Horst
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • We'll need to cover the topics below
        • SL6 upgrade status
        • FY13 Procurements
        • Updates for Rucio support (Dave's page)
        • Distributed Tier 3 (T2 flocking), user scratch, Facilities working subgroup
      • Conversion of the existing storage inventory to the Rucio naming convention (Hiro).
      • How much inventory is in active use? Can we purge before renaming? Hiro has been studying this - and there is a meeting being setup to discuss this. Now a hot topic.
      • Enabling DQ2 2.4+ clients, SL6 conversion
        • Global versus dq2 specific
        • SLAC has a problem - PRODDISK files are written the old convention.
      • SHA2 compliance deadline, October 1, 2013: https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/SHA2Compliance
        • SLAC - Bestman upgrade didn't work without gums. Will install a gums server. Wei is skeptical.
        • BU - needs may need to upgrade Bestman
        • UTA - ?
    • this week
      • Procurement status
      • Rucio support, SL6 - all set, final checks
      • Tier 3 flocking, Campus group formation: collection of services to utilize and leverage resources outside the institution. Forumulate a plan to materialize the capabilities and functionality; need to work closely with Analysis Support.
      • SHA-2 transition
      • High level comments: the SL6 transition took way too long, to support the latest dq2 clients; we ended up 3-months late. We must prepare and hold better to a schedule next time. Special thanks to David Lesny for spear-heading the technical issues. Equipment procurement - also taking too long. When can we expect final numbers.
      • Update for network program for next week.
      • Perfsonar registration in OIM. There are ambiguities to resolve.

SHA-2 (John Hover)

  • OSG CA will postpone issuing SHA-2 certs to December 1. (WLCG request)
  • All OSG software has been updated
  • John will send out links
  • Worker node client - 3.1.22 or higher is required
  • GUMS version
  • dCache - 2.2.17 or above is the required. What about BNL? 2.2.12; AGLT2 2.2.15. (The major service to be upgraded is SRM)
  • What about digcert host certs (+ SHA2). This has been dealt with already by OSG.
  • Have been testing pilot with digicert+sha2 end to end. So a question here.
  • What about xrootd with SHA2 certs? Wei will check.
  • Michael: need to setup a validation process - Hiro will setup a page to do this.
  • Hiro: can acquire SHA2 cert from CILogon

FY13 Procurements - Compute Server Subcommittee (Bob)

last time
  • AdHocComputeServerWG
  • Still waiting
  • Wei - starting purchasing process, pooling with other groups, to purchase a large group of machines. Will not run HT, 4GB/core, and IB. Dell. M620 blade. Happy with pricing.

this time:

  • SLAC purchase imminent. 2660.
  • MWT2 has quote for a plan B

Integration program issues

Rucio support (Dave)

Previous meeting

this meeting

  • NET2_HU: all nodes are SL6
  • NET2_BU: unclear
  • MWT2 DONE
  • AGLT2 DONE
  • BNL DONE
  • WT2 DONE
  • OU DONE
  • SWT2:
this meeting:
  • All sites are ready for dq2 2.4 client
  • Dave has emailed Thomas to find out when the "latest". Patrick: why not put this in the wrapper.
  • Local site mover changes in the pilot for the rucio naming convention. Paul updating that - moving to an lsm version. Now deployed at the two UTA sites and SLAC.

The transition to SL6

MAIN REFERENCE

CURRENTLY REPORTED

last meeting(s)

  • All sites - deploy by end of May, June
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Main thing to consider is whether you upgrade all at once, or rolling.
  • BNL will be migrated by the COB today! Will be back online tonight. BNL did the rolling update.
  • Look at AGIS - changing panda queues much easier
  • Are the new queue names handled reporting? If they are members of same Resource Group.
  • What about $APP? Needs a separate grid3-locations file. But the new system doesn't use it any longer.
  • Schedule:
    • BNL DONE
    • June 10 - AGLT2 - will do rolling
    • MWT2 - still a problem with validations; could start next week
    • SLAC - week of June 10
    • NET2 - all at once. Week of June 17
    • UTA - all at once. June 24. Lots of dependencies - new hardware, network. A multi-day outage is probably okay.
    • OU - all at once. Rocks versus Puppet decision. After July 5.
  • Goal: Majority of sites supporting the new client by end of June. May need to negotiate continued support

  • BNL DONE
  • MWT2 DONE
  • AGLT2: 1/3 of worker nodes were converted; ran into a CVMFS cache size config issue, but otherwise things are going well. The OSG app is owned by usatlas2, but validation jobs are now production jobs. Doing rolling upgrade. They are using the newest cvmfs release. n.b. change in cache location. Expect to be finished next week.
  • NET2: HU first, then BU. At HU - did big bang upgrade; ready for Alessandro to do validation. Ran into problem with host cert. 2.1.11 is production. One machine at BU. Hope to have this done in two weeks. BU team working on HPC center at Holyoke.
  • SWT2 (UTA)
  • SWT2 (OU)
  • WT2: Failed jobs on test nodes - troubleshooting with Alessandro. Expect to be complete by end of next week.

previous meeting update:

  • BU: after August 11 now since Alessandro on vacation. Will commit to doing it August 12.
  • OU: working on final ROCKS configuration. Still validating the OSCER site - problems not understood.
  • CPB is done DONE. SWT2_UTA - also delaying because of Alessandro's absence.

previous meeting (8/14/2013)

  • Updates?
  • BU: getting validated right now with a new set of Panda queues; they are getting validated right now. Did local testing, there were problems with Adler checksumming. Within the next week.
  • OU: wants to wait until OSCER is validated until starting OCHEP. Saul will help.
  • UTA_SWT2 - still needs to convert. Were waiting on getting some equipment moved over to the FW data center.

previous meeting (8/24/2013)

  • Updates?
  • BU: about 2/3 of the releases have been validated. Completion data is not known. Hope is for a week. Its been going rather slowly. Ran into a problem with CVMFS.
  • OU: currently down due to storage problems. But still have not fixed current problems.
  • Both OU and BU have had problems getting attention from Alessandro.
  • Need a formal statement about a setup file in $APP.

this meeting

  • All done

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 DONE
  • SLAC DONE

notes:

  • Updates?
  • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
  • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
  • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
  • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.

previous meeting

  • NET2: status unsure: either waiting on instructions from Mike O'Conner (unless there have been direct communications with Chuck). Will ramp things up.
  • OU: status: waiting for a large latency issue to be resolved from Internet2, then reestablish the BNL link. Believes throughput input matrix has improved (a packet loss problem seems to be resolved). Timeline unknown. Will ping existing tickets.
  • UTA: will need to talk with network staff this week. Attempting to advertise only a portion of the campus. Could PBR be implemented properly. After visit can provide update.

previous meeting (8/14/13)

  • Updates?
  • Saul sent a note to Mike O'Conner - no answer. There are management changes at Holyoke. Would like a set of instructions to drive progress.
  • OU: will check the link.
  • UTA - still need to get a hold of network staff. A new manager coming online. Will see about implementing PBR. Update next.

previous meeting (8/21/13)

  • Updates
  • OU - network problems were fixed. Then turned direct link back on. Then perfsonar issues, then resolved. Expect to have a either a Tier 2 or the OSCER site done within a few.
  • BU and Holyoke. Put the network engineers in touch. Still unknown when it will happen. Have not been able to extract a date to do it.
  • UTA - no progress.

previous meeting (9/4/13)

  • Updates?
  • UTA: meeting with new network director schedule this Friday or next week. Back on the page.

this meeting (9/18/13)

  • Updates?
  • UTA - no update; getting on the new director's manager. Before the next meeting.
  • BU & HU - made some headway with Chuck and Mike O'Conner. NOX at Holyoke to be at 100g in 6 months. (Michael: from LHCONE operations call, NOX will extend to MANLAN, initially 10g link on short notice; sounded promising.)
  • OU - OU network folks think we can be on LHCONE by Oct 1

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • There has been a lack of production jobs lately
    • From ADC meeting - new software install system; Alastaire's talk on production / squid.
    • SMU issues - there has been progress
    • ADC operations will be following up on any installation issues.
  • this meeting:
    • Two drainages over the last week; saturday drain caused by new certs installed on Friday.
    • A bad task go into the system; 10's K failures. Missing scout jobs?

Shift Operations (Mark)

  • last week: Operations summary:
    Summary from the weekly ADCoS meeting:
    Not available this week
    
    1)  9/7: HU_ATLAS_Tier2 - production jobs failing with stage-out errors & "Transform received signal SIGSEGV."  Most of the problems were resolved later that day.  
    Still see a low level of the "SIGSEGV/root HOME" errors - under investigation. https://ggus.eu/ws/ticket_info.php?ticket=97114 was closed, eLog 45721.
    2)  9/8: SLACXRD - file transfer failures with "[GRIDFTP_ERROR]."  Most likely a transient problem, as the errors went away without intervention.  
    https://ggus.eu/ws/ticket_info.php?ticket=97125 was closed, eLog 45793.
    3)  9/10: ADC Weekly meeting:
    https://indico.cern.ch/event/269433
    4)  9/10: AGLT2: jobs failing heavily on two WN's (c-104-42.aglt2.org, c-110-24.aglt2.org). Nodes were off-lined for debugging.  eLog 45792.
    5)  9/11: Issue with pandamover in the US cloud resolved.  Sites weren't receiving input files for production jobs. (The condor configuration on a pandamover 
    host was fixed.)
    
    Follow-ups from earlier reports:
    
    None!
    
  • this week: Operations summary:
    Summary from the weekly ADCoS meeting (Pavol Strizenec, Michal Svatos):
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=270961
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/adcos-summary-9_16_13.html
    
    1)  9/12: File transfers to the US cloud were failing with "[INTERNAL_ERROR] error creating file for memmap /var/tmp/glite-url-copy-edguser/..."  Issue was a full log file area 
    on an FTS server.  Area cleaned, problem fixed.  https://ggus.eu/ws/ticket_info.php?ticket=97261 was closed, eLog 45812, 45814.
    2)  9/12: OU_OCHEP_SWT2 - file transfer failures ("failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]").  Power-cycling a network switch 
    fixed the problem. 
    3)  9/12: AGLT2 - jobs were failing a couple of WN's with the error "Missing installation: No such file or directory: /cvmfs/atlas.cern.ch/repo/sw/software/..."  The problematic 
    hosts were off-lined, issue resolved.  https://ggus.eu/ws/ticket_info.php?ticket=97273 was closed on 9/13, eLog 45841.
    4)  9/12: NET2/BU - jobs were failing heavily on WN abc-03b.  Node off-lined the next day, issue resolved. eLog 45858.
    5)  9/13-9/14: Large number of 'holding' production jobs in all clouds, especially US.  Issue was traced to how DaTRI was handing a new panda server robot certificate.  
    Problem fixed.  More details in: https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/45882.
    6)  9/17: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=272134
    
    Follow-ups from earlier reports:
    
    None
    

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Has been deleting data on behalf a user - 220 TB, at MWT2. Need to send another note users. 80% are not being used? What is the popularity.
    • Deleting data from DATADISK limited by primary copies.
    • Dark data - need more feedback. USERDISK deleted before May 1. These files are eligible for CCC cleanup.
  • this meeting:
    • MinutesSep17DataManage2013
    • Estimating 5% of all data in US facility is dark, mostly in USERDISK
    • About 100 TB for primary to secondary
    • Kaushik's action items: dark data on DATADISK should not happen (subscription services): Saul and others will look into it. USERDISK: much is quite old - caused by glitching deletion services? Far more complex to figure: set a date, delete all prior to this.
    • Hiro: the Rucio File Catalog will be replicated to BNL

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Release 3.3.1 should be out today. Each site should update.
    • MWT2, AGLT2, OU done already.
    • Goal is to start to utilize this information in various ways.
  • this meeting:
    • Reports for program-funded network upgrades
    • Convergence on Perfsonar registration OIM needed: all US sites need to be properly registered

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei - sent a note to update Xrootd w/ notes. Reminds sites to
  • Ilija - working with Matez on monitoring - there was some mix up of cms and atlas data. New release of collector coming.
  • Panda failover monitoring.
  • All sites agreed to update Xroot https://twiki.cern.ch/twiki/bin/view/Atlas/Xrootd333Upgrade.
  • Waiting for several sites to install Rucio-based N2N? . Only SLAC and MW completed. Wei will send an official.
this week
  • Not much progress. There is a concern about rpm-upgrade of xrootd. Horst provided a solution.
  • BU completed all the steps, config seems to be correct. For some reason the service stops, quits right away. Wei is debugging.
  • Question about FAX monitoring information - needs to be reported to the Rucio popularity database.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Hiro has started to do direct access tests. Planning to move ANALY short queue to direct access. Anticipate moving more in the future. Working on new storage hardware - issued order for 1.5 PB of storage. $120/TB. Will run with ZFS on top - start deploying as soon as possible. Hiro and others have made good progress on evaluating the Ceph block device part. Looks attractive to us. Using old retired storage. Features are nice. Slice and dice the storage. 100 Gbps-capable Arista switch arrived; readying for trans atlantic test in mid-september. IB eval, 56 Gbps interfaces into these machines.
    • this meeting: Implementing Science DMZ with Juniper router; a prerequisite for the new 100g Arista router. Will be used to interface to WAN circuit, for transatlantic demonstration. Connection through a recognized network architecture. Moving to next gen OpenStack service Grizzly - reorganized way communication happens with the underlying network infrastructure; new virtual layer for control. Client SDN controller - to dynamically create VLANS. Will be doing this in the context of cloud activities. (Lab contributed the equipment.) Relation to DYNES? That is really a hardware setup - not the app layer.

  • AGLT2:
    • last meeting(s): More recently seeing larger numbers of CVMFS problems. HC offlined last night - most likely CVMFS issue. CVMFS 2.1.14 most recent version is causing problems. MW backed out to 2.0.19. Dave notes certain version of the kernel have problems with the latest 2.1.14.
    • this meeting: Lot of CVMFS issue - manefests with HC clouds - missing installation. Working with Dave Dykstra and Jacob Blummer - wn access. Too many files get pinned within the cache. Has a script which runs once a day randomly to do the reconfig. Updating wn-client and OSG releases; probably will do this well before. Will send CVMFS to the t2 list. 2.1.15 timescale? Unknown.

  • NET2:
    • last meeting(s): LHCONE progress; also expect 100g soon. HU has an issue with GK on SL6; lsf-gratia. (Wei has suggestion for disabling certain records). Has a ticket open. SL6 upgrade completed. Rucio upgrade.
    • this week: Appears that network performance to BNL goes way down, resulting in DDM errors. Perfsonar needs investigation.

  • MWT2:
    • last meeting(s): Updated SHA-2 compliant gums. dCache upgrade later in the month. There is a networking issue at IU. IU has received its upgrade equipment.
    • this meeting: Saw CVMFS probs as well. Seems to be related to OASIS - the catalog is so large (in 2.0.19 the caches were separate). Currently working on 2.1.14 - made sure the cache is large. No problem. SHA-2 compliant except for dCache. Sarah working on the plan. Registration will be fixed in OIM (a separate service group for each instance). Working with OSG folks on Condor - Gratia bug - 7.8.8 - occasionally a job record claims 2B seconds used. Update in the Gratia Condor probe that has a workaround.

  • SWT2 (UTA):
    • last meeting(s): Patrick up at OU.
    • this meeting: Have moved to new dq2 client; implemented new lsm to satisfy new naming conventions. Dark data deletion, LFC orphans - working on a version of clean-pm to work on the output of CCC; will circulate. FAX upgraded at CPB, implemented new Rucio N2N. Seems to be much better than before. Upgraded perfsonar to latest version. Still need to do OIM registration. Possible flaky hardware throughput node. Checked versions of software for SHA-2 - will need to update wn-client.

  • SWT2 (OU):
    • last meeting(s): Lustre problems - but IBM is working on the problem.
    • this meeting: Migration done and validation going well and almost complete, but currently working on hardware issues with gatekeeper and IPMI network issues with lustre servers. Also, still working on xrootd crashing our server because it seems to be triggering a lustre bug.

  • WT2:
    • last meeting(s): Now have outbound TCP/IP connections. There is a problem with the GRAM service, needs a restart every few days. Need to open up a ticket with OSG. Release validation - two sites have been working fine. Finding some validations fail with some of the new sites. (Notes two upgrades of both GT5 and LSF).
    • this meeting: Will do the correct resource registration for perfsonar in OIM. lsm for SLAC - running for prod and analy, seems fine. Will convert the rest of the SLAC site. Procurement order went out. Stabilized GT5 and lsf (work around, would hope OSG ).

AOB

last meeting
  • Will have a meeting next week, to get back in phase.
  • Fall US ATLAS Facilities workshop. Tentative dates: December 11, 12. Location: University of Arizona (Tucson).
this meeting


-- RobertGardner - 17 Sep 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback