r4 - 25 Jun 2014 - 14:16:22 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJun252014

MinutesJun252014

Introduction

Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:

Attending

  • Meeting attendees: Michael, Dave, Alden, Bob, Rob, Shawn, John Brunelle, Mayuko, Sarah, Horst
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Readiness of the Facility is the primary focus for the next month, for DC14 obviously.
      • Rucio migration; issue of additional higher level tools consistency checking.
      • Enthusiastic about Connect technology to get new computing architectures into our computing environment.
      • Event service - high expectations and hopes; should be ideal for opportunistic resources.
      • Still working on improving networking - finalizing program funded upgrades.
      • Migration away from GRAM-based CE. Transition plan from OSG Operations; Saul and Wei will be helping with the validation.
      • TIM meeting will be planned for the Fall - likely no Fall Facilities meeting. There is also going to be a December meeting at CERN with a technical sites-focus. Would like to discuss DC14 results in advance.
      • ATLAS Analytics questionnaire and whitepaper.
    • this week
      • Join US ATLAS Software & Computing / Physics Support meeting at LBL, August 20-22, 2014
        • Focus is on discussions and project planning for US ATLAS computing, and participation is by invitation to minimize travel and focus discussions. Not a facilities meeting per se - topics are cross-cutting, over a broad range. Will follow US ATLAS workshop in Seattle, which will include Tier3 meeting.
        • https://indico.cern.ch/event/326831/
      • SiteCertificationP30
      • Next meeting will discuss status of opportunistic production from the OSG VO on US ATLAS sites (Bo Jayatilaka/Fermilab will join). * Instructions for enabling the OSG VO are here: SupportingOSG
      • Rucio migration is fully completed. For the US, it is also about moving from Pandamover to standard DDM. Previously there have been comments about Pandamover advantages; after migration, we've seen the standard DDM service work just fine. E.g. at the Tier1, have not found problems at
        • Kaushik: there are some issues being worked out by the Panda team; mainly with the deletion service. Risk of breaking DDM by running mult-cloud.
        • Wei notes there are problems that are central service related. Kaushik - most of the problems are deletion related. Might be available in the log files, but these are not accessible.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • meeting (5/28/14):
    • Not much to report on the production front. Not sure when to expect large volumes of tasks.
    • Last week there have been historic lows of site issues - not much to report.
    • There was a minor update to the pilot from Paul
  • meeting (6/11/14):
    • Sporadically going up and down in terms of jobs. There is some MCORE testing going. Important to keep a low rate of MCORE pilots running. Keep 1 or 2 slots always available. Allows for fast startup of samples.
  • meeting (6/25/14):
    • Problems with DC14. Keep a few slots active at each site. At least by one to two weeks.

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Wahid Bhimji):
    http://indico.cern.ch/event/322220/contribution/1/material/slides/1.pdf
    
    1)  6/12: NERSC - file transfers failing with "DESTINATION SRM_PUT_TURL srm-ifce err: Communication error on send, [SE][srmRm][] httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server ." 
    Bestman SRM needed to be upgrade to support SHA-2 certificates. Work completed as of 6/18 - issue resolved. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106150 was 
    closed, eLog 49865.
    2)  6/13: BNL - file transfer failures with the error "DESTINATION SRM_PUT_TURL srm-ifce err: Communication error on send." From Iris: The namespace server was down due to 
    hardware issue. The problem is resolved now. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106171 was closed, eLog 49777. Also the same day 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=106181 was opened for tape staging errors ("[StatusOfBringOnlineRequest][SRM_FILE_UNAVAILABLE]"). Iris reported that the 
    files were recovered. ggus 106181 was closed, eLog 49778.
    3)  During this week - large transferring jobs backlog. See: http://indico.cern.ch/event/322220/contribution/1/material/slides/2.pdf
    4)  6/17: ADC Weekly meeting:
    http://indico.cern.ch/event/322220/
    
    Follow-ups from earlier reports:
    
    None 
    

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week.
    
    1)  6/18: MWT2 & MWT2_UC - file transfer failures with the error "[DESTINATION SRM_PUT_TURL srm-ifce err: Communication error on send." Sarah reported the site was experiencing 
    problems with the SRM door. As of 6/20 more additional errors, so https://ggus.eu/index.php?mode=ticket_info&ticket_id=106329 was closed. eLog 49895. SRM errors returned on 6/20, 
    https://ggus.eu/?mode=ticket_info&ticket_id=106440. All issues apparently resolved as of 6/25, ticket closed. eLog 49965.
    2)  6/19: https://ggus.eu/?mode=ticket_info&ticket_id=106334 was opened for file transfer errors to SLACXRD with checksum errors. However, this problem is not a site issue but rather a 
    known, occasional problem with user data. See: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCoSKnownProblem (first item under "Long standing"). ggus ticket was closed, eLog 49962.
    3)  6/20:  File transfers failing heavily across all clouds with the error "Failed to create the gfal2 handle: INITUnable to open the /usr/lib64/gfal2-plugins//libgfal_plugin_http.so
    plugin specified in the plugin directory, failure : /usr/lib64/gfal2-plugins//libgfal_plugin_http.so: undefined symbol." Issue eventually understood and resolved. 
    See https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/49896 (and details in the ggus ticket within).
    4)  6/21:  Lucille_CE - file transfers failing with "communication error on send." Site is in downtime for maintenance / upgrades. https://ggus.eu/?mode=ticket_info&ticket_id=106391 
    was closed, eLog 49967.
    5)  6/23: WISC_LOCALGROUPDISK - destination transfer errors with "srm-ifce err: Communication error on send." https://ggus.eu/index.php?mode=ticket_info&ticket_id=106442 in-progress, 
    eLog 49940.
    6)  6/24: ADC Weekly meeting:
    http://indico.cern.ch/e/322221
    
    Follow-ups from earlier reports:
    
    None 
    

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (6/11/14)
    • USERDISK is now on-going.
    • No plans for LOCALGROUPDISK cleanup. Rucio quota system might help with LOCALGROUPDISK policy.
    • LOCALGROUPDISK management service - Kaushik - there is work going on here, to improve monitoring, and database schema.
  • meeting (6/25/14)
    • Deletion of dark data - in progress.
    • Have not yet started the cleanup for BNL.
    • Deletion rate is quite low by a factor of 10.

DDM Operations (Hiro)

meeting (5/28/14):
  • Reminder that the query to find files is different.
  • Not sure if the Rucio dump created by the Rucio team was sufficient, or not.
  • Wei: notes there is a document, but it looks like it doesn't work. Does the basic functionality exist? The REST API commands - about half are not working. Will discuss next week.
  • Hiro: agrees documentation is in poor shape.
  • Will still need the equivalent of a CCC to find dark data. Or missing data?
  • Hiro will coordinate issues.

meeting (6/11/14)

  • You can no longer use the LFC dump.
  • Have not seen the new dump from Rucio yet - will need to get back to Vincent.
  • Should no longer have dark data - it can all be deleted.
  • There will be centrally provided script for CCC.
  • CERN FTS has some issues today - which caused a backlog, specific problem unknown.

meeting (6/25/14)

Condor CE

last week
  • Saul - started with Tim Cartwright. Will setup a separate machine to do this. Wei - has not started yet.
this week
  • Wei: setup a machine; installed but got errors, sent email. Looked at the code for checking the job status; worried about scalability.
  • BU: no update (Saul)
  • Bob: will take up at AGLT2.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference last meeting (6/11/14)

this meeting (6/25/14)

  • CA cloud: Got Triumf working well. Working with McGill and Victoria.
  • Tokyo Tier2 is interested, will look for a downtime to install.

ATLAS Connect (Dave, Rob)

last week
  • Much progress on several resource targets, as Dave Lesny will discuss.
  • Last two weeks working on four clusters into the Connect project. Taking an alternative to CVMFS. NFS-based CVMFS server.
  • Stampede: admins are working on delivering CVMFS via an NFS server; to deploy via their Lustre filesystem.
  • Midway cluster at Chicago - deployed several nodes, an NFS CVMFS solution. Proving to be successful - running ATLAS jobs; we bring everything the jobs. Working to deploy into a larger environment.
  • ICCC - working with an NFS CVMFS component.
  • Odyssey - already an ATLAS ready cluster. Opportunistically. A very large cluster - max has been 550 jobs at a time; preemption enabled.
this week
  • Meeting this week, https://indico.cern.ch/event/326237/. In particular see notes on CVMFS options.
  • TACC/XSEDE
    • Original startup request: ECSS Workplan for Particle Physics Computing on XSEDE Cyberinfrastructure Grant #: TG-PHY140018 Need to finish his up with Mats Rynge (XSEDE consultant) (Rob, Michael)
    • New award:
      Your request:
       Enabling Large Hadron Collider Computing on XSEDE with ATLAS Connect
       PHY140033
      
      Has been awarded and allocation on the following resources:
      
      TACC Dell PowerEdge C8220 Cluster with Intel Xeon Phi coprocessors (Stampede): 266355
      XSEDE Extended Collaborative Support: 2
      TACC Long-term tape Archival Storage (Ranch): 500
      Indiana University Gateway/Web Service Hosting (Quarry): 1
      The value of these awarded resources is $9,344.128.
      
      The allocation of these resources for your research represents a considerable investment by the NSF in advance computing infrastructure for the U.S. open science research. The rough value of this provision of resources, based on NSF awards in the support of U.S. open science, is indicated above.
      
      Requests for ECSS(Extended Collaborative Support Service) time require further assessment by ECSS staff; however, if your full request is eventually awarded, this represents an additional $16.7K/FTE-month in the value of the award.
      
      Please understand that the value of your award is an approximation, but is representative of the value of the type of allocations of resources made in support of your research.
      
      Awards will begin on Jul 01, 2014 with an end date of Jun 30, 2015.
  • Successful getting ATLAS Connect working on Harvard's Odyssey cluster. (Helps its already ATLAS-compliant). Have run up to 1200 slots, opportunistically (evictable).
  • Added "health checks" to the glidein script.
  • Added Tier3 cluster UTexas Austin (84 slots).
  • Midway (UC) making progress - standing up an NFS-based solution; tested and working well.
  • Illinois campus cluster - still working with the campus cluster admins. Finding issues with kernel panics caused by cvmfs client.
  • Stampede: close to an NFS-type implementation using their Lustre filesytem. Worried that repos at an alternate mount point; won't work with ALRB, and probably user scripts. Also in touch with Asoka to implement with alternative mount points.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Started initiative to phase out Oracle at the Tier 1, under discussion. WAN architecture changes. In process of setting up an object store, based on Ceph. Good opportunity to evaluate technology under quasi-production, and the event service. BNL, ESnet, Amazon discussion to waive egress fees. Invitation to setup a grant to make a long term study, to provide data to Amazon to assess and develop a business model for academic institutions. Worker node procurement, 120 server purchase, most likely based on AMD.
    • this meeting:

  • AGLT2:
    • last meeting(s): 338k dark files in Rucio. Trouble with gatekeeper around midnight last night - not sure what happened; memory and swap filled; haven't had a chance to clean up. Did update to latest OSG gatekeeper. EX9208 at MSU will be powered up and brought up to 40 Gbps. At EX9208 at UM, still running at 10G, problem with sub-interfaces on link aggregation of 40g. working hard on networking. There was a bug on the 40g interface on the Juniper; forwarding table does not get updated. Going to CC-NIE. Working on connecting the two sites. Bringing up EX9208 caused routing tables to overflow on the Juniper 4200 (16k entries), which caused problems. 9208 holds 1M entries.
    • this meeting: Good news on networking upgrade. 40g across campus to 100g router. Changed VLANS. One of the 40g waves to MSU. Working on second 40g UM-MSU. 2x40g to the outside world when all is in place. Hit a number of problems with the Juniper optics, and patch cables.

  • NET2:
    • last meeting(s): FTS3 poor performance issue addressed by deploying more gridftp issues. Gearing up to purchase storage. 700 usable TB with 4 TB drives. Half a rack. Need to update FAX doors. Have not yet purchased storage from Dell. Working on Condor CE.
    • this week: Nothing major. Great working with Dave on ATLAS Connect - very easy, works well. Trying to get WLCG SAM numbers in shape - looks like wrong queues are being probed.

  • MWT2:
    • last meeting(s): 100g testing to BNL. At UIUC will be updating GPFS 3.5.16 to 3.5.18 to address file corruption; a downtime of 16 hours.
    • this meeting:
      • SRM problems about a week ago. Would see SRM load spikes. Spoke with Dimitry - a bug in the CRL access module, causing auth's to fail. Upgraded on Friday, and since have not had load spikes. Happy with the fix, but why are we getting those errors now? Did we get a change in the access pattern?
      • Data consistency: have been working on CCC. Got it working, but its returning that we have 200 TB of dark data in Rucio. Is this due incomplete datasets? Will need Rucio dumps to go further. Found four datasets in a damaged state. Reported missing files as lost via the written procedure (submitted to Jira). (Armen: there is a large backlog of deletions. The SRM is not keeping up with the Rucio deletion.)

  • SWT2 (UTA):
    • last meeting(s): All the network equipment is in, and we've started stacking it, setting up LAGs, get through the configuration and setting a downtime. Close to scheduling downtime for networking upgrade. Prototype system in place. One more call needed to Dell the sign-off on the configuration. Expect to do the upgrade after that. Close to doing a purchase for worker nodes. Not getting great pricing from Dell. Ivybridge. Michael: on-going evaluation of Intel processor, AMD6000, seem to be much more cost effective. Seems to be well-suited.
    • this meeting: Biggest thing is cleaning up Rucio/non-Rucio data. Some test queues are writing into non-Rucio areas. Access problems for users from the Dutch CA. Upgraded compute nodes to CVMS 2.1.19.

  • SWT2 (OU, OSCER):
    • last meeting(s): LHCONE - working with John Bigrow to set this up.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): Will have a short outage next week for power work.
    • this meeting:

AOB

last meeting
  • OSG VO request next meeting
  • Alden: Sched config updating service was running on three servers; the machines were decommissioned. But restored - report any
  • v31 spreadsheet please
  • Gratia-APEL reporting email backup.
this meeting


-- RobertGardner - 24 Jun 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback