r4 - 09 Jul 2014 - 14:51:15 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJuly92014



Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:


  • Meeting attendees: Dave, Bob, Rob, Torre, Michael,
  • Apologies: Wei
  • Guests: Bo Jayatilaka, Chander Seghal (FNAL/OSG), Mats Rynge

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • US ATLAS Software & Computing / Physics Support meeting at LBL, August 20-22, 2014
        • Focus is on discussions and project planning for US ATLAS computing, and participation is by invitation to minimize travel and focus discussions. Not a facilities meeting per se - topics are cross-cutting, over a broad range. Will follow US ATLAS workshop in Seattle, which will include Tier3 meeting.
        • https://indico.cern.ch/event/326831/
      • SiteCertificationP30
      • Next meeting will discuss status of opportunistic production from the OSG VO on US ATLAS sites (Bo Jayatilaka/Fermilab will join). * Instructions for enabling the OSG VO are here: SupportingOSG
      • Rucio migration is fully completed. For the US, it is also about moving from Pandamover to standard DDM. Previously there have been comments about Pandamover advantages; after migration, we've seen the standard DDM service work just fine. E.g. at the Tier1, have not found problems at
        • Kaushik: there are some issues being worked out by the Panda team; mainly with the deletion service. Risk of breaking DDM by running mult-cloud.
        • Wei notes there are problems that are central service related. Kaushik - most of the problems are deletion related. Might be available in the log files, but these are not accessible.
    • this week
      • Thanks for updates to the SiteCertificationP30 table. Updates to facilities capacity spreadsheet v31 before next week.
      • Guest presentation from Bo Jayatilaka (OSG/Fermilab) to discuss opening queues for opportunistic production.
      • In the site certification table, we'll have an exercise for DC14. Time for specific tests, functional and at scale, and that expected service/performance levels are available. Assemble a team of people to look at the different aspects.
      • In terms of timeline, a TIM coming in October - in advance, the facility should a signification amount of data on how sites performed.

Supporting the OSG VO on US ATLAS Sites (Bo Jayatilaka, Fermilab)

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • meeting (6/25/14):
    • Problems with DC14. Keep a few MCORE slots active at each site. At least by one to two weeks before DC14 ramps up.
  • meeting (7/9/14):
    • Quite full; first batch of MCORE jobs last week. Second batch came last night. Will expect to see this behavior for the time being. ADC happy with the capacity.
    • Sill, we'll get single-core jobs in between.
    • Transferring backlog last week. FTS3 problem. Rucio call-back problems. Thread-safe issues. On Friday - allowed Rucio to do callbacks. Transfer backlog still being addressed.
    • Hiro: There seems to be a reverse DNS issue at CERN.

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Alessandra Forti):
    1)  6/27: DDM dashboard was unavailable for ~six hours. Issue was a corrupted database connection. Service restored as of ~7:30 p.m. CST.
    2)  6/28: UPENN - file transfers failing with "could not open connection to srm.hep.upenn.edu." Site admin reported the problem was fixed later 
    that day (no details). Errors went away, so closed https://ggus.eu/?mode=ticket_info&ticket_id=106542. eLog 50002.
    3)  7/1: LUCILLE - after coming out of a maintenance downtime file transfers were failing with 
    "Communication error on send, err: [SE][srmRm][] httpg://lutse1.lunet.edu:8443/srm/v2/server: 
    CGSI-gSOAP running on fts113.cern.ch reports Could not convert the address information to a name or address." 
    So apparently a DNS issue on either the OU or CERN end. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106592 in-progress, eLog 50053.
    4)  7/1: ADC Weekly meeting:
    http://indico.cern.ch/event/322222/ (several relevant topics - Rucio migration, ProdSys2, DC14)
    Follow-ups from earlier reports:
    (i)  6/23: WISC_LOCALGROUPDISK - destination transfer errors with "srm-ifce err: Communication error on send." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=106442 in-progress, eLog 49940.
    Update 6/30: site reported the problem was fixed (no details). Transfers are succeeding - ggus 106442 was closed. eLog 50020.

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    No meeting this week - ongoing WLCG workshop.
    1)  7/3 early a.m.: AGLT2 - file transfers failing with "gridftp internal operation timeout, operation canceled, operation timeout." Bob reported the site is 
    working to address some VMWare issues. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106620 in-progress, eLog 50121.
    2)  7/5: UTA_SWT2 - file transfers were failing due to an expired host certificate on a gridftp server. Certificate was updated - issue resolved. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=106661 was closed, eLog 50111.
    3)  7/8 early a.m.: AGLT2 - squid service shown as down in the monitor. Related to the VMWare issue in 1) above. Service was restored as of early afternoon. 
    https://ggus.eu/?mode=ticket_info&ticket_id=106691 was closed, eLog 50134.
    4)  7/8: ggus tickets were opened for two sites, MWT2 and SLACXRD, for file transfers failing with errors like "CGSI-gSOAP running on fts112.cern.ch reports 
    Error reading token data header: Connection closed." Sarah reported that she had noticed significant packet loss events in the perfsonar data around this time. 
    Also seeing transfer failures at SWT2_CPB with the same errors. https://ggus.eu/?mode=ticket_info&ticket_id=106708 / eLog 50118 (SLACXRD) & 
    https://ggus.eu/?mode=ticket_info&ticket_id=106709 / eLog 50119 (MWT2) in-progress.
    5)  7/8: SWT2_CPB - a storage went off-line due to a problem with the NIC in the machine (cooling fan). Hardware was replaced, issue resolved. Leaving 
    https://ggus.eu/?mode=ticket_info&ticket_id=106732 open for now since transfers are still failing with the "connection closed" errors like 4) above. eLog 50133.
    6)  7/8: ADC Weekly meeting:
    No meeting this week - ongoing WLCG workshop.
    Follow-ups from earlier reports:
    (i)  7/1: LUCILLE - after coming out of a maintenance downtime file transfers were failing with "Communication error on send, err: [SE][srmRm][] 
    httpg://lutse1.lunet.edu:8443/srm/v2/server: CGSI-gSOAP running on fts113.cern.ch reports Could not convert the address information to a name or address." 
    So apparently a DNS issue on either the OU or CERN end. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106592 in-progress, eLog 50053. 

  • Two separate storage server issues at UTA; fixed a NIC; but finding DDM transfer errors. But its happening at other sites?

Data Management and Storage Validation (Armen)

DDM Operations (Hiro)

meeting (7/9/14)
  • There is dark data at NET2 to be cleaned up.
  • Empty directories can be cleaned up.
  • At Tier1, found many datasets from non-Rucio directories that need to be moved.
  • Deletion service - claim is it solved, but the rate is still low. Urgent cleanup tasks are done for some sites. Overall 5-6 Hz. Started a month ago - mostly USERDISK. 0.5M datasets. There is no plan to deal with this. We need a statement from ADC management.
  • How much storage is dark or awaiting deletion.

Condor CE validation (Saul, Wei, Bob)

this week
  • Wei: setup a machine; installed but got errors, sent email. Looked at the code for checking the job status; worried about scalability.
  • BU: no update (Saul)
  • Bob: will take up at AGLT2.
meeting (7/9/14)
  • At BNL, have an instance of HTCondor CE working, on a production instance, in parallel with the GRAM CEs. Xin is working on the Gratia information, operational.
  • AGLT2: working on test gatekeeper; but, update to OSG 3.2.12 broke the RSV probes, GRAM auth failed. Might be a configuration.
  • NET2: identified the hardware to do this. Will put OSG on top of that. #1 priority.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference meeting (6/25/14)
  • CA cloud: Got Triumf working well. Working with McGill and Victoria.
  • Tokyo Tier2 is interested, will look for a downtime to install.
meeting (7/9/14)
  • Deployed over 91% of all ATLAS files. Goal is to be more than 95%. NDGF (a distributed "site") and Tokyo are sites to add. Probably will be difficult for NDGF, due to their networking costs.
  • Beijing is deploying.
  • Stability: wants to improve this to better than 99%. Agreed that we must debug remotely, using Xrootd 4.0. Deployed and testing at MWT2 - remote debugging works. Will deploy after a bit more testing.
  • FAX presented at software tutorial two weeks ago; new to most people.
  • ROOTIO workshop - important for xAOD read performance, WAN and local. Instrumenting the code to discover what is being read.
  • Monitoring - for failover and overflow.

ATLAS Connect (Dave, Rob)

meeting (7/9/2014)
  • Working with Notre Dame on Parrot, new test release. Progress is slow.
  • Installing on an NFS server, exporting to cluster - tested at UC Midway - working at small scale. Larger scale next week.
  • "Modules" solution.

Site news and issues (all sites)

  • WT2:
    • last meeting(s): Will have a short outage next week for power work.
    • this meeting:

  • T1:
    • last meeting(s): Started initiative to phase out Oracle at the Tier 1, under discussion. WAN architecture changes. In process of setting up an object store, based on Ceph. Good opportunity to evaluate technology under quasi-production, and the event service. BNL, ESnet, Amazon discussion to waive egress fees. Invitation to setup a grant to make a long term study, to provide data to Amazon to assess and develop a business model for academic institutions. Worker node procurement, 120 server purchase, most likely based on AMD.
    • this meeting: Preparing for a CPU upgrade, procurement underway. New storage replacements are being phased in. Older storage being converted to an object store using Ceph. Interested in getting experience with other interfaces, such as S3. Useful for event service. Still working on completing WAN setup. Have 100g into the lab. Moving circuit by circuit 10g infrastructure into 100g (LHCONE eg. has 40g cap). On CVMFS - moving to the new client 2.1.19.

  • AGLT2:
    • last meeting(s): Good news on networking upgrade. 40g across campus to 100g router. Changed VLANS. One of the 40g waves to MSU. Working on second 40g UM-MSU. 2x40g to the outside world when all is in place. Hit a number of problems with the Juniper optics, and patch cables.
    • this meeting: Have Juniper switches working well. 80g between sites, and to elsewhere. Working on asymmetries.

  • NET2:
    • last meeting(s): Nothing major. Great working with Dave on ATLAS Connect - very easy, works well. Trying to get WLCG SAM numbers in shape - looks like wrong queues are being probed.
    • this week: Upgraded CVMFS to new version at HU. New FAX node, and Condor CE testing. Storage and worker node purchase. LHCONE: there is a definite plan for a switch at Manlan to be used for LHCONE.

  • MWT2:
    • last meeting(s):
      • SRM problems about a week ago. Would see SRM load spikes. Spoke with Dimitry - a bug in the CRL access module, causing auth's to fail. Upgraded on Friday, and since have not had load spikes. Happy with the fix, but why are we getting those errors now? Did we get a change in the access pattern?
      • Data consistency: have been working on CCC. Got it working, but its returning that we have 200 TB of dark data in Rucio. Is this due incomplete datasets? Will need Rucio dumps to go further. Found four datasets in a damaged state. Reported missing files as lost via the written procedure (submitted to Jira). (Armen: there is a large backlog of deletions. The SRM is not keeping up with the Rucio deletion.)
    • this meeting: Ordered 16 additional nodes at Illinois. Still upgraded CVMFS.

  • SWT2 (UTA):
    • last meeting(s): Biggest thing is cleaning up Rucio/non-Rucio data. Some test queues are writing into non-Rucio areas. Access problems for users from the Dutch CA. Upgraded compute nodes to CVMS 2.1.19.
    • this meeting: Throughput and storage a CPB - investigating an issue. Have cleaned up the dark data, and empty directories. Will look into supporting the OSG VOs, shorter jobs would be easier to accommodate. Network upgrade - should have all the equipment needed. Next two weeks.

  • SWT2 (OU, OSCER):
    • last meeting(s): LHCONE - working with John Bigrow to set this up.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting: Reverse DNS problem, which seems to be CERN related.


last meeting this meeting

-- RobertGardner - 08 Jul 2014

  • graph_image.png:

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


png graph_image.png (50.3K) | RobertGardner, 09 Jul 2014 - 12:40 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback