r3 - 23 Jul 2014 - 13:55:36 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJuly232014



Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:


  • Meeting attendees: Alden, Rob, Torre, Fred, Shawn, Saul, Armen, Dave, Patrick,
  • Apologies: Michael, Jason
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • US ATLAS Software & Computing / Physics Support meeting at LBL, August 20-22, 2014
      • Thanks for updates to the SiteCertificationP30 table. Updates to facilities capacity spreadsheet v31 before next week.
      • Guest presentation from Bo Jayatilaka (OSG/Fermilab) to discuss opening queues for opportunistic production.
      • In the site certification table, we'll have an exercise for DC14. Time for specific tests, functional and at scale, and that expected service/performance levels are available. Assemble a team of people to look at the different aspects.
      • In terms of timeline, a TIM coming in October - in advance, the facility should a signification amount of data on how sites performed.
    • this week
      • Facility performance metrics. See URL in email.

Supporting the OSG VO on US ATLAS Sites

  • See SupportingOSG for specifications
  • Boston: reduced queue to 4 hours, OSG made an adjustment. All good.
  • UTA: have not got to it yet
  • No other updates.

Operations overview: Production and Analysis (Kaushik)

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week.
    1)  7/10 p.m. - sites were draining due to an authentication issue (the certificate used to submit pilots not working). John H. at BNL had to re-sign the AUP in VOMRS, 
    and this fixed the problem. Also around this time there were substantial transferring backlogs due to issues with FTS3. Ongoing work by experts to address this issue.
    2)  7/11:  The US cloud was moved from CERNFTS3 to RALFTS3 - see:
    3)  7/15: SWT2_CPB: file transfers failing with the SRM error "open() fail." https://ggus.eu/index.php?mode=ticket_info&ticket_id=106879 in-progress.
    4)  7/15: ADC Weekly meeting:
    5)  7/15: ATLAS Weekly talk with DC14 status/plans:
    Follow-ups from earlier reports:
    (i)  7/1: LUCILLE - after coming out of a maintenance downtime file transfers were failing with "Communication error on send, err: [SE][srmRm][] 
    httpg://lutse1.lunet.edu:8443/srm/v2/server: CGSI-gSOAP running on fts113.cern.ch reports Could not convert the address information to a name or address." 
    So apparently a DNS issue on either the OU or CERN end. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106592 in-progress, eLog 50053.
    (ii)  7/3 early a.m.: AGLT2 - file transfers failing with "gridftp internal operation timeout, operation canceled, operation timeout." Bob reported the site is working to 
    address some VMWare issues. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106620 in-progress, eLog 50121.
    Update: as of 7/14 VMWare and networking issues resolved - ggus 106620 was closed.
    (iii)  7/8: ggus tickets were opened for two sites, MWT2 and SLACXRD, for file transfers failing with errors like "CGSI-gSOAP running on fts112.cern.ch reports 
    Error reading token data header: Connection closed." Sarah reported that she had noticed significant packet loss events in the perfsonar data around this time. 
    Also seeing transfer failures at SWT2_CPB with the same errors. https://ggus.eu/?mode=ticket_info&ticket_id=106708 / eLog 50118 (SLACXRD) & 
    https://ggus.eu/?mode=ticket_info&ticket_id=106709 / eLog 50119 (MWT2) in-progress.
    Update 7/10: Apparently a transient issue at SLAC, since the errors stopped.  ggus 106708 was closed. eLog 50150.
    Update: as of 7/14 no more errors between MWT2 and NL cloud - ggus 106709 was closed.
    (iv)  7/8: SWT2_CPB - a storage went off-line due to a problem with the NIC in the machine (cooling fan). Hardware was replaced, issue resolved. 
    Leaving https://ggus.eu/?mode=ticket_info&ticket_id=106732 open for now since transfers are still failing with the "connection closed" errors like 4) above. eLog 50133.
    Update: Some transfer errors were reported later in the day after the NIC fan was replaced, but these were unrelated to this particular hardware problem. 
    Rather they were reported in ggus 106879 on 7/15 (see 3) above). ggus 106732 was closed.

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Alessandro Di Girolamo, Kai Leffhalm):
    1)  7/16: BNL - https://ggus.eu/index.php?mode=ticket_info&ticket_id=106896 was opened due to production jobs failing with the error "Could not add files to DDM: 
    dq2.repository.DQRepositoryException.DQFrozenDatasetException ." Not a site issue, but rather either a task or panda problem since attempts were being made to 
    add files to frozen datasets. ggus 106896 was closed, eLog 50270.
    2)  7/18: Discussion from Alessandro Di Girolamo concerning the lack of production jobs on the grid:
    3)  7/21: FTS was upgraded. During this period many transfer errors across many clouds with "INIT Unable to open the /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so plugin
    specified in the plugin directory." Issue resolved once the upgraded hosts/services were restarted. See:
    4)  7/22: ADC Weekly meeting (Rucio migration, dark data):
    5)  7/22 p.m.: Many sites in all clouds were auto-excluded by HC testing. The job errors were a mix of "Get error: Encountered an empty SURL-GUID dictionary" and 
    "Could not add files to DDM." Most likely related to reported Oracle problem which affected some site services hosts.
    Follow-ups from earlier reports:
    (i)  7/1: LUCILLE - after coming out of a maintenance downtime file transfers were failing with "Communication error on send, 
    err: [SE][srmRm][] httpg://lutse1.lunet.edu:8443/srm/v2/server: CGSI-gSOAP running on fts113.cern.ch reports Could not convert the address information to a name 
    or address." So apparently a DNS issue on either the OU or CERN end. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106592 in-progress, eLog 50053.
    (ii)  7/15: SWT2_CPB: file transfers failing with the SRM error "open() fail." https://ggus.eu/index.php?mode=ticket_info&ticket_id=106879 in-progress.
    Update 7/21: Issue understood. Owing to a problematic RAID array we had set the host storage server to read only. This caused transfer errors for the sonar files 
    because FTS could not delete the old files.  The storage server was remounted so that the deletions will succeed, and the errors went away. ggus 106879 was closed. 

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (7/23/14)
    • Overall US is deleting at 20 Hz. (Guess). Expect backlog to be completed shortly.
    • There were a couple of issues with BNL
    • Fred - so this is 1.7M files per day? How fast is dark data being deleted? How many files do we have to delete in the US?
    • Armen - should be no dark data being generated - only "temporarily". 3M dark files due to Panda pilot issues (see Tomas' email).
    • Sarah: working with Tomas to remove files that are in the Rucio catalog but not in a dataset. Not sure if the file sizes are correct. MWT2, AGLT2, BNL definitely have these. * Fred: how are you confirming whether new dark data is being created? Armen: doesn't see anything in the bourricot plots. Suggests looking at Rucio dumps.

DDM Operations (Hiro)

meeting (7/9/14)
  • There is dark data at NET2 to be cleaned up.
  • Empty directories can be cleaned up.
  • At Tier1, found many datasets from non-Rucio directories that need to be moved.
  • Deletion service - claim is it solved, but the rate is still low. Urgent cleanup tasks are done for some sites. Overall 5-6 Hz. Started a month ago - mostly USERDISK. 0.5M datasets. There is no plan to deal with this. We need a statement from ADC management.
  • How much storage is dark or awaiting deletion.
meeting (7/23/14)
  • No Hiro.
  • Mark: last night most sights were auto-excluded by HC. Two named errors that were DDM related (could not update datasets, etc.). No elogs or email. Perhaps Oracle update affected site services.
  • Wei: very large number in the transferring state. Mark: experts are aware. Saul: notes these are waiting for Taiwan. Mark will follow-up.

Condor CE validation (Saul, Wei, Bob)

this week
  • Wei: setup a machine; installed but got errors, sent email. Looked at the code for checking the job status; worried about scalability.
  • BU: no update (Saul)
  • Bob: will take up at AGLT2.
meeting (7/9/14)
  • At BNL, have an instance of HTCondor CE working, on a production instance, in parallel with the GRAM CEs. Xin is working on the Gratia information, operational.
  • AGLT2: working on test gatekeeper; but, update to OSG 3.2.12 broke the RSV probes, GRAM auth failed. Might be a configuration.
  • NET2: identified the hardware to do this. Will put OSG on top of that. #1 priority.
meeting (7/23/14)
  • Bob: has it running at AGLT2. Discovering differences between requirements and documentation. Working on integrating the existing MCORE configuration, and routing to the queues. Emergency OSG release had fixes for this. Interesting, and getting. Providing feedback to Tim. Using Job Router looks interesting. Running on test gatekeeper. There are problems, still. Probably not quite read for primetime.
  • NET2: installed and working. Augustine testing, in touch with Tim and Brian.
  • SLAC: have been working with OSG. Gave account - Brian finding problems, and configuration issues. Conflict with logfile locations. Also have some other issues.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference meeting (7/9/14)
  • Deployed over 91% of all ATLAS files. Goal is to be more than 95%. NDGF (a distributed "site") and Tokyo are sites to add. Probably will be difficult for NDGF, due to their networking costs.
  • Beijing is deploying.
  • Stability: wants to improve this to better than 99%. Agreed that we must debug remotely, using Xrootd 4.0. Deployed and testing at MWT2 - remote debugging works. Will deploy after a bit more testing.
  • FAX presented at software tutorial two weeks ago; new to most people.
  • ROOTIO workshop - important for xAOD read performance, WAN and local. Instrumenting the code to discover what is being read.
  • Monitoring - for failover and overflow.
meeting (7/23/14)
  • Working with McGill University - a strange problem with GPFS setup and permissions. Unresolved.
  • Victoria also experiencing problems - looking at it today.
  • A number of UK sites were down; fixed.
  • Changed limits of WAN traffic for Overflow to 10g for all sites. Parameters for overflow algorithm will be used (25 MB/s limit from cost matrix). Will soon include Canadian cloud in the overflow.
  • Fred's reported failures at AGLT2, BNL seemed to be one-offs.
  • New reports to ROOT team for checking error codes
  • New version of Xrootd at test machine at SLAC for testing. Client compatibility testing.
  • Versions 4.0.1 were incompatible with the old client. New version uses 4.0.2. (there are other reasons for going to the new version.)
  • Fred: seg faults? Ilija set a bug report.

ATLAS Connect (Dave, Rob)

meeting (7/23/2014)
  • On Midway - have deployed CVMFS on the cluster. Running production.
  • Illinois campus cluster: need one change of PBS queue.
  • Stampede - CVMFS remains the issue. Admins are very cautious. Want to rsync. Request to avoid hard path /cvmfs. There is a discussion on this in ATLAS for relocability.

Site news and issues (all sites)

  • WT2:
    • last meeting(s): Will have a short outage next week for power work.
    • this meeting: Short outage August 7 for internal network upgrades.

  • T1:
    • last meeting(s): Preparing for a CPU upgrade, procurement underway. New storage replacements are being phased in. Older storage being converted to an object store using Ceph. Interested in getting experience with other interfaces, such as S3. Useful for event service. Still working on completing WAN setup. Have 100g into the lab. Moving circuit by circuit 10g infrastructure into 100g (LHCONE eg. has 40g cap). On CVMFS - moving to the new client 2.1.19.
    • this meeting:

  • AGLT2:
    • last meeting(s): Have Juniper switches working well. 80g between sites, and to elsewhere. Working on asymmetries.
    • this meeting: Planning to do load testing tonight with Hiro. Webdav door getting setup. Ilija - can we do stress testing between UC and UM? (Need to make sure UM servers are used.)

  • NET2:
    • last meeting(s): Upgraded CVMFS to new version at HU. New FAX node, and Condor CE testing. Storage and worker node purchase. LHCONE: there is a definite plan for a switch at Manlan to be used for LHCONE.
    • this week: CV

  • MWT2:
    • last meeting(s):
    • this meeting: Working on upgrades of dCache services. Cleaning up dead nodes. Cleaning up dark data.

  • SWT2 (UTA):
    • last meeting(s): Throughput and storage a CPB - investigating an issue. Have cleaned up the dark data, and empty directories. Will look into supporting the OSG VOs, shorter jobs would be easier to accommodate. Network upgrade - should have all the equipment needed. Next two weeks.
    • this meeting: Dutch CA issue tracked down to address resolution on the CRL server. Network upgrade within the next week.

  • SWT2 (OU, OSCER):
    • last meeting(s): LHCONE - working with John Bigrow to set this up.
    • this meeting: CVMFS upgraded to 2.1.19 on OCHEP cluster. OSCER still not upgraded.

  • SWT2 (LU):
    • last meeting(s): Reverse DNS problem, which seems to be CERN related.
    • this meeting: CVMFS already upgraded. Reverse DNS still not resolved. Seems to be flopping back-n-forth.


last meeting this meeting
  • Alden: DAST shifters needed for North American timezone.

-- RobertGardner - 23 Jul 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback