r5 - 06 Aug 2014 - 14:11:47 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug062014



Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:


  • Meeting attendees:
  • Apologies: Shawn, Armen
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • US ATLAS Software & Computing / Physics Support meeting at LBL, August 20-22, 2014
      • Facility performance metrics meeting yesterday. See URL in email.
    • this week
      • Facility performance metrics
        • Google docs to describe plans and gather materials: http://bit.ly/V1T3Q8
        • Updates from Hiro's load testing
        • Need updates in all areas - okay to put URLs to external sources.
        • Michael: setting up object store based on Ceph at BNL; getting event service supported by pilot is making good progress. Ben is working on the integration of the S3 protocol, good since we have a large-scale pilot with ESnet and Amazon underway. having this directly in the pilot, no tricks are required to access Amazon to access block devices, and the same works with Ceph. Starting to evaluate the performance.
        • Saul: see email which summarizing pilot environment. Common environment and configuration.
      • For the facilities part of the LBL workshop (https://indico.cern.ch/event/326831/) - about 3 hours. Introductory slides to kick off discussion.
        • Management of user data. How to access, manage, delete, where, allocation.
        • Dynamic configurations for Multicore at sites: path forward, becoming a big issue, now its moving in this direction, dynamically. (Mark, has noticed Panda dynamically reassigns before availability).
        • Impact of derivation framework, and Jedi
        • Impact of new analysis models, pROOT, etc
        • Federated access - do we have a understanding of what will hit the sites? After Jedi is activated? Michael: would be good to summarize extent sites should be prepared for WAN accesses.
        • Consolidation of queues.
        • How to prioritize access for US physicists
        • Tier3 and ATLAS Connect, and the new analysis chain. Need a plan, and people willing to help.

Supporting the OSG VO on US ATLAS Sites

last meeting(s)

  • Boston: reduced queue to 4 hours, OSG made an adjustment. All good.
  • UTA: have not got to it yet
  • No other updates.

this meeting:

  • I asked Bo for a follow-up from the OSG side of things. Here is his summary: Hi Rob,

The most striking difference we see is at NET2 where the policy towards OSG VO shifted from a fixed quota to running fulling opportunistically. While there are now periods of time where OSG VO gets zero hours, we see much larger peaks during periods of lower ATLAS running and effectively get 8-10x more wall hours at the site:


Saul can presumably produce a longer-term version of this plot which shows how effectively OSG VO is filling in the gaps of ATLAS production:


Thanks, Bo

  • At UTA, will get back to this once the network upgrade work is complete.
  • Saul: may point is the queue time can be short, 4 hours was good.

Operations overview: Production and Analysis (Kaushik)

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Armen):
    1)  7/24 early a.m.: BNL - file transfer failures with "SOURCE file size is 0." Hiro reported the problem was fixed. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=107202 was closed, eLog 50329.
    2)  7/26: MWT2 - file transfer failures with " SRM_NO_FREE_SPACE" errors. From Sarah: We had a couple of pools offline for upgrades, and our space 
    token auto-adjuster did not adjust correctly to the reduction in space. I've manually adjusted the tokens so that the auto-adjuster can work again. Issue resolved, 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=107263 was closed. eLog 50369.
    3)  7/27 early a.m.: UTA_SWT2 - file transfers were failing with the error "Communication error on send, err: [SE][srmRm][] 
    httpg://gk05.swt2.uta.edu:8443/srm/v2/server: java.lang.reflect.InvocationTargetException]." A partition filled up on the SRM host, affecting the service. 
    Space was freed up, issue resolved. https://ggus.eu/index.php?mode=ticket_info&ticket_id=107266 was closed, eLog 50387.
    4)  7/27 early a.m.: MWT2 - file transfers failing with errors like "Communication error on send, err: [SE][ReleaseFiles] httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2: 
    CGSI-gSOAP running on fts116.cern.ch reports Error reading token data header: Connection reset by peer." Sarah reported about a recurring problem over 
    the past month with high load events on the site SRM server. Problem under investigation. https://ggus.eu/index.php?mode=ticket_info&ticket_id=107267, eLog 50374.
    5)  7/28: Many/most sites auto-excluded by HC testing. Test jobs were failing with the error "Get error: Encountered an empty SURL-GUID dictionary." Rucio 
    developers reported about a problem with an authentication host. Sites eventually set back on-line after ~6-7 hours. More details in:
    6)  7/28: AGLT2 - file transfers failing with errors like "[SRM_AUTHENTICATION_FAILURE] httpg://head01.aglt2.org:8443/srm/managerv2: 
    srm://head01.aglt2.org:8443/srm/managerv2?SFN=/pnfs/aglt2.org/atlasdatadisk/rucio/step09/e9/47: SRM Authentication failed]." Bob reported that the problem 
    was due to an issue with GUMS on a server, and the issue was resolved. Successful transfers resumed, https://ggus.eu/?mode=ticket_info&ticket_id=107306 
    was closed, eLog 50400.
    7)  7/29: ADC Weekly meeting:
    Follow-ups from earlier reports:
    (i)  7/1: LUCILLE - after coming out of a maintenance downtime file transfers were failing with "Communication error on send, err: [SE][srmRm][] 
    httpg://lutse1.lunet.edu:8443/srm/v2/server: CGSI-gSOAP running on fts113.cern.ch reports Could not convert the address information to a name or address." 
    So apparently a DNS issue on either the OU or CERN end. https://ggus.eu/index.php?mode=ticket_info&ticket_id=106592 in-progress, eLog 50053.
    Update 7/28: Issue resolved. A combination of OneNet changes (OU side) and CERN firewall tweaks. ggus 106592 was closed, eLog 50414.

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week.
    1)  8/1: SLAC/WT2 - production jobs failing with ""Copy command failed:Last server error 3011 ('No servers are available to stage the file.')." Update from 
    Wei (8/4): WT2 doesn’t have this file. There was also a network problem at BNL so FAX couldn’t rescue this job. The BNL network problem has been fixed. 
    Errors stopped, so https://ggus.eu/index.php?mode=ticket_info&ticket_id=107425 was closed. eLog 50510.
    2)  8/1: MWT2 - file transfers failing with the error "[TRANSFER globus_ftp_client: the server responded with an error 530 Login failed: Request to 
    [>gPlazma@local] timed out.]." Admins reported that the GUMS server went down due to kernel panic on the hypervisor host. Issue resolved, service restored. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=107428 was closed, eLog 50502.
    3)  8/5: ADC Weekly meeting:
    4)  8/6 early a.m. - SWT2_CPB - file transfers were failing with errors like " httpg://gk03.atlas-swt2.org:8443/srm/v2/server: CGSI-gSOAP running on f
    ts03.usatlas.bnl.gov reports Error reading token data header: Connection closed]," among others. The RAID card hosting the system drives in the xrootd 
    redirector machine failed. The service was moved to a different machine, and the storage system came back on-line. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=107511 was closed, eLog 50509.
    Follow-ups from earlier reports:
    (i)  7/27 early a.m.: MWT2 - file transfers failing with errors like "Communication error on send, err: [SE][ReleaseFiles] 
    httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2: CGSI-gSOAP running on fts116.cern.ch reports Error reading token data header: Connection reset by peer." 
    Sarah reported about a recurring problem over the past month with high load events on the site SRM server. Problem under investigation. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=107267, eLog 50374.
    Update 8/3: For several days transfer efficiency has been good (~99%), so the issue with high loads on the SRM has improved. ggus 107267 was closed, eLog 50474.

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (7/23/14)
    • Overall US is deleting at 20 Hz. (Guess). Expect backlog to be completed shortly.
    • There were a couple of issues with BNL
    • Fred - so this is 1.7M files per day? How fast is dark data being deleted? How many files do we have to delete in the US?
    • Armen - should be no dark data being generated - only "temporarily". 3M dark files due to Panda pilot issues (see Tomas' email).
    • Sarah: working with Tomas to remove files that are in the Rucio catalog but not in a dataset. Not sure if the file sizes are correct. MWT2, AGLT2, BNL definitely have these.
    • Fred: how are you confirming whether new dark data is being created? Armen: doesn't see anything in the bourricot plots. Suggests looking at Rucio dumps.
  • meeting (8/6/14)

DDM Operations (Hiro)

meeting (7/9/14)
  • There is dark data at NET2 to be cleaned up.
  • Empty directories can be cleaned up.
  • At Tier1, found many datasets from non-Rucio directories that need to be moved.
  • Deletion service - claim is it solved, but the rate is still low. Urgent cleanup tasks are done for some sites. Overall 5-6 Hz. Started a month ago - mostly USERDISK. 0.5M datasets. There is no plan to deal with this. We need a statement from ADC management.
  • How much storage is dark or awaiting deletion.
meeting (7/23/14)
  • No Hiro.
  • Mark: last night most sights were auto-excluded by HC. Two named errors that were DDM related (could not update datasets, etc.). No elogs or email. Perhaps Oracle update affected site services.
  • Wei: very large number in the transferring state. Mark: experts are aware. Saul: notes these are waiting for Taiwan. Mark will follow-up.
meeting (8/6/14)

Condor CE validation (Saul, Wei, Bob)

meeting (7/9/14)
  • At BNL, have an instance of HTCondor CE working, on a production instance, in parallel with the GRAM CEs. Xin is working on the Gratia information, operational.
  • AGLT2: working on test gatekeeper; but, update to OSG 3.2.12 broke the RSV probes, GRAM auth failed. Might be a configuration.
  • NET2: identified the hardware to do this. Will put OSG on top of that. #1 priority.
meeting (7/23/14)
  • Bob: has it running at AGLT2. Discovering differences between requirements and documentation. Working on integrating the existing MCORE configuration, and routing to the queues. Emergency OSG release had fixes for this. Interesting, and getting. Providing feedback to Tim. Using Job Router looks interesting. Running on test gatekeeper. There are problems, still. Probably not quite read for primetime.
  • NET2: installed and working. Augustine testing, in touch with Tim and Brian.
  • SLAC: have been working with OSG. Gave account - Brian finding problems, and configuration issues. Conflict with logfile locations. Also have some other issues.
meeting (8/6/14)
  • See Bob's instructions, experience at https://www.aglt2.org/wiki/bin/view/AGLT2/CondorCE
  • Saul: Augustine working with Tim Cartwright and Brian
  • Wei: working with Brian to get it working with LSF. Got a job submitted successfully, but with memory allocation errors.
  • Overall conclusion is software just not there yet.
  • Xin - working on Condor CE at BNL

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference meeting (7/23/14)
  • Working with McGill University - a strange problem with GPFS setup and permissions. Unresolved.
  • Victoria also experiencing problems - looking at it today.
  • A number of UK sites were down; fixed.
  • Changed limits of WAN traffic for Overflow to 10g for all sites. Parameters for overflow algorithm will be used (25 MB/s limit from cost matrix). Will soon include Canadian cloud in the overflow.
  • Fred's reported failures at AGLT2, BNL seemed to be one-offs.
  • New reports to ROOT team for checking error codes
  • New version of Xrootd at test machine at SLAC for testing. Client compatibility testing.
  • Versions 4.0.1 were incompatible with the old client. New version uses 4.0.2. (there are other reasons for going to the new version.)
  • Fred: seg faults? Ilija set a bug report.
meeting (8/6/14)
  • Quite stable - sites are mostly in green.
  • Problem at SLAC, and at MWT2.
  • New version of Xrootd at SLAC, MWT2, 4.0.3, which allows remote debugging.
  • Tests of bandwidth to various places. MWT2 and AGLT2 - up to 4 GB/s.
  • Submitting jobs via Jedi in overflow mode, to all US sites and Triumf. Rates and % successes.
  • Monitoring in overflow developments: added into the ADC dashboard: http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary
  • 4.0.3 on SLAC on RHEL 5,6 as proxy: crashed.

ATLAS Connect (Dave, Rob)

meeting (7/23/2014)
  • On Midway - have deployed CVMFS on the cluster. Running production.
  • Illinois campus cluster: need one change of PBS queue.
  • Stampede - CVMFS remains the issue. Admins are very cautious. Want to rsync. Request to avoid hard path /cvmfs. There is a discussion on this in ATLAS for relocability.
meeting (8/6/2014)

Site news and issues (all sites)

  • SWT2 (UTA):
    • last meeting(s): Throughput and storage a CPB - investigating an issue. Have cleaned up the dark data, and empty directories. Will look into supporting the OSG VOs, shorter jobs would be easier to accommodate. Network upgrade - should have all the equipment needed. Next two weeks. Dutch CA issue tracked down to address resolution on the CRL server. Network upgrade within the next week.
    • this meeting: Downtime next Monday for the big network change over. Xrootd redirector RAID controller died. Scrambled to get to a new machine.

  • SWT2 (OU, OSCER):
    • last meeting(s): CVMFS upgraded to 2.1.19 on OCHEP cluster. OSCER still not upgraded.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): CVMFS already upgraded. Reverse DNS still not resolved. Seems to be flopping back-n-forth.
    • this meeting:

  • WT2:
    • last meeting(s): Short outage August 7 for internal network upgrades.
    • this meeting: downtime tomorrow for internal network upgrade: 10 am - 4pm.

  • T1:
    • last meeting(s): Preparing for a CPU upgrade, procurement underway. New storage replacements are being phased in. Older storage being converted to an object store using Ceph. Interested in getting experience with other interfaces, such as S3. Useful for event service. Still working on completing WAN setup. Have 100g into the lab. Moving circuit by circuit 10g infrastructure into 100g (LHCONE eg. has 40g cap). On CVMFS - moving to the new client 2.1.19.
    • this meeting:

  • AGLT2:
    • last meeting(s): Planning to do load testing tonight with Hiro. Webdav door getting setup. Ilija - can we do stress testing between UC and UM? (Need to make sure UM servers are used.)
    • this meeting: all is well. Webdav door setup successfully, not sure if its being used. 1.5 GB/s across the two sites.

  • NET2:
    • last meeting(s): Upgraded CVMFS to new version at HU. New FAX node, and Condor CE testing. Storage and worker node purchase. LHCONE: there is a definite plan for a switch at Manlan to be used for LHCONE.
    • this week: Downtime coming up, MGHPCC annual downtime day. Issue recently with LSM for files at a remote site. Migrated files in LGD. Autopyfactory issue: once a week, get a huge number of authz requests from the factory, which jams gatekeeper; and one of the sites with lost contact.

  • MWT2:
    • last meeting(s): Working on upgrades of dCache services. Cleaning up dead nodes. Cleaning up dark data.
    • this meeting: Had problems with storage servers rebooting - investigating load incidents. Closer to being consistent in CCC - 600k files are now dark, down from 5.7M files. Illinois - 14 nodes on order, damaged, getting a priority shipment. Perfsonar box at Illinois, strangely capping at 6 Gbps - related to Myricom driver. UC - delivery and installation of two 32x10g Juniper line cards; testing with a storage server at 20g LACP this week.


last meeting
  • Alden: DAST shifters needed for North American timezone.
this meeting

-- RobertGardner - 05 Aug 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback