r6 - 19 Feb 2014 - 15:01:04 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb192014



Minutes of the Facilities IntegrationProgram meeting, February 19, 2014


  • Meeting attendees: Michael, Alden, Armen, David, Fred, Horst, Ilija, Bob, Rob, Myuko, Sarah, Saul, Shawn, Torre, Wei, Hiro
  • Apologies: Patrick
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Final updates to SiteCertificationP27 and v29 of google docs spread sheet in CapacitySummary due for quarterly report.
      • Progress on multi-core jobs a concern.
        • SWT2 - been busy
        • BU - MCORE now up; still a problem with validation with HU
        • There is a backlog now.
        • Good to validate. Condor dynamic partitioning now working at AGLT2, but equipped with cron that adjusts based on high/low level jobs.
        • BNL is running two multicore queues - one for MCORE jobs, and one for a high memory queue
    • this week
      • MCORE status
      • Two presentations today on FAX and ATLAS Connect (see Indico presentations)
      • US Cloud support issues
        • Need a "dispatcher" to triage problems. Would like that person to come from Production coordination. Possible candidates - Mark, Armen, Myuko.
        • Mark will discuss it and come up with a plan.

MCORE status

  • MCORE status across all clouds is below. US: 488 out of 2078 running jobs (about 4000 cores). BNL, AGLT2, MWT2, SLAC have MCORE queues up and running. Issues with MCORE at OU - Horst will summarize.
  • OU: had a problem with the AGIS configuration - fixed.
  • UTA: got setup with a queue. Got some test jobs - issue with getting sufficient # pilots. Then jobs failed. Close - should be fixed in a couple of days. Main problem is pilots.
  • NET2: queues are setup dynamically. Only problem - only problem is the particular release MCORE jobs want. Working on troubleshooting the validation jobs. Seem to be 17.7.3 not found. (Same as Horst)

Cleaning up Schedconfig attributes (Horst)

  • Bit with leftover appdir setting. Should not be set. Was getting re-pointed to wrong location. There are some old config parameters that are set. E.g. "JDL"
  • Can we do a clean up? Alden: What affects Tadashi's or Paul's infrastructure.
  • Bob: will look for an email that describes the settings needed.
  • Horst will set something up.

Managing LOCALGROUPDISK at all sites (Kaushik and/or Armen)

  • Has been discussion about tools, policy has not been sent to RAC
this meeting
  • Any update on process?
  • Had first meeting discussing technical requirements and features. Will meet in a week.
  • Will present list and schedule in two weeks.

Reports on program-funded network upgrade activities


last meeting(s) meeting (2/05/14)
  • Will have an outage this Friday.
  • Still waiting on 40g wave to show up. Will use 10g links until then. Will cut over to Juniper when 40g is available.
  • MSU - orders are going out.
this meeting, (2/19/14)


last meeting(s) meeting (2/05/14)
  • At UC, additional 6x10g links being connected. So technical fiber connection issues. Expect full 100g connectivity by the end of the week.
  • At Illinois - getting fiber from cluster to campus core.
this meeting, (2/19/14)
  • 8x10g connection at UIUC today
  • 8x10g in place at UC, and IU


last meeting(s) meeting, (2/05/14)
  • Starting receiving equipment for first round. $ available for the second increment. Expect six weeks out.
this meeting, (2/19/14)
  • 1/2 equipment arrived - expect purchase for the second increment shortly. Will need to take a significant downtime.
  • Timeframe? Depends on procurement. Delivery a couple of weeks.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting(s) meeting, (2/05/14)
  • AGLT2: 17 of R620s in production. 2 are being used for testing.
  • MWT2: 45/48 in production
  • NET2: 42 new nodes in production
  • SWT2: coming shortly (next week or two). 35-45 compute nodes.
  • WT2: receiving machines; 8/60 installed. The remainder in a week or two. These are shared resources, what fraction is owned by ATLAS? 30%?
  • Tier1: Acquiring samples of latest models for evaluation. Received a 12-core machine. Asked 3 vendors (HP, Dell, Penguin) for samples.
this meeting, (2/19/14)
  • Updates?

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s) See notes from MinutesJan222014 meeting, (2/05/14)
  • AGLT2: With MWT2, requesting ESNet provide LHCONE VRF at CIC OmniPoP
  • MWT2: Requesting ESnet to provision direct 100 Gbps circuit between BNL and UC.
  • NET2: Slow progress at MGHPCC. Needs to get BU on board.
  • SWT2: Try to export a different subnet with a couple machines as a test. If it works, will change over the Tier 2 network. UTA campus network peopl have coordinate with LEARN. Just need to coordinate with I2.
  • WT2: Already on LHCONE
this meeting, (2/19/14)
  • NET2: making progress, had meeting with networking group, having phone meetings now. NOX and MGHPCC. Link at MANLAN. Will get together next week.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Lots of back pressure in the production system, many tasks.
    • MCORE not as much as expected; there are still some problems. The plan was to have MCORE up to 50% (ADC Coordination). Michael: there is a significant backlog in the queue. We want all the sites to have these available.
  • this meeting:

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS report from the ADC Weekly meeting:
    Not available this week
    1)  2/6: WISC - DDM deletion errors. https://ggus.eu/ws/ticket_info.php?ticket=101020 - eLog 47979.
    2)  2/8: SLACXRD - file transfer failures with the error "has trouble with canonical path - cannot access it." https://ggus.eu/ws/ticket_info.php?ticket=101073. 
    All hardware issues resolved as of 2/11 - ggus 101073 was closed - eLog 48013.
    3)  2/10: Lucille_CE - auto-excluded in panda and DDM. Site is experiencing a power problem and declared an unscheduled downtime.
    4)  2/11: ADC Weekly meeting:
    Follow-ups from earlier reports:
    (i)  2/5 early a.m. BNL - https://ggus.eu/ws/ticket_info.php?ticket=100984 was opened for file transfer failures with the error "First non-zero marker not received 
    within 180 seconds." Issue understood - from Hiro: This is due to the test gridftp doors we had for 100Gb/s test last week. The special circuit must have been 
    removed for these gridftp doors, causing traffic to fail. We disabled these gridftp doors - the failures should disappear. ggus ticket in-progress, no eLog ticket created.
    Update 2/12: Issue resolved - ggus 100984 was closed.
  • this week: Operations summary:
    AMOD/ADCoS report from the ADC Weekly meeting:
    1)  2/13: HU_ATLAS_Tier2 - production jobs failing with errors related to accessing conditions data ("CORAL/Services/ConnectionService Warning Failure while 
    attempting to connect to "sqlite_file:geomDB/geomDB_sqlite"). Issue due to pilots not sourcing $OSG_APP/atlas_app/local/setup.sh, and hence not having a value 
    for FRONTIER_SERVER. Problem started following the move to a new gatekeeper. Resolved - https://ggus.eu/ws/ticket_info.php?ticket=101194 was closed, 
    eLog 48055.
    2)  2/13: MWT2 - file transfers failing with "GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 530 Login failed: TTL exceeded]." Issue 
    understood and resolved - from Sarah: We had a few problems - a dCache test node accidentally added to the pool of production doors, and our GUMS server had 
    gone unresponsive. https://ggus.eu/ws/ticket_info.php?ticket=101204 was closed, eLog 48043.
    3)  2/13: New pilot release from Paul (v58i). Details here:
    4)  2/15: BNL - destination file transfer failures with SRM errors, "requests failed in some way or another," etc. The problem was caused by the unresponsiveness of 
    the SE's (dCache) namespace manager (Chimera). The namespace management service was operational, but its database queries interfere with auto-vacuum 
    operations running at a very high level of I/O initiated by the postgres database. On 2/16 dCache was shutdown
    to address the namespace db issue. Operations restored as of 10:00 p.m. EST that evening. https://ggus.eu/ws/ticket_info.php?ticket=101275 was closed, 
    eLog 48099, 48100.
    5)  2/18: ADC Weekly meeting:
    6)  2/19: New pilot release from Paul (v58j). Details here:
    Follow-ups from earlier reports:
    (i)  2/6: WISC - DDM deletion errors. https://ggus.eu/ws/ticket_info.php?ticket=101020 - eLog 48071.
    (ii)  2/10: Lucille_CE - auto-excluded in panda and DDM. Site is experiencing a power problem and declared an unscheduled downtime.

Data Management and Storage Validation (Armen)

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference this meeting (2/19/14)
  • state of xrootd and plugin updates
  • added Milano T2; FR endpoints are coming back online.
  • running stress tests

Site news and issues (all sites)

  • WT2:
    • last meeting(s): 28% SLAC cores (10k) are ATLAS owned. Found Gratia bug that under report WT2 cpu usage for all production jobs. Found that GridFTP? 6.14 (current on in OSG RPMs) doens't work well with network settings suggested by ESnet (net.core.rmem, net.ipv4.tcp_rmem, ...wmem, etc.)
    • this meeting: Added 56 nodes (900 cores), 47 cores to go.

  • T1:
    • last meeting(s): Buying > 2PB of storage, replacing old, progressing. Replacing F10 equipment. Moving towards a high capacity interlink fabric, 3 Arista 7500s (100g trunks between cores). Space manager discovered problems with 2.6. (What settings did Hiro use to fix the problem?) Happens with high file transfer completion rate. Avoids growth.
    • this meeting: Problem over the weekend with Chimera namespace manager became unresponsive. Postgres autovacuum. Have about 2.7 PB on order, finally through procurement (this is replacement storage). Also will be

  • AGLT2:
    • last meeting(s): Looking into OMD - Open Monitoring Distribution
    • this meeting: Continuing to work with OMD - monitoring all hardware and software status.

  • MWT2:
    • last meeting(s): David and Lincoln able to get the 6248 connected to the Juniper - getting the new R620s online. Confirmation from UIUC additional fiber inside ACB.
    • this meeting: ICC networking being upgraded, and extra disks being added to GPFS.

  • SWT2 (UTA):
    • last meeting(s): MCORE, LHCONE. Resurrecting some old UTA_SWT2 servers. CPB issues. Rucio.
    • this meeting: Info from DAST about getting data from UTA. There was a disk issue as files were being written - bad timing.

  • SWT2 (OU, OSCER):
    • last meeting(s):
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:


last meeting
  • Wei: new WLCG availability matrix is being calculated, see email sent to usatlas-t2-l. Awaiting clarification from Alessandro about including grid3-locations.txt in the availability test.
this meeting

-- RobertGardner - 19 Feb 2014

  • screenshot_1222.png:

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


png screenshot_1222.png (114.9K) | RobertGardner, 19 Feb 2014 - 12:28 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback