r4 - 05 Mar 2014 - 14:57:50 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar052014



Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:


  • Meeting attendees: Bob, Dave, Rob, John Hover, Michael, Torre, Shawn, Wei, Xin, Saul, Ilija, Horst, Kaushik, Alden, Mayu, Sarah
  • Apologies: Jason
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • MCORE status
      • Two presentations today on FAX and ATLAS Connect (see Indico presentations)
      • US Cloud support issues
        • Need a "dispatcher" to triage problems. Would like that person to come from Production coordination. Possible candidates - Mark, Armen, Myuko.
        • Mark will discuss it and come up with a plan.
    • this week
      • US ATLAS Facilities Meeting at SLAC, https://indico.cern.ch/event/303799/
      • Rucio client testing
      • FY14 resources, as pledged, to be provisioned by next month. We should make sure capacities as published are provisioned. Action item on Rob: summarize info.

Facility IT-wide Infrastructure

  • US ATLAS virtualization: site level, and across sites.
  • Guest presentation from John Hover on "Facility-wide Flexible IT Infrastructure", see today's Indico agenda: https://indico.cern.ch/event/297077/
  • Will provide notes following.
  • Current state of affair with Open Stack - CERN (two site), HLT (internal network); but networking is an issue. Custom network driver. Cannot do large scales. All about IPs and networks. NAT issues.
  • Lose security if you give up internal networking - tenant isolation.
  • 300-400 vms current scale. Seems to be the network part. (CERN and HLT are not using openstack networking)
  • What is the path to be able to take advantage of what these tools offer?
  • Multihost mode. (Run a router on each compute node.) Ice House is the next release, might be solved there?
  • Could be used for Tier3 provisioning.

MCORE status

last meeting(s):
  • MCORE status across all clouds is below. US: 488 out of 2078 running jobs (about 4000 cores). BNL, AGLT2, MWT2, SLAC have MCORE queues up and running. Issues with MCORE at OU - Horst will summarize.
  • OU: had a problem with the AGIS configuration - fixed.
  • UTA: got setup with a queue. Got some test jobs - issue with getting sufficient # pilots. Then jobs failed. Close - should be fixed in a couple of days. Main problem is pilots.
  • NET2: queues are setup dynamically. Only problem - only problem is the particular release MCORE jobs want. Working on troubleshooting the validation jobs. Seem to be 17.7.3 not found. (Same as Horst)

this meeting:

  • AGLT2 working on configuration. 50% MCORE jobs running.
  • BNL: adjusted share, 75% prod resources.
  • WT2 - MCORE jobs make scheduling MPI jobs easier.
  • What are the relative priorities - serial vs mcore? Michael: Hard for us to tell what the relative priorities with regard to the overall workload. Very manual.
  • Kaushik: Jedi wil automate this. Dynamic queues ideas.
  • MWT2 - 200 slot
  • NET2 - 200 slots configured at BU. Also at HU. Segfault on a particular.
  • SWT2 - two queues, looks like things are working.
  • SWT2-OU: setup.

Cleaning up Schedconfig attributes (Horst)

last meeting:
  • Bit with leftover appdir setting. Should not be set. Was getting re-pointed to wrong location. There are some old config parameters that are set. E.g. "JDL"
  • Can we do a clean up? Alden: What affects Tadashi's or Paul's infrastructure.
  • Bob: will look for an email that describes the settings needed.
  • Horst will set something up.

this meeting:

  • Last week setup a set of variables which can be deleted. app_dir.
  • Will send an email to the us cloud support.

Managing LOCALGROUPDISK at all sites (Kaushik and/or Armen)

  • Has been discussion about tools, policy has not been sent to RAC
  • Any update on process?
  • Had first meeting discussing technical requirements and features. Will meet in a week.
  • Will present list and schedule in two weeks.
this meeting
  • See attached document.
  • Kaushik: its a large software development effort. Can we see the plan, and timeline of implementation.
  • Present preliminary design at SLAC meeting? Not a trivial task!

Reports on program-funded network upgrade activities


last meeting(s) meeting (2/05/14)
  • Will have an outage this Friday.
  • Still waiting on 40g wave to show up. Will use 10g links until then. Will cut over to Juniper when 40g is available.
  • MSU - orders are going out.
this meeting, (2/19/14) this meeting, (3/05/14)
  • Don't have all the parts. Waiting on 40g blade for UM.
  • MSU - also waiting on parts.
  • April timeframe


last meeting(s) meeting (2/05/14)
  • At UC, additional 6x10g links being connected. So technical fiber connection issues. Expect full 100g connectivity by the end of the week.
  • At Illinois - getting fiber from cluster to campus core.
this meeting, (2/19/14)
  • 8x10g connection at UIUC today
  • 8x10g in place at UC, and IU
this meeting, (3/05/14)
  • Network upgrades complete. Working on WAN studies.


last meeting(s) meeting, (2/05/14)
  • Starting receiving equipment for first round. $ available for the second increment. Expect six weeks out.
this meeting, (2/19/14)
  • 1/2 equipment arrived - expect purchase for the second increment shortly. Will need to take a significant downtime.
  • Timeframe? Depends on procurement. Delivery a couple of weeks.
this meeting, (3/05/14)
  • Have final quotes in procurement, for the second half.
  • Money still hasn't arrived. Completion date unsure.
  • Delayed.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting(s) meeting, (2/05/14)
  • AGLT2: 17 of R620s in production. 2 are being used for testing.
  • MWT2: 45/48 in production
  • NET2: 42 new nodes in production
  • SWT2: coming shortly (next week or two). 35-45 compute nodes.
  • WT2: receiving machines; 8/60 installed. The remainder in a week or two. These are shared resources, what fraction is owned by ATLAS? 30%?
  • Tier1: Acquiring samples of latest models for evaluation. Received a 12-core machine. Asked 3 vendors (HP, Dell, Penguin) for samples.
this meeting, (3/05/14)
  • Updates?
  • SWT2: submitted. R620s. 40 nodes. Hope to deploy with the network update.
  • SLAC: all in production use. Will update spreadsheet and OIM.
  • Tier1: Will be getting two Arista 7500s, providing an aggregation layer. Awaiting the additional storage systems, in about 4 weeks.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s) See notes from MinutesJan222014 meeting, (2/05/14)
  • AGLT2: With MWT2, requesting ESNet provide LHCONE VRF at CIC OmniPoP
  • MWT2: Requesting ESnet to provision direct 100 Gbps circuit between BNL and UC.
  • NET2: Slow progress at MGHPCC. Needs to get BU on board.
  • SWT2: Try to export a different subnet with a couple machines as a test. If it works, will change over the Tier 2 network. UTA campus network peopl have coordinate with LEARN. Just need to coordinate with I2.
  • WT2: Already on LHCONE
this meeting, (2/19/14)
  • NET2: making progress, had meeting with networking group, having phone meetings now. NOX and MGHPCC. Link at MANLAN. Will get together next week.
this meeting, (3/05/14)
  • Major NRE are bringing up perfsonar test points within the LHCONE infrastructure.
  • AGLT2 - was disrupted, re-enabled.
  • NET2: nothing new
  • SWT2: Patrick needs a contact for I2. Had a question. Do we have any changes? Is I2 providing the VRF. Dale Finkelson? Will re-send, with cc to Shawn. May need another contact.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Lots of back pressure in the production system, many tasks.
    • MCORE not as much as expected; there are still some problems. The plan was to have MCORE up to 50% (ADC Coordination). Michael: there is a significant backlog in the queue. We want all the sites to have these available.
  • this meeting:
    • Large number of jobs requiring data from tape, caused drain of US cloud. Having problems keeping sites full. Will recommend what each site should do - change fair share policy in AGIS?

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS report from the ADC Weekly meeting:
    No meeting due to software week at CERN
    1)  Not US-specific, but: announcement about moving the first site from LFC to the Rucio File Catalog (FR/LAPP). Details:
    2)  2/24: SLAC - file transfers were failing with the error "could not open connection to osgserv04.slac.stanford.edu." No ticket needed, 
    as the issue was quickly resolved by Wei.
    3)  2/25: ADC Weekly meeting:
    No meeting due to software week at CERN
    Follow-ups from earlier reports:
    (i)  2/6: WISC - DDM deletion errors. https://ggus.eu/ws/ticket_info.php?ticket=101020 - eLog 48071.
    Update 2/24: Site reported they are now a in maintenance downtime for system upgrades and reorganizing the computer room (through 2/28). 
    ggus 101020 was closed - eLog 48182.
    (ii)  2/10: Lucille_CE - auto-excluded in panda and DDM. Site is experiencing a power problem and declared an unscheduled downtime.
  • this week: Operations summary:
    AMOD/ADCoS report from the ADC Weekly meeting:
    Not available this week
    1)  2/26: AGLT2 - jobs failing heavily on three WN's. They were removed from production - issue resolved.
    2)  2/26: File transfers via FTS tool from BNL to french and italian T2's were failing for large files (> 1 GB). Tickets for the issue:
    ESNET-20140224-001; https://savannah.cern.ch/bugs/?103990
    Issue traced to a problem in the LHCONE connection in STARLIGHT between ESnet and GEANT - resolved the next day. Tickets were closed.
    3)  3/3: FTS service at RAL down, which now affects all clouds following the migration of the service to a central location. Site admins reported there 
    was a problem with the virtualization infrastructure at RAL. Service was restored the next morning. Various issues under investigation. eLog 48273 
    plus various e-mail threads.
    4) 3/3 p.m.: SWT2_CPB - file transfers were failing with errors like "Communication error on send httpg://gk03.atlas-swt2.org:8443/srm/v2/server: 
    CGSI-gSOAP running on fts103.cern.ch reports Error reading token data header: Connection closed." xrootd had died on a storage server. 
    Restarting the process fixed the problem. https://ggus.eu/index.php?mode=ticket_info&ticket_id=101786 was closed, eLog 48293.
    5)  3/4: ADC Weekly meeting:
    Follow-ups from earlier reports:
    (i)  2/10: Lucille_CE - auto-excluded in panda and DDM. Site is experiencing a power problem and declared an unscheduled downtime.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • NTUP_COMMON distribution
    • Need to do some cleanup at MW and WT2; USERDISK cleanup didn't go through. * this meeting:
    • USERDISK cleanups? Hiro? MW still need to do it. Armen will do it.
    • (Don't delete DDM_TEST) this meeting, (3/05/14)
    • Hiro: by accident, started deleted USERDISK data, older than Feb 1.
    • Kaushik: suggests sending an email to DAST.
    • Armen - was deleting data from UC. There were a couple of 'jumbo' datasets (~800k files) that crashed the system.
    • Next data management meeting? Not sure.
    • Can files outside Rucio be removed?
    • CCC futures?

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference this meeting (2/19/14)
  • state of xrootd and plugin updates
  • added Milano T2; FR endpoints are coming back online.
  • running stress tests
this meeting(3/05/14)

Site news and issues (all sites)

  • SWT2 (UTA):
    • last meeting(s): MCORE, LHCONE. Resurrecting some old UTA_SWT2 servers. CPB issues. Rucio. Info from DAST about getting data from UTA. There was a disk issue as files were being written - bad timing.
    • this meeting:

  • SWT2 (OU, OSCER):
    • last meeting(s):
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): 28% SLAC cores (10k) are ATLAS owned. Found Gratia bug that under report WT2 cpu usage for all production jobs. Found that GridFTP? 6.14 (current on in OSG RPMs) doens't work well with network settings suggested by ESnet (net.core.rmem, net.ipv4.tcp_rmem, ...wmem, etc.) Added 56 nodes (900 cores), 47 cores to go.
    • this meeting:

  • T1:
    • last meeting(s): Problem over the weekend with Chimera namespace manager became unresponsive. Postgres autovacuum. Have about 2.7 PB on order, finally through procurement (this is replacement storage).
    • this meeting:

  • AGLT2:
    • last meeting(s): Looking into OMD - Open Monitoring Distribution. Continuing to work with OMD - monitoring all hardware and software status.
    • this meeting: Downtime on Tuesday.

  • MWT2:
    • last meeting(s): David and Lincoln able to get the 6248 connected to the Juniper - getting the new R620s online. Confirmation from UIUC additional fiber inside ACB. ICC networking being upgraded, and extra disks being added to GPFS.
    • this meeting:


last meeting
  • Wei: new WLCG availability matrix is being calculated, see email sent to usatlas-t2-l. Awaiting clarification from Alessandro about including grid3-locations.txt in the availability test.
this meeting

-- RobertGardner - 04 Mar 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pdf USATLAS_LOCALGROUPDISK-Management_SoftwareInfrastructure_v1.0.pdf (68.1K) | ArmenVartapetian, 05 Mar 2014 - 12:22 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback