r4 - 16 Apr 2014 - 11:39:21 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesApr162014



Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:


  • Meeting attendees:
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • MCORE status
      • US ATLAS Facilities Meeting at SLAC, https://indico.cern.ch/event/303799/
      • Rucio client testing
      • FY14 resources, as pledged, to be provisioned by next month. We should make sure capacities as published are provisioned. Action item on Rob: summarize info.
      • Agency meeting resulted in much discussion about computing. Non-LHC panelists had generic questions. For 2 hours! Constructive and positive reactions, in particular the Integration Program. Impressed about how we handle on-going operations while we do innovation.
      • There are two important aspects going on facility wide: Rucio migration; evaluation, migration to FTS3. Must in place prior to DC14
    • this week

ATLAS Connect (Rob)

MCORE status

last meeting(s):
  • MCORE status across all clouds is below. US: 488 out of 2078 running jobs (about 4000 cores). BNL, AGLT2, MWT2, SLAC have MCORE queues up and running. Issues with MCORE at OU - Horst will summarize.
  • OU: had a problem with the AGIS configuration - fixed.
  • UTA: got setup with a queue. Got some test jobs - issue with getting sufficient # pilots. Then jobs failed. Close - should be fixed in a couple of days. Main problem is pilots.
  • NET2: queues are setup dynamically. Only problem - only problem is the particular release MCORE jobs want. Working on troubleshooting the validation jobs. Seem to be 17.7.3 not found. (Same as Horst)

last meeting (3/5/14):

  • AGLT2 working on configuration. 50% MCORE jobs running.
  • BNL: adjusted share, 75% prod resources.
  • WT2 - MCORE jobs make scheduling MPI jobs easier.
  • What are the relative priorities - serial vs mcore? Michael: Hard for us to tell what the relative priorities with regard to the overall workload. Very manual.
  • Kaushik: Jedi wil automate this. Dynamic queues ideas.
  • MWT2 - 200 slot
  • NET2 - 200 slots configured at BU. Also at HU. Segfault on a particular release.
  • SWT2 - two queues, looks like things are working.
  • SWT2-OU: setup.

last meeting (3/19/14):

  • NET2: working at BU, figured out the problem (affecting 12 sites worldwide). 1700 cores running at BU. All due to an AGIS setting. The max memory was too low. Used by validation to kill itself if the jobs fail. Difficult to notice. 32G. Bob: guideline is 8x for memory and disk. HU - not running yet, but only because no validation jobs run.
  • Kaushik: demand is met, for now. At UTA, using an auto-adjuster. Seems to be working, whatever people.
  • OU: there is an MCORE at OSCAR. Will deploy one at Langston. Will deploy MCORE on OCHEP after the Condor upgrade. Timescale: 2 weeks.

this meeting:

  • Updates?

Cleaning up Schedconfig attributes (Horst)

last meeting:
  • Bit with leftover appdir setting. Should not be set. Was getting re-pointed to wrong location. There are some old config parameters that are set. E.g. "JDL"
  • Can we do a clean up? Alden: What affects Tadashi's or Paul's infrastructure.
  • Bob: will look for an email that describes the settings needed.
  • Horst will set something up.

previous meeting (3/5/14):

  • Last week setup a set of variables which can be deleted. app_dir.
  • Will send an email to the us cloud support.

last meeting (3/5/14):

  • Concluded?
  • Alden? We did not iterate on this this week. Can arrange with Gancho, try again. Will clear out values left over in those fields. /appdir only set at CPB.

this meeting:

  • Concluded?

Managing LOCALGROUPDISK at all sites (Kaushik and/or Armen)

  • Has been discussion about tools, policy has not been sent to RAC
  • Any update on process?
  • Had first meeting discussing technical requirements and features. Will meet in a week.
  • Will present list and schedule in two weeks.
previous meeting (3/5/14):
  • See attached document.
  • Kaushik: its a large software development effort. Can we see the plan, and timeline of implementation.
  • Present preliminary design at SLAC meeting? Not a trivial task!
this meeting
  • Update?
  • Kaushik: Discussed at the Tier3 implementation committee meeting on Friday. Good discussion, liked by the committee. Increase base quote from 3-5 TB? Up to 20TB should have no significant RAC action, etc. Most users use less than 3 TB. Generally agreement this will be policy (5 TB). 250 users expected. 1 PB for beyond pledge. Armen is checking this figure. (May have to ask for more resources.) If T3 imp committee feels this is needed, it should be stated in the recommendations.
  • Michael: currently 1.5 PB in LGD.
  • Armen: cleaning at the moment.

meeting (4/16/14):

Reports on program-funded network upgrade activities


last meeting(s) meeting (2/05/14)
  • Will have an outage this Friday.
  • Still waiting on 40g wave to show up. Will use 10g links until then. Will cut over to Juniper when 40g is available.
  • MSU - orders are going out.
this meeting, (2/19/14) this meeting, (3/05/14)
  • Don't have all the parts. Waiting on 40g blade for UM.
  • MSU - also waiting on parts.
  • April timeframe
meeting (3/19/14):
  • UM assigned a network engineer to transition to the 100g wave April 1. Still waiting on the 40g blade (next week).
  • MSU: ordered, but have not received parts. Anticipate beginning of April.

meeting (4/16/14):


last meeting(s) meeting (2/05/14)
  • At UC, additional 6x10g links being connected. So technical fiber connection issues. Expect full 100g connectivity by the end of the week.
  • At Illinois - getting fiber from cluster to campus core.
this meeting, (2/19/14)
  • 8x10g connection at UIUC today
  • 8x10g in place at UC, and IU
this meeting, (3/05/14)
  • Network upgrades complete. Working on WAN studies.
meeting (3/19/14):
  • ESNet to OmniPoP fixed this morning. Can now do large scale studies
meeting (4/16/14):


last meeting(s) meeting, (2/05/14)
  • Starting receiving equipment for first round. $ available for the second increment. Expect six weeks out.
this meeting, (2/19/14)
  • 1/2 equipment arrived - expect purchase for the second increment shortly. Will need to take a significant downtime.
  • Timeframe? Depends on procurement. Delivery a couple of weeks.
this meeting, (3/05/14)
  • Have final quotes in procurement, for the second half.
  • Money still hasn't arrived. Completion date unsure.
  • Delayed.
*meeting (3/19/14):
  • Patrick: Got the money - in the process of quote refreshing. Will get into purchasing later this week or next week.
  • Kaushik: second round will have S6000s, and S4810. Unfortunately edge HW on campus is only 10g. Don't yet have lease lines to Dallas where I2 has 100g. Just submitted a proposal. Asked for Juniper 100g capable hardware, for campus infrastructure.
meeting (4/16/14):

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting(s) meeting, (2/05/14)
  • AGLT2: 17 of R620s in production. 2 are being used for testing.
  • MWT2: 45/48 in production
  • NET2: 42 new nodes in production
  • SWT2: coming shortly (next week or two). 35-45 compute nodes.
  • WT2: receiving machines; 8/60 installed. The remainder in a week or two. These are shared resources, what fraction is owned by ATLAS? 30%?
  • Tier1: Acquiring samples of latest models for evaluation. Received a 12-core machine. Asked 3 vendors (HP, Dell, Penguin) for samples.
this meeting, (3/05/14)
  • Updates?
  • SWT2: submitted. R620s. 40 nodes. Hope to deploy with the network update.
  • SLAC: all in production use. Will update spreadsheet and OIM.
  • Tier1: Will be getting two Arista 7500s, providing an aggregation layer. Awaiting the additional storage systems, in about 4 weeks.
*meeting (3/19/14):
  • SWT2 - still looking at quotes, maybe 420s. Maximize the HEPSPEC.
meeting (4/16/14):

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s) See notes from MinutesJan222014 meeting, (2/05/14)
  • AGLT2: With MWT2, requesting ESNet provide LHCONE VRF at CIC OmniPoP
  • MWT2: Requesting ESnet to provision direct 100 Gbps circuit between BNL and UC.
  • NET2: Slow progress at MGHPCC. Needs to get BU on board.
  • SWT2: Try to export a different subnet with a couple machines as a test. If it works, will change over the Tier 2 network. UTA campus network peopl have coordinate with LEARN. Just need to coordinate with I2.
  • WT2: Already on LHCONE
this meeting, (2/19/14)
  • NET2: making progress, had meeting with networking group, having phone meetings now. NOX and MGHPCC. Link at MANLAN. Will get together next week.
this meeting, (3/05/14)
  • Major NRE are bringing up perfsonar test points within the LHCONE infrastructure.
  • AGLT2 - was disrupted, re-enabled.
  • NET2: nothing new
  • SWT2: Patrick needs a contact for I2. Had a question. Do we have any changes? Is I2 providing the VRF. Dale Finkelson? Will re-send, with cc to Shawn. May need another contact.
meeting (3/19/14):
  • NET2 status? No news; waiting on MIT.
  • SWT2? UTA networking has done everything needed. Patrick - to send something to ESnet.
  • Getting monitoring in place for LHCONE.
meeting (4/16/14):

Operations overview: Production and Analysis (Kaushik)

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Pavol Strizenec):
    1)  Very quiet week for U.S. sites.
    2)  4/8: ADC Weekly meeting:
    Follow-ups from earlier reports:
    (i)  3/14: WISC_LOCALGROUPDISK - DDM deletion errors - eLog 48420, https://ggus.eu/?mode=ticket_info&ticket_id=102249 in-progress.
    Update 3/26: Overheating problem in the site machine room. Ticket set to 'on-hold' and a downtime was declared.
    Update 4/2: downtime extended to 4/18.
    (ii) 4/1: SWT2_CPB - file transfers failing with the error "got an error when removing the destination surl." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=102880 in-progress, eLog 48681.
    Update 4/4: Issue was due to a problematic xfs partition that had been set to read-only prior to running a filesystem repair. Repair was completed, 
    and the partition set back to read/write. ggus 102880 was closed, eLog 48759. 

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week
    1)  4/10: HU_ATLAS_Tier2 - job failures with "lost heartbeat" errors. Issue with the gatekeeper resolved. Problem recurred on 4/12. Repairs were 
    made to the filesystem which hosts user home directories. Thanks to John & Saul for the info.
    2)  4/11: It was announced that all Savannah projects will be migrated to JIRA by the end of April.
    3)  4/12: NET2 - file transfer failures due to an SRM issue (" could not open connection to atlas.bu.edu"). 
    https://ggus.eu/?mode=ticket_info&ticket_id=103330 was closed on 4/14 after the site ran with good DDM efficiency for >24 hours. eLog 48873.
    4)  4/12-4/15: SWT2_CPB - problem with job submissions to the site following an upgrade of the OSG software on the gatekeeper. Obscure issue 
    related to the machine's hostname was eventually tracked down, and job submissions now back to normal as of late p.m. 4/15. 
    https://rt.racf.bnl.gov/rt/Ticket/Display.html?id=24323 was closed.
    5)  4/15: ADC Weekly meeting:
    Follow-ups from earlier reports:
    (i)  3/14: WISC_LOCALGROUPDISK - DDM deletion errors - eLog 48420, https://ggus.eu/?mode=ticket_info&ticket_id=102249 in-progress.
    Update 3/26: Overheating problem in the site machine room. Ticket set to 'on-hold' and a downtime was declared.
    Update 4/2: downtime extended to 4/18.

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (3/19/14):
    • LOCALGROUPDISK cleanups: MWT2 and SLAC, largest set of users. Getting lists of files. (35 DN's at MWT2)
      • Lots of effort, but resulting cleanup is very good.
    • USERDISK - some sites are finished. SWT2 still going on; still submitting batches for MWT2, from the previous campaign. Going slowing, since some users had several hundred thousand files per dataset. Hiro sent out notifications for the next campaign (MWT2, BNL).
    • GROUPDISK issue: dataset subscriptions not working for AGLT2 - for the Higgs group. Bugged the Higgs representative due to over quota.

  • meeting (4/15/14)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Phasing out Pandamover. Kaushik: will have a follow-up phone meeting. (Rucio and Panda team.) Did decide to test on LUCILLE. Is there an open question of cleanup? The cleanup modules in Rucio are not quite ready. Smaller Tier2's in Europe which have immediate file removal. Ours: we leave files around, for re-use, since we have larger PRODDISK. Michael requests progress be made based on DQ2 with Lucille. (Should be quick).
  • meeting (3/19/14):
    • FTS3 issues. "Someone" is debugging at CERN. We'll have to wait for it to be resolved. Not sure what the underlying issue is.
    • Wei notes that there may be efficiency issues for OSG gridftp sever; before managed by RAL FTS3 - did they modify TCP buffer settings?
    • Hiro: we need to repeat throughput testing. Will wait for the developer to come back. Also installing FTS3 here.
    • Rucio deployment:
      • See slides yesterday from David Cameron's slides.
      • Old dq2 clients (BU and HU). Tier3 users?
      • Hiro: 2.5 or later.
      • Doug: looked at slides, seems to be from jobs. Saul: get it from CVMFS?
      • Saul - to follow-up with David, to find which hosts are responsible.
      • LUCILLE as a candidate to migrate this week. Nothing has happened.
      • Mark: will contact David Cameron again. There was email two weeks ago; there were problems. Will follow-up again.
      • Based on the experience - move forward with the rest of the cloud.
      • Mark - to schedule, after LUCILLE then SWT2_UTA; then can do a site per day.
      • Michael: allowfax=true for all sites. Ilija created the file, sent to Alessandro di Giralomo.

  • meeting (4/16/14):

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference this meeting (4/16/14)
  • added SFU. Changing FAX topology in North A.
  • understanding EOS too many ddm endpoints issue.
  • simplified direct dCache federating

Site news and issues (all sites)

  • MWT2:
    • last meeting(s): Work continues on Stampede. Full framework is in place. Implementing an ATLAS ready node: using Parrot (for CVMFS), and fakechroot environment for compatibility libs and missing binary.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): Biggest issue last two weeks CPB getting offlined. Looks like issue with CVMFS on some compute nodes locking up. Rebuild of the node fixes it. Lastest version of CVMFS, rebuild, in background. Will likely rebuild squid caches too, tho no problems seen. auto-adjust script in place for MCORE jobs.
    • this meeting:

  • SWT2 (OU, OSCER):
    • last meeting(s): Had a few issues with sites getting offlined, quick fills with analy jobs - pounding Lustre.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): Michael: replacing aging Thumpers? Answer: getting quotes, but funding is not yet in place? (Michael will talk with Chuck). Outbound connectivity for worker nodes is available.
    • this meeting:

  • T1:
    • last meeting(s): Infrastructure issue: the FTS3 deployment model - concentrate the FTS service at one site. Proposing a second service for North America. Michael will send a confirmation to the FTS3 list recommending.
    • this meeting:

  • AGLT2:
    • last meeting(s): Downtime on Tuesday. Updated 2.6.22 version. Big cleanup of PRODDISK. Removing dark data - older than March 13 on USERDISK. Updated SL 3.2.5, gatekeepers running 3.2.6. (3.2.7 emergency release next week)
    • this meeting:

  • NET2:
    • last meeting(s): Getting HU MCORE queue validated.
    • this week:


last meeting
  • Add ATLAS Connect as a standing agenda items, to define scope
  • Kaushik: 2/3 MCORE jobs running from other clouds; US is running a lot of MCORE jobs for other clouds. And, other clouds have taken MCORE jobs from US queues. Eventually jobs will arrive. Not sure if Rod can help.
this meeting

-- RobertGardner - 15 Apr 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pptx 2014.04.16-fax-status.pptx (2255.8K) | IlijaVukotic, 15 Apr 2014 - 22:13 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback