r3 - 19 Mar 2014 - 15:20:45 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar192014

MinutesMar192014

Introduction

Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:

Attending

  • Meeting attendees: Bob, Michael, Doug, Shawn, Kaushik, Wei, David, Alden, Hiro, Horst, Ilija, John Brunelle, Kaushik, Mayu, Saul
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • MCORE status
      • Two presentations today on FAX and ATLAS Connect (see Indico presentations)
      • US Cloud support issues
        • Need a "dispatcher" to triage problems. Would like that person to come from Production coordination. Possible candidates - Mark, Armen, Myuko.
        • Mark will discuss it and come up with a plan.
      • US ATLAS Facilities Meeting at SLAC, https://indico.cern.ch/event/303799/
      • Rucio client testing
      • FY14 resources, as pledged, to be provisioned by next month. We should make sure capacities as published are provisioned. Action item on Rob: summarize info.
    • this week
      • Agency meeting resulted in much discussion about computing. Non-LHC panelists had generic questions. For 2 hours! Constructive and positive reactions, in particular the Integration Program. Impressed about how we handle on-going operations while we do innovation.
      • There are two important aspects going on facility wide: Rucio migration; evaluation, migration to FTS3. Must in place prior to DC14

Featured Topic: OMD - Shawn

  • See slides on Indico above
  • Q: Sarah: use of RRD files - load creating on file system? Shawn: preconfigured ramdisks host RRD graphs. In OMD they pre-configured all lot.
  • Extensible. IOSTAT

MCORE status

last meeting(s):
  • MCORE status across all clouds is below. US: 488 out of 2078 running jobs (about 4000 cores). BNL, AGLT2, MWT2, SLAC have MCORE queues up and running. Issues with MCORE at OU - Horst will summarize.
  • OU: had a problem with the AGIS configuration - fixed.
  • UTA: got setup with a queue. Got some test jobs - issue with getting sufficient # pilots. Then jobs failed. Close - should be fixed in a couple of days. Main problem is pilots.
  • NET2: queues are setup dynamically. Only problem - only problem is the particular release MCORE jobs want. Working on troubleshooting the validation jobs. Seem to be 17.7.3 not found. (Same as Horst)

last meeting (3/5/14):

  • AGLT2 working on configuration. 50% MCORE jobs running.
  • BNL: adjusted share, 75% prod resources.
  • WT2 - MCORE jobs make scheduling MPI jobs easier.
  • What are the relative priorities - serial vs mcore? Michael: Hard for us to tell what the relative priorities with regard to the overall workload. Very manual.
  • Kaushik: Jedi wil automate this. Dynamic queues ideas.
  • MWT2 - 200 slot
  • NET2 - 200 slots configured at BU. Also at HU. Segfault on a particular release.
  • SWT2 - two queues, looks like things are working.
  • SWT2-OU: setup.

this meeting:

  • Updates?
  • NET2: working at BU, figured out the problem (affecting 12 sites worldwide). 1700 cores running at BU. All due to an AGIS setting. The max memory was too low. Used by validation to kill itself if the jobs fail. Difficult to notice. 32G. Bob: guideline is 8x for memory and disk. HU - not running yet, but only because no validation jobs run.
  • Kaushik: demand is met, for now. At UTA, using an auto-adjuster. Seems to be working, whatever people.
  • OU: there is an MCORE at OSCAR. Will deploy one at Langston. Will deploy MCORE on OCHEP after the Condor upgrade. Timescale: 2 weeks.

Cleaning up Schedconfig attributes (Horst)

last meeting:
  • Bit with leftover appdir setting. Should not be set. Was getting re-pointed to wrong location. There are some old config parameters that are set. E.g. "JDL"
  • Can we do a clean up? Alden: What affects Tadashi's or Paul's infrastructure.
  • Bob: will look for an email that describes the settings needed.
  • Horst will set something up.

previous meeting (3/5/14):

  • Last week setup a set of variables which can be deleted. app_dir.
  • Will send an email to the us cloud support.

this meeting:

  • Concluded?
  • Alden? We did not iterate on this this week. Can arrange with Gancho, try again. Will clear out values left over in those fields. /appdir only set at CPB.

Managing LOCALGROUPDISK at all sites (Kaushik and/or Armen)

previously
  • Has been discussion about tools, policy has not been sent to RAC
  • Any update on process?
  • Had first meeting discussing technical requirements and features. Will meet in a week.
  • Will present list and schedule in two weeks.
previous meeting (3/5/14):
  • See attached document.
  • Kaushik: its a large software development effort. Can we see the plan, and timeline of implementation.
  • Present preliminary design at SLAC meeting? Not a trivial task!
this meeting
  • Update?
  • Kaushik: Discussed at the Tier3 implementation committee meeting on Friday. Good discussion, liked by the committee. Increase base quote from 3-5 TB? Up to 20TB should have no significant RAC action, etc. Most users use less than 3 TB. Generally agreement this will be policy (5 TB). 250 users expected. 1 PB for beyond pledge. Armen is checking this figure. (May have to ask for more resources.) If T3 imp committee feels this is needed, it should be stated in the recommendations.
  • Michael: currently 1.5 PB in LGD.
  • Armen: cleaning at the moment.

Reports on program-funded network upgrade activities

AGLT2

last meeting(s) meeting (2/05/14)
  • Will have an outage this Friday.
  • Still waiting on 40g wave to show up. Will use 10g links until then. Will cut over to Juniper when 40g is available.
  • MSU - orders are going out.
this meeting, (2/19/14) this meeting, (3/05/14)
  • Don't have all the parts. Waiting on 40g blade for UM.
  • MSU - also waiting on parts.
  • April timeframe
*meeting (3/19/14):
  • UM assigned a network engineer to transition to the 100g wave April 1. Still waiting on the 40g blade (next week).
  • MSU: ordered, but have not received parts. Anticipate beginning of April.

MWT2

last meeting(s) meeting (2/05/14)
  • At UC, additional 6x10g links being connected. So technical fiber connection issues. Expect full 100g connectivity by the end of the week.
  • At Illinois - getting fiber from cluster to campus core.
this meeting, (2/19/14)
  • 8x10g connection at UIUC today
  • 8x10g in place at UC, and IU
this meeting, (3/05/14)
  • Network upgrades complete. Working on WAN studies.
*meeting (3/19/14):
  • ESNet to OmniPoP fixed this morning. Can now do large scale studies

SWT2-UTA

last meeting(s) meeting, (2/05/14)
  • Starting receiving equipment for first round. $ available for the second increment. Expect six weeks out.
this meeting, (2/19/14)
  • 1/2 equipment arrived - expect purchase for the second increment shortly. Will need to take a significant downtime.
  • Timeframe? Depends on procurement. Delivery a couple of weeks.
this meeting, (3/05/14)
  • Have final quotes in procurement, for the second half.
  • Money still hasn't arrived. Completion date unsure.
  • Delayed.
*meeting (3/19/14):
  • Patrick: Got the money - in the process of quote refreshing. Will get into purchasing later this week or next week.
  • Kaushik: second round will have S6000s, and S4810. Unfortunately edge HW on campus is only 10g. Don't yet have lease lines to Dallas where I2 has 100g. Just submitted a proposal. Asked for Juniper 100g capable hardware, for campus infrastructure.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting(s) meeting, (2/05/14)
  • AGLT2: 17 of R620s in production. 2 are being used for testing.
  • MWT2: 45/48 in production
  • NET2: 42 new nodes in production
  • SWT2: coming shortly (next week or two). 35-45 compute nodes.
  • WT2: receiving machines; 8/60 installed. The remainder in a week or two. These are shared resources, what fraction is owned by ATLAS? 30%?
  • Tier1: Acquiring samples of latest models for evaluation. Received a 12-core machine. Asked 3 vendors (HP, Dell, Penguin) for samples.
this meeting, (3/05/14)
  • Updates?
  • SWT2: submitted. R620s. 40 nodes. Hope to deploy with the network update.
  • SLAC: all in production use. Will update spreadsheet and OIM.
  • Tier1: Will be getting two Arista 7500s, providing an aggregation layer. Awaiting the additional storage systems, in about 4 weeks.
*meeting (3/19/14):
  • SWT2 - still looking at quotes, maybe 420s. Maximize the HEPSPEC.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s) See notes from MinutesJan222014 meeting, (2/05/14)
  • AGLT2: With MWT2, requesting ESNet provide LHCONE VRF at CIC OmniPoP
  • MWT2: Requesting ESnet to provision direct 100 Gbps circuit between BNL and UC.
  • NET2: Slow progress at MGHPCC. Needs to get BU on board.
  • SWT2: Try to export a different subnet with a couple machines as a test. If it works, will change over the Tier 2 network. UTA campus network peopl have coordinate with LEARN. Just need to coordinate with I2.
  • WT2: Already on LHCONE
this meeting, (2/19/14)
  • NET2: making progress, had meeting with networking group, having phone meetings now. NOX and MGHPCC. Link at MANLAN. Will get together next week.
this meeting, (3/05/14)
  • Major NRE are bringing up perfsonar test points within the LHCONE infrastructure.
  • AGLT2 - was disrupted, re-enabled.
  • NET2: nothing new
  • SWT2: Patrick needs a contact for I2. Had a question. Do we have any changes? Is I2 providing the VRF. Dale Finkelson? Will re-send, with cc to Shawn. May need another contact.
*meeting (3/19/14):
  • NET2 status? No news; waiting on MIT.
  • SWT2? UTA networking has done everything needed. Patrick - to send something to ESnet.
  • Getting monitoring in place for LHCONE.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Lots of back pressure in the production system, many tasks.
    • MCORE not as much as expected; there are still some problems. The plan was to have MCORE up to 50% (ADC Coordination). Michael: there is a significant backlog in the queue. We want all the sites to have these available.
    • Large number of jobs requiring data from tape, caused drain of US cloud. Having problems keeping sites full. Will recommend what each site should do - change fair share policy in AGIS?
*meeting (3/19/14):
  • Everything looks good; full of jobs again. Multi-core queue AGIS configuration caused lots of problems, seem to be resolved.
  • Saul: Would be nice to have a steady flow of MCORE jobs. Kaushik: will check.
  • What about sites being tagged as Tier2D?

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS report from the ADC Weekly meeting:
    http://indico.cern.ch/event/306432/contribution/3/material/slides/0.pdf
    
    1)  3/7: SLACXRD - file transfer failures with "could not open connection to osgserv04.slac.stanford.edu." Issue was an expired host certificate. Once the cert was updated 
    transfers resumed. https://ggus.eu/index.php?mode=ticket_info&ticket_id=101877 was closed, eLog 48323.
    2)  3/11: SLACXRD - file transfer failures with "could not open connection to osgserv04.slac.stanford.edu." Problem due to ESnet/LHCONE. Experts working on the problem. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=101982 in-progress, eLog 48368.
    3)  3/12: DDM dashboard update - details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/DDM-dash-v2.2.html
    4)  3/11: ADC Weekly meeting:
    http://indico.cern.ch/event/306432/
    
    Follow-ups from earlier reports:
    
    (i)  2/10: Lucille_CE - auto-excluded in panda and DDM. Site is experiencing a power problem and declared an unscheduled downtime.
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/47997.
    3/6: Downtime over - site back on-line.
    

  • last week: Operations summary:
    AMOD/ADCoS report from the ADC Weekly meeting:
    https://indico.cern.ch/event/308477/contribution/0/material/0/
    
    1)  3/12: MWT2 - jobs failing on golub0xx WN's. Dave reported a problem with the WN's at Illinois not coming back up cleanly following a power outage (meta rpm 
    HEP_OSlibs_SL6 was not installing correctly). Problem fixed - https://ggus.eu/index.php?mode=ticket_info&ticket_id=102017 was closed. eLog 48396.
    2)  3/14: HU_ATLAS_Tier2 - squid service not available. https://ggus.eu/index.php?mode=ticket_info&ticket_id=102065. Intermittent outages over the next couple 
    of days, but stable as of 3/16. ggus 102065 closed, eLog 48455.
    3)  3/14: WISC_LOCALGROUPDISK - DDM deletion errors - eLog 48420, https://ggus.eu/?mode=ticket_info&ticket_id=102249 in-progress.
    4)  3/18: Lucille - file transfer failures with "Communication error on send, err: [SE][Ls][] httpg://lutse1.lunet.edu:8443/srm/v2/server: CGSI-gSOAP running on 
    fts112.cern.ch reports Error reading token data header: Connection reset by peer]." Issue was due to expired CRL's on the SE not getting updated. Problem 
    fixed - waiting to verify successful transfers before closing https://ggus.eu/index.php?mode=ticket_info&ticket_id=102428. eLog 48458.
    5)  3/18: ADC Weekly meeting:
    http://indico.cern.ch/e/307170
    
    Follow-ups from earlier reports:
    
    (i)  3/11: SLACXRD - file transfer failures with "could not open connection to osgserv04.slac.stanford.edu." Problem due to ESnet/LHCONE. Experts working on 
    the problem. https://ggus.eu/index.php?mode=ticket_info&ticket_id=101982 in-progress, eLog 48368.
    Update 3/13: Issue understood and resolved. ggus 101982 was closed, eLog 48399.
    

  • Email sent to cloud support; dataset subscriptions from UTD to Datri. Site had expired host cert, updated, fixed. (thus cloud support mechanism)

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • NTUP_COMMON distribution
    • Need to do some cleanup at MW and WT2; USERDISK cleanup didn't go through. * this meeting:
    • USERDISK cleanups? Hiro? MW still need to do it. Armen will do it.
    • (Don't delete DDM_TEST)
  • this meeting, (3/05/14)
    • Hiro: by accident, started deleted USERDISK data, older than Feb 1.
    • Kaushik: suggests sending an email to DAST.
    • Armen - was deleting data from UC. There were a couple of 'jumbo' datasets (~800k files) that crashed the system.
    • Next data management meeting? Not sure.
    • Can files outside Rucio be removed?
    • CCC futures?
  • meeting (3/19/14):
    • LOCALGROUPDISK cleanups: MWT2 and SLAC, largest set of users. Getting lists of files. (35 DN's at MWT2)
      • Lots of effort, but resulting cleanup is very good.
    • USERDISK - some sites are finished. SWT2 still going on; still submitting batches for MWT2, from the previous campaign. Going slowing, since some users had several hundred thousand files per dataset. Hiro sent out notifications for the next campaign (MWT2, BNL).
    • GROUPDISK issue: dataset subscriptions not working for AGLT2 - for the Higgs group. Bugged the Higgs representative due to over quota.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Phasing out Pandamover. Kaushik: will have a follow-up phone meeting. (Rucio and Panda team.) Did decide to test on LUCILLE. Is there an open question of cleanup? The cleanup modules in Rucio are not quite ready. Smaller Tier2's in Europe which have immediate file removal. Ours: we leave files around, for re-use, since we have larger PRODDISK. Michael requests progress be made based on DQ2 with Lucille. (Should be quick).
  • meeting (3/5/14):
    • Saul's plots point to a problem with FTS3
    • We're using CERN FTS - Hiro.
  • meeting (3/19/14):
    • FTS3 issues. "Someone" is debugging at CERN. We'll have to wait for it to be resolved. Not sure what the underlying issue is.
    • Wei notes that there may be efficiency issues for OSG gridftp sever; before managed by RAL FTS3 - did they modify TCP buffer settings?
    • Hiro: we need to repeat throughput testing. Will wait for the developer to come back. Also installing FTS3 here.
    • Rucio deployment:
      • See slides yesterday from David Cameron's slides.
      • Old dq2 clients (BU and HU). Tier3 users?
      • Hiro: 2.5 or later.
      • Doug: looked at slides, seems to be from jobs. Saul: get it from CVMFS?
      • Saul - to follow-up with David, to find which hosts are responsible.
      • LUCILLE as a candidate to migrate this week. Nothing has happened.
      • Mark: will contact David Cameron again. There was email two weeks ago; there were problems. Will follow-up again.
      • Based on the experience - move forward with the rest of the cloud.
      • Mark - to schedule, after LUCILLE then SWT2_UTA; then can do a site per day.
      • Michael: allowfax=true for all sites. Ilija created the file, sent to Alessandro di Giralomo.

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference this meeting (2/19/14)
  • state of xrootd and plugin updates
  • added Milano T2; FR endpoints are coming back online.
  • running stress tests
this meeting(3/05/14)
  • Working on the overflow use case with Paul and Tadashi. There are two steps; changes will be required in the pilot.
  • Kaushik: turned on overflow running in Jedi a couple of weeks ago - but jobs were failing. The executable is built and used at the local site.
  • Also, the amount of bandwidth that can be used by Panda.
  • Preparing larger scale tests - Wahid in the UK; a small site will run as "diskless".
  • Working on documentation and expansion. Working with spanish Tier 1. And Asoka to get Triumf involved.
  • Users: with xAOD - will have the opportunity to control how users will use FAX. Will need a special library, so we'll have an optimization for TTreeCache. Also, should think about putting monitoring cache.
  • ROOTIO meeting tomorrow, and meeting at ANL.

Site news and issues (all sites)

  • SWT2 (UTA):
    • last meeting(s): MCORE, LHCONE. Resurrecting some old UTA_SWT2 servers. CPB issues. Rucio. Info from DAST about getting data from UTA. There was a disk issue as files were being written - bad timing.
    • this meeting: Biggest issue last two weeks CPB getting offlined. Looks like issue with CVMFS on some compute nodes locking up. Rebuild of the node fixes it. Lastest version of CVMFS, rebuild, in background. Will likely rebuild squid caches too, tho no problems seen. auto-adjust script in place for MCORE jobs.

  • SWT2 (OU, OSCER):
    • last meeting(s):
    • this meeting: Had a few issues with sites getting offlined, quick fills with analy jobs - pounding Lustre.

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): 28% SLAC cores (10k) are ATLAS owned. Found Gratia bug that under report WT2 cpu usage for all production jobs. Found that GridFTP? 6.14 (current on in OSG RPMs) doens't work well with network settings suggested by ESnet (net.core.rmem, net.ipv4.tcp_rmem, ...wmem, etc.) Added 56 nodes (900 cores), 47 cores to go.
    • this meeting: Michael: replacing aging Thumpers? Answer: getting quotes, but funding is not yet in place? (Michael will talk with Chuck). Outbound connectivity for worker nodes is available.

  • T1:
    • last meeting(s): Problem over the weekend with Chimera namespace manager became unresponsive. Postgres autovacuum. Have about 2.7 PB on order, finally through procurement (this is replacement storage).
    • this meeting: Infrastructure issue: the FTS3 deployment model - concentrate the FTS service at one site. Proposing a second service for North America. Michael will send a confirmation to the FTS3 list recommending.

  • AGLT2:
    • last meeting(s): Looking into OMD - Open Monitoring Distribution. Continuing to work with OMD - monitoring all hardware and software status.
    • this meeting: Downtime on Tuesday. Updated 2.6.22 version. Big cleanup of PRODDISK. Removing dark data - older than March 13 on USERDISK. Updated SL 3.2.5, gatekeepers running 3.2.6. (3.2.7 emergency release next week)

  • MWT2:
    • last meeting(s): David and Lincoln able to get the 6248 connected to the Juniper - getting the new R620s online. Confirmation from UIUC additional fiber inside ACB. ICC networking being upgraded, and extra disks being added to GPFS.
    • this meeting: Work continues on Stampede. Full framework is in place. Implementing an ATLAS ready node: using Parrot (for CVMFS), and fakechroot environment for compatibility libs and missing binary.

AOB

last meeting this meeting
  • Add ATLAS Connect as a standing agenda items, to define scope
  • Kaushik: 2/3 MCORE jobs running from other clouds; US is running a lot of MCORE jobs for other clouds. And, other clouds have taken MCORE jobs from US queues. Eventually jobs will arrive. Not sure if Rod can help.


-- RobertGardner - 19 Mar 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback