r4 - 30 Apr 2014 - 14:27:37 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr302014

MinutesApr302014

Introduction

Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:

Attending

  • Meeting attendees: Rob, Michael, Armen, Saul, Mark, Mayuko, Wei, Ilija, Horst, Hito, John Brunelle, Sarah
  • Apologies: Alden
  • Guests: none

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • MCORE status
      • US ATLAS Facilities Meeting at SLAC, https://indico.cern.ch/event/303799/
      • Rucio client testing
      • FY14 resources, as pledged, to be provisioned by next month. We should make sure capacities as published are provisioned. Action item on Rob: summarize info.
      • Agency meeting resulted in much discussion about computing. Non-LHC panelists had generic questions. For 2 hours! Constructive and positive reactions, in particular the Integration Program. Impressed about how we handle on-going operations while we do innovation.
      • There are two important aspects going on facility wide: Rucio migration; evaluation, migration to FTS3. Must in place prior to DC14
    • this week
      • A few integration program issues:
        • Rucio conversion: status and timeline. LUCILLE was successfully transitioned; and no panda mover; Rucio used exclusively. AGLT2 was mentioned; first prep steps were successful (removal of Panda-mover). Next few days will complete the transition. Next - MWT2 will volunteer. Other comments - seem to be working very well - entire clouds have been transitioned.
        • Panda-WAN testing: Panda team is ready. Schedconfig parameters need adjusting. Ilija in contact with Tadashi. There were other changes in the pilot to m
        • Continued work on ATLAS Connect
        • xAOD: ready to test? There is the ROOTIO working group - to understand the performance, and that we have the right systems to monitor. User-job monitoring. Ilija is asking for example user code.
      • WAN performance, even in LHCONE, can sometimes lead to surprises. This is why the PET has pursued a solution which minimizes the number of domains. Has been discussed this week. US LHC Management are now in the position to let ESnet management know we can move forward. DOE office of high energy physics is in agreement. This solution will be put in place in about 6 months, pending ESnet approval. Much better network connectivity with our European counterparts.
      • Yesterday's ADC meeting - getting the ATLAS environment software setup. Dedicated ADC development meeting.

ATLAS Connect (Rob)

  • Testing by early adopters continues
  • Much of the team is up at Condor Week this week
  • Main technical roadblock for opportunistic production jobs (via Panda, e.g. on Stampede) is two-fold (but related):
    • Delivering ATLAS compatibility libraries (HEP_OSlibs_SL6)
    • A threading problem in Parrot that may or may not be related to using multiple CVMFS repos, or something else. CCTools team currently stumped, but will com
  • Classic Tier3-style analysis batch usage continues to help improve the platform definition (e.g. installing analysis helper tools, git-svn, e.g.)
  • Usage past week: see below.
  • We'll have a meeting next Monday

MCORE status

last meeting(s):
  • MCORE status across all clouds is below. US: 488 out of 2078 running jobs (about 4000 cores). BNL, AGLT2, MWT2, SLAC have MCORE queues up and running. Issues with MCORE at OU - Horst will summarize.
  • OU: had a problem with the AGIS configuration - fixed.
  • UTA: got setup with a queue. Got some test jobs - issue with getting sufficient # pilots. Then jobs failed. Close - should be fixed in a couple of days. Main problem is pilots.
  • NET2: queues are setup dynamically. Only problem - only problem is the particular release MCORE jobs want. Working on troubleshooting the validation jobs. Seem to be 17.7.3 not found. (Same as Horst)

last meeting (3/5/14):

  • AGLT2 working on configuration. 50% MCORE jobs running.
  • BNL: adjusted share, 75% prod resources.
  • WT2 - MCORE jobs make scheduling MPI jobs easier.
  • What are the relative priorities - serial vs mcore? Michael: Hard for us to tell what the relative priorities with regard to the overall workload. Very manual.
  • Kaushik: Jedi wil automate this. Dynamic queues ideas.
  • MWT2 - 200 slot
  • NET2 - 200 slots configured at BU. Also at HU. Segfault on a particular release.
  • SWT2 - two queues, looks like things are working.
  • SWT2-OU: setup.

last meeting (3/19/14):

  • NET2: working at BU, figured out the problem (affecting 12 sites worldwide). 1700 cores running at BU. All due to an AGIS setting. The max memory was too low. Used by validation to kill itself if the jobs fail. Difficult to notice. 32G. Bob: guideline is 8x for memory and disk. HU - not running yet, but only because no validation jobs run.
  • Kaushik: demand is met, for now. At UTA, using an auto-adjuster. Seems to be working, whatever people.
  • OU: there is an MCORE at OSCAR. Will deploy one at Langston. Will deploy MCORE on OCHEP after the Condor upgrade. Timescale: 2 weeks.

this meeting:

  • Updates?
  • Michael sent a note out last week - we are out of multi-core jobs for the next 3-4 weeks. Only sequential jobs.
  • Mark: no major multi-core tasks in the pipeline
  • Sites should back off #mcore slots
  • Michael: Will and others are going to setup with Condor

Cleaning up Schedconfig attributes (Horst)

last meeting:
  • Bit with leftover appdir setting. Should not be set. Was getting re-pointed to wrong location. There are some old config parameters that are set. E.g. "JDL"
  • Can we do a clean up? Alden: What affects Tadashi's or Paul's infrastructure.
  • Bob: will look for an email that describes the settings needed.
  • Horst will set something up.

previous meeting (3/5/14):

  • Last week setup a set of variables which can be deleted. app_dir.
  • Will send an email to the us cloud support.

last meeting (3/5/14):

  • Concluded?
  • Alden? We did not iterate on this this week. Can arrange with Gancho, try again. Will clear out values left over in those fields. /appdir only set at CPB.

this meeting:

  • Concluded?
  • Horst will discuss with Ales

Managing LOCALGROUPDISK at all sites (Kaushik and/or Armen)

previously
  • Has been discussion about tools, policy has not been sent to RAC
  • Any update on process?
  • Had first meeting discussing technical requirements and features. Will meet in a week.
  • Will present list and schedule in two weeks.
previous meeting (3/5/14):
  • See attached document.
  • Kaushik: its a large software development effort. Can we see the plan, and timeline of implementation.
  • Present preliminary design at SLAC meeting? Not a trivial task!
this meeting
  • Update?
  • Kaushik: Discussed at the Tier3 implementation committee meeting on Friday. Good discussion, liked by the committee. Increase base quote from 3-5 TB? Up to 20TB should have no significant RAC action, etc. Most users use less than 3 TB. Generally agreement this will be policy (5 TB). 250 users expected. 1 PB for beyond pledge. Armen is checking this figure. (May have to ask for more resources.) If T3 imp committee feels this is needed, it should be stated in the recommendations.
  • Michael: currently 1.5 PB in LGD.
  • Armen: cleaning at the moment.

meeting (4/30/14):

  • Armen is working on cleaning up - 800 TB cleaned up.
  • Hiro: can you use the Rucio client command to find the file status? Not showing up at Michigan? (proddisk)

Reports on program-funded network upgrade activities

AGLT2

last meeting(s) meeting (2/05/14)
  • Will have an outage this Friday.
  • Still waiting on 40g wave to show up. Will use 10g links until then. Will cut over to Juniper when 40g is available.
  • MSU - orders are going out.
this meeting, (2/19/14) this meeting, (3/05/14)
  • Don't have all the parts. Waiting on 40g blade for UM.
  • MSU - also waiting on parts.
  • April timeframe
meeting (3/19/14):
  • UM assigned a network engineer to transition to the 100g wave April 1. Still waiting on the 40g blade (next week).
  • MSU: ordered, but have not received parts. Anticipate beginning of April.

meeting (4/30/14):

MWT2

last meeting(s) meeting (2/05/14)
  • At UC, additional 6x10g links being connected. So technical fiber connection issues. Expect full 100g connectivity by the end of the week.
  • At Illinois - getting fiber from cluster to campus core.
this meeting, (2/19/14)
  • 8x10g connection at UIUC today
  • 8x10g in place at UC, and IU
this meeting, (3/05/14)
  • Network upgrades complete. Working on WAN studies.
meeting (3/19/14):
  • ESNet to OmniPoP fixed this morning. Can now do large scale studies
meeting (4/30/14):

SWT2-UTA

last meeting(s) meeting, (2/05/14)
  • Starting receiving equipment for first round. $ available for the second increment. Expect six weeks out.
this meeting, (2/19/14)
  • 1/2 equipment arrived - expect purchase for the second increment shortly. Will need to take a significant downtime.
  • Timeframe? Depends on procurement. Delivery a couple of weeks.
this meeting, (3/05/14)
  • Have final quotes in procurement, for the second half.
  • Money still hasn't arrived. Completion date unsure.
  • Delayed.
*meeting (3/19/14):
  • Patrick: Got the money - in the process of quote refreshing. Will get into purchasing later this week or next week.
  • Kaushik: second round will have S6000s, and S4810. Unfortunately edge HW on campus is only 10g. Don't yet have lease lines to Dallas where I2 has 100g. Just submitted a proposal. Asked for Juniper 100g capable hardware, for campus infrastructure.
meeting (4/30/14):
  • Internal networking: split into two pieces; second half arriving very shortly. Then will schedule a downtime. Equipment is showing up.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting(s) meeting, (2/05/14)
  • AGLT2: 17 of R620s in production. 2 are being used for testing.
  • MWT2: 45/48 in production
  • NET2: 42 new nodes in production
  • SWT2: coming shortly (next week or two). 35-45 compute nodes.
  • WT2: receiving machines; 8/60 installed. The remainder in a week or two. These are shared resources, what fraction is owned by ATLAS? 30%?
  • Tier1: Acquiring samples of latest models for evaluation. Received a 12-core machine. Asked 3 vendors (HP, Dell, Penguin) for samples.
this meeting, (3/05/14)
  • Updates?
  • SWT2: submitted. R620s. 40 nodes. Hope to deploy with the network update.
  • SLAC: all in production use. Will update spreadsheet and OIM.
  • Tier1: Will be getting two Arista 7500s, providing an aggregation layer. Awaiting the additional storage systems, in about 4 weeks.
*meeting (3/19/14):
  • SWT2 - still looking at quotes, maybe 420s. Maximize the HEPSPEC.
meeting (4/16/14):

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s) See notes from MinutesJan222014 meeting, (2/05/14)
  • AGLT2: With MWT2, requesting ESNet provide LHCONE VRF at CIC OmniPoP
  • MWT2: Requesting ESnet to provision direct 100 Gbps circuit between BNL and UC.
  • NET2: Slow progress at MGHPCC. Needs to get BU on board.
  • SWT2: Try to export a different subnet with a couple machines as a test. If it works, will change over the Tier 2 network. UTA campus network peopl have coordinate with LEARN. Just need to coordinate with I2.
  • WT2: Already on LHCONE
this meeting, (2/19/14)
  • NET2: making progress, had meeting with networking group, having phone meetings now. NOX and MGHPCC. Link at MANLAN. Will get together next week.
this meeting, (3/05/14)
  • Major NRE are bringing up perfsonar test points within the LHCONE infrastructure.
  • AGLT2 - was disrupted, re-enabled.
  • NET2: nothing new
  • SWT2: Patrick needs a contact for I2. Had a question. Do we have any changes? Is I2 providing the VRF. Dale Finkelson? Will re-send, with cc to Shawn. May need another contact.
meeting (3/19/14):
  • NET2 status? No news; waiting on MIT.
  • SWT2? UTA networking has done everything needed. Patrick - to send something to ESnet.
  • Getting monitoring in place for LHCONE.
meeting (4/30/14):
  • NET2: lost dedicated circuit to BNL. Brian Malo (MIT) said he would take care of it. Will find out some sort of word for next meeting.
  • SWT2: have a dedicated machine on a test subnet. Still had some questions for LEARN. Will find out what was heard from LEARN.

Operations overview: Production and Analysis (Kaushik)

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week.
    
    1)  4/16: From Hiro: BNL dCache is experiencing a issue to stage files from HPSS, resulting large number of failures in Panda mover. Issue resolved as of?
    2)  4/22: ADC Weekly meeting:
    http://indico.cern.ch/event/307175/
    3)  4/23 early a.m.: Following a rucio upgrade jobs began failing with errors like "Could not add files to DDM: Details: Problem validating dids : 
    u'3046f202-7e8f-4dd4-b75e-a48f60b45c1f' does not match." Site services were also affected. Issue understood and resolved.  
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/49010.
    
    Follow-ups from earlier reports:
    
    (i)  3/14: WISC_LOCALGROUPDISK - DDM deletion errors - eLog 48420, https://ggus.eu/?mode=ticket_info&ticket_id=102249 in-progress.
    Update 3/26: Overheating problem in the site machine room. Ticket set to 'on-hold' and a downtime was declared.
    Update 4/2: downtime extended to 4/18.
    

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    http://indico.cern.ch/event/307176/contribution/1/material/slides/0.pdf
    
    1)  4/29: Pilot update from Paul (v58k). Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_58k.html
    2)  4/29: SLACXRD - file transfer errors ("SRM_AUTHORIZATION_FAILURE"). Wei reported that a CRL update earlier the same day had failed, resulting in the errors. 
    Issue resolved - https://ggus.eu/?mode=ticket_info&ticket_id=104986 was closed. eLog 49136. Additional comment from Hiro: CRL problem was observed at BNL yesterday.  
    I have heard from OSG (Xin) that it was caused by an upstream issue and that it was discussed in an OSG meeting.  If you see it, just re-run fetch-crl.
    3)  4/29: ADC Weekly meeting:
    http://indico.cern.ch/event/307176/
    
    Follow-ups from earlier reports:
    
    (i)  3/14: WISC_LOCALGROUPDISK - DDM deletion errors - eLog 48420, https://ggus.eu/?mode=ticket_info&ticket_id=102249 in-progress.
    Update 3/26: Overheating problem in the site machine room. Ticket set to 'on-hold' and a downtime was declared.
    Update 4/2: downtime extended to 4/18.
    Update 4/29: Site reported that the SRM service was moved to a different host, and steps were taken to avoid the original overheating issue in the machine. 
    They closed the ggus ticket, but little recent DDM activity, so cannot assess the situation fully.
    (ii) 4/30: SMU - continuing source / destination file transfer errors, mostly "Unable to connect to smuosg1.hpc.smu.edu." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=101975, eLog 49138. Site blacklisted.
    

  • A low level of US issues, nice
  • Bug tracking is now in Jira
  • Pilot update from Paul
  • Tier3 issues - Wisconsin (cooling issues, but not much activity there); SMU transfers to/from the site. Michael: this might be an operational burden for ADC Ops. Most are storage related. Is this being discussed among shifters? Complaints? Ans: Mark - its mostly a reporting issue.

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (3/19/14):
    • LOCALGROUPDISK cleanups: MWT2 and SLAC, largest set of users. Getting lists of files. (35 DN's at MWT2)
      • Lots of effort, but resulting cleanup is very good.
    • USERDISK - some sites are finished. SWT2 still going on; still submitting batches for MWT2, from the previous campaign. Going slowing, since some users had several hundred thousand files per dataset. Hiro sent out notifications for the next campaign (MWT2, BNL).
    • GROUPDISK issue: dataset subscriptions not working for AGLT2 - for the Higgs group. Bugged the Higgs representative due to over quota.

  • meeting (4/30/14)
    • See above.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Phasing out Pandamover. Kaushik: will have a follow-up phone meeting. (Rucio and Panda team.) Did decide to test on LUCILLE. Is there an open question of cleanup? The cleanup modules in Rucio are not quite ready. Smaller Tier2's in Europe which have immediate file removal. Ours: we leave files around, for re-use, since we have larger PRODDISK. Michael requests progress be made based on DQ2 with Lucille. (Should be quick).
  • meeting (3/19/14):
    • FTS3 issues. "Someone" is debugging at CERN. We'll have to wait for it to be resolved. Not sure what the underlying issue is.
    • Wei notes that there may be efficiency issues for OSG gridftp sever; before managed by RAL FTS3 - did they modify TCP buffer settings?
    • Hiro: we need to repeat throughput testing. Will wait for the developer to come back. Also installing FTS3 here.
    • Rucio deployment:
      • See slides yesterday from David Cameron's slides.
      • Old dq2 clients (BU and HU). Tier3 users?
      • Hiro: 2.5 or later.
      • Doug: looked at slides, seems to be from jobs. Saul: get it from CVMFS?
      • Saul - to follow-up with David, to find which hosts are responsible.
      • LUCILLE as a candidate to migrate this week. Nothing has happened.
      • Mark: will contact David Cameron again. There was email two weeks ago; there were problems. Will follow-up again.
      • Based on the experience - move forward with the rest of the cloud.
      • Mark - to schedule, after LUCILLE then SWT2_UTA; then can do a site per day.
      • Michael: allowfax=true for all sites. Ilija created the file, sent to Alessandro di Giralomo.

  • meeting (4/30/14):
    • Nothing to report

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference last meeting (4/16/14)
  • added SFU. Changing FAX topology in North A.
  • understanding EOS too many ddm endpoints issue.
  • simplified direct dCache federating
this meeting (4/30/14)
  • Reconfigured redirector network for US being setup.
  • IN2P3? is now online
  • New dCache redirectors; will simplify deployments
  • Also the ROOTIO activity
  • Analyzing cost matrix data; using Google Engine to store and analysis. 15 MB/s second for most links.

Site news and issues (all sites)

  • MWT2:
    • last meeting(s): Work continues on Stampede. Full framework is in place. Implementing an ATLAS ready node: using Parrot (for CVMFS), and fakechroot environment for compatibility libs and missing binary.
    • this meeting: network issues. Otherwise things are quiet. Did have a hardware failure on the management node, but caused.

  • SWT2 (UTA):
    • last meeting(s): Biggest issue last two weeks CPB getting offlined. Looks like issue with CVMFS on some compute nodes locking up. Rebuild of the node fixes it. Lastest version of CVMFS, rebuild, in background. Will likely rebuild squid caches too, tho no problems seen. auto-adjust script in place for MCORE jobs.
    • this meeting: Working on MCORE and tuning. Patrick has implemented a local site kill-job mechanism. If a machine is swapping, sometimes causes HC jobs to fail, and getting offlined. These are ATLAS user analysis jobs.

  • SWT2 (OU, OSCER):
    • last meeting(s): Had a few issues with sites getting offlined, quick fills with analy jobs - pounding Lustre.
    • this meeting: OSCER cluster is not getting a lot of throughput as cluster is being used for weather studies. Expect to get more soon as opp queue gets reassigned.

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): Michael: replacing aging Thumpers? Answer: getting quotes, but funding is not yet in place? (Michael will talk with Chuck). Outbound connectivity for worker nodes is available.
    • this meeting: Network reconfiguration for the gridftp server. Observed a problem with the Digicert CRL update - caused probs with VOMS proxy init. Also there were GUMS issues. OSG Gratia service was down for more than one day. Were job statistics lost?

  • T1:
    • last meeting(s): Infrastructure issue: the FTS3 deployment model - concentrate the FTS service at one site. Proposing a second service for North America. Michael will send a confirmation to the FTS3 list recommending.
    • this meeting: Maintenance next Tuesday. Will integrate the two Arista 7500's and will replace obsolete network switches. Will then have full 100g inter-switch technology.

  • AGLT2:
    • last meeting(s): Downtime on Tuesday. Updated 2.6.22 version. Big cleanup of PRODDISK. Removing dark data - older than March 13 on USERDISK. Updated SL 3.2.5, gatekeepers running 3.2.6. (3.2.7 emergency release next week)
    • this meeting:

  • NET2:
    • last meeting(s): Getting HU MCORE queue validated.
    • this week: DDM problem - symptom of an underlying congestion problem. Upgrading SRM host to SL6. (timeouts) Validation problem tracked down to MAXMEM setting in AGIS. It was 3000. Causes certain tasks to not-validate. > 4000. Undocumented.

AOB

last meeting
  • Add ATLAS Connect as a standing agenda items, to define scope
  • Kaushik: 2/3 MCORE jobs running from other clouds; US is running a lot of MCORE jobs for other clouds. And, other clouds have taken MCORE jobs from US queues. Eventually jobs will arrive. Not sure if Rod can help.
this meeting


-- RobertGardner - 30 Apr 2014

  • daily_hours_by_user.png:
    daily_hours_by_user.png

  • hours_by_project_and_site.png:
    hours_by_project_and_site.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png daily_hours_by_user.png (45.8K) | RobertGardner, 30 Apr 2014 - 12:51 |
png hours_by_project_and_site.png (105.0K) | RobertGardner, 30 Apr 2014 - 12:53 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback