r4 - 05 Feb 2014 - 15:39:41 - WeiYangYou are here: TWiki >  Admins Web > MinutesFeb052014

MinutesFeb052014

Introduction

Minutes of the Facilities IntegrationProgram meeting, February 5, 2014

Attending

  • Meeting attendees: Bob, Ilija, Rob, Torre, Shawn, Michael, Saul, Armen, Mark, Hiro, Dave Lesny, Alden, Wei, Mayuko, Mark,
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Final updates to SiteCertificationP27 and v29 of google docs spread sheet in CapacitySummary due for quarterly report.
      • Some activities for next quarter:
        • MCORE availability
        • DC14 readiness tests
          • Want to make it easy to deploy at other sites, taking out
          • Invite John to explain openstack environment at BNL
        • Facility-wide flexible infrastructure activity
      • Transatlantic FAX demonstrator. Goal is demonstrate large scale use of 100 Gbps transatlantic test circuit using direct read or copies of data via FAX from multiple sites in the U.S.
        • Writing this up. To be used by the network providers on both sides. Timeframe in early Feb. Adding 100g circuit later in the month. What performance can you get with direction.
      • ATLAS Connect
        • Starting some beta testing now. http://connect.usatlas.org
        • Currently, bosco-flocking with small quotas are in place to MWT2, AGLT2, and CSU-Fresno
        • http://connect.usatlas.org/accounting-summary/
        • Working on connecting to TACC-Stampede (first steps); later, add autopyfactory to support Panda jobs to these targets.
        • Full description is available in Tier3 implementation committee google docs
      • Last week Condo-of-Condos meeting at NCSA, http://www.ncsa.illinois.edu/Conferences/ARCC/. Met with some our institutional campus computing partners (in particular Holyoke, OU directors) who expressed interest in ATLAS Connect and more generally http://ci-connect.net.
    • this week
      • Progress on multi-core jobs a concern.
        • SWT2 - been busy
        • BU - MCORE now up; still a problem with validation with HU
        • There is a backlog now.
        • Good to validate. Condor dynamic partitioning now working at AGLT2, but equipped with cron that adjusts based on high/low level jobs.
        • BNL is running two multicore queues - one for MCORE jobs, and one for a high memory queue

Managing LOCALGROUPDISK at all sites (Kaushik and/or Armen)

previously
  • See Indico for slides.
  • Q: (Doug) why 3 TB? 30TB total. A(Kaushik): 3x300 ~ 1PB; which would be un-pledged resources. At what scale do we go to RAC? Of course, if everyone used 30TB we'd be out of space. We're assuming a factor of 10 overcommitment.
  • Q: (Doug): week interval? Kaushik: three warnings, then go to RAC before deleting.
  • Q: (Doug): sample lifetime? Ask the user to set a lifetime.
  • Q: (Rob): whats the granularity? A: (Kaushik): will put it in at the dataset level. Allow wildcards. Will need a lot of software, monitoring, and web-based front-ends. We have a rough idea of what it will be like. Will be ready before the next run. Hiro, Mayuko and Armen will be doing the development we think.
  • Q: (Saul): is LGD for US users only? A: (Kaushik): yes, only U.S.
  • Q: (Doug): where does the policy go for approval? A:(Kaushik): reviewing within the facilities; then to analysis support (Jason & Erich); then to RAC. Once approved by the RAC, then give to US ATLAS management. Throughput the process, this will be reviewed. Doug: IB? A: good suggestion.
  • Q: (Doug): what about archival?
  • Michael: there is a natural opportunity to present at an IB meeting - there is a standard time slot, suggest we use it.
  • Michael: we reserve 20% for US physicists. We should be consistent. This is more like 2 PB. We should discuss where these go - presumably majority to LGD, but there may be others.
  • Kaushik - notes most users tend to stay at a single site. Entire quota can go at a site.
  • Rob: notes
this meeting
  • Any update on process?
  • Has been discussion about tools, policy has not been sent to RAC

Reports on program-funded network upgrade activities

AGLT2

last meeting(s) this meeting, (2/05/14)
  • Will have an outage this Friday.
  • Still waiting on 40g wave to show up. Will use 10g links until then. Will cut over to Juniper when 40g is available.
  • MSU - orders are going out.

MWT2

last meeting(s) this meeting, (2/05/14)
  • At UC, additional 6x10g links being connected. So technical fiber connection issues. Expect full 100g connectivity by the end of the week.
  • At Illinois - getting fiber from cluster to campus core.

SWT2-UTA

last meeting(s) this meeting, (2/05/14)
  • Starting receiving equipment for first round. $ available for the second increment. Expect six weeks out.

FY13 Procurements - Compute Server Subcommittee (Bob)

last meeting(s) this meeting, (2/05/14)
  • AGLT2: 17 of R620s in production. 2 are being used for testing.
  • MWT2: 45/48 in production
  • NET2: 42 new nodes in production
  • SWT2: coming shortly (next week or two). 35-45 compute nodes.
  • WT2: receiving machines; 8/60 installed. The remainder in a week or two. These are shared resources, what fraction is owned by ATLAS? 30%?
  • Tier1: Acquiring samples of latest models for evaluation. Received a 12-core machine. Asked 3 vendors (HP, Dell, Penguin) for samples.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s) See notes from MinutesJan222014 this meeting, (2/05/14)
  • AGLT2: With MWT2, requesting ESNet provide LHCONE VRF at CIC OmniPoP
  • MWT2: Requesting ESnet to provision direct 100 Gbps circuit between BNL and UC.
  • NET2: Slow progress at MGHPCC. Needs to get BU on board.
  • SWT2: Try to export a different subnet with a couple machines as a test. If it works, will change over the Tier 2 network. UTA campus network peopl have coordinate with LEARN. Just need to coordinate with I2.
  • WT2: Already on LHCONE

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Completely full of production jobs. Plenty of analysis.
    • Getting requests for US regional production to speed-up requests. 1000 tasks waiting to be assigned. Wolfgang - when we're full, we're really full.
  • this meeting:
    • Lots of back pressure in the production system, many tasks.
    • MCORE not as much as expected; there are still some problems. The plan was to have MCORE up to 50% (ADC Coordination). Michael: there is a significant backlog in the queue. We want all the sites to have these available.

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    
    AMOD/ADCoS report from the ADC Weekly meeting:
    https://indico.cern.ch/getFile.py/access?contribId=5&resId=0&materialId=slides&confId=297019
    
    1)  1/24: OU-LUCILLE PRODDISK failing as a source for file transfers ~60%. Issue seemed to be only a transient high load on a storage server. Ticket closed on 1/27 
    since the errors stopped.  https://ggus.eu/ws/ticket_info.php?ticket=100529, eLog 47850.
    2)  1/25: UTA_SWT2 - file transfers were failing with SRM errors. Issue was traced to a dead xrootd process on one of the storage servers.  Restarting xrootd solved 
    the problem, and transfers resumed. (Notified shifters, but no ticket issued since the problem was quickly resolved.)
    3)  1/26: SWT2_CPB - file transfers failing with SRM errors. A storage server went off-line when the NIC in the machine shutdown. We've seen this issue many times - 
    the cooling fan on the NIC fails, causing it to shutdown. Fan replaced 1/27 a.m., issue resolved. https://ggus.eu/ws/ticket_info.php?ticket=100545 was closed on 1/28 
    after confirming there were no additional transfer failures. eLog 47855.
    4)  1/28: New pilot release from Paul (v58h) - details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_58h.html
    5)  1/28: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=297019
    
    Follow-ups from earlier reports:
    
    None
    

  • this week: Operations summary:
    AMOD/ADCoS report from the ADC Weekly meeting:
    https://indico.cern.ch/getFile.py/access?contribId=3&resId=0&materialId=slides&confId=299399
    
    1)  1/29: BNL - file transfers were failing with SRM errors.  From Michael: The problem was solved by switching to the backup SRM database. Experts are investigating 
    why an SSD device was removed from the RAID set by the system. eLog 47892.
    2)  1/30; From Bob at AGLT2: A condor reconfiguration gone bad accidentally killed our full load of jobs at AGLT2.  I don't know how these are going to show up over the 
    next few days time.  Our own monitoring is a bit screwy right now on the subject.
    3)  1/30: BNL - https://ggus.eu/ws/ticket_info.php?ticket=100871 was opened when it appeared that file transfers were failing, possibly due to missing files at the site. 
    Issue was really due to a configuration problem with FTS/Gfal2. See details in https://its.cern.ch/jira/browse/LCGUTIL-334, 335. ggus ticket was closed, eLog 47914.
    4)  1/31: MWT2 - site was down due to a network outage. Restored as of ~3:00 p.m. CST. Expect some job failures with "lost heartbeat," file transfer errors, etc.
    5)  2/3 p.m. SWT2_CPB: https://ggus.eu/ws/ticket_info.php?ticket=100936 due to file transfers failing with "failed to contact on remote SRM" errors. A storage server crashed, 
    causing these errors. Issue resolved by restarting the host - still investigating the cause of the crash. ggus 100936 was closed, eLog 47953.
    6)  2/5 early a.m. BNL - https://ggus.eu/ws/ticket_info.php?ticket=100984 was opened for file transfer failures with the error "First non-zero marker not received within 
    180 seconds." Issue understood - from Hiro: This is due to the test gridftp doors we had for 100Gb/s test last week. The special circuit must have been removed for these 
    gridftp doors, causing traffic to fail. We disabled these gridftp doors - the failures should disappear. ggus ticket in-progress, no eLog ticket created.
    7)  2/4: ADC Weekly meeting:
    https://indico.cern.ch/conferenceDisplay.py?confId=299399
    
    Follow-ups from earlier reports:
    
    None
    

Data Management and Storage Validation (Armen)

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference prior meeting (1/22/14)
  • Ilija - doing some stress testing. There were some test datasets missing - one remains to be restored at BNL.
  • LFC-free N2N, added stability for FAX
  • Please use new Rucio gLFNs
  • Wei - discovered a load condition with new N2N, awaiting Andy to troubleshoot, could be an Xrootd problem itself.
  • Rucio renaming status: UTA - should be done, but Patrick wants to check a few things.
this meeting (2/5/14)

Site news and issues (all sites)

  • WT2:
    • last meeting(s): converted 600 old cores to rhel6 for ATLAS jobs. Additional 1600 still running rhel5, unlikely we'll use these. Site-wide power issue, some ATLAS nodes were affected.
    • this meeting: 28% SLAC cores (10k) are ATLAS owned. Found Gratia bug that under report WT2 cpu usage for all production jobs. Found that GridFTP? 6.14 (current on in OSG RPMs) doens't work well with network settings suggested by ESnet (net.core.rmem, net.ipv4.tcp_rmem, ...wmem, etc.)

  • T1:
    • last meeting(s): Buying > 2PB of storage, replacing old, progressing. Replacing F10 equipment. Moving towards a high capacity interlink fabric, 3 Arista 7500s (100g trunks between cores). Space manager discovered problems with 2.6. (What settings did Hiro use to fix the problem?) Happens with high file transfer completion rate. Avoids growth.
    • this meeting:

  • AGLT2:
    • last meeting(s): Looking into OMD - Open Monitoring Distribution
    • this meeting:

  • NET2:
    • last meeting(s): Working on WAN connectivity issues.
    • this week: Moved HU gatekeeper to new hardware. SHA compliance is complete.

  • MWT2:
    • last meeting(s): David and Lincoln able to get the 6248 connected to the Juniper - getting the new R620s online. Confirmation from UIUC additional fiber inside ACB.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): MCORE, LHCONE. Resurrecting some old UTA_SWT2 servers. CPB issues. Rucio.
    • this meeting:

  • SWT2 (OU, OSCER):
    • last meeting(s):
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

AOB

last meeting
  • Wei: new WLCG availability matrix is being calculated, see email sent to usatlas-t2-l. Awaiting clarification from Alessandro about including grid3-locations.txt in the availability test.
this meeting WLCG will no longer check grid3-locations.txt for ATLAS releases.


-- RobertGardner - 05 Feb 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback