r5 - 01 Oct 2014 - 15:25:08 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct012014

MinutesOct012014

Introduction

Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:

Attending

  • Meeting attendees:
  • Apologies: Jason, Kaushik, Mayuko
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • US ATLAS Software & Computing / Physics Support meeting at LBL, August 20-22, 2014
      • Facility performance metrics meeting yesterday. See URL in email.
        • Google docs to describe plans and gather materials: http://bit.ly/V1T3Q8
        • Updates from Hiro's load testing
        • Need updates in all areas - okay to put URLs to external sources.
        • Notes from Facilities session at Berkeley: http://bit.ly/1BbyoJY
        • Updates to SiteCertificationP30 for HTCondorCE and DC14 columns.
          • Gather list of "To-do" resulting from findings in http://bit.ly/V1T3Q8
          • Example, at MWT2 we a noticeable recurring error with lsm-puts.
        • US ATLAS-MWT2_UC is hosting the ATLAS ADC TIM meeting in Chicago, October 27-29. These technical interchange meetings are usually focused on system development and planning, and are organized by ADC management. Website is http://tim2014.mwt2.org/ - will be soliciting help with local organization.
        • Michael
          • See follow-up notes in http://bit.ly/1BbyoJY
          • Ilija - working with Torre to create a priority boosting algorithm. Will check with Tadashi, will need to tune weights accordingly.
      • this week:

LOCALGROUPDISK space monitoring (Mayuko)

Supporting the OSG VO on US ATLAS Sites

last meeting(s)

  • UTA - its enabled. Still need to setup preemption though.

this meeting

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • meeting (9/3/14):
    • Nothing much new... analysis is low. Has not looked at overflow jobs.
    • Otherwise is well.
    • LOCALGROUPDISK - there is a new monitoring page, now visible from BNL. Used for tracking usage policy. Setup by Myuko. Storing it in Hiro's database. Will have a report in 4 weeks. October 1.
  • meeting (10/1/14):
    • Mark reporting. Again a note about MCORE developments.
    • Wisconsin - a long-standing ticket. Unresponsive. Transfers are still failing.
    • Would be useful to have #active slots displayed. Request for bigpanda monitor.

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week.
    
    1)  9/20: WISC - file transfers failing with "DESTINATION MAKE_PARENT srm-ifce err: Invalid argument, err: [SE][Mkdir][SRM_FAILURE]." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108665 in progress, eLog 51118.
    
    2)  9/20: BNL - source file transfer failures with errors like "[SRM_INVALID_PATH] No such file or directory." Iris reported the issue was inconsistencies 
    in the chimera database, and the problem is being worked on. https://ggus.eu/index.php?mode=ticket_info&ticket_id=108669 in progress, eLog 51127. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108689 was opened on 9/22, but is really a duplicate ticket. eLog 51163.
    
    3)  9/21: AGLT2 - source file transfer failures with "[Failed to get source file size: srm-ifce err: Invalid argument, err: [SE][Ls][SRM_AUTHENTICATION_FAILURE] 
    httpg://head01.aglt2.org:8443/srm/managerv2: SRM Authentication failed]." From Shawn: We had some intermittent network issues over the weekend which caused 
    gPlazma services to timeout for some SRM requests. Problem being worked on. https://ggus.eu/index.php?mode=ticket_info&ticket_id=108679 in progress, eLog 51148.
    
    4)  9/23: ADC Weekly meeting:
    http://indico.cern.ch/event/332963/ (several relevant talks)
    
    Follow-ups from earlier reports:
    
    (i)  9/17: SLACXRD - source and destination file transfer errors. https://ggus.eu/index.php?mode=ticket_info&ticket_id=108584 in progress, eLog 51092.
    Update 9/22: no recent errors, ggus 108584 was closed. (No details in the ticket.)
    

  • this week: Operations summary:
    
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week
    
    1)  9/25: UTA_SWT2 - file transfers were failing with "SRM_AUTHORIZATION_FAILURE" errors. Host certificate had expired on the cluster GUMS server. 
    Cert was updated, issue resolved. https://ggus.eu/?mode=ticket_info&ticket_id=108818 was closed, eLog 51220.
    
    2)  9/29: SLACXRD - destination file transfer failures - errors like "Transfer process died with..." & "DESTINATION OVERWRITE srm-ifce err: Communication error on send." 
    Issue resolved as of 9/30 (bad hard drive?), so https://ggus.eu/index.php?mode=ticket_info&ticket_id=108893 was closed. eLog 51286.
    
    3)  9/30: ADC Weekly meeting:
    https://indico.cern.ch/event/339990/
    
    Follow-ups from earlier reports:
    
    (i)  9/20: WISC - file transfers failing with "DESTINATION MAKE_PARENT srm-ifce err: Invalid argument, err: [SE][Mkdir][SRM_FAILURE]." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108665 in progress, eLog 51118.
    
    (ii)  9/20: BNL - source file transfer failures with errors like "[SRM_INVALID_PATH] No such file or directory." Iris reported the issue was inconsistencies 
    in the chimera database, and the problem is being worked on. https://ggus.eu/index.php?mode=ticket_info&ticket_id=108669 in progress, eLog 51127. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108689 was opened on 9/22, but is really a duplicate ticket. eLog 51163.
    Update 9/26: work re-indexing the database tables ongoing. As of 9/30 all files in the dashboard links were accessible.
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108669 was closed.
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108689 (duplicate of 108669) was closed on 9/26.
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108820 is another duplicate ticket - will also be closed.
    
    (iii)  9/21: AGLT2 - source file transfer failures with "[Failed to get source file size: srm-ifce err: Invalid argument, err: [SE][Ls][SRM_AUTHENTICATION_FAILURE] 
    httpg://head01.aglt2.org:8443/srm/managerv2: SRM Authentication failed]." From Shawn: We had some intermittent network issues over the weekend which caused 
    gPlazma services to timeout for some SRM requests. Problem being worked on. https://ggus.eu/index.php?mode=ticket_info&ticket_id=108679 in progress, eLog 51148.
    Update 9/26: Issue understood and resolved (see details in the ggus ticket). ggus 108679 was closed, eLog 51308.
    

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (9/3/14)
    • Deletion: observing reasonable rates, 12-14 Hz. Week ago it was down, but only for the U.S.
    • Request to open up webdav to NET2, SLAC, SWT2. Patrick was investigating. Hiro: use Xrootd's webdav, or apache with mod-gridsite. Patrick will investigate.
    • There is no significant amounts of dark data. 20-30 TB in USERDISK, per Tier2. DATADISK - varies by Tier2.
    • Emphasizing BNL, to clean up the non-Rucio data. DDM_TEST - looking dark, sometimes this is several hundred TB. 300 TB? There is also some cleanup in DATADISK - Hiro.
  • meeting (10/1/14)
    • Regarding dark data, we've seen a reduction.
    • There are a number of unassociated files in the Rucio file catalog, but the Rucio team is working on it.
    • Would like a systematic dump of the storage for each site. Previously some sites were providing this.
    • Wei: it was the other way around - we took dumps from dq2 and compared to storage.
    • Discussion with Tomas: wants a file catalog dump from each site.
    • *Armen will send an email message
    • Is this for continuous testing/checking? Rucio provides a daily file catalog dump.
    • Is CCC up to date?

DDM Operations (Hiro)

meeting (9/3/14)
  • No report.
meeting (10/1/14)
  • There is an FTS issue at CE

Condor CE validation (Saul, Wei, Bob)

meeting (8/6/14)
  • See Bob's instructions, experience at https://www.aglt2.org/wiki/bin/view/AGLT2/CondorCE
  • Saul: Augustine working with Tim Cartwright and Brian
  • Wei: working with Brian to get it working with LSF. Got a job submitted successfully, but with memory allocation errors.
  • Overall conclusion is software just not there yet.
  • Xin - working on Condor CE at BNL
meeting (9/3/14)
  • New page from Xin: HTCondorCE
  • WT2: Brian has got it running, in a test mode. Still need to understand the package to install. Then, will need to run HC jobs. Believes its not ready for production. Also need to enable C-groups.
  • AGLT2: running HC on the test gatekeeper, running fine. Need rpms out of the OSG testing repo, to get best results. Debugging the job router is inadequate. Condor v8 has differences - there were simple changes that caused.
  • NET2: got to the stage of SGE job submissions running. There's a new version. Need to run at scale.
meeting (10/1/14)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference meeting (9/3/14)

meeting (10/1/14)

  • EU privacy issue now fixed.

ATLAS Connect (Dave, Rob)

meeting (8/6/2014) meeting (9/3/2014)
  • Still working with TACC to get a CVMFS solution, of some nature, working on Stampede. They are setting up test nodes with FUSE enabled.
  • Replicate data on their Lustre file system - using ReplicateCVMFS
meeting (10/1/2014)

Site news and issues (all sites)

  • SWT2 (UTA):
    • last meeting(s): Network upgrade is done. Looking at increasing # analysis jobs. Received 30 compute nodes.
    • this meeting: working on bringing 30 compute nodes

  • SWT2 (OU, OSCER):
    • last meeting(s): Ran 1800 jobs slots on OSCER - 3600 between LUCILLE, OCHEP, OSCER, really good.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Everything running smoothly.
    • this meeting:

  • WT2:
    • last meeting(s): Short outage August 7 for internal network upgrades.
    • this meeting:

  • T1:
    • last meeting(s): WAN connectivity: BNL building a fully redundant SciDMZ, deploying new Juniper, provided by Lab. Expect to have 200 Gbps. Looking into adding one more link. Currently conducting large scale test with Amazon, ESnet helping tremendously setting up a VPN between VA, CA, OR - to allow running ATLAS production jobs at VLS (50-100k) with no networking bottlenecks. Hiro is setting up a SE in Amazon, asked for a PB of storage. Discussion with FTS3 for adding S3 protocol. (No SRM needed! Timescale? November.) Hiro is addressing a critical item associated with FAX - when jobs running at BNL, traffic goes through the firewall rather than the SciDMZ... interferes with lab traffic, setting up a (client-side) proxy.
    • this meeting:

  • AGLT2:
    • last meeting(s): Some flakiness in one of the 40g LR interfaces, testing. Getting ready for purchases, updated quotes from Dell. Analysis users running multi-core jobs by mistake. The analysis is WW, checked into SVN. Implement C-groups? GLOW even sent some MC jobs. Alden: it was a one-off.
    • this meeting: 14 R620's received, 3 M620s, waiting on storage and switch. MSU will delay purchases.

  • NET2:
    • last meeting(s): Getting ready to make purchases, and discussing options with Dell.
    • this week: close to making a purchase on R630s. Haswell, there are benchmarks.

  • MWT2:
    • last meeting(s): Had problems with storage servers rebooting - investigating load incidents. Closer to being consistent in CCC - 600k files are now dark, down from 5.7M files. Illinois - 14 nodes on order, damaged, getting a priority shipment. Perfsonar box at Illinois, strangely capping at 6 Gbps - related to Myricom driver. UC - delivery and installation of two 32x10g Juniper line cards; testing with a storage server at 20g LACP this week.
    • this meeting: Increase in MCORE. 14 new workers at UIUC - have benchmarks on Bob's twiki.

AOB

last meeting this meeting


-- RobertGardner - 30 Sep 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf Site-VOmatrix-Sep-ATLAS.xlsx.pdf (27.7K) | RobertGardner, 01 Oct 2014 - 12:49 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback