r7 - 29 Feb 2012 - 14:50:27 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb29

MinutesFeb29

Introduction

Minutes of the Facilities Integration Program meeting, Feb 29, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Rob, Michael, John Hover, Marco, Saul, Wei, Tom, Kaushik, Mark, Armen, Shawn, Horst, Hiro, Fred, John B, Dave, Mark, Xin, Alden, Bob
  • Apologies:
  • Guests: Dan Fraser, Greg Thain, Jason

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
      • OSG All hands meeting: March 19-23, 2012 (University of Nebraska at Lincoln). Program being discussed. As last year part of the meeting will have co-located US ATLAS, and joint USCMS/OSG session.
  • For reference:
  • Program notes:
    • last week(s)
      • ConsolidatingLFCStudyGroupUS
      • OSG All Hands meeting, https://indico.fnal.gov/conferenceTimeTable.py?confId=5109#20120319
        • Monday morning: US ATLAS Facilities
        • Monday afternoon: Federated Xrootd w/ CMS
        • Tuesday morning: plenary sessions on Campus grid and cloud
        • Tuesday afternoon: Joint CMS session on WLCG summaries
        • US ATLAS Facilities at OSG AH (tentative): https://indico.cern.ch/conferenceDisplay.py?confId=178216
      • GuideToJIRA - will use to coordinate perfSONAR and OSG 3.0 deployments this quarter.
      • Funding agencies are requesting metrics from the facility: this is diverse, multiple things; capacity deployment against the deadline; analysis performance; validation matrix - we used to use; site certification matrix for the phase; maintaining the result. The matrix: SiteCertificationP19.
      • Deployment of capacities to follow-up.
      • 10G perfsonar deployed by end of quarter
      • OSG 3.0 deployment by end of quarter
      • LHCONE - LBNL working meeting; BNL is directly connected EsNET VRF zone implemented. Should all just work. Close to being able to use infrastructure.
      • Planned interventions should be all complete by end of March
    • this week

Guest topic: HTPC from OSG (Dan Fraser)

64bit Python and LFC clients - recent problems at SLAC

  • SLAC and UTA have direct reading for analysis jobs. 32-bit wn environment. SFN to be converted to URL. Works for most analysis jobs. Certain number of prun jobs require using 64 bit python. Pilot then fails return a 64 bit python; therefore cannot load the LFC client bindings.
  • Is it a problem with the version of the worker node client? Wei - running a relatively old version.
  • Marco describes what happens with the current install of wn-client.

Follow-up on CVMFS deployments & plans

last meeting:
  • OU - its deployed - will be there when the site comes back up, today or tomorrow - will start the validation.
  • UTA - ran into an issue with the test queue. In place now. Need Xin and Alessandro. Getting queue setup correctly on test gatekeeper - looks good.
  • No other sites have not yet deployed.

this meeting:

  • UTA - very close to converting the SWT2 cluster to CVMFS. All tests were passing successfully - got sidetracked by a network issue.

rpm-based OSG 3.0 CE install

last meeting
  • Mirror of EPEL repo - need to follow-up
this meeeting
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: on near term agenda.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Drop off in production
    • mc12 - energy to be raised to 8 TeV - but validation continues.
    • Sites should take downtimes asap.

  • this meeting:
    • Running ~50%. Expect another week before mc12 starts, so still a very good time for downtimes.
    • The crunch will start after that.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Storage status looks good. More deletion errors - LFC errors, but these were caught at fixed. (AGLT2 and NET2)
    • Allocation of GROUPDISK at Tier2s at ~ 400 TB.
    • Discussion about setting quota - they are in ToA.
    • US ATLAS policy for data management - consistent guidance for sites. Use of auto-adjuster - sites keep minimum level. Shawn's notes Savannah ticket. Action item for DDM group.
  • this meeting:
    • Consistency checking cleaned up.
    • A number of GGUS tickets for central deletion closed.
    • Space token implementation policy - recommendation for space token sizes. Too much free-space in group disk.

Shift Operations (Mark)

  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=1&confId=179341
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-2_20_2012.html
    
    1)  2/15: NET2 - Jobs from task 710174 were failing at HU with the error " "libg2c.so.0: cannot open shared object file: No such file or directory."
    From John: Looks like a missing 32-bit compat-libf2c-34 on some new nodes we have, I'm installing now.  eLog 33828.
    2)  2/15: WISC DDM errors ("failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]."  As of 2/17 issue with updating certificates resolved.  
    Closed ggus 79265, eLog 33882.
    3)  2/16: UTD-HEP - jobs were failing with the error "CORAL/RelationalPlugins/sqlite Error SQLiteStatement::prepare 11 database disk image is malformed."  Two problematic 
    WN's (g7 & g10) were removed from production, site set back on-line.  eLog 33859, Savannah ticket http://savannah.cern.ch/support/?126432.
    4)  2/16: Job failures at NET2 ("The following ReadCatalogs: file:/cvmfs/atlas.cern.ch/repo/conditions/poolcond/PoolFileCatalog.xml could have not been started (probably they 
    are non-existing)."  Failures were occurring on two WN's.  They were drained, and CVMFS then restarted.  eLog 33860 (also see details from Saul: 
    http://egg.bu.edu/gpfs1/ATLAS/panda/history/2011/reports/studies/cvmfs/).
    5)  2/16: Job failures at MWT2 (WN uct2-c226.mwt2.org) with errors about an atlas s/w release, 17.1.3.  Node off-lined and drained, then CVMFS was re-started to 
    repair the 17.1.3 area. 
    6)  2/17: DDM deletion errors at WT2 - ggus 79335 in-progress.  See the ticket for more details about the ongoing investigations.  eLog 33931.
    7)  2/19: MWT2 - ggus 78999 was re-opened due ~1.3k "lost heartbeat" job failures.  These errors are difficult to debug owing to lack of log files.  eLog 33911.
    8)  2/20 early a.m.: File transfer errors at SWT2_CPB ("has trouble with canonical path. cannot access it").  A re-start of the xrootffs process on the SRM host solved this 
    problem.  Since the site was set off-line during this period it was necessary to validate it with test jobs prior to resetting on-line.  Investigating an issue with the test job outputs 
    not getting transferred back to BNL.  ggus 79355 / RT 21701 in-progress, eLog 33954.
    9)  2/21: System is being rolled out to allow for the auto-exclusion of problematic panda production sites, in the same way as is already done for analysis sites.  Not yet enabled 
    in the US cloud.  See eLog 33973, also https://indico.cern.ch/materialDisplay.py?contribId=12&materialId=slides&confId=178429
    10) 2/22: AGLT2 - file transfer failures with "locality is unavailable" errors.  From Shawn: All dCache pools are functioning now. The problem was related to heavy load and 
    memory pressures on the nodes. We have disabled "scatter-gather" on the 10GE NICs via 'ethtool -K eth2 sg off' to see if that helps reduce the problem in the future.  
    ggus 79477 closed, eLog 33984.
    
    Follow-ups from earlier reports:
    
    (i)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    Update 1/17: ggus tickets 78324/40 opened - closed since this issue is being tracked in ggus 77729.
    (ii)  2/9: MWT2 file transfer errors (" Failed : All Ready slots are taken and Ready Thread Queue is full. Failing request]").  ggus 79080 in-progress, eLog 33683.
    Update 2/17: ggus 79080 marked as 'solved', but no details available. 
    (iii)  2/15: NET2 PHYS-TOP - file transfer errors (" failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]').  ggus 79252, eLog 33820.
    Update 2/18: Probably due to occasional slowness on the BU gatekeeper.  Improvements have been made to reduce the incidence of these kinds of errors - plus the host will 
    soon be upgraded to new hardware.  ggus 79252 closed.
    
    • Site state for Panda has changed - sites are put into "test" rather than "brokeroff".
    • Pilot releases from Paul
    • DDM robot certificate - voms extension expired on the cert - causing errors everywhere
    • Analysis queues were auto-offlined everywhere, prob was a cron at cern, issue now resolved.
    • Alden: Schedconfig failure resulting from a race condition with subversion updates, causing folders to be deleted. This has been fixed.

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=180223
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-2_27_2012.html
    
    1)  2/22: UTD-HEP set off-line to replace a failed hard drive.  Issue resolved as of 2/25 - test jobs successful, set back on-line.  eLog 34046,
    https://savannah.cern.ch/support/index.php?126583 (Savannah site exclusion).
    2)  2/23: BU_ATLAS_Tier2 - job failures with errors like "Get error: and sys.argv[0].find('lsm-df') -1: log(lsm called: %s % cmd) File /atlasgrid/osg-wn-1.0.0/lsm/lsm-get, line 
    508, in log else: open(LOGFILE, 'a').write(msg+'\n') IOError: [Errno 13] Permission denied."  Saul reported the problem was fixed.  ggus 79544 closed, eLog 34017.
    3)  2/23: OU_OCHEP_SWT2 - cluster upgrade completed, test jobs finished successfully, site back on-line in production.  eLog 34019.
    4)  2/24: Shifter reported a problem with pilots at BNL (expired proxy).  Issue only affected test pilots for a different VO, so production not impacted.
    eLog 34033.
    5)  2/24: cloudconfig.tier1 database field changed from BNL_ATLAS_1 to BNL_CVMFS_1 to reflect migration of resources to the newer panda site.
    6)  2/25: HU_ATLAS_Tier2 - problem with CVMFS on two WN's caused production job failures ("Trf installation dir does not exist and could not be installed") and activated 
    jobs in the analysis site got stuck.  Issue resolved, eLog 34057.
    7)  2/25: AGLT2 - jobs were failing with stage-in errors.  Bob reported there was a problem with some dCache servers around this time, so presumably the cause.  
    No recent errors.  eLog 34053.
    8) 2/28: New pilot release from Paul (version number 51a).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_51a.html
    9)  2/28: File transfer errors to various US cloud sites with the error "Invalid SRM version [] for endpoint []."  From Hiro: There were typo errors in the FTS info system.  
    The type 'srm' must be capitalized.  But for several storage endpoint, it was using the lower letter, which caused the errors.  This has been resolved.  
    ggus 79703 closed, eLog 34102.
    
    Follow-ups from earlier reports:
    
    (i)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    Update 1/17: ggus tickets 78324/40 opened - closed since this issue is being tracked in ggus 77729.
    (ii)  2/17: DDM deletion errors at WT2 - ggus 79335 in-progress.  See the ticket for more details about the ongoing investigations.  eLog 33931.
    (iii)  2/19: MWT2 - ggus 78999 was re-opened due ~1.3k "lost heartbeat" job failures.  These errors are difficult to debug owing to lack of log files.  eLog 33911.
    (iv)  2/20 early a.m.: File transfer errors at SWT2_CPB ("has trouble with canonical path. cannot access it").  A re-start of the xrootffs process on the SRM host solved this 
    problem.  Since the site was set off-line during this period it was necessary to validate it with test jobs prior to resetting on-line.  Investigating an issue with the test job 
    outputs not getting transferred back to BNL.  ggus 79355 / RT 21701 in-progress, eLog 33954.
    Update 2/27: Initial sets of test jobs were failing due to output files transfer timeouts.  Not sure what the issue was.  Test jobs submitted Friday/Saturday (2/24,2/25) finished 
    successfully.  Set the production site back on-line.  ggus 79355 / RT 21701 closed.  eLog 34092.
    (v)  2/21: System is being rolled out to allow for the auto-exclusion of problematic panda production sites, in the same way as is already done for analysis sites.  Not yet 
    enabled in the US cloud.  See eLog 33973/34118, also
    https://indico.cern.ch/materialDisplay.py?contribId=12&materialId=slides&confId=178429
    
    • New policy for auto exclusion for production problems
    • US not yet added to the system (see link above)

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • Sites need to be getting perfsonar 10G for bandwidth nodes.
    • Traceroute tests - need to check configurations.
    • LHCONE baseline measured
  • this meeting:
    • Getting LHCONE sites setup for testing; some sites have been lagging.
    • Goal is to get a baseline measurement. Probably about a week or two away.
    • Tom has been working on a new version of the dashboard.

Federated Xrootd deployment in the US (Wei)

last week(s) this week:
  • Two monitoring packages are being evaluated. UCSD package - looks interesting. Can customize for our needs.

Tier 3 GS

last meeting:
  • Removal of DATADISK sites - like Wisconsin and other Tier3s. Kaushik and Armen wanted to draft something to this effect. Armen believes this is limited to Wisconsin.
  • UTD - do they need DATADISK?
  • Make this a coherent facility policy.
this meeting:

Site news and issues (all sites)

  • T1:
    • last meeting(s): Allowing more analysis jobs. Working on a new Condor version - with new group quota implementation. Whole node scheduling with Condor. Post software week we will make plans for the facility; there are options (eg. pilot).
    • this meeting: Intervention planned for next Tuesday - border router change would affect production in the US. Affecting the OPN connectivity, primarily. Moving to Nexxus generation.

  • AGLT2:
    • last meeting(s): Things running well - will be planning a downtime for network re-configs, new equipment. Continuing virtualizations - all services virtualized and with site redundancy. MSU has hardware on order for this. Bob: working on cfengine to quickly reconf worker nodes at both sites; applying to both sites; more flexible and quickly change. Run in failover mode first; then in a disaster recovery mode. Future look at resilient. Hot failover will be down the road.
    • this meeting: March 7 down day. Will move network switching around. Change back to single-rapid spanning tree. Reconfigured space tokens. Will be testing OSG wn rpm.

  • NET2:
    • last meeting(s): New workers being installed at BU and HU (~ 1800 worker nodes). Will be trying out OSG 3.0.
CONFLICT original 5:
    • this meeting:
CONFLICT version 6:
CONFLICT version new:
    • this meeting: Downtime soon - 900 TB racked and ready. New gatekeeper, file server. Seeing CVMFS failure every two or three days; annoying since you have to drain the nodes. New worker nodes up at HU and BU. Bestman2: brought it up, had trouble restarting so needs more work; on agenda
CONFLICT end

  • MWT2:
    • last meeting(s): Progress continues on deploying new hardware at all sites. Storage completed at UC. Working on computes at IU and UC. Campus cluster meeting at UIUC focusing first compute deployment, networking, storage, head nodes (condor, squid). UIUC nodes flocked to from main MWT2 queues, integrated in sysview, http://www.mwt2.org/sys/view/. Have done significant OSG opportunistic jobs during the drainage.
    • this meeting: Continuing to bring new compute nodes online (60 R410), gatekeeper problems earlier this week, updating. Waiting on hardware to arrive at the Illinois CC. Using existing pilot hardware. CC folks did major reconfiguration to accommodate MWT2. High connectivity up to Chicago. UIUC integration proceeding: squid and condor head nodes setup. Adding CC nodes should be straightforward.

  • SWT2 (UTA):
    • last meeting(s): Focusing on CVMFS. Xin got the atlas-wn installed.
    • this meeting: CVMFS on UTA_SWT2 is done - waiting on completion of Alessandro's validation tests. Convert UTA_CPB by end of week. Have had an issue transferring files back to BNL - surprisingly. Also slow transfers coming in. Discussions with Hiro and campus networking folks. Perfsonar history does show problems about three weeks ago, and deteriorated last week. Still don't have a good answer as to what changed. Continue investigations.

  • SWT2 (OU):
    • last meeting(s): Upgrade from Hell. Two Dell servers were faulty - iDRAC and frequent crashes. Replaced mobo, replaced memory, swapped cpu's. Lustre upgrade went well.
    • this meeting: Came out of downtime last week. All looks good, pleased with storage performance. 1100 MB/s for gridftp IO. 250-350 MB/s on average to BNL. Started filling up on the weekend. Lustre servers work better with the cluster fills. Handling large number of analysis.

  • WT2:
    • last meeting(s): Discussions with Hiro gridftp-only door, by-passing SRM. EOS experimenting with an xrootd sites.
    • this meeting: wn-client issue.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week

AOB

last week this week


-- RobertGardner - 28 Feb 2012

  • screenshot_01.png:
    screenshot_01.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf HTPC_2012.pptx.pdf (6449.7K) | RobertGardner, 29 Feb 2012 - 10:02 |
pptx HTPC_2012.pptx (223.3K) | RobertGardner, 29 Feb 2012 - 10:02 |
png screenshot_01.png (40.4K) | RobertGardner, 29 Feb 2012 - 12:58 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback