r3 - 17 Feb 2010 - 12:42:10 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesFeb17

MinutesFeb17

Introduction

Minutes of the Facilities Integration Program meeting, Feb 17, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • further meetings this past week with regards to T3 planning.
    • several working groups for T3's - distributed storage, ddm, proof; 3 month timeframe, intermediate report in 6 weeks.
    • output will be the recommended solutions
    • T3's in the US - will be working in parallel. Expect funding soon.
    • Expect a communication from Massimo regarding call for participation of working groups.
  • this week:

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/World-wide-Panda_ADCoS-report-%28Feb2-8-2010%29.html
    
    1)  2/3: MWT2_UC -- failed jobs with stage-in errors:
    2010-02-03T19:51:37| !!FAILED!!2999!! Error in copying (attempt 1): 1099 - lsm-get failed (28169):
    2010-02-03T19:51:37| !!FAILED!!2999!! Failed to transfer EVNT.106517._000459.pool.root.1: 1099 (Get error: Staging input file failed)
    >From Charles:
    I deployed a new version of pcache on MWT2_UC and something seems to have gone wrong. If I can't get this resolved quickly I'll revert to
    the previous version. ==> Issue resolved.  eLog 9202.
    2)   2/4: Problem affecting test jobs submitted to sites which now have the value "cmtConfig = i686-slc5-gcc43-opt" in schedconfigdb has been resolved.  Test scripts need an explicit job.cmtConfig = 'i686-slc4-gcc34-opt' option.  
    (Jobs were otherwise failing with errors like "Required CMTCONFIG (i686-slc5-gcc43-opt) 
    incompatible with that of local system (i686-slc4-gcc34-opt).")
    3)  2/4 - 2/5: MWT2_IU -- job failures with errors like:
    2010-02-04T20:01:34| !!FAILED!!2999!! Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, Could not secure the connection)
    Expired certificate updated -- issue resolved.
    4)  2/5: New pilot version from Paul (42b):
    The following change was just applied to the pilot: voatlas57 was added to the server list. Requested by Tadashi Maeno.
    5)  2/5: HU_ATLAS_Tier2 set 'on-line' after test jobs completed successfully.
    6)  2/5: UTD-HEP set to 'on-line' following a site maintenance outage.  Test jobs completed successfully.
    7)  2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/
    More to say about this later as we see how its use evolves.
    8)  2/8: AGLT2 -- gatekeeper gate01.aglt2.org crashed.  Site admins decided to use the opportunity to perform some s/w updates (SL5.4 and osg 1.2.6).  Maintenance completed, successful test jobs -- back to 'on-line'.  Savannah 62568, eLog 9369.
    9)  2/8: NET2 -- job failures with pilot error about missing file DBRelease-8.5.1.tar.gz.  Copy on disk had the name DBRelease-8.5.1.tar.gz__DQ2-1265117046.  (This can happen when there is a transfer problem -- a subsequent transfer names the file with the "__DQ2-..." extension.)  
    This becomes an issue for the pilot.  From Paul:
    A flaw in LocalSiteMover (lfn is not used in the lsm-get command, only the path). Strange it was not noticed before.. I will try to squeeze in the fix in the pilot version to be released asap (pending a non-related discussion).  RT 15341.
    10)  2/8: AGLT2 -- DDM transfer errors like:
    [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY] Source file [srm://head01.aglt2.org/pnfs/aglt2.org/atlasproddisk/mc09_7TeV/log/
    e511_s703/mc09_7TeV.105730.Pythia_direct_Jpsie3X.simul.log.e511_s703_tid106704_00/log.106704._000433.job.log.tgz.2]:
    locality is UNAVAILABLE]
    Problem was a networking issue at MSU -- resolved.  ggus 55348, eLog 9353.
    11)  2/8 - 2/10: Intermittent slowness when accessing the panda monitor.  Two potential fixes:
    (i) Restart of host voatlas21.cern.ch (ii) Some of the httpd server settings modified to allow more threads to run.
    12)  2/9: MWT2_UC -- From Sarah:
    MWT2_UC drained this morning due to an unresponsive DNS server.  Now that the server is back up the cluster is recovering, but I expect that there will be failed jobs associated with the event. ==> No significant job failures observed.
    13)  2/10: MWT2_UC and ANALY_MWT2 offline for system maintenance (upgrade storage element and associated software).
    14)  2/10: SLAC -- SLACXRD_PRODDISK transfers to BNL-OSG2_MCDISK failed with error:[INVALID_PATH] source file doesn't exist.
    >From Wei: Thanks for reporting, this is fixed.  RT 15416, eLog 9397.
    
    Follow-ups from earlier reports:
    i)  Sites SWT2_CPB & ANALY_SWT2_CPB: maintence outage is almost complete.  Expect to resume production sometime on 2/10.
    ii) Reminder: analysis jamboree at BNL 2/9 - 2/12.
    
  • this meeting:
     Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=85286
    
    1)  2/10: MWT2_UC -- Transfer errors from MWT2_UC_PRODDISK to BNL-OSG2_MCDISK, with errors like "[FTS] FTS State [Failed] FTS Retries [7] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] source file doesn't exist]."  Issue resolved.  RT 15424, eLog 9427.
    2)  2/11: SWT2_CPB -- maintenance outage complete, test jobs finished successfully, site set back to 'on-line'.
    3)  2/11: AGLT2 -- ~200 failed jobs with the error "Get error: dccp get was timed out after 18000 seconds ." From Bob:
    This was fixed around 9am EST today.  There was a missing routing table entry.  Normal operations have since resumed.  There may be more jobs reporting this failure until everything is caught up.  Only half of AGLT2 worker nodes were affected by this.
    4)  2/11: New pilot version from Paul (42c) --
    * "t1d0" added to LFC replica sorting algorithm to prevent such tape replicas from appearing before any disk residing replicas. Requested by Hironori Ito et al.
    * Pilot is now monitoring the size of individual output files. Max allowed size is 5 GB. Currently the check is performed once every ten minutes. New error code was introduced: 1124 (pilot error code), 'Output file too large', 441003/EXEPANDA_OUTPUTFILETOOLARGE (proddb error code). 
    Panda monitor, Bamboo and ProdDB was updated as well. Requested by Dario Barberis et al.
    * The limit of the total size of all input files is now read from the schedconfig DB (maxinputsize). The default site value is set to 14336 MB (by Alden Stradling). Brokering has been updated as well (by Tadashi Maeno). Any problem reading the schedconfig value (not set/bad chars) 
    will lead to pilot setting its internal default as before (14336 MB). maxinputsize=0 means unlimited space. 
    Monitoring of individual input sizes will be added to a later pilot version.
    * Fixed problem with local site mover; making sure that all input file names are defined by LFN. Previously problems occurred with PFN's containing the *__DQ2-* part in the file name (copied into the local directory leading to file not found problem). Requested by John Brunelle.
    * Fixed issue with FileStager access mode not working properly for sites with direct access switched off. Discovered by Dan van der Ster.
    * Removed panda servers voatlas19-21 from internal server list. Added voatlas58-59 (voatlas57 was added in pilot v 42b). Requested by Tadashi Maeno.
    5)  2/11: IllinosHEP -- Job were failing with the errors "EXEPANDA_GET_FAILEDTOGETLFCREPLICAS " & "EXEPANDA_DQ2PUT_LFC-REGISTRATION-FAILED (261)."  Seems the tier 3 LFC was unresponsive for a period of time, causing these errors.  Issue resolved by Hiro.  RT 15427, eLog 9468.
    6)  2/11: AGLT2, power problem -- from Bob:
    At 10:55am EST a power trip at the MSU site dropped 12 machines accounting for up to 144 job slots.  Not all of those slots were running T2 jobs, but many were.  Those jobs were lost.
    7)  2/12 - 2/13: Reprocessing validation jobs were failing at MWT2_IU due to missing atlas release 15.6.3.  The release was subsequently installed by Xin.
    8)  2/12: Reprocessing job submission begun with tag r1093.
    9)  2/13: BNL -- Issue with data transfers resolved by re-starting the pnfs and SRM services.  Problem appears to be related to high loads induced by reprocessing jobs.  Another instance of the problem on 2/15.  See details in eLog 9524, 9508.
    10)  2/16: Power outage at NET2.  From Saul:
    NET2 has had a power outage at the BU site.  All systems have been rebooted and brought back to normal.
    Test jobs completed successfully, site set back to 'on-line'.
    11)  2/16: Test jobs submitted to UCITB_EDGE7 to verify the latest OSG version (1.2.7, a minor update of version 1.2.6 -- contains mostly security fixes).  Jobs completed successfully.
    12)  2/17: Maintenance outage at AGLT beginning at 6:30 a.m. EST.  From Shawn:
    The dCache headnodes have been restored on new hardware running SL5.4/x86_64. We are now working on BIOS/FW updates and will then rebuild the storage(pool) nodes. Outage is scheduled to end at 5 PM Eastern but we hope to be back before then.
    13)  2/17: Note from Paul about impending pilot update:
    Currently the pilot only downloads a job if the WN has at least 5 GB available space. As discussed in a separate thread, we need to increase this limit to guarantee (well..) that a job using a lot of input data, can finish and not run out of local space. 
    It was suggested by Kaushik to use the limit 14 + 5 + 2 = 21 GB (14 GB for input files, 5 GB for output, 2 GB for log).  
    Please let me know as soon as possible if this needs to be discussed any further.
    
    Follow-ups from earlier reports:
    (i) 2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/
    More to say about this later as we see how its use evolves.
    

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Recent progress on Panda job rebrokering:
      • removed the restriction of us physicists going to us queues (no DN brokering)
      • more information to users when a task is submitted
      • also pcache development (Charles, Tadashi) providing wn-level brokering (use local wn disk); agnostic about its source, could be an NSF backend. Another goal is to integrate this into the pilot so as to remove site admin involvement.
    • BNL queues can not keep up high pilot rate because of Condor-g problems, Xin is investigating.
    • DAST involvement in blacklisting problematic analysis sites. Starting this week DAST receives an email twice a day for the sites failing GangaRobot? jobs. A procedure is being set up to act on these failures.
  • this meeting:

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Postpone discussion of proposal on renaming sites for consistency
      I am wondering if we can agree on the consistent site naming convention
      for various services in the ATLAS production system used in US.  There
      seems to be confusions among people/shifters outside of the US to
      identify the actual responsible site from various names used in the US
      production services/queues.   In fact, some of them are openly
      commenting the frustration of the difficulty in the computing log. 
      Hence, I am wondering if we can/should put the effort to use the
      consistent naming conventions for the site name used in the various
      systems.    In the below, I have identified some of the systems which
      could help users if the consistent naming were being used. 
      
      1.  PANDA site name
      2.  DDM site name
      3.  BDII site name
      
      At least, since these three names come to the front of the major ATLAS
      computing monitoring system, the good consistent naming for each site in
      these three separate systems should help ease problems encountered by
      the other people.   So, is it possible to change any of the name?  ( I
      know some of them are pain to change.   If needed, I can make a table of
      names for each site used in these three system. )
      
      Hiro 
    • FTS 2.2 coming soon - will demand update 2 weeks after certification --> close to when we can consolidate site services
    • Be prepared ---> will consolidation of DQ2 site services from Tier 2s; week following FTS upgrade
    • proddisk-cleanse change from Charles - working w/ Hiro
    • Tier 3 in ToA will be subject to FT and blacklisting; under discussion
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • Fred - previous performance tests were invalid - found squid was out of the loop, requires adding client machines in squid.conf file; new tests show very good performance
    • So should T3 sites maintain their own squid? Doug thinks every site should and its easy to do. CVMFS - web file system gets better performance if you have a local squid, so will speed up access for releases and conditions data.
    • There are two layers of security - source and destination; John: there are recommendations in the instructions.
    • There is a discussion about how feasible it is to install Squid at each Tier 3 - John worries about the load on the associated Tier 2s.
    • Can also use CVMFS. Testing at BU.
    • There was an issue of HC performance at BU relative to AGLT2 for conditions data access jobs. Fred will run his tests against BU.
  • this week

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • focus was on perfsonar - new release to be out this friday; Jason: no show stoppers. All sites should upgrade next week.
    • Fixes bugs identified at our sites - hope this release resilient enough for T3 recommendations
    • Next meeting Feb 9
  • this week:
    • Minutes:
      USATLAS Throughput Meeting Notes - February 9, 2010
          =====================================================
      
      Attending:  Shawn, Aaron(UC), Dave, Karthik, Sarah, Jason, Andy, Zafar
      Excused: Horst
      
      1) perfSONAR and site reports.
           Release still on target for February 17th.   Testing looks good so far.  Will want 3-4 sites (OU, MSU, UM and IU) to try next release candidate prior to the official release on Feb 17th.   These sites should be ready to test things around the end of the week.  
      	WT2 - Yee has deployed tools on AFS.   Being tested on private network for now. Zafar is working on remastering his own version of an ISO.  Working with Aaron/Internet2 on this.  Shawn will send example configs based upon his node setup. 
      	SWT2_OU - perfSONAR mostly working well but one-way latency results empty.  Throughtput has perfSONAR regular testing stopped.  
      	NET2 - No report
      	MWT2 - UC perfSONAR, large SMU server packet loss (one way only), SNMP not running on either.  Throughput service stopped.   UC dCache headnodes upgraded to 64-bit.  Increased threads for postgres and SRM.  IU still have firewall issue.  Request in place to open appropriate ports.   One service perfSONAR regular testing stopped.  IU nodes upgraded to 64-bit and threads updated as UC was.  
      	Illinois - perfSONAR throughput service crashed.  Perhaps related to transitions from "active" to "not active"?   Perhaps related to problematic nodes being tested against?   Network asymmetry results have interested the local network folks.  Will be looking into this over  the next couple of weeks.
      	AGLT2 - Issues at both MSU and UM with throughput tests stopping and needing restart.   MSU reports 3.1.2 RC2 version for the latency node seems to have fixed problems found previously.   AGLT2 is undergoing a dCache upgrade (storage/headnodes migrating from SL4 to SL5.4, dCache 1.9.5-10->1.9.5-15) later this week or early next week.
      
      Topic 2)  Information presented on possible "transactional" tests to be added to the automated infrastructure testing that Hiro developed.  Current tests are for bandwidth and data-transfer testing: 10-20 fixed files transferred between sites (Tier-n) with results on successful transfers, time (min/max/avg) and bandwidth saved and graphed.  Plan is to add some kind of transaction testing focusing on measuring the number of files (small) that can be transferred between sites in a fixed time window.  Emphasizes the overhead in such transactions.   Postponed details until Hiro can attend the call (next time).   
      
      AOB - None
      
      Plan to meet again in two weeks at the usual time (Feb 23).   All sites should plan to upgrade perfSONAR once the release is ready on the 17th (within 1 week).   We may be able to get this deployed just prior to LHC physics running...
      
      Shawn

Site news and issues (all sites)

  • T1:
    • last week(s):On-going issue with condor-g - there has been incremental progress being made but there are new effects observed. Observed a slow-down in job throughput. Working with Condor team, some fixes were applied (new condor-q) which helped for a while; decided to add another submit host to the configuration. New HPSS data movers and network links.
    • this week:

  • AGLT2:
    • last week: Failover testing of WAN, all successful. Tuesday will be upgrading dcache hosts to SL5.
    • this week:

  • NET2:
    • last week(s): Upgraded GK host to SL5 OSG 1.2.6 at BU; HU in progress. LFC upgraded to latest version.
    • this week:

  • MWT2:
    • last week(s): Completed upgrade of MWT2 (both sites) to SL 5.3 and new Puppet/Cobbler configuration build system. Both gatekeepers at OSG 1.2.5, to be upgraded to 1.2.6 at next downtime. Delivery of 28 MD1000 shelves.
    • this week:

  • SWT2 (UTA):
    • last week: Extended downtime thurs/fri to add new storage hardware plus usual new SW upgrades; hope to be done by end of week.
    • this week:

  • SWT2 (OU):
    • last week: Started getting more equip from storage order; continue to wait for hardware.
    • this week:

  • WT2:
    • last week(s): ATLAS home and release NFS server failed; will be relocating to temporary hardware.
    • this week:

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Preliminary capacity report is now working:
      This is a report of pledged installed computing and storage capacity at sites.
      Report date:  2010-01-25
      --------------------------------------------------------------------------
       #       | Site                   |      KSI2K |       HS06 |         TB |
      --------------------------------------------------------------------------
       1.      | AGLT2                  |      1,570 |     10,400 |          0 |
       2.      | AGLT2_CE_2             |        100 |        640 |          0 |
       3.      | AGLT2_SE               |          0 |          0 |      1,060 |
      --------------------------------------------------------------------------
       Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       4.      | BU_ATLAS_Tier2         |      1,910 |          0 |        400 |
      --------------------------------------------------------------------------
       Total:  | US-NET2                |      1,910 |          0 |        400 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       5.      | BNL_ATLAS_1            |          0 |          0 |          0 |
       6.      | BNL_ATLAS_2            |          0 |          0 |          1 |
       7.      | BNL_ATLAS_SE           |          0 |          0 |          0 |
      --------------------------------------------------------------------------
       Total:  | US-T1-BNL              |          0 |          0 |          1 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       8.      | MWT2_IU                |      3,276 |          0 |          0 |
       9.      | MWT2_IU_SE             |          0 |          0 |        179 |
       10.     | MWT2_UC                |      3,276 |          0 |          0 |
       11.     | MWT2_UC_SE             |          0 |          0 |        200 |
      --------------------------------------------------------------------------
       Total:  | US-MWT2                |      6,552 |          0 |        379 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       12.     | OU_OCHEP_SWT2          |        464 |          0 |         16 |
       13.     | SWT2_CPB               |      1,383 |          0 |        235 |
       14.     | UTA_SWT2               |        493 |          0 |         15 |
      --------------------------------------------------------------------------
       Total:  | US-SWT2                |      2,340 |          0 |        266 |
      --------------------------------------------------------------------------
      
       Total:  | All US ATLAS           |     12,472 |     11,040 |      2,106 |
      --------------------------------------------------------------------------
      
    • Debugging underway
  • this meeting

AOB

  • last week
  • this week
    • none


-- RobertGardner - 09 Feb 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback