r3 - 16 Feb 2010 - 05:32:39 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb10



Minutes of the Facilities Integration Program meeting, Feb 10, 2010 -- cancelled
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • further meetings this past week with regards to T3 planning.
    • several working groups for T3's - distributed storage, ddm, proof; 3 month timeframe, intermediate report in 6 weeks.
    • output will be the recommended solutions
    • T3's in the US - will be working in parallel. Expect funding soon.
    • Expect a communication from Massimo regarding call for participation of working groups.
  • this week:

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/28: New pilot version from Paul (42a) --
    * CERNVM modifications allowing the pilot to communicate with the dispatcher, run jobs inside the virtual machine. Output file handling done in a similar way to NG using mv site mover. Requested by CERNVM team.
    * Refactoring in various pilot codes to facilitate glExec integration (next couple of pilot versions) and multi-job handling.
    * Error handling added to curl class responsible for server communications.
    * Pilot is now searching for orphan processes after found looping job and eliminates them. Requested by John Brunelle.
    * Changed 'source setup.sh' to 'source ./setup.sh' in payload setup. Requested by Rod Walker.
    * Fix for space token handling related to the issue of datasets being split between MCDISK and MCTAPE. Space tokens did not arrive to the pilot in the expected order. Requested by Stephane Jezequel et al.
    * Replaced slc5,gcc43 tags with cmtconfig value in setup used for SITEROOT extraction/dynamic trf installation. Seen to have caused problems at TRIUMF with SL(C)5 jobs (fixed earlier by installing patch release) Requested by Rod Walker.
    2)  1/28: Issues with the storage systems at MWT2_IU following some sustem upgrades resolved -- from Hiro:
    UC had experienced the problems in the storage earlier today.  Some of hosts for the storage disk pools run out of memory due to too small swap spaces after the recent upgrade of the system. To correct the problem, those affected hosts have been restarted, resulting in failures in jobs and transfers. 
    Since the fix, all are reported to be working correctly.
    Also, from Sarah (1/29):
    MWT2_IU has completed the upgrade of our gatekeeper to OSG 1.2.5, and workers nodes to SL5 x86_64, and test jobs have completed successfully.  We are setting the site back online in OIM, Panda and FTS.
    3)  1/28 - 2-1: ddm upgrades at NET2.  Noticed some CA-related issues after re-starting -- resolved, system now back on-line.  RT 15282.
    4)  1/28: Xavier has set up a twitter page for ADCoS shift announcements / news: https://twitter.com/ADCShifts.
    5)  1/29 (early a.m. U.S. time): panda server unavailable due to typo in the ToA -- issue resolved.  eLog 9060.
    6)  Sites SWT2_CPB & ANALY_SWT2_CPB will be off-line 2/2 - 2/5 for system upgrades.  eLog 9098.
    7)  2/1: SLAC -- CE is off-line -- from Wei:
    CE at SLAC is down. We got error from the disk array hosting Grid home and Atlas releases. Investigating.
    Site is curretnly off-line in panda.
    8)  2/1: From Hiro, regarding one of the SAM tests with problems at U.S. sites:
    For CE test, you can ignore the current failure results due to the lack of "lcg-tags" command. The SAM test will be modified to accommodate
    US / OSG sites configuration.
    9)  Again this week very low level of production jobs in the U.S. cloud.
    Follow-ups from earlier reports:
    i)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    Update from Paul (1/28): Error handling added to curl class responsible for server communications (see pilot update above).
    ii)  1/11: From Paul, discussing possible changes to the pilot to deal with the issue of orphaned processes left behind following athena crashes (seen by John at BU and Horst at OU):
    Since the parent pid is set to 1 (a result of athena core dumping?), the pilot doesn't identify it as belonging to its chain of processes. I will see if the pilot can identify it by e.g. also looking at the pgid. [Added to the todo list].
    Update from Paul (1/28): Pilot is now searching for orphan processes after found looping job and eliminates them. Requested by John Brunelle (see pilot update above).
    iii) Reminder: analysis jamboree at BNL 2/9 - 2/12.
  • this meeting:
     Yuri's summary from the weekly ADCoS meeting:
    1)  2/3: MWT2_UC -- failed jobs with stage-in errors:
    2010-02-03T19:51:37| !!FAILED!!2999!! Error in copying (attempt 1): 1099 - lsm-get failed (28169):
    2010-02-03T19:51:37| !!FAILED!!2999!! Failed to transfer EVNT.106517._000459.pool.root.1: 1099 (Get error: Staging input file failed)
    >From Charles:
    I deployed a new version of pcache on MWT2_UC and something seems to have gone wrong. If I can't get this resolved quickly I'll revert to
    the previous version. ==> Issue resolved.  eLog 9202.
    2)   2/4: Problem affecting test jobs submitted to sites which now have the value "cmtConfig = i686-slc5-gcc43-opt" in schedconfigdb has been resolved.  Test scripts need an explicit job.cmtConfig = 'i686-slc4-gcc34-opt' option.  
    (Jobs were otherwise failing with errors like "Required CMTCONFIG (i686-slc5-gcc43-opt) 
    incompatible with that of local system (i686-slc4-gcc34-opt).")
    3)  2/4 - 2/5: MWT2_IU -- job failures with errors like:
    2010-02-04T20:01:34| !!FAILED!!2999!! Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, Could not secure the connection)
    Expired certificate updated -- issue resolved.
    4)  2/5: New pilot version from Paul (42b):
    The following change was just applied to the pilot: voatlas57 was added to the server list. Requested by Tadashi Maeno.
    5)  2/5: HU_ATLAS_Tier2 set 'on-line' after test jobs completed successfully.
    6)  2/5: UTD-HEP set to 'on-line' following a site maintenance outage.  Test jobs completed successfully.
    7)  2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/
    More to say about this later as we see how its use evolves.
    8)  2/8: AGLT2 -- gatekeeper gate01.aglt2.org crashed.  Site admins decided to use the opportunity to perform some s/w updates (SL5.4 and osg 1.2.6).  Maintenance completed, successful test jobs -- back to 'on-line'.  Savannah 62568, eLog 9369.
    9)  2/8: NET2 -- job failures with pilot error about missing file DBRelease-8.5.1.tar.gz.  Copy on disk had the name DBRelease-8.5.1.tar.gz__DQ2-1265117046.  (This can happen when there is a transfer problem -- a subsequent transfer names the file with the "__DQ2-..." extension.)  
    This becomes an issue for the pilot.  From Paul:
    A flaw in LocalSiteMover (lfn is not used in the lsm-get command, only the path). Strange it was not noticed before.. I will try to squeeze in the fix in the pilot version to be released asap (pending a non-related discussion).  RT 15341.
    10)  2/8: AGLT2 -- DDM transfer errors like:
    [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY] Source file [srm://head01.aglt2.org/pnfs/aglt2.org/atlasproddisk/mc09_7TeV/log/
    locality is UNAVAILABLE]
    Problem was a networking issue at MSU -- resolved.  ggus 55348, eLog 9353.
    11)  2/8 - 2/10: Intermittent slowness when accessing the panda monitor.  Two potential fixes:
    (i) Restart of host voatlas21.cern.ch (ii) Some of the httpd server settings modified to allow more threads to run.
    12)  2/9: MWT2_UC -- From Sarah:
    MWT2_UC drained this morning due to an unresponsive DNS server.  Now that the server is back up the cluster is recovering, but I expect that there will be failed jobs associated with the event. ==> No significant job failures observed.
    13)  2/10: MWT2_UC and ANALY_MWT2 offline for system maintenance (upgrade storage element and associated software).
    14)  2/10: SLAC -- SLACXRD_PRODDISK transfers to BNL-OSG2_MCDISK failed with error:[INVALID_PATH] source file doesn't exist.
    >From Wei: Thanks for reporting, this is fixed.  RT 15416, eLog 9397.
    Follow-ups from earlier reports:
    i)  Sites SWT2_CPB & ANALY_SWT2_CPB: maintence outage is almost complete.  Expect to resume production sometime on 2/10.
    ii) Reminder: analysis jamboree at BNL 2/9 - 2/12.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Recent progress on Panda job rebrokering:
      • removed the restriction of us physicists going to us queues (no DN brokering)
      • more information to users when a task is submitted
      • also pcache development (Charles, Tadashi) providing wn-level brokering (use local wn disk); agnostic about its source, could be an NSF backend. Another goal is to integrate this into the pilot so as to remove site admin involvement.
    • BNL queues can not keep up high pilot rate because of Condor-g problems, Xin is investigating.
    • DAST involvement in blacklisting problematic analysis sites. Starting this week DAST receives an email twice a day for the sites failing GangaRobot? jobs. A procedure is being set up to act on these failures.
  • this meeting:

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Postpone discussion of proposal on renaming sites for consistency
      I am wondering if we can agree on the consistent site naming convention
      for various services in the ATLAS production system used in US.  There
      seems to be confusions among people/shifters outside of the US to
      identify the actual responsible site from various names used in the US
      production services/queues.   In fact, some of them are openly
      commenting the frustration of the difficulty in the computing log. 
      Hence, I am wondering if we can/should put the effort to use the
      consistent naming conventions for the site name used in the various
      systems.    In the below, I have identified some of the systems which
      could help users if the consistent naming were being used. 
      1.  PANDA site name
      2.  DDM site name
      3.  BDII site name
      At least, since these three names come to the front of the major ATLAS
      computing monitoring system, the good consistent naming for each site in
      these three separate systems should help ease problems encountered by
      the other people.   So, is it possible to change any of the name?  ( I
      know some of them are pain to change.   If needed, I can make a table of
      names for each site used in these three system. )
    • FTS 2.2 coming soon - will demand update 2 weeks after certification --> close to when we can consolidate site services
    • Be prepared ---> will consolidation of DQ2 site services from Tier 2s; week following FTS upgrade
    • proddisk-cleanse change from Charles - working w/ Hiro
    • Tier 3 in ToA will be subject to FT and blacklisting; under discussion
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • Fred - previous performance tests were invalid - found squid was out of the loop, requires adding client machines in squid.conf file; new tests show very good performance
    • So should T3 sites maintain their own squid? Doug thinks every site should and its easy to do. CVMFS - web file system gets better performance if you have a local squid, so will speed up access for releases and conditions data.
    • There are two layers of security - source and destination; John: there are recommendations in the instructions.
    • There is a discussion about how feasible it is to install Squid at each Tier 3 - John worries about the load on the associated Tier 2s.
    • Can also use CVMFS. Testing at BU.
    • There was an issue of HC performance at BU relative to AGLT2 for conditions data access jobs. Fred will run his tests against BU.
  • this week

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • focus was on perfsonar - new release to be out this friday; Jason: no show stoppers. All sites should upgrade next week.
    • Fixes bugs identified at our sites - hope this release resilient enough for T3 recommendations
    • Next meeting Feb 9
  • this week:
    • Minutes:
      USATLAS Throughput Meeting Notes - February 9, 2010
      Attending:  Shawn, Aaron(UC), Dave, Karthik, Sarah, Jason, Andy, Zafar
      Excused: Horst
      1) perfSONAR and site reports.
           Release still on target for February 17th.   Testing looks good so far.  Will want 3-4 sites (OU, MSU, UM and IU) to try next release candidate prior to the official release on Feb 17th.   These sites should be ready to test things around the end of the week.  
      	WT2 - Yee has deployed tools on AFS.   Being tested on private network for now. Zafar is working on remastering his own version of an ISO.  Working with Aaron/Internet2 on this.  Shawn will send example configs based upon his node setup. 
      	SWT2_OU - perfSONAR mostly working well but one-way latency results empty.  Throughtput has perfSONAR regular testing stopped.  
      	NET2 - No report
      	MWT2 - UC perfSONAR, large SMU server packet loss (one way only), SNMP not running on either.  Throughput service stopped.   UC dCache headnodes upgraded to 64-bit.  Increased threads for postgres and SRM.  IU still have firewall issue.  Request in place to open appropriate ports.   One service perfSONAR regular testing stopped.  IU nodes upgraded to 64-bit and threads updated as UC was.  
      	Illinois - perfSONAR throughput service crashed.  Perhaps related to transitions from "active" to "not active"?   Perhaps related to problematic nodes being tested against?   Network asymmetry results have interested the local network folks.  Will be looking into this over  the next couple of weeks.
      	AGLT2 - Issues at both MSU and UM with throughput tests stopping and needing restart.   MSU reports 3.1.2 RC2 version for the latency node seems to have fixed problems found previously.   AGLT2 is undergoing a dCache upgrade (storage/headnodes migrating from SL4 to SL5.4, dCache 1.9.5-10->1.9.5-15) later this week or early next week.
      Topic 2)  Information presented on possible "transactional" tests to be added to the automated infrastructure testing that Hiro developed.  Current tests are for bandwidth and data-transfer testing: 10-20 fixed files transferred between sites (Tier-n) with results on successful transfers, time (min/max/avg) and bandwidth saved and graphed.  Plan is to add some kind of transaction testing focusing on measuring the number of files (small) that can be transferred between sites in a fixed time window.  Emphasizes the overhead in such transactions.   Postponed details until Hiro can attend the call (next time).   
      AOB - None
      Plan to meet again in two weeks at the usual time (Feb 23).   All sites should plan to upgrade perfSONAR once the release is ready on the 17th (within 1 week).   We may be able to get this deployed just prior to LHC physics running...

Site news and issues (all sites)

  • T1:
    • last week(s):On-going issue with condor-g - there has been incremental progress being made but there are new effects observed. Observed a slow-down in job throughput. Working with Condor team, some fixes were applied (new condor-q) which helped for a while; decided to add another submit host to the configuration. New HPSS data movers and network links.
    • this week:

  • AGLT2:
    • last week: Failover testing of WAN, all successful. Tuesday will be upgrading dcache hosts to SL5.
    • this week:

  • NET2:
    • last week(s): Upgraded GK host to SL5 OSG 1.2.6 at BU; HU in progress. LFC upgraded to latest version.
    • this week:

  • MWT2:
    • last week(s): Completed upgrade of MWT2 (both sites) to SL 5.3 and new Puppet/Cobbler configuration build system. Both gatekeepers at OSG 1.2.5, to be upgraded to 1.2.6 at next downtime. Delivery of 28 MD1000 shelves.
    • this week:

  • SWT2 (UTA):
    • last week: Extended downtime thurs/fri to add new storage hardware plus usual new SW upgrades; hope to be done by end of week.
    • this week:

  • SWT2 (OU):
    • last week: Started getting more equip from storage order; continue to wait for hardware.
    • this week:

  • WT2:
    • last week(s): ATLAS home and release NFS server failed; will be relocating to temporary hardware.
    • this week:

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Preliminary capacity report is now working:
      This is a report of pledged installed computing and storage capacity at sites.
      Report date:  2010-01-25
       #       | Site                   |      KSI2K |       HS06 |         TB |
       1.      | AGLT2                  |      1,570 |     10,400 |          0 |
       2.      | AGLT2_CE_2             |        100 |        640 |          0 |
       3.      | AGLT2_SE               |          0 |          0 |      1,060 |
       Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
               |                        |            |            |            |
       4.      | BU_ATLAS_Tier2         |      1,910 |          0 |        400 |
       Total:  | US-NET2                |      1,910 |          0 |        400 |
               |                        |            |            |            |
       5.      | BNL_ATLAS_1            |          0 |          0 |          0 |
       6.      | BNL_ATLAS_2            |          0 |          0 |          1 |
       7.      | BNL_ATLAS_SE           |          0 |          0 |          0 |
       Total:  | US-T1-BNL              |          0 |          0 |          1 |
               |                        |            |            |            |
       8.      | MWT2_IU                |      3,276 |          0 |          0 |
       9.      | MWT2_IU_SE             |          0 |          0 |        179 |
       10.     | MWT2_UC                |      3,276 |          0 |          0 |
       11.     | MWT2_UC_SE             |          0 |          0 |        200 |
       Total:  | US-MWT2                |      6,552 |          0 |        379 |
               |                        |            |            |            |
       12.     | OU_OCHEP_SWT2          |        464 |          0 |         16 |
       13.     | SWT2_CPB               |      1,383 |          0 |        235 |
       14.     | UTA_SWT2               |        493 |          0 |         15 |
       Total:  | US-SWT2                |      2,340 |          0 |        266 |
       Total:  | All US ATLAS           |     12,472 |     11,040 |      2,106 |
    • Debugging underway
  • this meeting


  • last week
  • this week
    • none

-- RobertGardner - 09 Feb 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback