r2 - 03 Mar 2010 - 11:25:35 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesMar3

MinutesMar3

Introduction

Minutes of the Facilities Integration Program meeting, Mar 3, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees:
  • Apologies: Rob

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • Hardware recommendations within 2 weeks.
    • Frontier-squid setup in SL5 virtual machine at ANL ASC (push - button solution)
  • this week:

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=86064
    
    1)  2/17: NET2 -- following recovery from a power outage test jobs completed successfully -- site set back to 'on-line'.
    2)  2/18: UTD-HEP set 'off-line' in preparation for OS and other s/w upgrades.
    3)  2/19: New pilot version from Paul (42e):
    * Begun process of tailing the brief pilot error diagnostics (not completed). Long error messages were previously cut off by the 256 char limit on the server side, which often lead to the actual error not being displayed. Some (not yet all) error messages will now be tailed 
    (i.e. the tail of the error message will be shown and not only the beginning of the string). Requested by I Ueda.
    * Now grabbing the number of events from non-standard athena stdout info strings (which are different for running with "good run list"). See discussion in Savannah ticket 62721.
    * Added dCache sub directory verification (which in turn is used to determine whether checksum test or file size test should be used on output files). Needed for sites that share dCache with other sites. Requested by Brian Bockelman et al.
    * Pilot queuedata downloads are now using new format for retrieving the queuedata from schedconfig. Not yet added to autopilot wrapper. Requested by Graeme Stewart.
    * DQ2 tracing report now contains PanDA job id as well as hostname, ip and user dn (a DQ2 trace can now be traced back to the original PanDA job). Requested by Paul Nilsson(!)/Angelos Molfetas.
    * Size of user workdir is now allowed to be up to 5 GB (previously 3 GB). Discussed in separate thread. Requested by Graeme Stewart.
    4)  2/19 - 2/22: AGLT2 -- dCache issues discovered after the site was coming back on-line following a maintenance outage for s/w upgrades (SL5, etc.)  Issue eventually resolved.  Test jobs succeeded, back to 'on-line'.  ggus 55709, eLog 9750.
    5)  2/19: Oracle outage at CERN on 2/18 described here:
    https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/9644
    6)  2/19: SLAC -- DDM errors like "FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] source file doesn't exist]."  Issue resolved -- from Wei:
    This is a name space problem. I will investigate. In the mean time, I switched to run the xrootdfs (pnfs equivalent) without using name space. I hope DDM will retry.
    7)  2/19: Large numbers of tasks were killed in the US & DE clouds to allow high priority ones to run.  (The high priority tasks were needed for a min-bias paper in preparation.)  eLog 9645.
    8)  2/20: From John at Harvard -- HU_ATLAS_Tier2 set back to 'on-line' after test jobs completed successfully.
    9)  2/23: BNL -- FTS upgraded to v2.2.3.  From Hiro:
    This is just to inform you that BNL FTS has been upgraded to the checksum capable version. There will be some test for this capabilities.  Also, as we have planed all along, the consolidation of DQ2 site service will happen after some tests in coming weeks after BNL DQ2 is upgraded in the next week.
    10)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    
    Follow-ups from earlier reports: 
    
    (i) 2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/
    More to say about this later as we see how its use evolves.
    (ii) This past week:  What began with test jobs at UCITB_EDGE7 to verify the latest OSG release (1.2.7) led to the issue of having another site available besides UCITB.  Long mail thread about this topic.  Conclusion (from Alden, 2/23):  
    Things are pretty much resolved. We'll need to create a new queue for ATLAS ITB activities, 
    and shift all focus for ATLAS from off the BNL_ITB_Test1-condor queue. I'll get that started this afternoon.
    (iii) Issue about pilot space checking / minimum space requirements noted last week -- has there been a decision here?
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=86925
    
    1)  This past week -- ongoing work to setup a new BNL ITB test site.  Long mail thread about this topic.
    2)  2/24 - 2/26: AGLT2 -- transfer errors for AGLT2_DATADISK.  Most likely a network issue -- from Shawn:
    We have seen no indication of a problem with our AGLT2_DATADISK space-token area. We have seen intermittent network problems (as observed by the perfSONAR OWAMP testing between AGLT2 and BNL) where there are periods of large packet loss. 
    Our best guess is that these problems were correlated with some network issue along the path between AGLT2 and BNL. Almost all the time
    between this ticket and now the ARDA dashboard has been "Green" for the AGLT2 space-token areas.  ggus 55854, eLog 9786.
    3)  2/24 - 2/25: SWT2_CPB -- data transfers were failing due to a problem with one of the data servers.  Data from this machine was replicated elsewhere in the cluster, resolving this issue.  ggus 55895, eLog 9810, RT 15556
    4)  2/25: Transfer errors at BNL.  Issue resolved.  eLog 9849, ggus 55936, RT 15563.
    5)  2/25: New pilot version from Paul (42f):
    The pilot has been updated. The mini-release contains a few corrections to the DQ2 tracing reports (added DN for job owner, corrections for input file dataset, missing appdir variable for log transfers). Requested by Angelos Molfetas.
    6)  2/25: SLAC -- brief outage to upgrade OS of SRM host and upgrade kernel for LFC DB host -- completed.
    7)  2/26: MWT2_UC -- from Sarah:
    We've encountered a hardware fault on our LFC server, and are working to repair it. MWT2_UC and Analy_MWT2 will remain offline until it is back up.  2/27: update from Rob:
    The server hosting the LFC catalog failed and is being restored. In addition we are re-synching LFC, dCache and DQ2 central catalogs for data in the MWT2_UC_* space token areas. MWT2_UC and ANALY_MWT2 should be kept offline until both of these are complete.  
    As of 3/2 issue resolved, test jobs completed successfully, and all MWT2_* sites set back to 'online'.
    eLog 9919, 9920, 9987.
    8)  2/26: NET2 -- Transfer errors like:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNEC
    TION_ERROR] failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]. Givin' up after 3 tries].  Issue with certificate updates -- From John:
    Even though the cronjob is present, and this had been running regularly, something critical was updated so recently that we needed to --force an update.  RT 15580, eLog 9896.
    9)  2/28, 3/1: Additional space added to MWT2_UC_MCDISK after transfers failed with "id=381513 does not have enough space" errors.  eLog 9934.
    10)  3/1: BNL -- new gatekeeper is available.  From Xin:
    Just let you know that there is a new gatekeeper gridgk05.racf.bnl.gov, which can run production/analysis/pandamover jobs now at BNL.  Please feel free to make new queues, or direct existing queue pilots to it.
    11)  3/2: SLAC -- almost 100% job failures with the error "unable to safeguard against Oracle overload due to ORA-12170: TNS:Connect timeout occurred."  Problem understood -- from Wei:
    A new set of batch nodes that we are bringing online don't have the correct setup for tunneling (actually via iptables and xinetd) to BNL Oracle. I am sending an inquire on this issue. Hopefully this is just a neglection of our part.
    12)  3/3 (early a.m.): BNL DQ2 site services upgraded.
    
    Follow-ups from earlier reports:
    
    (i) Issue about pilot space checking / minimum space requirements noted last week -- has there been a decision here?
    (ii)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Panda user analysis profile is dynamic. Some days 25k jobs queued, 20k of them in US (most at BNL).
    • Looked at the queued jobs at BNL. Some users hardwired BNL site name in their submission. Jobs use input ESDs run at BNL, not at other sites where replicas available; CERN, NDGF, one UK site. Should be site performance issue as brokerage checks. Investigating.
    • Saul looked at 5k random jobs at BNL for Feb. 19-20 to understand the job profiles better. A pdf file is attached here for statistics.
      • 56% of jobs running on ESDs, 21% on AOD's, others on RAW, EVNT, DESD.
      • The two users submitting large fraction of jobs are non-US users.
      • US Tier2s do not have as much input data located as BNL. ESD data distribution to Tier1s.
  • this meeting:

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Postpone discussion of proposal on renaming sites for consistency
      I am wondering if we can agree on the consistent site naming convention
      for various services in the ATLAS production system used in US.  There
      seems to be confusions among people/shifters outside of the US to
      identify the actual responsible site from various names used in the US
      production services/queues.   In fact, some of them are openly
      commenting the frustration of the difficulty in the computing log. 
      Hence, I am wondering if we can/should put the effort to use the
      consistent naming conventions for the site name used in the various
      systems.    In the below, I have identified some of the systems which
      could help users if the consistent naming were being used. 
      
      1.  PANDA site name
      2.  DDM site name
      3.  BDII site name
      
      At least, since these three names come to the front of the major ATLAS
      computing monitoring system, the good consistent naming for each site in
      these three separate systems should help ease problems encountered by
      the other people.   So, is it possible to change any of the name?  ( I
      know some of them are pain to change.   If needed, I can make a table of
      names for each site used in these three system. )
      
      Hiro 
    • FTS 2.2 coming soon - will demand update 2 weeks after certification --> close to when we can consolidate site services
    • Be prepared ---> will consolidation of DQ2 site services from Tier 2s; week following FTS upgrade
    • proddisk-cleanse change from Charles - working w/ Hiro
    • Tier 3 in ToA will be subject to FT and blacklisting; under discussion
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • focus was on perfsonar - new release to be out this friday; Jason: no show stoppers. All sites should upgrade next week.
    • Fixes bugs identified at our sites - hope this release resilient enough for T3 recommendations
    • Next meeting Feb 9
  • this week:
    • Minutes:
      
      

Site news and issues (all sites)

  • T1:
    • last week(s):On-going issue with condor-g - there has been incremental progress being made but there are new effects observed. Observed a slow-down in job throughput. Working with Condor team, some fixes were applied (new condor-q) which helped for a while; decided to add another submit host to the configuration. New HPSS data movers and network links.
    • this week:

  • AGLT2:
    • last week: Failover testing of WAN, all successful. Tuesday will be upgrading dcache hosts to SL5.
    • this week:

  • NET2:
    • last week(s): Upgraded GK host to SL5 OSG 1.2.6 at BU; HU in progress. LFC upgraded to latest version.
    • this week:

  • MWT2:
    • last week(s): Completed upgrade of MWT2 (both sites) to SL 5.3 and new Puppet/Cobbler configuration build system. Both gatekeepers at OSG 1.2.5, to be upgraded to 1.2.6 at next downtime. Delivery of 28 MD1000 shelves.
    • this week:

  • SWT2 (UTA):
    • last week: Extended downtime thurs/fri to add new storage hardware plus usual new SW upgrades; hope to be done by end of week.
    • this week:

  • SWT2 (OU):
    • last week: Started getting more equip from storage order; continue to wait for hardware.
    • this week:

  • WT2:
    • last week(s): ATLAS home and release NFS server failed; will be relocating to temporary hardware.
    • this week:

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Preliminary capacity report is now working:
      This is a report of pledged installed computing and storage capacity at sites.
      Report date:  2010-01-25
      --------------------------------------------------------------------------
       #       | Site                   |      KSI2K |       HS06 |         TB |
      --------------------------------------------------------------------------
       1.      | AGLT2                  |      1,570 |     10,400 |          0 |
       2.      | AGLT2_CE_2             |        100 |        640 |          0 |
       3.      | AGLT2_SE               |          0 |          0 |      1,060 |
      --------------------------------------------------------------------------
       Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       4.      | BU_ATLAS_Tier2         |      1,910 |          0 |        400 |
      --------------------------------------------------------------------------
       Total:  | US-NET2                |      1,910 |          0 |        400 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       5.      | BNL_ATLAS_1            |          0 |          0 |          0 |
       6.      | BNL_ATLAS_2            |          0 |          0 |          1 |
       7.      | BNL_ATLAS_SE           |          0 |          0 |          0 |
      --------------------------------------------------------------------------
       Total:  | US-T1-BNL              |          0 |          0 |          1 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       8.      | MWT2_IU                |      3,276 |          0 |          0 |
       9.      | MWT2_IU_SE             |          0 |          0 |        179 |
       10.     | MWT2_UC                |      3,276 |          0 |          0 |
       11.     | MWT2_UC_SE             |          0 |          0 |        200 |
      --------------------------------------------------------------------------
       Total:  | US-MWT2                |      6,552 |          0 |        379 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       12.     | OU_OCHEP_SWT2          |        464 |          0 |         16 |
       13.     | SWT2_CPB               |      1,383 |          0 |        235 |
       14.     | UTA_SWT2               |        493 |          0 |         15 |
      --------------------------------------------------------------------------
       Total:  | US-SWT2                |      2,340 |          0 |        266 |
      --------------------------------------------------------------------------
      
       Total:  | All US ATLAS           |     12,472 |     11,040 |      2,106 |
      --------------------------------------------------------------------------
      
    • Debugging underway
  • this meeting

AOB

  • last week
  • this week
    • none


-- RobertGardner - 03 Mar 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback