r4 - 24 Feb 2010 - 14:12:34 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesFeb24

MinutesFeb24

Introduction

Minutes of the Facilities Integration Program meeting, Feb 24, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • further meetings this past week with regards to T3 planning.
    • several working groups for T3's - distributed storage, ddm, proof; 3 month timeframe, intermediate report in 6 weeks.
    • output will be the recommended solutions
    • T3's in the US - will be working in parallel. Expect funding soon.
    • Expect a communication from Massimo regarding call for participation of working groups.
  • this week:
    • Hardware recommendations within 2 weeks.
    • Frontier-squid setup in SL5 virtual machine at ANL ASC (push - button solution)

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=85286
    
    1)  2/10: MWT2_UC -- Transfer errors from MWT2_UC_PRODDISK to BNL-OSG2_MCDISK, with errors like "[FTS] FTS State [Failed] FTS Retries [7] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] source file doesn't exist]."  Issue resolved.  RT 15424, eLog 9427.
    2)  2/11: SWT2_CPB -- maintenance outage complete, test jobs finished successfully, site set back to 'on-line'.
    3)  2/11: AGLT2 -- ~200 failed jobs with the error "Get error: dccp get was timed out after 18000 seconds ." From Bob:
    This was fixed around 9am EST today.  There was a missing routing table entry.  Normal operations have since resumed.  There may be more jobs reporting this failure until everything is caught up.  Only half of AGLT2 worker nodes were affected by this.
    4)  2/11: New pilot version from Paul (42c) --
    * "t1d0" added to LFC replica sorting algorithm to prevent such tape replicas from appearing before any disk residing replicas. Requested by Hironori Ito et al.
    * Pilot is now monitoring the size of individual output files. Max allowed size is 5 GB. Currently the check is performed once every ten minutes. New error code was introduced: 1124 (pilot error code), 'Output file too large', 441003/EXEPANDA_OUTPUTFILETOOLARGE (proddb error code). 
    Panda monitor, Bamboo and ProdDB was updated as well. Requested by Dario Barberis et al.
    * The limit of the total size of all input files is now read from the schedconfig DB (maxinputsize). The default site value is set to 14336 MB (by Alden Stradling). Brokering has been updated as well (by Tadashi Maeno). Any problem reading the schedconfig value (not set/bad chars) 
    will lead to pilot setting its internal default as before (14336 MB). maxinputsize=0 means unlimited space. 
    Monitoring of individual input sizes will be added to a later pilot version.
    * Fixed problem with local site mover; making sure that all input file names are defined by LFN. Previously problems occurred with PFN's containing the *__DQ2-* part in the file name (copied into the local directory leading to file not found problem). Requested by John Brunelle.
    * Fixed issue with FileStager access mode not working properly for sites with direct access switched off. Discovered by Dan van der Ster.
    * Removed panda servers voatlas19-21 from internal server list. Added voatlas58-59 (voatlas57 was added in pilot v 42b). Requested by Tadashi Maeno.
    5)  2/11: IllinosHEP -- Job were failing with the errors "EXEPANDA_GET_FAILEDTOGETLFCREPLICAS " & "EXEPANDA_DQ2PUT_LFC-REGISTRATION-FAILED (261)."  Seems the tier 3 LFC was unresponsive for a period of time, causing these errors.  Issue resolved by Hiro.  RT 15427, eLog 9468.
    6)  2/11: AGLT2, power problem -- from Bob:
    At 10:55am EST a power trip at the MSU site dropped 12 machines accounting for up to 144 job slots.  Not all of those slots were running T2 jobs, but many were.  Those jobs were lost.
    7)  2/12 - 2/13: Reprocessing validation jobs were failing at MWT2_IU due to missing atlas release 15.6.3.  The release was subsequently installed by Xin.
    8)  2/12: Reprocessing job submission begun with tag r1093.
    9)  2/13: BNL -- Issue with data transfers resolved by re-starting the pnfs and SRM services.  Problem appears to be related to high loads induced by reprocessing jobs.  Another instance of the problem on 2/15.  See details in eLog 9524, 9508.
    10)  2/16: Power outage at NET2.  From Saul:
    NET2 has had a power outage at the BU site.  All systems have been rebooted and brought back to normal.
    Test jobs completed successfully, site set back to 'on-line'.
    11)  2/16: Test jobs submitted to UCITB_EDGE7 to verify the latest OSG version (1.2.7, a minor update of version 1.2.6 -- contains mostly security fixes).  Jobs completed successfully.
    12)  2/17: Maintenance outage at AGLT beginning at 6:30 a.m. EST.  From Shawn:
    The dCache headnodes have been restored on new hardware running SL5.4/x86_64. We are now working on BIOS/FW updates and will then rebuild the storage(pool) nodes. Outage is scheduled to end at 5 PM Eastern but we hope to be back before then.
    13)  2/17: Note from Paul about impending pilot update:
    Currently the pilot only downloads a job if the WN has at least 5 GB available space. As discussed in a separate thread, we need to increase this limit to guarantee (well..) that a job using a lot of input data, can finish and not run out of local space. 
    It was suggested by Kaushik to use the limit 14 + 5 + 2 = 21 GB (14 GB for input files, 5 GB for output, 2 GB for log).  
    Please let me know as soon as possible if this needs to be discussed any further.
    
    Follow-ups from earlier reports:
    (i) 2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/
    More to say about this later as we see how its use evolves.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=86064
    
    1)  2/17: NET2 -- following recovery from a power outage test jobs completed successfully -- site set back to 'on-line'.
    2)  2/18: UTD-HEP set 'off-line' in preparation for OS and other s/w upgrades.
    3)  2/19: New pilot version from Paul (42e):
    * Begun process of tailing the brief pilot error diagnostics (not completed). Long error messages were previously cut off by the 256 char limit on the server side, which often lead to the actual error not being displayed. Some (not yet all) error messages will now be tailed 
    (i.e. the tail of the error message will be shown and not only the beginning of the string). Requested by I Ueda.
    * Now grabbing the number of events from non-standard athena stdout info strings (which are different for running with "good run list"). See discussion in Savannah ticket 62721.
    * Added dCache sub directory verification (which in turn is used to determine whether checksum test or file size test should be used on output files). Needed for sites that share dCache with other sites. Requested by Brian Bockelman et al.
    * Pilot queuedata downloads are now using new format for retrieving the queuedata from schedconfig. Not yet added to autopilot wrapper. Requested by Graeme Stewart.
    * DQ2 tracing report now contains PanDA job id as well as hostname, ip and user dn (a DQ2 trace can now be traced back to the original PanDA job). Requested by Paul Nilsson(!)/Angelos Molfetas.
    * Size of user workdir is now allowed to be up to 5 GB (previously 3 GB). Discussed in separate thread. Requested by Graeme Stewart.
    4)  2/19 - 2/22: AGLT2 -- dCache issues discovered after the site was coming back on-line following a maintenance outage for s/w upgrades (SL5, etc.)  Issue eventually resolved.  Test jobs succeeded, back to 'on-line'.  ggus 55709, eLog 9750.
    5)  2/19: Oracle outage at CERN on 2/18 described here:
    https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/9644
    6)  2/19: SLAC -- DDM errors like "FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] source file doesn't exist]."  Issue resolved -- from Wei:
    This is a name space problem. I will investigate. In the mean time, I switched to run the xrootdfs (pnfs equivalent) without using name space. I hope DDM will retry.
    7)  2/19: Large numbers of tasks were killed in the US & DE clouds to allow high priority ones to run.  (The high priority tasks were needed for a min-bias paper in preparation.)  eLog 9645.
    8)  2/20: From John at Harvard -- HU_ATLAS_Tier2 set back to 'on-line' after test jobs completed successfully.
    9)  2/23: BNL -- FTS upgraded to v2.2.3.  From Hiro:
    This is just to inform you that BNL FTS has been upgraded to the checksum capable version. There will be some test for this capabilities.  Also, as we have planed all along, the consolidation of DQ2 site service will happen after some tests in coming weeks after BNL DQ2 is upgraded in the next week.
    10)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    
    Follow-ups from earlier reports: 
    
    (i) 2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/
    More to say about this later as we see how its use evolves.
    (ii) This past week:  What began with test jobs at UCITB_EDGE7 to verify the latest OSG release (1.2.7) led to the issue of having another site available besides UCITB.  Long mail thread about this topic.  Conclusion (from Alden, 2/23):  
    Things are pretty much resolved. We'll need to create a new queue for ATLAS ITB activities, 
    and shift all focus for ATLAS from off the BNL_ITB_Test1-condor queue. I'll get that started this afternoon.
    (iii) Issue about pilot space checking / minimum space requirements noted last week -- has there been a decision here?
    

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Recent progress on Panda job rebrokering:
      • removed the restriction of us physicists going to us queues (no DN brokering)
      • more information to users when a task is submitted
      • also pcache development (Charles, Tadashi) providing wn-level brokering (use local wn disk); agnostic about its source, could be an NSF backend. Another goal is to integrate this into the pilot so as to remove site admin involvement.
    • BNL queues can not keep up high pilot rate because of Condor-g problems, Xin is investigating.
    • DAST involvement in blacklisting problematic analysis sites. Starting this week DAST receives an email twice a day for the sites failing GangaRobot? jobs. A procedure is being set up to act on these failures.
  • this meeting:
    • Panda user analysis profile is dynamic. Some days 25k jobs queued, 20k of them in US (most at BNL).
    • Looked at the queued jobs at BNL. Some users hardwired BNL site name in their submission. Jobs use input ESDs run at BNL, not at other sites where replicas available; CERN, NDGF, one UK site. Should be site performance issue as brokerage checks. Investigating.
    • Saul looked at 5k random jobs at BNL for Feb. 19-20 to understand the job profiles better. A pdf file is attached here for statistics.
      • 56% of jobs running on ESDs, 21% on AOD's, others on RAW, EVNT, DESD.
      • The two users submitting large fraction of jobs are non-US users.
      • US Tier2s do not have as much input data located as BNL. ESD data distribution to Tier1s.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Postpone discussion of proposal on renaming sites for consistency
      I am wondering if we can agree on the consistent site naming convention
      for various services in the ATLAS production system used in US.  There
      seems to be confusions among people/shifters outside of the US to
      identify the actual responsible site from various names used in the US
      production services/queues.   In fact, some of them are openly
      commenting the frustration of the difficulty in the computing log. 
      Hence, I am wondering if we can/should put the effort to use the
      consistent naming conventions for the site name used in the various
      systems.    In the below, I have identified some of the systems which
      could help users if the consistent naming were being used. 
      
      1.  PANDA site name
      2.  DDM site name
      3.  BDII site name
      
      At least, since these three names come to the front of the major ATLAS
      computing monitoring system, the good consistent naming for each site in
      these three separate systems should help ease problems encountered by
      the other people.   So, is it possible to change any of the name?  ( I
      know some of them are pain to change.   If needed, I can make a table of
      names for each site used in these three system. )
      
      Hiro 
    • FTS 2.2 coming soon - will demand update 2 weeks after certification --> close to when we can consolidate site services
    • Be prepared ---> will consolidation of DQ2 site services from Tier 2s; week following FTS upgrade
    • proddisk-cleanse change from Charles - working w/ Hiro
    • Tier 3 in ToA will be subject to FT and blacklisting; under discussion
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • Fred - previous performance tests were invalid - found squid was out of the loop, requires adding client machines in squid.conf file; new tests show very good performance
    • So should T3 sites maintain their own squid? Doug thinks every site should and its easy to do. CVMFS - web file system gets better performance if you have a local squid, so will speed up access for releases and conditions data.
    • There are two layers of security - source and destination; John: there are recommendations in the instructions.
    • There is a discussion about how feasible it is to install Squid at each Tier 3 - John worries about the load on the associated Tier 2s.
    • Can also use CVMFS. Testing at BU.
    • There was an issue of HC performance at BU relative to AGLT2 for conditions data access jobs. Fred will run his tests against BU.
  • this week

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • focus was on perfsonar - new release to be out this friday; Jason: no show stoppers. All sites should upgrade next week.
    • Fixes bugs identified at our sites - hope this release resilient enough for T3 recommendations
    • Next meeting Feb 9
  • this week:
    • Minutes:
      
      

Site news and issues (all sites)

  • T1:
    • last week(s):On-going issue with condor-g - there has been incremental progress being made but there are new effects observed. Observed a slow-down in job throughput. Working with Condor team, some fixes were applied (new condor-q) which helped for a while; decided to add another submit host to the configuration. New HPSS data movers and network links.
    • this week:

  • AGLT2:
    • last week: Failover testing of WAN, all successful. Tuesday will be upgrading dcache hosts to SL5.
    • this week:

  • NET2:
    • last week(s): Upgraded GK host to SL5 OSG 1.2.6 at BU; HU in progress. LFC upgraded to latest version.
    • this week:

  • MWT2:
    • last week(s): Completed upgrade of MWT2 (both sites) to SL 5.3 and new Puppet/Cobbler configuration build system. Both gatekeepers at OSG 1.2.5, to be upgraded to 1.2.6 at next downtime. Delivery of 28 MD1000 shelves.
    • this week:

  • SWT2 (UTA):
    • last week: Extended downtime thurs/fri to add new storage hardware plus usual new SW upgrades; hope to be done by end of week.
    • this week:

  • SWT2 (OU):
    • last week: Started getting more equip from storage order; continue to wait for hardware.
    • this week:

  • WT2:
    • last week(s): ATLAS home and release NFS server failed; will be relocating to temporary hardware.
    • this week:

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Preliminary capacity report is now working:
      This is a report of pledged installed computing and storage capacity at sites.
      Report date:  2010-01-25
      --------------------------------------------------------------------------
       #       | Site                   |      KSI2K |       HS06 |         TB |
      --------------------------------------------------------------------------
       1.      | AGLT2                  |      1,570 |     10,400 |          0 |
       2.      | AGLT2_CE_2             |        100 |        640 |          0 |
       3.      | AGLT2_SE               |          0 |          0 |      1,060 |
      --------------------------------------------------------------------------
       Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       4.      | BU_ATLAS_Tier2         |      1,910 |          0 |        400 |
      --------------------------------------------------------------------------
       Total:  | US-NET2                |      1,910 |          0 |        400 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       5.      | BNL_ATLAS_1            |          0 |          0 |          0 |
       6.      | BNL_ATLAS_2            |          0 |          0 |          1 |
       7.      | BNL_ATLAS_SE           |          0 |          0 |          0 |
      --------------------------------------------------------------------------
       Total:  | US-T1-BNL              |          0 |          0 |          1 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       8.      | MWT2_IU                |      3,276 |          0 |          0 |
       9.      | MWT2_IU_SE             |          0 |          0 |        179 |
       10.     | MWT2_UC                |      3,276 |          0 |          0 |
       11.     | MWT2_UC_SE             |          0 |          0 |        200 |
      --------------------------------------------------------------------------
       Total:  | US-MWT2                |      6,552 |          0 |        379 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       12.     | OU_OCHEP_SWT2          |        464 |          0 |         16 |
       13.     | SWT2_CPB               |      1,383 |          0 |        235 |
       14.     | UTA_SWT2               |        493 |          0 |         15 |
      --------------------------------------------------------------------------
       Total:  | US-SWT2                |      2,340 |          0 |        266 |
      --------------------------------------------------------------------------
      
       Total:  | All US ATLAS           |     12,472 |     11,040 |      2,106 |
      --------------------------------------------------------------------------
      
    • Debugging underway
  • this meeting

AOB

  • last week
  • this week
    • none


-- RobertGardner - 24 Feb 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf 5000jobs.pdf (391.9K) | NurcanOzturk, 24 Feb 2010 - 12:28 | BNL job profiles
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback