r5 - 31 Mar 2010 - 14:06:55 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar31

MinutesMar31

Introduction

Minutes of the Facilities Integration Program meeting, Mar 31, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Karthik, Sarah, Shawn, Nate, Aaron, Fred, Justin, Jason, Tom, Bob, Patrick, Wei, Mark, John B
  • Apologies: Saul, Horst, Michael, Charles

Integration program update (Rob, Michael)

  • SiteCertificationP12 - FY10Q2
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • LHC collisions at 7 TeV formally by March 30, starting the 18-24 month run (press release)
      • Two OSG documents for review:
      • Updated CapacitySummary
      • Hope for stable beam around March 30. However after 3.5 TeV ramp at noon resulted in Cryo failure that will take a day to recover.
      • Quarter is about to end - quarterly reporting, there will be a 9 day deadline
      • glexec heads up: WLCG management discussion about glexec yesterday - details to be spelled out - will be a requirement to have this installed. Will require integration testing. There were a number of issues raised previously, so a lot of details to iron out. Will need to work with OSG to get glexec installation - may want to invite Jose to the meeting and describe the system - shouldn't place much impact on users or sites. The basic requirement is traceability at the gatekeeper level.
    • this week
      • 7 TeV collisions!
      • Quarterly reports due (FY10Q2? )

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • The link to ATLAS T3 working groups Twikis are here
    • Draft users' guide to T3g is here
    • Model T3g is up at ANL, allowing users. Will use it at the end of next week.
    • Pathena submission to T3 still not working. Build job works, but not the full job.
    • Still need a dq2 get, or via FTS channel.
    • Working groups making lots of progress
    • One issue is that the file system CVMFS - centralizes deployments of releases and conditions data
    • OSG support issues
      • how to route tickets for T3g's? Should we route directly to the sites?
      • Questions about who is responsible for the tickets? What about connection to the T3 support group? So there is a bigger question that just GOC and RT tickets.
      • Rik - would like to get US participation in the T3 support group. Support model not clear.
      • Non-grid T3's - should they go through DAST?
      • Should bring up at the L2/L3 management meeting.
      • Nothing new from Hiro
  • this week:
    • ANL workshop this week, no report.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Back to full capacity
    • Some requests for regional production - ready to be used for backfill
    • 200M events from central production will be defined as queue fillers
    • Next week: meeting at BNL ADC computing meeting (Alexei organizing). Large ADC-ATLAS attendance. Will focus on production issues to be discussed. Planning for the next couple of years.
    • Distributed analysis tests at AGLT2 - full sequence of 250, 500, 750, 1000, .. Ganga robot job sets. Results looked great.
      • Will prepare a talk for next week.
      • Will re-run high occupancy test.
      • Deploy pcache3 at UM? After next set of tests.
    • (Aside: there were HC tests at SLAC last week - found previous results reproduced with good results). Using old version of ROOT - older xrootd client.
  • this week:
    • Mark: for the most part things have been relatively smooth.
    • Task problem failing jobs at BNL
    • HU pilot execution problem understood - grid monitor job not getting correct certificates directory; once fixed - ramped up and seeing newer problems.
    • There are a couple of tasks failing this week - jobs are expiring - they are lower priority tasks that get bumped by higher priority jobs.
    • Distributed analysis tests at AGLT2, preliminary conclusions
      • HC_Test_Performance_at_AGLT2.pdf: HC Testing Analysis results
      • NFS access to VO home directories and ATLAS kits is unchanged (saturated?) for all HC job counts
        • Likely need better service here than current PE1950/MD1000 combination, eg, SUNNAS or SSD
        • Should likely split kits and VO home directories
      • dCache performs well, but would be aided by better distribution over available server systems
      • Limiting number of Analysis jobs per worker node seems to be a good idea
        • Best number for a PE1950 TBD (testing this is ongoing today)
        • Using half the cores as a maximum seems to be a good rule of thumb
          • May want to decrease this even further, at cost of slots, to get cpu/walltime higher
      • Other things: pcache, increasing number of jobs on the server
    • Next week: Wei will discuss w/ Kaushik running tests at SLAC - but need to reduce number of production

Data Management & Storage Validation (Kaushik)

Release installation, validation (Xin, Kaushik)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Michael: John Hover will open a thread with Alessandro to begin deploying releases using his methods. Which is WMS-based installation.
    • John's email will start the process today -
    • There will be questions - certificate to be used and account to be mapped to.
    • Charles: makes point that it would be good to have tools that admins could have to test releases. Will do this in the context of Alessandro's framework.
  • this meeting:

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=88790
    
    1)  3/17 - 3/18: MWT2_UC, ANALY_MWT2 -- off-line for electrical work -- completed, back to on-line.  eLog 10526.
    2)  3/18: Issue with transfer failures at NET2 resolved.  RT 15745/46, ggus 56511.
    3)  3/19: Issue with installation jobs at IllinoisHEP resolved? 
    4)  3/19: From aaron at MWT2:
    Due to a power event and reboot of a number of our worker nodes, we lost a fair number of jobs. You should expect to see a number of jobs failing with a lost heartbeat.
    5)  3/21: LFC problem at AGLT2 understood -- from Shawn:
    Ours was just "slow"...we are working on the back-end iSCSI to get backups setup and the iSCSI appliance was really slow for a while.
    6)  3/22: Issue with installation jobs at HU_ATLAS_Tier2 understood (sort of, still some questions about jobs running when a site is in 'brokeroff' vs. 'test' vs. ....).  Test jobs succeeded, queue  HU_ATLAS_Tier2-lsf set to online.
    7)  3/22: Lack of pilots at several sites was due to a problem with the submit host gridui11.  Machine rebooted, pilots again flowing.
    8)  3/22: BNL -- FTS and LFC database maintenance completed successfully.
    9)  3/23: From Charles at MWT2:
    Due to a glitch while installing a new pcache package, a number of jobs have failed during stage-in at MWT2_IU and MWT2_UC, with the following error:
    23 Mar 20:46:13|LocalSiteMov| !!WARNING!!2995!! lsm-get failed (51456): 201 Copy command failed
    This was a brief transient problem and has been resolved. Please do not offline the sites. We are watching closely for any further job failures.
    10)  3/24: Pilot update from Paul (v43a):
    * Multi-jobs. Several jobs can now be executed sequentially by the same pilot until it runs out of time. The total run time a multi-job pilot is allowed to run is defined by schedconfig.timefloor (minutes) [currently unset for all sites, 
    so feature is not enabled anywhere as of today]. 
    The primary purpose is to reduce the pilot rate when a lot of short jobs are in the system, and can be used for both production and analysis jobs. Initial testing will use a suggested timefloor of 15-20 minutes. Requested by Michael Ernst.
    * Tier 3 modifications. Minor changes to skip e.g. LFC file registrations on tier 3 sites. cp and mv site movers can be used to transfer input/output files. Currently pilot is writing output to ///. Input file specification done via file list. 
    Testing under way at ANALY_ANLASC.
    * Further improvements in (especially) get and put pilot error diagnostic messages. Requested by I Ueda.
    * Corrected problem with athena setup when option -f was used. Requested by Rod Walker.
    * Added pilot support for 5-digit releases. Requested by Tadashi Maeno et al.
    * Removed hardcoded slc5 software path from setup path since it is no longer needed. Requested by Alessandro Di Salvo.
    * Replaced hardcoded panda servers with pandaserver.cern.ch for queuedata download. Requested by Graeme Stewart.
    * Installation problem now recognized during CMTCONFIG verification (NotAvailable error). Requested by Rod Walker.
    * Job definition printout now contains special command setup for xrdcp (when available). Note: printout done twice, at the beginning of the job and when all setups are done. Special command setup will only be set in the second printout. 
    Requested by Johannes Elmsheuser et al.
    * Corrected undefined variable in local site mover. Requested by Pedro Salgado.
    * Minor change in queuedata read function needed for glExec integration to allow queuedata file to be found after identity switching. glExec/pilot integration done in parallel by Jose Caballero.
    
    Follow-ups from earlier reports:
    (i)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    (ii)  3/3: Consolidation of dq2 site services at BNL for the tier 2's by Hiro beginning.  Will take several days to complete all sites.  ==> Has this migration been completed?
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=89684
    
    1)  3/24: From Wei at SLAC --
    SLAC will take an one hour outage for hardware maintenance from 2-3pm PDT (9-10pm UTC).
    Outage completed.
    2)  3/24:  Issue with gridui07 affecting autopilot submission to AGLT2 (and other sites?) resolved by Alden.
    3)  3/25: From Saul at NET2:
    Just so you're not confused, we're draining the queues at BU in order to reboot the worker nodes in the morning.  3/26 follow-up:  PBS is opened up and we're ready to start up again at BU_ATLAS_Tier2o and ANALY_NET2.  Test jobs succeeded, queues set back to on-line.
    4)  3/25: New DNS aliases for the DQ2 Central catalogs implemented.
    5)  3/26: Job failures at BU with the error "Put error: lfc_creatg failed with (1001, Host not known)|Log put error: lfc_creatg failed with (1001, Host not known)."  Solution:
    We had site issues affecting DNS service to compute nodes. We believe all these problems were fixed on Friday (3.26).  eLog 10748, ggus 56743, RT 15782.
    6)  3/29 - 3/30: SWT2_CPB -- transfer errors like "[Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]."  Site set off-line for storage maintenance (a directory in the space token area was too full, 
    almost 32k subdirectories, creating xroot errors).  
    Test jobs succeeded, back to on-line.
    7)  3/30 evening: From Shawn at AGLT2:
    We had one of our dCache pools on umfs09.aglt2.org go offline. I am working on recovering it but it required a reboot of umfs09.aglt2.org.  This may cause some transient errors in our dCache.
    8)  3/30 - 3/31: IllinoisHEP -- jobs were failing with errors like "globus_l_ftp_client_connection_error:4580: the server responded with an error 535 Authentication failed: GSSException: Defective credential detected."   Seems to have been a transient problem.  
    No recent errors of this type.  
    ggus 56898, RT 15838, eLog 10909.
    9)  Over the past week there was an ongoping issue of HU_ATLAS_Tier2 not filling with jobs -- instead seemed to be somehow capped at 50.  As of last night John thinks this issue is understood (grid-monitor jobs not using the proper CA certs location -- 
    details in the mail thread).  
    Once the site began filling up overheating issues discovered, 
    such that large numbers of jobs were failing with stage-in errors.  Site set off-line while this problem is being worked on.  RT 15839, ggus 56899, eLog 10891.
     
    Follow-ups from earlier reports:
    (i)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    (ii)  3/3: Consolidation of dq2 site services at BNL for the tier 2's by Hiro beginning.  Will take several days to complete all sites.  ==> Has this migration been completed?
    (iii)  3/19: Issue with installation jobs at IllinoisHEP resolved?  Yes, as of 3/24 production resumed at the site.
    (iv)  3/24: Question regarding archival / custodial bit in dq2 deletions -- understood and/or resolved?
    
    

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • DQ2 logging has a new feature - errors are reporting now. Request: to be able to search at the error level.
    • Will be adding link for FTS log viewing.
    • FTS channel configuration change for data flow time-out. New FTS has option for terminating the transfer timeouts. Default for the entire transfer is 30 minutes. Wastes channel for a failed transfer. If no progress in first 3 minutes, transfer is terminated. Now active for all t2 channels.
      • If no progress (bytes transferred) during the a 180 second window, transfer cancelled. (Every 30 seconds a transfer marker is sent.) Making a page with all the settings.
      • Have observed that some transfers being terminated.
      • BNL-IU problem - fails for small files when directly writing into pools. All sites with direct transfers to pools are affected - its a gridftp2.
      • Logfiles and root files - few hundred kilobyte sized files.
      • In the meantime BNL-IU is not using gridftp2
      • dcache developers being consulted - may need a new dcache adapter
    • DQ2 SS consolidation except BU - problem with checksum issues.
    • Need to update Tier 3 DQ2. Note: Illinois working
    • BU site service now at BNL.. so all sites now running DQ SS at BNL DONE
    • FTS log, site level SS logs both available
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Minutes:
      		USATLAS Throughput Meeting Notes - March 23, 2010
                 =================================================
      
      Attending: Aaron, Sarah, John, Saul, Charles, Jason, David, Andy, Hiro, Mark, Augustine, Horst, Karthik
      Excused: Karthik
      
      1) Jason updated us on the segfaulting issue: related to perl modules ending.  Shouldn't be causing any real problems  Restarts are done daily of all services that should be "Running".  A future version of perfSONAR may have a "monitor" which watches processes that should be running AND will restart them if they stop.   Karthik reported that the DNS issue they were having (heavy DNS load from perfSONAR nodes) was resolved by putting 'nscd' in to cache DNS requests. Andy report that the next version will have 'nscd' to cache DNS.   Jason sent along instructions on how sites can put 'nscd' in place right now:   
      
        sudo apt-get install nscd
        sudo /etc/init.d/pSB_owp_master.sh restart
        sudo /etc/init.d/pSB_owp_collector.sh restart
      
      About 1 month before another release (April 23 or so). 
      
      2) Hiro reported that the dCache bug for small file transfers using GridFTP2 will prevent our "transaction" testing until it is resolved.  The transaction tests use lots of small files.  Hiro will try to work with WT2/Wei concerning transaction testing to a BestMan/Xrootd site in the interim.   Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).   Will add the appropriate links to the Throughtput testing pages at: http://www.usatlas.bnl.gov/dq2/throughput to that we can search when there are failures.    
      
      3) Site reports
      	BNL -  John reported tests with Artur and Eduardo at CERN to explore the 10 minute "bursty" network results they were seeing at joint-techs.  Looks fine now.   LHCOPN 8.5+8.5 Gbps works fine.   perfSONAR issues as per other sites. 
      	BU - Augustine reported on local throughput problem; dual 10GE network NIC and 1GE destinations was having poor performance.  Fix was to  disable Linux "autotuning" via setting net.ipv4.tcp_moderate_rcvbuf = 0.   More details would be interesting.  BNL perfSONAR now configured for testing (after call).
           MWT2 - Small files problem at IU.  perfSONAR problems as mentioned.    Xrootd testing ongoing at IU.   Bonnie++ testing of XFS (default vs tuned) at UC.
      	Illinois - perfSONAR issues there.  Restarts working.
      	SWT2_OU/SWT2_UTA.  No additions from Karthik's note.  All working now.
      	AGLT2 -  NFS replacement via Lustre being explored.  perfSONAR issue with stopping service observed.  
      
      4) AOB?  None.
      
      Plan to meet again in two weeks.  Sites should prepare by looking at their perfSONAR measurement results and bring questions to the meeting.    Notify Shawn if there are other topics to add to the agenda.
      
      Please send along corrections and comments to the list.
      
      Shawn
    • perfsonar release schedule - about a month away - anticipate doing only bug fixes.
    • Transaction bottleneck tests - but there is a dcache bug for small files that must be solved first; use xrootd site.
    • Look at data in perfsonar - all sites
    • BU site now configured. SLAC - still not deployed, still under discussion.
  • this week:
    • No meeting this week
    • Still waiting to hear about when to begin testing perfsonar.

Site news and issues (all sites)

  • T1:
    • last week(s):Testing of new storage - dCache testing by Pedro. Will purchase 2000 cores - R410s rather than high density units, ~ six weeks. Another Force10 coming online 100 Gbps interconnect. Requested another 10G link out of BNL - for the Tier 2s. Hope ESnet will manage the bw's to sites well. Fast track muon recon running for the last couple of days, majority at BNL (kudos); lsm by Pedro now supporting put operations - tested on ITB. CREAM CE discussion w/ OSG (Alain) - have encouraged him to go for this and make available to US ATLAS as soon as possible.
    • this week:

  • AGLT2:
    • last week: Lustre in VM going well. v1.8.2
    • this week: Now have a Lustre deployment going here - looking to replace multiple NFS servers (home, releases, osg home, etc). Getting experience. Will start to transfer to use it and evaluate.

  • NET2:
    • last week(s): Filesystem problem turned out to be a local networking problem. HU nodes added - working on ramping up jobs. Top priority is acquiring more storage - will be Dell. DQ2 SS moved to BNL. Shawn helped tune up perfsonar machines. Moving data around - ATLASDATADISK seems too large. Also want to start using pcache.
    • this week: HU issues as discussed above. There was a missing production cache causing failures.

  • MWT2:
    • last week(s): Electrical work complete putting new storage systems behind UPS. New storage coming online: SL5.3 installed via Cobbler and Puppet on seven R710 systems. RAID configured for MD1000 shelves. 10G network to each system (6 into our core Tier 2 Cisco 6509, 1 into our Dell 6248 switch stack). dCache installed. Also working on WAN Xrootd testing (see ATLAS Tier 3 working group meeting yesterday). Python bindings for xrootd library - work continues - in advance of local site mover development for xrootd.
    • this week: Focus getting new storage online. Everything installed, configured, running dcache test pools and running load tests. xrootd testing continued.

  • SWT2 (UTA):
    • last week: SL5.4 w/ Rocks 5.3 complete. SS transitioned to BNL. Issues w/ transfers failing to BNL. There may be an issue w/ how checksums are being handled. 400 TB of storage being racked and stacked. Looking into ordering more compute nodes. All running fine - putting together 400 TB storage. Continuing to look into procuring new compute and storage.
    • this week: Ran into problem with too many directories in PRODDISK - this is a problem when running cns service; may be cleared up with XFS rather than ext3. Cleared up with proddisk cleands. Continued installation of storage.

  • SWT2 (OU):
    • last week: 23 servers arrived. Call scheduled w/ Dell regarding installation.
    • this week: Equipment in place - probably take big downtime last week of April. Just now finding a problem with low number of running jobs - none activated; maybe a job shortage. Mark will follow-up.

  • WT2:
    • last week(s): ATLAS home and release NFS server failed; will be relocating to temporary hardware. All is well. Storage configuration changed - no longer using the xrootd namespace (CNS service)
    • this week: all is well.

Carryover issues (any updates?)

VDT Bestman, Bestman-Xrootd

Local Site Mover

  • Specification: LocalSiteMover
  • code
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]
  • this week if updates:

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • There is a report complete - there is an email every Tuesday.
    • AGLT2 is the only site that is in compliant in terms of reporting HS correctly. OIM is likely out of date.
    • Once the sites have completed their updates Karhik will check.
    • Karthik will send a reminder.
  • this meeting
    • This is a report of pledged installed computing and storage capacity at sites.
      Report date: Tue, Mar 30 2010
      
      --------------------------------------------------------------------------
       #       | Site                   |      KSI2K |       HS06 |         TB |
      --------------------------------------------------------------------------
       1.      | AGLT2                  |      1,570 |     10,400 |          0 |
       2.      | AGLT2_SE               |          0 |          0 |      1,060 |
      --------------------------------------------------------------------------
       Total:  | US-AGLT2               |      1,570 |     10,400 |      1,060 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       3.      | BU_ATLAS_Tier2         |      1,910 |      5,520 |        200 |
       4.      | HU_ATLAS_Tier2         |      1,600 |      5,520 |        200 |
      --------------------------------------------------------------------------
       Total:  | US-NET2                |      3,510 |     11,040 |        400 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       5.      | BNL_ATLAS_1            |      8,100 |     31,000 |          0 |
       6.      | BNL_ATLAS_2            |          0 |          0 |          0 |
       7.      | BNL_ATLAS_5            |          0 |          0 |          0 |
       8.      | BNL_ATLAS_SE           |          0 |          0 |      4,500 |
      --------------------------------------------------------------------------
       Total:  | US-T1-BNL              |      8,100 |     31,000 |      4,500 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       9.      | MWT2_IU                |      3,276 |      5,520 |          0 |
       10.     | MWT2_IU_SE             |          0 |          0 |        179 |
       11.     | MWT2_UC                |      3,276 |      5,520 |          0 |
       12.     | MWT2_UC_SE             |          0 |          0 |        250 |
      --------------------------------------------------------------------------
       Total:  | US-MWT2                |      6,552 |     11,040 |        429 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       13.     | OU_OCHEP_SWT2          |        464 |      3,189 |        200 |
       14.     | SWT2_CPB               |      1,383 |      4,224 |        436 |
       15.     | UTA_SWT2               |        493 |      3,627 |         39 |
      --------------------------------------------------------------------------
       Total:  | US-SWT2                |      2,340 |     11,040 |        675 |
      --------------------------------------------------------------------------
               |                        |            |            |            |
       16.     | WT2                    |        820 |      9,057 |          0 |
       17.     | WT2_SE                 |          0 |          0 |        597 |
      --------------------------------------------------------------------------
       Total:  | US-WT2                 |        820 |      9,057 |        597 |
      --------------------------------------------------------------------------
      
       Total:  | All US ATLAS           |     22,892 |     83,577 |      7,661 |
      --------------------------------------------------------------------------
  • All sites reasonably up to date. Final report sent to Michael.
  • GOC will start running the report periodically

AOB

  • last week
  • this week


-- RobertGardner - 30 Mar 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf HC_Test_Performance_at_AGLT2.pdf (365.6K) | RobertBall, 31 Mar 2010 - 12:56 | HC Testing Analysis results
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback