r5 - 15 Sep 2010 - 19:37:35 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesSep8

MinutesSep8

Introduction

Minutes of the Facilities Integration Program meeting, Sep 8, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Attending

  • Meeting attendees: Nate, Charle, Karthik, Dave, Michael, Booker, Jason, Sarah, Wei, Fred, Pat, Saul, Bob, John B, Horst, Shawn, Tom, Alden
  • Apologies: Kaushik & UTA team (flooding!)

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • CVMFS evaluation at the Tier 2 - OU, UC, AGLT2 interested in evaluation. See further TestingCVMFS.
      • October 12-13 next facility workshop.
      • WLCG asking for US ATLAS pledges for 2011, prelim for 2012. ~12K HS06.
      • Open ATLAS EB meeting on Tuesday. Next reprocessing campaign is shaping up. 7 TeV runs will be used. 1B events. October 20 deadline for building the dataset. Nov 26 repro deadline. For next major phys conferences (La Thuile - March 2011). 6-8 weeks for the simulation - mostly at the Tier 2s.
    • this week
      • For the OSG storage forum we've been asked to summarize Tier 2 storage experiences.. I'll send out an email request.
      • Installed capacity reporting. Here is the currently (weekly) report:
        This is a report of pledged installed computing and storage capacity at sites.
        Report date: Tue, Sep 07 2010
        
        --------------------------------------------------------------------------
         #       | Site                   |      KSI2K |       HS06 |         TB |
        --------------------------------------------------------------------------
         1.      | AGLT2                  |      1,670 |     11,040 |          0 |
         2.      | AGLT2_SE               |          0 |          0 |      1,060 |
        --------------------------------------------------------------------------
         Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
        --------------------------------------------------------------------------
                 |                        |            |            |            |
         3.      | BU_ATLAS_Tier2         |      1,910 |      5,520 |        200 |
         4.      | HU_ATLAS_Tier2         |      1,600 |      5,520 |        200 |
        --------------------------------------------------------------------------
         Total:  | US-NET2                |      3,510 |     11,040 |        400 |
        --------------------------------------------------------------------------
                 |                        |            |            |            |
         5.      | BNL_ATLAS_1            |      8,100 |     31,000 |          0 |
         6.      | BNL_ATLAS_2            |          0 |          0 |          0 |
         7.      | BNL_ATLAS_3            |          0 |          0 |          0 |
         8.      | BNL_ATLAS_4            |          0 |          0 |          0 |
         9.      | BNL_ATLAS_5            |          0 |          0 |          0 |
         10.     | BNL_ATLAS_SE           |          0 |          0 |      4,500 |
        --------------------------------------------------------------------------
         Total:  | US-T1-BNL              |      8,100 |     31,000 |      4,500 |
        --------------------------------------------------------------------------
                 |                        |            |            |            |
         11.     | MWT2_IU                |      3,276 |      5,520 |          0 |
         12.     | MWT2_IU_SE             |          0 |          0 |        179 |
         13.     | MWT2_UC                |      3,276 |      5,520 |          0 |
         14.     | MWT2_UC_SE             |          0 |          0 |        250 |
        --------------------------------------------------------------------------
         Total:  | US-MWT2                |      6,552 |     11,040 |        429 |
        --------------------------------------------------------------------------
                 |                        |            |            |            |
         15.     | OU_OCHEP_SWT2          |        464 |      3,189 |        200 |
         16.     | SWT2_CPB               |      1,383 |      4,224 |        670 |
         17.     | UTA_SWT2               |        493 |      3,627 |         39 |
        --------------------------------------------------------------------------
         Total:  | US-SWT2                |      2,340 |     11,040 |        909 |
        --------------------------------------------------------------------------
                 |                        |            |            |            |
         18.     | WT2                    |        820 |      9,057 |          0 |
         19.     | WT2_SE                 |          0 |          0 |        597 |
        --------------------------------------------------------------------------
         Total:  | US-WT2                 |        820 |      9,057 |        597 |
        --------------------------------------------------------------------------
        
         Total:  | All US ATLAS           |     22,992 |     84,217 |      7,895 |
        --------------------------------------------------------------------------
      • Mid-October face-to-face meeting at SLAC. Make your reservations at the guest house soon. Agenda is being discussed, slowly shaping up.
      • Data management meeting yesterday - notice production has ramped up; but also have a very spikey analysis load. At BNL have shifted 1,000 more cores into analysis. Please move more resources into analysis.
      • LHC - after completion of technical stop still in development mode. No beam expected before the weekend. Expect ramp up 50 to 300 bunches.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • OSG storage meeting coming up - will get into contact w/ Tanya for Tier 3 talk & requirements.
  • Working on plan to deal with local data management.
  • Most attention at the moment is figuring out funding to Tier 3s. Several sites have ordered their equipment.
  • Would like Panda monitoring to separate Tier 3 sites, so as to not be monitored.
this week:

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Armen: other clouds have the same problem. Armen and Wensheng are deleting when space gets too full. The problem is the service itself. USERDISK has been seen the most deletions.
    • Small transfer requests to Tier 3s. Discussed in RAC. Stephane notes cannot control on site by site basis. 1/2 TB is the limit. Requires ACLs to be set correctly in the LFC Tier 3.
    • Same policy for LOCALGROUPDISK.
    • Investigating many problems "awaiting subscriptions". Not an approval process issue.
    • Discussed approval process - VOMS based approval for a local site data manager. This is plan long term.
  • this week:

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting: not available this week (Yuri just back from vacation)
    1)  8/25: DDM failures on WISC_GROUP token - "cannot continue since no size has been returned after PrepareToGet or SrmStat."  Issue resolved.  ggus 61572 closed, eLog 16245.
    2)  8/26: ggus 60982 (transfer errors from SLACXRD_USERDISK to SLACXRD_LOCALGROUPDISK) was closed.  eLog 16268.
    3)   8/27: SWT2_CPB_USERDISK - several files were unavailable from this token.  Issue resolved - from Patrick:
    Sorry for the problems. The Xrootd system was inconsistent between the SRM and the disk contents. The issue has been resolved and the files are now available via the SRM.  (A 4 hour maintenance outage was taken 8/27 p.m. to fix this problem.)  
    ggus 61611 & RT 18043 closed, eLog 16349.
    4)  8/28: From Wei at SLAC: We will take WT2 down on Monday Aug 30, 10am to 5pm PDT (5pm - 12am UTC) for site maintenance and to bring additional storage online.  Maintenance completed on 8/30 - test jobs successful, 
    queues set back on-line as of ~1:30 p.m. CST.  eLog 16453.
    5)  8/29: BNL DDM errors - T0 exports failing to BNL-OSG2_DATADISK.  Initially seemed to be caused by a high load on pnfs01.  Problem recurred,
    from Jane: pnfs was just restarted to improve the situation.  ggus 61627 & goc #9155 both closed; eLog 16360/77.
    6)  8/30: BNL - upgrade of storage servers' BIOS - from Michael:
    This is to let shifters know that at BNL we are taking advantage of the LHC technical stop and conductiong an upgrade of the BIOS of some of our storage servers. This maintenance is carried out in a "transparent" fashion meaning the servers 
    will briefly go down for a reboot (takes ~10 minutes).  As files requested during that time are not available there may be a few transfer failures due to "locality is unavailable."  Maintenance completed - eLog 16419.
    7)  9/1: BNL - Major upgrade of the BNL network routing infrastructure that connects the US Atlas Tier 1 Facility at BNL to the Internet.  Duration: Wednesday, Sept 1, 2010 9:00AM EDT to 6:00PM EDT.  See eLog 16476.
    
    Follow-ups from earlier reports:
    (i)  8/18: WISC - DDM transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  
    Resolved as of 8/23, ggus 61283 closed, eLog 16167.  (Note: ggus 61352 may be a related ticket?)
    As of 8/31 ggus 61352 is still 'in progress'.
    (ii)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w 
    releases to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.
    (iii)  8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out."  Issue with a dCache pool - from Pedro (8/25):
    We still have a dcache pool offline.  It should be back by tomorrow noon.  In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt.  ggus 61338/355, eLog 16154/230.
    As of 8/31 ggus 61338 'solved', 61355 'in progress'.
    (iv)  8/24 - 8/25: SLAC - DDM errors: 'failed to contact on remote SRM': SLACXRD_DATADISK and SLACXRD_MCDISK.  From Wei:
    Disk failure. Have turned off FTS channels to give priority to RAID resync. Will turn them back online when resync finish.  ggus 61537 in progress, eLog 16223/33.
    Update, 8/31: a bad hard drive was replaced - issue resolved.  ggus 61537 closed,  eLog 16327.
    
    • Quiet week.
    • Release brokering is nearly Alden.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=106608
    
    1)  9/2: SLACXRD_DATADISK - "NO_SPACE_LEFT" transfer errors.  Additional storage space added.  Savannah 116562 closed, eLog 16565.
    2)  9/2: WISC_DATADISK file transfer errors:
    Failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries.
    >From Wen: It's fixed. We are working on it to make it stable. Sorry for the problem.  ggus 61751 closed, eLog 16511.
    3)  9/2: From Michael at BNL:
    Due to high load on the namespace manager some transfers are failing. Experts are investigating the issue.
    Later:
    The problem is solved (was caused by frequent SCSI errors we observed on one of the storage servers).
    4)  9/3: AGLT2 - from Bob: An NFS server went offline, and completely bolloxed the works here.  We had to kill all running jobs, both Production and Analysis and restart.  Now back on-line.
    5)  9/6: UTD-HEP - file transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server]. Givin' up after 3 tries].  
    Resolution: file server rebooted.  ggus 61831 closed, Savannah 116628 (site was briefly blacklisted), eLog 16672.
    6)  9/6: BNL - file transfer errors:
    Source file [srm://dcsrm.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT0D1/data10_7TeV/ESD/f282/data10_7TeV.0016097
    5.physics_Egamma.recon.ESD.f282/data10_7TeV.00160975.physics_Egamma.recon.ESD.f282._lb0091._0002.1]: locality
    is UNAVAILABLE].  dCache pool was off-line - issue resolved.  ggus 61834 (closed), eLog 16633.
    7)  9/7: UTD-HEP - SRM access errors.  Site blacklisted pending resolution of the problem.  Savannah 116654, ggus 61895/77.
    
    Follow-ups from earlier reports:
    (i)  8/18: WISC - DDM transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  
    Resolved as of 8/23, ggus 61283 closed, eLog 16167.  (Note: ggus 61352 may be a related ticket?)
    As of 8/31 ggus 61352 is still 'in progress'.
    ggus 61352 'solved' as of 9/3 - no recent errors of the type described in the ticket.
    (ii)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  
    Still waiting for a complete set of ATLAS s/w releases to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.  Also, work underway to enable analysis queue at OU (Squid, schedconfig mods, etc.)
    As of 9/7: ongoing discussions with Alessandro DeSalvo regarding atlas s/w installations at the site.
    (iii)  8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out."  Issue with a dCache pool - from Pedro (8/25):
    We still have a dcache pool offline.  It should be back by tomorrow noon.  In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt.  ggus 61338/355, eLog 16154/230.
    As of 8/31 ggus 61338 'solved', 61355 'in progress'.
    ggus 61355 'solved' as of 9/3.
    (iv)  9/1: BNL - Major upgrade of the BNL network routing infrastructure that connects the US Atlas Tier 1 Facility at BNL to the Internet.  Duration: Wednesday, Sept 1, 2010 9:00AM EDT to 6:00PM EDT.  See eLog 16476.
    Upgrade completed as of ~5:30 EST, with all storage pools on-line as of ~10:00 p.m. EST.  eLog 16500.
    
    

DDM Operations (Hiro)

libdcap & direct access (Charles)

last meeting:
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.

this meeting:

  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline
  • buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Phone meeting notes:
      
      
    • From Jason:
      • We are still actively trying to figure out what is causing the performance problems between the new release and the KOI harware
      • Final release will be pushed back until we can fix this, RC3 will be out this week.
      • Still not recommending that the USATLAS Tier2s adopt yet. Working with BNL/MSU/UM on testing solutions.
  • this week:
    • Meeting scheduled to next week.
    • Jason rc3 will be out this week; hold off on updates; seeing a performance impact, not yet understood.

HEPSpec 2006 (Bob)

Site news and issues (all sites)

  • T1:
    • last week(s): Currently a problem with one of the storage pools - large - 100K files/day to re-create the metadata. local-site-mover - Pedro making progress handling of exceptions (have seen a rate of 2-5%). Want data access problems to become invisible.
    • this week:

  • AGLT2:
    • last week: Putting in storage orders - R710+MD1200 shelves. Focused also network infrastructure upgrades and service nodes. Running with direct access for a while; have found some queuing of movers on pools. (Note MWT2 is running dcap 1000 movers; note there is a newer dcap version test).
    • this week: Still preparing for next purchase. Storage, services and network purchases are out. Decision for computes needed by mid-month. More virtualization of primary services. Both gatekeepers are virtualized. Condor head node is virtualized. Started testing compute nodes that are about to go out of warranty. Capacity down a bit.

  • NET2:
    • last week(s): 250 TB rack is online, DATADISK is being migrated. Working on a second rack. Working on HU analysis queue.
    • this week: Running smoothly in the past week. Move to DATADISK is 95% complete. Wensheng and Armen have cleared up some space. 250 MB/s nominal copy. HU as an analysis queue - still working on this. Need to put another server on the Harvard side. Another 250 TB has been delivered, awaiting electrical work. Future - a green computing space, multi-university, may move NET2 there.

  • MWT2:
    • last week(s): Chimera issues resolved - due to schema change; No crashes since. PNFS relocated to newer harder, good performance.
    • this week: Quiet week. Continuing planning purchase, hepspec benchmarks for two servers.

  • SWT2 (UTA):
    • last week: 200 TB of storage added to system - up and running without problems. Will be making modifcations for multipilot.
    • this week: Last week was quiet. Had a problem with Torque - restarted. Processing backlog of analysis jobs. scheddb changes don't seem to be taking effect. (Multi-pilot option) Otherwise all okay.

  • SWT2 (OU):
    • last week: Lustre upgrade seems to have stabilized things. Pilot update for adler32 update. Getting OSCER back online. Found problems installing new releases using Alessandro.
    • this week: Put squid online. Otherwise everything smooth. Expect to turn on analysis queue.

  • WT2:
    • last week(s): Disk failure yesterday - had to shutdown FTS for a while; fixed. Installing additional storage next week.
    • this week: New Dell servers - three in production; almost immediately got disk failures. Cable, enclosure, .. working on double-cabling. Lost some data. 16 hour downtime required. During the holiday we were working well. Reached 2000 analysis jobs. Gradually adding more storage. Some power work to do before turning on two servers. These were Dell MD1000 with PERC6 controllers.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting(s)
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

AOB

  • last week
    • LHC accelerator outtage next week, M-Thurs.
  • this week


-- RobertGardner - 07 Sep 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback