r8 - 28 Jun 2010 - 23:36:06 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune23

MinutesJune23

Introduction

Minutes of the Facilities Integration Program meeting, June 23, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Aaron, Dave, Torre, Charles, Nate, Michael, Mark, Sarah, Rik, Saul, Fred, Patrick, Booker, Kaushik, Wei, Hiro, John Brunelle, Karthik, Armen, Tom
  • Apologies: Shawn, Bob, Jason, John DeStefano

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Updates to SiteCertificationP13 for fabric upgrade plans and Squid updates - due
      • WLCG data: WLCGTier2May2010
      • WLCG storage workshop (Amsterdam, June 16-18): _General registration for the Jamboree on the Evolution of WLCG Data & Storage Management is now open. The event page can be found at http://indico.cern.ch/conferenceDisplay.py?confId=92416._
      • OSG Storage Forum in September 21-22; location U of Chicago
      • ATLAS Tier 3 meeting at Argonne June 8,9
      • LHC - intervention next week;
      • Space crunch .. already
    • this week
      • Quarterly reports are coming due - end of quarter - see site certification matrix
      • Production is quite low but analysis is going well. 56K jobs completed in the US in the last day, the largest fraction.
      • Machine - 10 days of beam commissioning should be finished, but everyone is awaiting stable beams. Over the weekend decision to restart data exports, starting to see a little.
      • Expect new data then at any time.
      • Expect another reprocessing campaign in July - unknown scale
      • WLCG Jamboree on data management last week - a brainstorming meeting, lots of information presented: requirements, technology providers (eg. NFS v4.1); ROOT performance issues. New ideas were presented about content delivery networks, utilizing P2P? . Resilient data access - failover when files are not found in the storage system. Possible demonstrator projects - there will be a follow-up meeting on July 9 at Imperial College.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here

last week(s):

this week:
  • The meeting on June 8,9 was well-attended, everyone who is funded sent a representative. Money still not appeared yet, only a few places have started working. Everyone up to speed on which services will be deployed.
  • Instructions for Tier 3 setup in progress - next week hope to finish.
  • Still questions about how to distribute data to Tier 3 - will organize this in the next week or so.
  • Bellamine group funded for a large T3. Working on a configuration with 64 nodes, 220 TB disk. They have questions about the file system setup.

Operations overview: Production and Analysis (Kaushik)

  • this week:
    • Out of production jobs - no prospect of getting any any time soon. Waiting for new release.
    • User analysis looks good - job distribution is doing well.
    • Panda data distribution service (PD2P? ) in production for the past week. Panda makes subscriptions to DATADISK and MCDISK. 300 subscriptions made. Everything working well, with no increased latency. The backlog is not too bad. Will do a deep data deletion on sites since this mechanism seems to handle things well and uses much less space.
    • Still a need for re-brokerage.

Data fullness (Saul)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=98432
    
    1)  6/11: MWT2_UC DDM errors such as:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries]
    >From Sarah:
    It looks like PNFS failed at 12:01 AM, coinciding with the start of a pnfsDump run. SRM failed at 4 AM, due to filling memory with PNFS requests. We're still investigating the causes of the PNFS failure.  We're looking at other options, like Chimera and better hardware.  ggus 58966 (closed), eLog 13676.
    2)  6/11: From Bob at AGLT2: Had an NFS problem.  Noted too late.  Nearly all running jobs lost.  Problem now repaired, should be OK shortly.  The next day (Saturday, 6/12) there were still some stale condor jobs on the BNL submit host -  some of these were removed, 
    others eventually cleared out of the system.  eLog 13718.
    3)  6/12: Job failures at BNL - example from the pilot log:
    12 Jun 09:00:55|Mover.py | !!FAILED!!2999!! Error in copying (attempt 1): 1099 - dccp failed with output: ec = 256,
    12 Jun 09:00:55|Mover.py | !!FAILED!!2999!! Failed to transfer log.144950._567122.job.log.tgz.1: 1099 (Get error: Staging input file failed).  Issue resolved, ggus 58986 (closed), eLog 13708.
    4)  6/12: WISC - job failures with errors like:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir.  Update 6/15 from Wen:
    The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible.  Savannah 115123 (open), eLog 13790.
    5)  6/13-14: Power problem at AGLT2 - from Shawn:
    The primary input circuit breaker on our APC 80kW UPS system had tripped at 4:10 AM. Apparently the last of the batteries ran out around 5:10 AM.  System breaker was reset around 1:15 PM. Still checking that all systems have come back up in an operational way.  
    Test jobs successful, site back to on-line.  
    ggus 59002 & RT 17217 (closed), eLog 13770.
    6)  6/14 - 6/16: SWT2_CPB - problem with the internal cluster switch stack.  Restored once early Monday morning, but the problem recurred after ~ 4 hours.  Working with Dell tech support to resolve the issue.  ggus 59006 & RT 17220 (open), eLog 13776.
    7)  6/15: DB Release 10.9.1 was corrupted, resulting in large numbers of failed jobs.  Updated version released.  Savannah 68831.
    8)  6/15: BNL FTS was upgraded to v2.2.4 (Hiro).
    
    Follow-ups from earlier reports:
    (i)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    Update 6/10, from Horst:
    We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system.
    (ii)  6/3: AGLT2 - DDM errors, for example:
    [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3tries].  From Shawn:
    The queue problem has not returned (this refers to ggus 58720). However we have an open ticket with OSG-storage about trying to resolve this type of problem: dCache: "[CONNECTION_ERROR]" from FTS to our dCac... ISSUE=8630 PROJ=71.  Once we here more we can update this ticket.  
    ggus 58772 (in progress), eLog 13428.
    Update, 6/9: No recent errors of this type seen - ggus 58772 closed.
    (iii)  6/4: Hiro announced auto-shutoff mechanism for FTS channels with high failure rates.
    Update, 6/11 (from Hiro):
    The auto-shutoff has been modified to check the source error as well.   The threshold of error rate is 150 errors per 10 minutes.  And, the following three errors are only counted in the check.
    1.  No space left error.
    2.  Destination error
    3.  failed to contact remove SRM
    The shutoff should only happen when the SE is really not working.   And, it should not casually turn off channel .
    (iv)  6/5: IllinoisHEP - jobs failing with the error (from the pilot log):
    |Mover.py | !!FAILED!!3000!! Exception caught: Get function can not be called for staging input files: \'module\' object has no attribute \'isTapeSite\'.  ggus 58813 (in progress), eLog 13468.
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=99202
    
    1)  6/16: OU_OCHEP_SWT2 DDM transfers failing with:
    AGENT error during ALLOCATION phase: [CONFIGURATION_ERROR].
    >From Horst: Bestman had hung up for some reason - restarted.  ggus 59114 & RT 17254 (both closed), eLog 13819.
    2)  6/16 - 6/17: AGLT2 - bad disk in one of the RAID arrays causing DDM transfer errors.  From Bob:
    Same disk shelf as last night failed again, same disk.  Off line from 8am-11:40am EDT.  Disk removed from array and system rebooted.  srmwatch looks good since reboot.  Replacement disk for RAID-6 array due here tomorrow.
    3)  6/17: Job failures at NET2:
    Error details: pilot: Too little space left on local disk to run job: 1271922688 B (need > 2147483648 B).  Unknown transExitCode error code 137.  From Saul:
    This is a low level problem that we know about.  It's caused because the local scratch space used by some production jobs has been gradually increasing over time causing a problem on some of our nodes with small scratch volumes. 
    We're working on a solution and are watching for these in the mean time.  ggus 59145 (closed), eLog 13827.
    4)  6/18: NET2_DATADISK, NET2_MCDISK - DDM errors like:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]. Givin' up after 3 tries].  From John & Saul:
    We had a 1 hour outage to upgrade one of our GPFS volumes. All systems are back and files are arriving now.  ggus 59203 (closed), eLog 13855.
    5)  6/19 - 6/20: NET2 - "No space left on device" errors at NET2_DATADISK & MCDISK.  From John & Saul:
    There has been a big burst of data arriving at NET2 and our DATADISK and MCDISK space tokens have run out of space. Armen and Wensheng 
    have been helping us with this today, but since we can write data very fast, our space tokens can fill up very quickly.  There is far more subscribed than free space, so we need some DDM help.  ggus 59220 (in progress), eLog 13888/90.
    6)  6/22: Upgrade of core network routers at BNL completed.  No impact on services.  eLog 13941.
    
    Follow-ups from earlier reports:
    (i)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- 
    and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    Update 6/10, from Horst:
    We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system.
    Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro).  As of 6/22 test jobs are still failing with ddm registration errors - under investigation.
    (ii)  6/5: IllinoisHEP - jobs failing with the error (from the pilot log):
    |Mover.py | !!FAILED!!3000!! Exception caught: Get function can not be called for staging input files: \'module\' object has no attribute \'isTapeSite\'.  ggus 58813 (in progress), eLog 13468.
    Update, 6/21, from Dave at Illinois:
    I believe this problem has been solved.  The problem was due to the DQ2Clients package in the AtlasSW not being properly updated at my site.  Those problems have been resolved and the package is now current.  ggus ticket closed.
    (iii)  6/12: WISC - job failures with errors like:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir.  Update 6/15 from Wen:
    The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible.  Savannah 115123 (open), eLog 13790.
    (iv)  6/14 - 6/16: SWT2_CPB - problem with the internal cluster switch stack.  Restored once early Monday morning, but the problem recurred after ~ 4 hours.  Working with Dell tech support to resolve the issue.  ggus 59006 & RT 17220 (open), eLog 13776.
    Update, 6/17: one of the switches in the stack was replaced, and this appears to have solved the problem, as the stack has been stable since then.  ggus and RT tickets closed.
    
    
    • Not much production over the past week.
    • OU: need scheddb updates, getting closer.

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • NetworkMonitoring
    • Biweekly meeting yesterday. Personar issues discussed - new RH-based version coming mid-Aug. Probably won't have another Knoppix release.
    • Patch for display issues available.
    • Transaction tests and plotting - to be covered next time; alerts as well.
    • New hardware platform perfsonar - 10G capability
    • Minutes from meeting:
      	USATLAS Throughput Meeting Notes --- June 15, 2010
                       ==================================================
      
      Attending: John, Philippe, Jason, Sarah, Andy, Dave, Aaron, Charles, Hiro, Karthik
      Excused:
      
      1) perfSONAR status:  a) New hardware install status report about possible future perfSONAR platform.  It is in place at AGLT2 and installed with current version.  Shawn will send out details to Internet2/ESnet developers later today.   b) Comments on speed and usability of current software on KOI systems:  very problematic because system slowness makes it painful to examine results.   Jason said some effort is being made to isolate the source of the problem.   Possibly non-optimal SQL queries are the problem.   Being investigated.
      2) Transaction tests:   a) Nothing new yet.  Once URL with results is ready Hiro will make it available.
      3) perfSONAR alerting:  a) Nothing new from Shawn or Sarah,  b) update on API...Jason, will see if API fixes will make it into the next version.   Will be important to have this for future "set and forget" capability.
      4) Site reports:
         a) MWT2:  Firmware upgrade on switch/router (Cisco 6509).   Looks like it worked. Unable to cause the problem using the same test (Iperf across the backplane using two sets of src/destinations).  
         b) BNL:  Problem with poor performance a few weeks ago (which was routed around) turned out to be a bad 10GE card.  Seeing overruns incrementing on the bad card. Possible "diags schedule" command pointed out by Aaron (UC). 
         c) OU:  Still problem with throughput service going to "Not Running".   Aaron will contact Karthik to see if DB related. 
         d) Internet2/ESnet: New CentOS based perfSONAR LiveCD version maybe ready for alpha testing by mid-July.   Possibility in the future to install to disk and use YUM to update system.  Sites can choose to still run from CD.   
         e) AGLT2:  Brief report on last week's Transatlantic Networking Workshop.  URL for it: http://indico.cern.ch/conferenceTimeTable.py?confId=88883#20100611.detailed  
      Planning to meet again in two weeks.  
      Please send along corrections or additions to the mailing list.
      Thanks,
      Shawn
    • Next perfsonar release candidate expected mid-July
    • News from networking meeting - ATLAS most likely moving to a flatter data distribution model (more like CMS)
    • Meeting link: http://indico.cern.ch/conferenceTimeTable.py?confId=88883#20100611.detailed

  • this week:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week(s)
    • Fred has tested squids everywhere except NET2 where firewall issues are being worked out.
    • OU still needs setup and testing
    • Local setup files from each Tier 2 are being compared against ToA
    • Fail-overs should work, but Wei notes there may be a firewall issue failing over WT2 to SWT2.
    • Problem at BNL launch-pad due to tomcat/java issue
    • Fred will review configuration files eveywhere
  • this week
    • testing
      Site       Squid Installed  Squid Works  Fail-Over Works
      AGLT2      Yes              Yes          Yes
      ANL T3     Test jobs on analy queue don't start.
      BNL        Yes              Yes          No failover
      Duke       Missing 15.8.0
      Illinois   Test jobs on build failed / don't start
      MWT2 IU    Yes              Yes          Not tested
          UC    Yes              Yes          Yes
      NET2 BU    Yes              Yes          Yes
          HU    Test jobs on analy queue don't start
      SWT2 CPB   Yes              Yes          No
          UTA   Test job build failed
          OU    No               Still upgrading hardware
      WT2        Yes              Yes           Yes
      

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • BNL available for testing. OU will also be available soon. Will keep bugging Alessandro.
    • Testing underway at OU and UTD, seems successful
    • Migration should be transparent to production: new installation will have different directory structure, will provide sym-links for compatibility
    • Will do this one site at a time, starting with BNL
    • Will start with the next release/cache.
  • this meeting:
    • Fred had discovered problems with installed releases during his tests. Site admins weren't aware. Request to notify sites if there are failures to retry or inform site admin.
    • Make a special request to get an email if there are failures.

Site news and issues (all sites)

  • T1:
    • last week(s): Planning to upgrade Condor clients on worker nodes next week. Force 10 code update. Continue testing DDN.
    • this week:

  • AGLT2:
    • last week: Power issue over weekend - 80 kW Symmetra input breaker tripped, down for several hours and then recovery; resolved on Monday. 2 second interruption today covered UPS.
    • this week: Everything at AGLT2 seems to be going well right now. Knock on wood. We are running a full complement of analysis jobs, which seem to be doing fine, although there are times, typically once per day, when we get no auto-pilots for an hour or 2 at a time, enough that the running analysis job count takes a big dip. The auto-pilot dearth hits us several times per day, but only 1 or 2 of those dips typically results in the running job dip. In the long run, this may be something worth investigating as to cause.

  • NET2:
    • last week(s): Will go offline for a firmware upgrade on GPFS servers. Racking and stacking storage.
    • this week: Fullness problem was solved last week by Kaushik, Armen and Wensheng http://atlas.bu.edu/~youssef/NET2-fullness/ , new storage is being tested while waiting for PDUs, BU-HU networking has been tuned up, preparing to open the Harvard site to analysis jobs.

  • MWT2:
    • last week(s): Updating 64 8 core nodes adding disk, doing raid0, 70% I/O improvement. Cisco all okay with heavy IO doing pool-to-pool transfers.
    • this week: Disk update for the 64 nodes complete. Xrootd testing (ANALY_MWT2_X), comparison to dCache (ANALY_MWT2). Looking at dCache headnode update.

  • SWT2 (UTA):
    • last week: A problem with the 6248 switch stack. AGLT2 found a problem when switch stack module #7 was added (known firmware problem). Sometimes lose filesystem, or have bad flash - sometimes updates fail. Shawn notes problems with version 3 of the firmware.
    • this week: Network stack traced to a bad image on one of the switch stacks. Replaced the switch. Basic operations are fine. Panda+autopilot creating very high GK load; clean-up of gass cache and gram state area. May need to bring this issue up with the Condor team - Kaushik notes this seems to happen after downtimes. Patrick notes this may be a problem with the grid monitor tracking very old jobs. Could use auto-adjuster.

  • SWT2 (OU):
    • last week: Ready for test jobs, Mark will send some this afternoon.
    • this week: Still waiting for scheddb to be updated, then on to the next step - getting very close to being finished.

  • WT2:
    • last week(s): Working on a storage configuration from Dell. 7 R710s each with 6 MD1000s. Two 8024F switches, a 10G switch - a new offering from Dell.
    • this week: Still having problem with NFS server holding ATLAS releases. A hard time installing releases - its a shared server with another group, so can't fix immediate, will wait.

Carryover issues ( any updates?)

AOB

  • last week
  • this week


-- RobertGardner - 22 Jun 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback