r3 - 30 Jun 2010 - 13:54:11 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune30

MinutesJune30

Introduction

Minutes of the Facilities Integration Program meeting, June 30, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Dave, Rob, Aaron, Charles, Nate, Shawn, Torre, Saul, Armen, Mark, Rik, Doug, Patrick, Sarah, Horst, Karthiik, Booker, Hiro, Xin, Michael
  • Apologies: Wei, Fred, Kaushik

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Quarterly reports are coming due - end of quarter - see site certification matrix
      • Production is quite low but analysis is going well. 56K jobs completed in the US in the last day, the largest fraction.
      • Machine - 10 days of beam commissioning should be finished, but everyone is awaiting stable beams. Over the weekend decision to restart data exports, starting to see a little.
      • Expect new data then at any time.
      • Expect another reprocessing campaign in July - unknown scale
      • WLCG Jamboree on data management last week - a brainstorming meeting, lots of information presented: requirements, technology providers (eg. NFS v4.1); ROOT performance issues. New ideas were presented about content delivery networks, utilizing P2P? . Resilient data access - failover when files are not found in the storage system. Possible demonstrator projects - there will be a follow-up meeting on July 9 at Imperial College.
    • this week
      • Site certification table remdiner:
        screenshot_01.jpg
      • For the analysis queue benchmark we ran a round of HC stress tests over the sites
      • 15.6.6
      • Output DS: user.elmsheus.hc.10000267.*
      • Input DS Patterns: mc09*merge.AOD*.e*r12*
      • Ganga Job Template: /data/hammercloud/atlas/inputfiles/muon1566/muon1566_panda.tpl
      • Athena User Area: /data/hammercloud/atlas/inputfiles/muon1566/MuonTriggerAnalysis_1566.tar.gz
      • Athena Option file: /data/hammercloud/atlas/inputfiles/muon1566/MuonTriggerAnalysis_1566.py
      • ANALY_AGLT2, SWT2_CPB, SLAC, NET2: http://hammercloud.cern.ch/atlas/10000220/test/
      • ANALY_MWT2, MWT2_X, SWT2_CPB, SLAC, NET2, BNL_ATLAS_1: http://hammercloud.cern.ch/atlas/10000241/test/
      • Finding average wall times efficiencies between 40-62%
      • SLAC and SWT2_CPB efficiencies especially good (high 50s, 50%)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here

last week(s):

  • The meeting on June 8,9 was well-attended, everyone who is funded sent a representative. Money still not appeared yet, only a few places have started working. Everyone up to speed on which services will be deployed.
  • Instructions for Tier 3 setup in progress - next week hope to finish.
  • Still questions about how to distribute data to Tier 3 - will organize this in the next week or so.
  • Bellamine group funded for a large T3. Working on a configuration with 64 nodes, 220 TB disk. They have questions about the file system setup.
this week:
  • Busy working on procedures for Tier 3

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

dCache local site mover plans (Charles)

  • Meeting with Pedro and Shawn - to create a unified local site mover

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=99202
    
    1)  6/16: OU_OCHEP_SWT2 DDM transfers failing with:
    AGENT error during ALLOCATION phase: [CONFIGURATION_ERROR].
    >From Horst: Bestman had hung up for some reason - restarted.  ggus 59114 & RT 17254 (both closed), eLog 13819.
    2)  6/16 - 6/17: AGLT2 - bad disk in one of the RAID arrays causing DDM transfer errors.  From Bob:
    Same disk shelf as last night failed again, same disk.  Off line from 8am-11:40am EDT.  Disk removed from array and system rebooted.  srmwatch looks good since reboot.  Replacement disk for RAID-6 array due here tomorrow.
    3)  6/17: Job failures at NET2:
    Error details: pilot: Too little space left on local disk to run job: 1271922688 B (need > 2147483648 B).  Unknown transExitCode error code 137.  From Saul:
    This is a low level problem that we know about.  It's caused because the local scratch space used by some production jobs has been gradually increasing over time causing a problem on some of our nodes with small scratch volumes. 
    We're working on a solution and are watching for these in the mean time.  ggus 59145 (closed), eLog 13827.
    4)  6/18: NET2_DATADISK, NET2_MCDISK - DDM errors like:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]. Givin' up after 3 tries].  From John & Saul:
    We had a 1 hour outage to upgrade one of our GPFS volumes. All systems are back and files are arriving now.  ggus 59203 (closed), eLog 13855.
    5)  6/19 - 6/20: NET2 - "No space left on device" errors at NET2_DATADISK & MCDISK.  From John & Saul:
    There has been a big burst of data arriving at NET2 and our DATADISK and MCDISK space tokens have run out of space. Armen and Wensheng 
    have been helping us with this today, but since we can write data very fast, our space tokens can fill up very quickly.  There is far more subscribed than free space, so we need some DDM help.  ggus 59220 (in progress), eLog 13888/90.
    6)  6/22: Upgrade of core network routers at BNL completed.  No impact on services.  eLog 13941.
    
    Follow-ups from earlier reports:
    (i)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- 
    and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    Update 6/10, from Horst:
    We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system.
    Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro).  As of 6/22 test jobs are still failing with ddm registration errors - under investigation.
    (ii)  6/5: IllinoisHEP - jobs failing with the error (from the pilot log):
    |Mover.py | !!FAILED!!3000!! Exception caught: Get function can not be called for staging input files: \'module\' object has no attribute \'isTapeSite\'.  ggus 58813 (in progress), eLog 13468.
    Update, 6/21, from Dave at Illinois:
    I believe this problem has been solved.  The problem was due to the DQ2Clients package in the AtlasSW not being properly updated at my site.  Those problems have been resolved and the package is now current.  ggus ticket closed.
    (iii)  6/12: WISC - job failures with errors like:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir.  Update 6/15 from Wen:
    The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible.  Savannah 115123 (open), eLog 13790.
    (iv)  6/14 - 6/16: SWT2_CPB - problem with the internal cluster switch stack.  Restored once early Monday morning, but the problem recurred after ~ 4 hours.  Working with Dell tech support to resolve the issue.  ggus 59006 & RT 17220 (open), eLog 13776.
    Update, 6/17: one of the switches in the stack was replaced, and this appears to have solved the problem, as the stack has been stable since then.  ggus and RT tickets closed.
    
    • Not much production over the past week.
    • OU: need scheddb updates, getting closer.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=99883
    
    1)  SWT2_CPB: beginning on 6/22 issues with gatekeeper being heavily loaded, apparently due to activity from the condor grid monitor agents.  Ongoing discussions with Xin, Jamie to diagnose this problem.
    2)  6/23: SLAC - ~half-day outage to patch and restart a NFS server hosting ATLAS releases.  Completed.  eLog 13990.
    3)  6/24: SLAC - DDM errors such as:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_
    ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries]
    >From Wei:
    We have blacklisted SLACXRD_GROUPDISK because it ran out of disk space. However, the effect will not be immediate. Our SRM will likely be overrun once a while due to large number of srmGetSpaceTokens and srmGetSpaceMetadata in short period. This is because there are still large number of transfer 
    requests already in DQ2 SS queues for SLACXRD_GROUPDISK.  
    ggus 59384 (closed), eLog 13991.
    4)  6/25: Job failures at NET2 with errors like:
    25 Jun 14:24:09 | /atlasgrid/Grid3-app/atlas_app/atlas_rel/15.8.0/cmtsite/setup.sh: No such file or directory
    25 Jun 14:24:09 | runJob.py   | !!WARNING!!2999!! runJob setup failed: installPyJobTransforms failed
    >From Saul:
    I suddenly realized that I probably caused this problem my own self.  Around the time of the errors, I was making an egg demo that would have been hitting the release file system pretty hard ( http://egg.bu.edu/15.8.0-egg/index.html ).
    We'll confirm, but there was probably nothing wrong otherwise.  No additional failed jobs observed.
    5)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. (Users should not put all metadata of files in the filename itself.) 
    I have contacted the DQ2 developers to limit the length.  
    Savannah 69217, eLog 14016.
    6)  6/26: BNL - jobs stuck in the "waiting" state due to missing file data10_7TeV.00155550.physics_MinBias.recon.ESD.f260._lb0156._0001.1 from dataset data10_7TeV.00155550.physics_MinBias.recon.ESD.f260.  Pavel subscribed BNL, and the files are now available.  Savannah 69223, eLog 14024.
    7)  6/28-29: OU_OCHEP_SWT2 - file transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3
    tries].  From Horst:
    [Our head node, tier2-01, had crashed with a kernel panic.  It's back up now, and a couple of SRM and LFC test commands succeded fine.  Can you please try again, and close this ticket, if it works again now?] 
    No recent errors of this type observed - ggus 59490 & RT 17324 (closed), eLog 14118.
    8)  6/28-29: NET2 - job failures with the error:
    pilot: installPyJobTransforms failed: sh: /atlasgrid/Grid3-app/atlas_app/atlas_rel/15.6.10/cmtsite/setup.sh: No such file or directory.  From Saul:
    One of our sysadmins mistakenly mounted an old copy of the releases area on some of our nodes, so it's not surprising that a recent production cache is missing. These nodes have been taken offline and the correct releases will be remounted. I'll keep the ticket open until these nodes are back with the right mounts.
    Issue resolved, ggus 59439 (closed), eLog 14078.
    9)  6/29: Proxies on all DDM VO boxes at CERN expired.  Proxies renewed as of ~8:30 a.m. CST.  eLog 14109.
    10)  6/29: MWT2_IU maintenance outage completed as of ~4:00 p.m. CST.  >From Sarah:
    We will be applying software updates to the MWT2_IU gatekeeper, with the goal of performance and job scheduling improvements.
    11)  6/30: Apparently a recurrence of the "long filename" issue in 5) above.  From Michael:
    The problem is caused by an overload situation of the storage system's namespace component. The likely reason for the high load are user analysis jobs requesting to create files with very long file names. These operations fail and are retried at high rate.  Experts are looking into ways to fix the problem.  
    ggus 59567 (closed), eLog 14141/33.
    
    Follow-ups from earlier reports:
    (i)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    Update 6/10, from Horst:
    We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system.
    Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro).  As of 6/22 test jobs are still failing with ddm registration errors - under investigation.
    (ii)  6/12: WISC - job failures with errors like:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir.  Update 6/15 from Wen:
    The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible.  Savannah 115123 (open), eLog 13790.
    (iii)  6/19 - 6/20: NET2 - "No space left on device" errors at NET2_DATADISK & MCDISK.  From John & Saul:
    There has been a big burst of data arriving at NET2 and our DATADISK and MCDISK space tokens have run out of space. Armen and Wensheng have been helping us with this today, but since we can write data very fast, our space tokens can fill up very quickly.  There is far more subscribed than free space, 
    so we need some DDM help.  ggus 59220 (in progress), eLog 13888/90.
    Update, 6/26: additional space made available, MCDISK and DATADISK now o.k.  ggus 59220 closed.
    
    

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • this week:
    • Network asymmetries - is Esnet involved? Dave (Illinois) investigating, possible issue with campus switch
    • May need to get path-providers involved
    • Hiro has FTS instrumented so we can get information on transfer times. Will track overhead. Will check concurrency.
    • Alerting - will be trying to get this to be part of the system. Check for deviations from expected behavior.
    • Notes from meeting:
      	USATLAS Throughput Meeting - June 29, 2010
      	==========================================
      
      Attending: Shawn, Andy, Dave, Sarah, Aaron, Saul, Karthik, Hiro, Philippe
      Excused: Jason
      
      0) Action items...reminder about the list from the agenda
      
      1) perfSONAR status:  New Dell R410 is online with current perfSONAR release.   Will be used for testing.  Status of work on improving responsiveness of perfSONAR.   Next release has some SQL improvements optimizing data access.  Currently the next release (V3.2) is under "internal" testing.   A few weeks from now USATLAS can start testing.
      
      Issues for discussion from perfSONAR results:  
      	a) Sarah noted UC has a longstanding issue with outbound larger than inbound (same as OU).   
      	b) Problem with services at OU stopping.   Possibly just a display issue?  Doesn't seem to be the case since there are matching discontinuities in the data.  Log file have been sent and still awaiting resolution.  
           c) Philippe noted that MSU-UC testing shows larger bandwidth "outbound" (900 MSU-UC vs. 600 UC-MSU) while for MSU-OU it is opposite (400 MSU-OU vs. 800 OU-MSU).  Checking the UM and MSU perfSONAR nodes shows:
      	from MSU and UM
      	    AGL to UChicago > UChi to AGL
      	    AGL to OU       < OU   to AGL
      	looking from BNL, same thing
      	    BNL to UChicago > UChi to BNL
      	    BNL to OU       < OU   to BNL
      	looking from IU, same thing
      	    IU  to Uchicago > UChi to UI
      	    IU  to OU       < OU   to IU
      
           d) David noted that campus is slow (perfSONAR found this issue) and campus was notified...problem was then identified as a missed firmware upgrade.  Will be fixed tomorrow.  ICCN vs. ESnet difference.  Over ESnet is showing asymmetry.  Testing to ESnet (Chicago) is good from Illinois but testing to the next "hop" Cleveland shows a big asymmetry (Ill->Cleveland 900 Mbps, but Cleveland->Ill is 200-400 Mbps).   
      
      2) Transaction rate capability:  Hiro has created new plots of FTS details.  Follow "Show Plots" in the following list.   A sample test is done.  The test is to transfer 1K files of 1MB in size.   The first plot is the 
      histogram of transfer time per file while the second plot is the histogram of transfer time + times in the queue.  As a result, the 2nd one is indicative of SRM performance. However, it is mostly controlled 
      by the FTS channel parameter for the number of concurrent transfers.   
      But, you can general see how many files your sites should be able to get (if files are small.)
      
      AGLT2
      http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=8b747104-83a9-11df-9d63-f0c52a5177e3 
      
      UC
      http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=8c56cffa-83ab-11df-9d63-f0c52a5177e3 
      
      SLAC
      http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=f69bca7f-83ad-11df-9d63-f0c52a5177e3 
      
      Other sites will follow (doing that right now.)   Hiro will setup to have same concurrency for this test.   Also add in plot of "overhead" time per transfer.
      
      3) Alerting progress:   Shawn or Sarah?     No progress yet.  Andy will send documentation links for using clients to access perfSONAR data (Hiro inquired about how to best access the existing data).
      
      4) Site reports:  Open forum for reports on throughput related items.   Hiro reports that PANDA will now be subscribing data and will be potentially getting datasets from sources outside the hierarchy.   This means that better monitoring and debugging is critical.  Default will be to rely on DQ2 to select the source for transfers.  Will need watching.      
      
      Please send along any corrections or additions to the mailing list.   Next meeting in 2 weeks (July 13).
      
      Shawn

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Fred had discovered problems with installed releases during his tests. Site admins weren't aware. Request to notify sites if there are failures to retry or inform site admin.
    • Make a special request to get an email if there are failures.
  • this meeting:
    • Email alerts for release installation problems (w/ Alessandro's system): https://atlas-install.roma1.infn.it/atlas_install/protected/subscribe.php
    • OU, Illinois-T3, BNL testing
    • OU - waiting for full panda validation
    • Illinois - problems with jobs sent WMS - lag in status update
    • BNL - Alessandro tested poolfilecatalog creation - there were problems with the environment; Xin provided a patch.
    • Waiting for feedback from Alessandro

Site news and issues (all sites)

  • T1:
    • last week(s): Planning to upgrade Condor clients on worker nodes next week. Force 10 code update. Continue testing DDN.
    • this week: New worker nodes have arrived, installed: 247 machines, ~2000 cores. Lots of improvement in DDN over the past week, will go forward with this system. Tier 1: 5K jobs, 6PB storage. Filename length issue caused job failures. 199 characters. PNFS-specific. Pedro has fixed this in the local-site-mover script, aborts request.

  • AGLT2:
    • last week: Everything at AGLT2 seems to be going well right now. Knock on wood. We are running a full complement of analysis jobs, which seem to be doing fine, although there are times, typically once per day, when we get no auto-pilots for an hour or 2 at a time, enough that the running analysis job count takes a big dip. The auto-pilot dearth hits us several times per day, but only 1 or 2 of those dips typically results in the running job dip. In the long run, this may be something worth investigating as to cause.
    • this week: All going okay for the past week. Debugging the path-too-long issue.

  • NET2:
    • last week(s): Fullness problem was solved last week by Kaushik, Armen and Wensheng http://atlas.bu.edu/~youssef/NET2-fullness/ , new storage is being tested while waiting for PDUs, BU-HU networking has been tuned up, preparing to open the Harvard site to analysis jobs.
    • this week: A production issue last week with old releases getting mounted for a short while. Preparing machine room for new storage - expect a one day downtime next Thursday.

  • MWT2:
    • last week(s): Disk update for the 64 nodes complete. Xrootd testing (ANALY_MWT2_X), comparison to dCache (ANALY_MWT2). Looking at dCache headnode update.
    • this week: Testing static-script for OSG setup, testing new scheduling order for analy jobs, testing ANALY_MWT2_X for xrootd storage system tests.

  • SWT2 (UTA):
    • last week: Network stack traced to a bad image on one of the switch stacks. Replaced the switch. Basic operations are fine. Panda+autopilot creating very high GK load; clean-up of gass cache and gram state area. May need to bring this issue up with the Condor team - Kaushik notes this seems to happen after downtimes. Patrick notes this may be a problem with the grid monitor tracking very old jobs. Could use auto-adjuster.
    • this week: Having an issue with the Grid Monitor not giving correct status - Condor-G then running into trouble. Looking at auto-adjuster for nqueue when there are no jobs. Preparing to remove files that are not associated with storage area. Migrate to another LFC.

  • SWT2 (OU):
    • last week: Still waiting for scheddb to be updated, then on to the next step - getting very close to being finished.
    • this week: Working on test jobs.

  • WT2:
    • last week(s): Still having problem with NFS server holding ATLAS releases. A hard time installing releases - its a shared server with another group, so can't fix immediate, will wait.
    • this week: Making progress on new disk acquisition.

Carryover issues ( any updates?)

AOB

  • last week
  • this week


-- RobertGardner - 29 Jun 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


jpg screenshot_01.jpg (147.1K) | RobertGardner, 30 Jun 2010 - 12:43 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback