r5 - 17 Sep 2010 - 14:08:27 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesSep15



Minutes of the Facilities Integration Program meeting, Sep 15, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475


  • Meeting attendees: Rob, Aaron, Nate, Kaushik, Armen, Patrick, Saul, Dave, Karthik, Jim, Fred, Sarah, Michael, Shawn, Justin, John B, Tom, Wei, Doug, Alden, Wensheng, Hiro, Charles
  • Apologies: Bob (broken wrist), Jason, Mark, Horst

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • CVMFS evaluation at the Tier 2 - OU, UC, AGLT2 interested in evaluation. See further TestingCVMFS.
      • October 12-13 next facility workshop.
      • WLCG asking for US ATLAS pledges for 2011, prelim for 2012. ~12K HS06.
      • Open ATLAS EB meeting on Tuesday. Next reprocessing campaign is shaping up. 7 TeV runs will be used. 1B events. October 20 deadline for building the dataset. Nov 26 repro deadline. For next major phys conferences (La Thuile - March 2011). 6-8 weeks for the simulation - mostly at the Tier 2s.
      • For the OSG storage forum we've been asked to summarize Tier 2 storage experiences.. I'll send out an email request.
      • Installed capacity reporting. Here is the currently (weekly) report:
        This is a report of pledged installed computing and storage capacity at sites.
        Report date: Tue, Sep 07 2010
         #       | Site                   |      KSI2K |       HS06 |         TB |
         1.      | AGLT2                  |      1,670 |     11,040 |          0 |
         2.      | AGLT2_SE               |          0 |          0 |      1,060 |
         Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
                 |                        |            |            |            |
         3.      | BU_ATLAS_Tier2         |      1,910 |      5,520 |        200 |
         4.      | HU_ATLAS_Tier2         |      1,600 |      5,520 |        200 |
         Total:  | US-NET2                |      3,510 |     11,040 |        400 |
                 |                        |            |            |            |
         5.      | BNL_ATLAS_1            |      8,100 |     31,000 |          0 |
         6.      | BNL_ATLAS_2            |          0 |          0 |          0 |
         7.      | BNL_ATLAS_3            |          0 |          0 |          0 |
         8.      | BNL_ATLAS_4            |          0 |          0 |          0 |
         9.      | BNL_ATLAS_5            |          0 |          0 |          0 |
         10.     | BNL_ATLAS_SE           |          0 |          0 |      4,500 |
         Total:  | US-T1-BNL              |      8,100 |     31,000 |      4,500 |
                 |                        |            |            |            |
         11.     | MWT2_IU                |      3,276 |      5,520 |          0 |
         12.     | MWT2_IU_SE             |          0 |          0 |        179 |
         13.     | MWT2_UC                |      3,276 |      5,520 |          0 |
         14.     | MWT2_UC_SE             |          0 |          0 |        250 |
         Total:  | US-MWT2                |      6,552 |     11,040 |        429 |
                 |                        |            |            |            |
         15.     | OU_OCHEP_SWT2          |        464 |      3,189 |        200 |
         16.     | SWT2_CPB               |      1,383 |      4,224 |        670 |
         17.     | UTA_SWT2               |        493 |      3,627 |         39 |
         Total:  | US-SWT2                |      2,340 |     11,040 |        909 |
                 |                        |            |            |            |
         18.     | WT2                    |        820 |      9,057 |          0 |
         19.     | WT2_SE                 |          0 |          0 |        597 |
         Total:  | US-WT2                 |        820 |      9,057 |        597 |
         Total:  | All US ATLAS           |     22,992 |     84,217 |      7,895 |
      • This week's capacity (significant bump):
        This is a report of pledged installed computing and storage capacity at sites.
        Report date: Tue, Sep 14 2010
         #       | Site                   |      KSI2K |       HS06 |         TB |
         1.      | AGLT2                  |      1,670 |     11,040 |          0 |
         2.      | AGLT2_SE               |          0 |          0 |      1,060 |
         Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
                 |                        |            |            |            |
         3.      | BU_ATLAS_Tier2         |      1,910 |      5,520 |        200 |
         4.      | HU_ATLAS_Tier2         |      1,600 |      5,520 |        200 |
         Total:  | US-NET2                |      3,510 |     11,040 |        400 |
                 |                        |            |            |            |
         5.      | BNL_ATLAS_1            |     16,052 |     58,000 |          0 |
         6.      | BNL_ATLAS_2            |          0 |          0 |          0 |
         7.      | BNL_ATLAS_3            |          0 |          0 |          0 |
         8.      | BNL_ATLAS_4            |          0 |          0 |          0 |
         9.      | BNL_ATLAS_5            |          0 |          0 |          0 |
         10.     | BNL_ATLAS_SE           |          0 |          0 |     10,100 |
         Total:  | US-T1-BNL              |     16,052 |     58,000 |     10,100 |
                 |                        |            |            |            |
         11.     | MWT2_IU                |      3,276 |      5,838 |          0 |
         12.     | MWT2_IU_SE             |          0 |          0 |        179 |
         13.     | MWT2_UC                |      3,276 |     10,410 |          0 |
         14.     | MWT2_UC_SE             |          0 |          0 |      1,140 |
         Total:  | US-MWT2                |      6,552 |     16,248 |      1,319 |
                 |                        |            |            |            |
         15.     | OU_OCHEP_SWT2          |      1,389 |      3,189 |        200 |
         16.     | SWT2_CPB               |      1,383 |      4,224 |        821 |
         17.     | UTA_SWT2               |        493 |      3,627 |         39 |
         Total:  | US-SWT2                |      3,265 |     11,040 |      1,060 |
                 |                        |            |            |            |
         18.     | WT2                    |        820 |      9,057 |          0 |
         19.     | WT2_SE                 |          0 |          0 |      1,400 |
         Total:  | US-WT2                 |        820 |      9,057 |      1,400 |
         Total:  | All US ATLAS           |     31,869 |    116,425 |     15,339 |
      • Mid-October face-to-face meeting at SLAC. Make your reservations at the guest house soon. Agenda is being discussed, slowly shaping up.
      • Data management meeting yesterday - notice production has ramped up; but also have a very spikey analysis load. At BNL have shifted 1,000 more cores into analysis. Please move more resources into analysis.
      • LHC - after completion of technical stop still in development mode. No beam expected before the weekend. Expect ramp up 50 to 300 bunches.
    • this week
      • Rob and Kaushik met with Michael Barnett (ATLAS Outreach Coordinator) last Friday to discuss Tier 2 outreach - suggested creating brochure, perhaps similar to this circa-2008 computing brochure created by Dario, http://pdgusers.lbl.gov/~pschaffner/atlas_computing_brochure.html. Suggestions and contributions welcome.
      • OSG Storage Forum next week @ UC - will be circulating a US ATLAS T2 storage overview set of slides; talks from AGLT2 (Shawn), MWT2 (Aaron & Sarah), WT2 (Wei), OU (Horst), Tier 3 program (Rik), Illinois (Dave), SMU (Justin), data access performance (Charles); should also have time to discuss distributed xrootd services as a remote access infrastructure. Note: proposal canceling next week's usual facilities meeting.
      • Working on anticipated capacity over the next 5 years
      • Existing and new institutions wishing to create Tier 2's will want to know what the capacities are.
      • LHC technical stop - restarted w/ beam development, increasing the bunch train. S
      • Science policy meeting had ATLAS results with full statistics. Our analysis contributions in the facility helped make this happen. Getting a handle on the full-range of Higgs studies - comparison w/ Tevatron.
      • Discussion in progress - 2011 run may be extended to reach luminosity goals
      • Note - heavy ions will start on Nov 8, 3-4 weeks, leading into the Christmas shutdown (Dec 6).

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • OSG storage meeting coming up - will get into contact w/ Tanya for Tier 3 talk & requirements.
  • Working on plan to deal with local data management.
  • Most attention at the moment is figuring out funding to Tier 3s. Several sites have ordered their equipment.
  • Would like Panda monitoring to separate Tier 3 sites, so as to not be monitored.
this week:
  • Sites are starting to get money and acquire equipment
  • Meeting last week with physics conveners - how analysis will be done, getting analysis examples; validation
  • ADC monitoring meeting - looking at data monitoring in Tier 3. Meeting with ALICE to discuss re-using agents
  • Doug - tagged as ADC technical coordinator
  • Xrootd-VDT on native packaging. Global ATLAS namespace discussion/proposal.
  • Existing Tier 3's into functional tests. ADC will be bringing a new person to improve monitoring for Tier 3 sites.
  • UTD not receiving functional tests for DDM. Want to avoid this - may need to write-up procedure to remove sites.
  • Alden: setting up a Tier 3 - however needing more documentation. Doug: send errors to RT system.
  • Setting up CVMFS server for configuration files. RAL- doing 800 job tests.
  • node affinity to be used from pcache - Charles will work with Doug on this
  • dq2-FTS testing? non this past week.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Armen: other clouds have the same problem. Armen and Wensheng are deleting when space gets too full. The problem is the service itself. USERDISK has been seen the most deletions.
    • Small transfer requests to Tier 3s. Discussed in RAC. Stephane notes cannot control on site by site basis. 1/2 TB is the limit. Requires ACLs to be set correctly in the LFC Tier 3.
    • Same policy for LOCALGROUPDISK.
    • Investigating many problems "awaiting subscriptions". Not an approval process issue.
    • Discussed approval process - VOMS based approval for a local site data manager. This is plan long term.
  • this week:
    • deletion rate issues, but overall everything looks healthy
    • all in good shape
    • Michael - at the Tier 1 ADC notified of a missing library, related to Oracle. Communication issue between facility and ADC. ATLAS dependency in underlying OS components - change not documented. Would be good to have a page.
    • There used to a webpage - but its outdated; also there have been multiple lists.
    • Saul: there are many libraries required, more than 200, - not formally specified anywhere. ldd on everything in the release.
    • Would like an official page for this - we need to address this.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting: not available this week (Yuri just back from vacation)
    1)  8/25: DDM failures on WISC_GROUP token - "cannot continue since no size has been returned after PrepareToGet or SrmStat."  Issue resolved.  ggus 61572 closed, eLog 16245.
    2)  8/26: ggus 60982 (transfer errors from SLACXRD_USERDISK to SLACXRD_LOCALGROUPDISK) was closed.  eLog 16268.
    3)   8/27: SWT2_CPB_USERDISK - several files were unavailable from this token.  Issue resolved - from Patrick:
    Sorry for the problems. The Xrootd system was inconsistent between the SRM and the disk contents. The issue has been resolved and the files are now available via the SRM.  (A 4 hour maintenance outage was taken 8/27 p.m. to fix this problem.)  
    ggus 61611 & RT 18043 closed, eLog 16349.
    4)  8/28: From Wei at SLAC: We will take WT2 down on Monday Aug 30, 10am to 5pm PDT (5pm - 12am UTC) for site maintenance and to bring additional storage online.  Maintenance completed on 8/30 - test jobs successful, 
    queues set back on-line as of ~1:30 p.m. CST.  eLog 16453.
    5)  8/29: BNL DDM errors - T0 exports failing to BNL-OSG2_DATADISK.  Initially seemed to be caused by a high load on pnfs01.  Problem recurred,
    from Jane: pnfs was just restarted to improve the situation.  ggus 61627 & goc #9155 both closed; eLog 16360/77.
    6)  8/30: BNL - upgrade of storage servers' BIOS - from Michael:
    This is to let shifters know that at BNL we are taking advantage of the LHC technical stop and conductiong an upgrade of the BIOS of some of our storage servers. This maintenance is carried out in a "transparent" fashion meaning the servers 
    will briefly go down for a reboot (takes ~10 minutes).  As files requested during that time are not available there may be a few transfer failures due to "locality is unavailable."  Maintenance completed - eLog 16419.
    7)  9/1: BNL - Major upgrade of the BNL network routing infrastructure that connects the US Atlas Tier 1 Facility at BNL to the Internet.  Duration: Wednesday, Sept 1, 2010 9:00AM EDT to 6:00PM EDT.  See eLog 16476.
    Follow-ups from earlier reports:
    (i)  8/18: WISC - DDM transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  
    Resolved as of 8/23, ggus 61283 closed, eLog 16167.  (Note: ggus 61352 may be a related ticket?)
    As of 8/31 ggus 61352 is still 'in progress'.
    (ii)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w 
    releases to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.
    (iii)  8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out."  Issue with a dCache pool - from Pedro (8/25):
    We still have a dcache pool offline.  It should be back by tomorrow noon.  In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt.  ggus 61338/355, eLog 16154/230.
    As of 8/31 ggus 61338 'solved', 61355 'in progress'.
    (iv)  8/24 - 8/25: SLAC - DDM errors: 'failed to contact on remote SRM': SLACXRD_DATADISK and SLACXRD_MCDISK.  From Wei:
    Disk failure. Have turned off FTS channels to give priority to RAID resync. Will turn them back online when resync finish.  ggus 61537 in progress, eLog 16223/33.
    Update, 8/31: a bad hard drive was replaced - issue resolved.  ggus 61537 closed,  eLog 16327.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  9/9 - 9/10: SLACXRD_DATADISK SRM access issues:
    failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries].
    Issue resolved - server outage, returned to service.  No additional transfer errors.  ggus 61955 closed, eLog 16784.
    2)  9/9: MWT2_X_MCDISK transfer errors from sources MWT2_IU_HOTDISK and MWT2_UC_HOTDISK.  Issue resolved - from Sarah:
    It looks like the sudo config on our bestman door was overwritten during config updates. I've restored the config, and these transfers should start to succeed.  ggus 61965 closed, eLog 16786.
    3)  9/10: Slow pilot submission to MWT2 - from Aaron:
    We've noticed that we're draining again at MWT2, and we don't seem to be receiving any new pilots to UC. I understand last time we brought this up, we had another autopilot submitter set up against ANALY_MWT2. 
    I would imagine that ought to be enough, but as of sometime at the end of thursday we've not seen more than a handful of analysis pilots sent to our site.  Resolved - from Xin:
    It took condor too long to submit a new pilot on gridui12. One site accumulated too many pilots, most of them are zombies, after some cleanup, the submitting time is back to normal now.
    Problem recurred on 9/14 - from Xin:
    After the cleanup, the submit time is cut to the normal level, the pilot flow should ramp up now.  I will talk with condor team tomorrow on further looking into the deep reason, and will also add a monitor 
    so that we can catch such case earlier later.
    4)  9/11: WISC - old ggus ticket 61692 (opened 8/31) related to transfer errors at the site resolved and verified.  Modifications to the site xrootd system seemed to be the fix. 
    5)  9/11: ggus 61877 (failed file transfers from BNL-OSG2 spacetokens to UPENN_LOCALGROUPDISK - opened 9/7) solved.  The discussion thread in the ticket somehow morphed into the SRM issue at UTD-HEP (and other topics).  
    UTD issue is covered in the follow-ups section below.
    6)  9/12: All running mc10 simulation tasks were aborted due to detector geometry problems.  eLog 16901.
    7)  9/13: IllinoisHEP - 7 jobs from urgent task 167289 were failing with the error:
    pilot: Get error: Remote and local checksums (of type adler32) do not match for...(filename).  Issue resolved - from Dave:
    A file was copied to a worker node with a checksum error.  Since we use pcache, all jobs on that worker node which
    used this file failed with a checksum error (8 jobs in total).  That file has been removed and all jobs seem to be working correctly.  ggus 62932 closed.
    8)  9/13: BNL - "lost heartbeat" errors.  Problem with a worker node - from Xin:
    The lost heartbeat jobs landed on a worker node which was giving bus errors.  This node is being taken out of the batch system now. Checking BNL_ATLAS_1 jobs for the last 12 hours, the failure rate is very low, so this 
    doesn't look like any site-wide problem.  ggus 62033 closed.
    9)  9/13: AGLT2 - large number of failed jobs from task 167289 with "lost heartbeat" errors.  From Bob:
    The osg home NFS volume was mis-behaving from approximately 14:30-19:00 EDT on Sunday, 9/12. A system power cycle at the latter time appears to have resolved this issue. As far as I can tell, 
    all lost heartbeat failures from this task were reported when the volume came back online and resumed normal behavior.  ggus 62035 in-progress.
    10)  9/14: AGLT2 - Jobs were failing with LFC errors like:
    14 Sep 14:43:54|SiteMover.py| !!FAILED!!2999!! lfc-mkdir failed: LFC_HOST=lfc.aglt2.org
    From Shawn: There was a certificate problem with the LFC service that was fixed about 25 minutes ago. The hostcert was
    updated back on September 8th but we missed the service cert. Should be OK now.  eLog 16940.
    11)  9/14: UTA_SWT2: job failures due to stage-in errors.  Issue resolved - from Patrick:
    These failures were coming from one particular compute node in the cluster. The node has been removed from the
    local batch system and is being diagnosed. The affected jobs should rerun without this error in the future.  ggus 62068, RT 18157 closed.  eLog 16923/48.
    12)  9/14: Very large number of "Setupper._setupSource()" (DDM) errors across most clouds, due to central catalogs issue during the early a.m. (US time).  Problem solved - eLog 16919.
    Follow-ups from earlier reports:
    (i)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  
    Still waiting for a complete set of ATLAS s/w releases to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.  Also, work underway to enable analysis queue at OU (Squid, schedconfig mods, etc.)
    As of 9/7: ongoing discussions with Alessandro DeSalvo regarding atlas s/w installations at the site.
    Update 9/13: Thanks to Rod Walker for necessary updates to ToA.  Analysis site should be almost ready for testing.
    (ii)  9/7: UTD-HEP - SRM access errors.  Site blacklisted pending resolution of the problem.  Savannah 116654, ggus 61895/77.
    Update 9/11: Re-start of SRM service resolved the transfer problem.  Test jobs successful, back to on-line.  ggus ticket 61895 closed, eLog 16831. 
    Update 9/12: Some high-priority production jobs submitted over the weekend were failing at UTD-HEP with "lost heartbeat" errors.  Site was temporarily blacklisted, then returned to production.  
    Lost heartbeats are an intermittent issue at the site.  Savannah 116714, eLog 16880/909, ggus 62059.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • My reporting is that all US sites including T3s are now in CERN BDII. Also, there is a issue with deletion of production data at BNL (at least) by the central deletion service. This could be really bad bug. It is under investigation by Vincent. Also, the central deletion has completed halted for US sites (including USERDISK) for unknown reason. Experts are being notified. In some near(?) future, the deletion service will come to BNL instead of located at CERN.
    • Armen: other clouds have the same problem. Armen and Wensheng are deleting when space gets too full. The problem is the service itself. USERDISK has been seen the most deletions.
    • On vacation - working on data deletion
  • this meeting:
    • Site services seem to be fine; a change made for Wisconsin, to fix site name.
    • BNL disappearing from LCG BDII - found the cause, duplicated entries in the BNL gums. There was a modified VO name in GUMS, there were requirements in spelling not met.
    • Tier 3- make sure OIM entries are updated. Communicating these to Doug.
    • data deletion proceeding according to expectations? Armen: can it be sped up? It will probably be towards the end of the week, another 30K by the weeks. Hiro: should be able to double the rate.
    • Hiro - believes there are problems, the service is leaving files.
    • Working on AGLT2 list.

libdcap & direct access (Charles)

last meeting:
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline
  • buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.

this meeting:

  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Phone meeting notes:
    • From Jason:
      • We are still actively trying to figure out what is causing the performance problems between the new release and the KOI harware
      • Final release will be pushed back until we can fix this, RC3 will be out this week.
      • Still not recommending that the USATLAS Tier2s adopt yet. Working with BNL/MSU/UM on testing solutions.
    • Meeting scheduled to next week.
    • Jason rc3 will be out this week; hold off on updates; seeing a performance impact, not yet understood.
  • this week:
    • From Jason:
      Hi Rob;
      I will need to mail in an update today since I am currently traveling:
      - RC3 Is being tested at UM/MSU/BNL.  Some minor hiccups unrelated to the performance issue were found (permissions on directory that stores some owamp data).
      - This release will be looking at the driver for the NICs on the KOI boxes, there are two we can use (sky2 or sk98lin), and we are trying the latter right now.
      - Early results show that it doesnt solve the issue, but we will be checking to see what improvements it has over the sky2 driver.  It appears that for a long enough test (20+ seconds) we can get up to 940Mb/s, but the ramp up is a lot longer than it used to be.
      - We will also be looking at the kernel, and trying different versions.  Specifically we want to see why the handling of soft interrupts may have changed between the prior releases and the current.
    • Notes from this week's throughput meeting:
      USATLAS Throughput Meeting Notes --- September 14th, 2010
      Attending: Shawn, John, Dave, Karthik, Andy, Sarah, Philippe, Tom, Aaron, Hiro
      Excused: Jason
      1) Discussion about problem status at:
      	a) OU - Karthik reported on status. Lots of test information sent to Chris Tracy/ESnet who looked it over.  No host issue is apparent. Possible issues with autotuning or kernel/stack/driver issues.  Try UDP to help isolate the issues (but it is disruptive).  Indication of issue at BNL.  
           b) BNL - John reported dCache machine links all transferred to Force10 (no more Cisco in that path).  Planning to move a VLAN associated with the WAN and run it directly to the ATLAS area, bypassing the BNL core (next few weeks ?). Queue settings setup according the ESnet recommendations now (as of a couple weeks ago).   Some drops were observed at the 40Gbps to 10Gbps system.  However drops not seen during testing.   Traceroute BNL to OU through ESnet cloud allowed John to determine test-points along the path.   One example was BNL-Cleveland which showed a factor of 10 asymmetry in/out directions, however testing to the next point (Chicago) showed a symmetric result!  Was repeatable.  The newmon 10GE testing box is directly connected to the border router 'mutt' (LHCnet interface) which interconnects to the other border router 'amon'.  Testing from US Tier-2's to newmon will traverse 'amon' and 'mutt' before getting to newmon.  
           c) Illinois - Campus perfSONAR box is working and being used to help diagnose the problem.  Box was setup at campus border and the asymmetry is not observed, indicating the campus may have the problem.  Next steps will involve moving the campus perfSONAR box to the Tier-3 location and retesting.  Progress is being made.
      2) perfSONAR release update and information on improving existing system performance.  
           a) Results from "beta" tests at sites  (pSPT v3.2rc3 at BNL, AGLT2 UM, AGLT2 MSU)
           b) Status update on Dell R410 box for use as throughput/latency perfSONAR node (Jason).  
           c) Current release schedule
      From Jason: " RC3 Up at BNL/UM/MSU.  Philippe just reported a problem with his, we will look into it.
      Early results from UM show that the newer sk98lin driver didn't solve the issue, we will need to examine to see what positive effects it did have.  In some basic testing I was able to get a 30 second test up to 941Mb after about 20 seconds of ramp up - the cards are capable but something else either in the kernel or on the machine may be limiting it from getting there sooner.
      Next steps:
       - Watch the RC3 hosts for a couple of days
       - Prepare a 'new' (2.6.27 or newer) kernel to use with both drivers to see if this sidesteps the issue
       - Looking into ksoftirqd some more, and why it is active on the new hosts.
      RC4 will be in 2 or so weeks pending the results of the testing."
      Philippe reported on the issue with 3.2rc3 on the latency box.  Apparently a protection issue which must be manually handled.  Fix doesn't survive a reboot though.
      3) Monitoring and using perfSONAR in USATLAS. 
           a) Status of "Nagios" plugin (Andy) --- Developer working on it has produced an rpm and documentation.  Some minor issues that need resolving...maybe another day or two.  Tom is ready to test it once it is available.  
      	b) Discussion of useful monitoring strategies --- Nothing yet to discuss.  Will wait for 3.2 release.
      4) Site reports --- Aaron mentioned a network expansion at MWT2_UC going to 2x10GE between the 2 primary switches via Cisco's etherchannel. Shawn described updates at AGLT2_UM which will add 2 of the Dell 8024F switches (24 port SFP+) as a primary 10GE "backbone" for the local storage nodes and network.  New storage nodes and switches will uplink to both switches (active-backup mode likely) for resiliency.   John mentioned that BNL is using 100Gbps interswitch trunks now (10x10GE).  Some discussion about how trunking works in practice.
      AOB:  None.   We will plan to meet again in 2 weeks.  Please send along any corrections or additions to these notes by sending them to the mailing list.
    • Michael: still have problems with BNL to CNAF and NGDF; current arrangement is not good. Shawn: has the monitoring been done in the OPN points of presence?
    • Believe service providers are not proactive.. not communicating across service providers, at the administrative level. Problem seems to be in the GEANT2 network and regional providers.
    • At BNL there are two paths - OPN reserved for CERN traffic; at least three providers to get from one T1 to the other.
    • Just need to apply pressure - ATLAS to WLCG.

HEPSpec 2006 (Bob)

last week:

this week:

Site news and issues (all sites)

  • T1:
    • last week(s): Currently a problem with one of the storage pools - large - 100K files/day to re-create the metadata. local-site-mover - Pedro making progress handling of exceptions (have seen a rate of 2-5%). Want data access problems to become invisible.
    • this week: not much to report.. provisioning more disk capacity; adding Nexan RAID (IBM in front, running Solaris+ZFS) - 1.3 PB of storage shortly. (Added to 1.6 PB DDN from two months ago, which focuses on read applications; write applications not as performance on the smaller servers where there is checksumming on the fly).

  • AGLT2:
    • last week: Putting in storage orders - R710+MD1200 shelves. Focused also network infrastructure upgrades and service nodes. Running with direct access for a while; have found some queuing of movers on pools. (Note MWT2 is running dcap 1000 movers; note there is a newer dcap version test). Still preparing for next purchase. Storage, services and network purchases are out. Decision for computes needed by mid-month. More virtualization of primary services. Both gatekeepers are virtualized. Condor head node is virtualized. Started testing compute nodes that are about to go out of warranty. Capacity down a bit.
    • this week: Dell visit yesterday - Walker and Roger; have setup a test site; more interested in having representative analysis jobs. Goal is to allow engineers to observe behavior and benchmark their systems. How to go forward getting jobs? Packaged pre-packed analysis jobs. Equipment orders arriving. UM - blade. MSU - 1U servers. Nearline SAS, 7200 rpm 2TB.

  • NET2:
    • last week(s): 250 TB rack is online, DATADISK is being migrated. Working on a second rack. Working on HU analysis queue. Running smoothly in the past week. Move to DATADISK is 95% complete. Wensheng and Armen have cleared up some space. 250 MB/s nominal copy. HU as an analysis queue - still working on this. Need to put another server on the Harvard side. Another 250 TB has been delivered, awaiting electrical work. Future - a green computing space, multi-university, may move NET2 there.
    • this week: Updated storage capacity in OIM. 250 TB freed up. Rack for next 250 TB in progress. HU - problem with gratia (50K jobs/day, gets behind and dies). John working with OSG to fix, and backdate statistics with WLCG.

  • MWT2:
    • last week(s): Chimera issues resolved - due to schema change; No crashes since. PNFS relocated to newer harder, good performance. Quiet week. Continuing planning purchase, hepspec benchmarks for two servers.
    • this week: Running very stable. Maui upgrade. Completed retirement of older worker-node dcache storage.

  • SWT2 (UTA):
    • last week: 200 TB of storage added to system - up and running without problems. Will be making modifcations for multipilot. Last week was quiet. Had a problem with Torque - restarted. Processing backlog of analysis jobs. scheddb changes don't seem to be taking effect. (Multi-pilot option) Otherwise all okay.
    • this week: No much to report this week - GGUS ticket from a black hole. Analysis going very well in the past week.

  • SWT2 (OU):
    • last week: Lustre upgrade seems to have stabilized things. Pilot update for adler32 update. Getting OSCER back online. Found problems installing new releases using Alessandro. Put squid online. Otherwise everything smooth. Expect to turn on analysis queue.
    • this week: production running smoothly. No longer getting DDM errors since the last upgrade. Frontier-Squid has setup completed; waiting on scheddb, ToA updates. Fred: is OU getting poolfilecatalog files updates - Xin will check woth Alessandro. Alden: will work with Horst to get scheddb changes up to date. Fred: believes ToA is done.

  • WT2:
    • last week(s): Disk failure yesterday - had to shutdown FTS for a while; fixed. Installing additional storage next week. New Dell servers - three in production; almost immediately got disk failures. Cable, enclosure, .. working on double-cabling. Lost some data. 16 hour downtime required. During the holiday we were working well. Reached 2000 analysis jobs. Gradually adding more storage. Some power work to do before turning on two servers. These were Dell MD1000 with PERC6 controllers.
    • this week: all is well. Replaced a number of disks and cables. One of the new storage servers failing stress tests. One of the Thors seems to be dropping off intermittently, traced to bug in Solaris.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting(s)
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.
  • this meeting:
    • Alessandro is running the final tests at OU now.
    • Expect to start next week migrating BNL Tier 1. Next new release kit will then use Alessandro's system. Expect this to take a couple of weeks.
    • Have asked for documentation for site administrators - there are options. Understanding installation jobs sent via wlcg system.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)


  • last week
    • LHC accelerator outtage next week, M-Thurs.
  • this week

-- RobertGardner - 14 Sep 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback