r4 - 27 Oct 2009 - 21:26:43 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesOct21



Minutes of the Facilities Integration Program meeting, Oct 21, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Jim C, Michael, Fred, Nurcan, Doug B, Mark, Hiro, Wei, John B, Shawn, Jason, Bob, Charles, Patrick, Rik
  • Apologies: Rob

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week:
    • Will have a visit to Wisconsin in a couple of weeks
    • Test cluster will be setup at UW to examine instructions
    • Tier 3 workshop at ANL Oct 29-30; Plan strategy for deploying Tier 3s. Input from experts.
    • Tier 3 commissioning meeting - will be a separate meeting, will summarize and report here.
    • Doug working with Hiro on the SE - focus of the integration
    • Hiro will begin integrating Duke and ANL into site services and FTS at BNL, and FTS controls throughput as appropriate
    • Kaushik: modification in Panda complete for destination option. Works only for Tier 3 sites.
    • Tier 3 data transfer stress test against the Tier 2's; plan a Tier 3 data transfer test
    • What about cross-cloud transfers? Right now its only a US solution.
  • this week:
    • Subscription test this week. Identified 10 sites that could have test storage elements brought up.
    • Tier 3 meeting at end of month.
    • Though-put tests from Tier 2 to Tier 3.

UAT program (Kaushik, Jim C)

  • last week(s):
    • ADC daily operations notes
    • See https://twiki.cern.ch/twiki/bin/view/Atlas/UserAnalysisTest
    • Status: 6 containers have been defined in DQ2. 72 TB.
    • Date: October 21-23.
    • 2 (pre-test) containers have already been distributed and tested. uat09. Will delete step09 containers.
    • 6 containers are mostly replicated to tall the Tier 1s. Stephane has a table.
    • Containers have been assigned to Tier 2s - there is a table. 1-4 containers out of the six at each Tier 2. BNL has all 6.
    • Jim: contacted 18 users from step 09; 11 confirmed. Several other inquires.
    • Follow the computing model - 50% of the resources for a Tier 2. There will probably be about 3000 slots, 2000 across the Tier 2s.
    • What about special users requiring special access.
  • this week:
    • Many worries about having sufficient disk space.
    • Large number of US sites participating.
    • Pretest tomorrow
    • Pretest data sets called step09.* are being cleaned.
    • Sites should set analysis slots to 50% of all slots for UAT.
    • Report from Jim C and Nurcan.

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  10/6-10/8 -- Intermittent job failures at various US sites due to lack of tape access during the HPSS upgrade at BNL.
    2)  Problems with the panda servers at CERN following a move to new hosts.  One machine was blocked by a firewall.  Issues seem to be resolved now.  eLog 6092, 6098.
    3)  10/9: MWT2_IU -- issue with access to library file ( libpopt.so.0) from some SL5 worker nodes -- they were taken offline to fix the problem.  ggus 52293.
    4)  10/9: Armen and Alden completed migration of user analysis areas to USERDISK at BNL.
    5)  10/9: BU -- gatekeeper reboot -- resulted in ~50 "lsm-get failed" errors.
    6)  Over this past weekend (10/10 - 10/12) -- large number of failed jobs at BNL - issue was a misconfiguration in schedconfigdb -- resolved.  See ggus 52281.
    7)  10/12: BU -- ~500 failed jobs due to a GPFS partition filling up -- resolved.  ggus 52283.
    8)  10/12: AGLT2 -- Jobs failing due to lack of free space in AGLT2_PRODDISK -- resolved.  ggus 52274.
    9)  10/13: dCache upgrade at BNL -- some residual issues following re-start, but everything seems to be resolved now.
    10)  10/13: UTA_SWT2 set 'offline' to investigate problems with the ibrix storage.
    11)  10/13-14: SLAC outage for OSG upgrade -- initially some issues sending test jobs to the site, owing to stale entries on the BNL submit host -- cleaned up by Xin 
    -- test jobs eventually succeeded, site set back to 'online'.
    12)  Large increase recently in the number of nagios alerts -- from Tomasz:
    Nagios seems to flip flop on gatekeeper tests. The problem started few days ago and we do not know the cause. It seems intermittent: I can run the bare test several 
    times by hand and it works and then suddenly it fails.  In addition to that we do see network interruptions which come and go.  Those two problems may or may not be related.  
    I will disable nagios e-mail alerts for gatekeeper tests in order to reduce noise.
    Last few days nagios was going nuts about gatekeeper tests: the probes were flipping up and down continuously.  We had some sort of connectivity problem: nagios could not 
    reach various hosts. The connections would intermittently fail. To make matters harder to debug the connection failures appeared completely random.  
    In the end I had to disable notifications from nagios gatekeeper probes until the underlying connectivity problem is resolved.  It seems that by now we have a partial
     understanding of what caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications.  I still have one issue 
    which I need to discuss with administrators of sites which run osg 1.2 - I will contact you off line.
    Follow-ups from earlier reports:
    (iii) ATLAS User Analysis Test (UAT) scheduled for October 21-23.

  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  10/15: From Tomasz, regarding recent issues with nagios:
    It seems that by now we have a partial understanding of what caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications.
    2)  10/15: From Wei, regarding Bestman / CA certs issue:
    The issue between Bestman and users with Canada's westgrid certificates (see my original email below) is addressed via a workaround provided by the LBL team.  The workaround is to replace 
    $VDT_LOCATION/bestman/lib/globus/cry*.jar in bestman or above by the .jars in bestman
    3)  10/16: Jobs were failing at MWT2_UC with the error "Input file DBRelease-7.5.1.tar.gz not found."  From Sarah:
    DBRelease file was transferred with __DQ2 extension, which is incompatible with panda/pilot/athena software. I've renamed the file and
    updated the lfc registration.
    4)  10/17: Storage upgade at SLAC completed-- no major interruptions.
    5)  10/19: Job failures at MWT2_UC likely due to disk cleanup removing still-needed files -- from Charles:
    I think the production job failures at MWT2_UC may have been due to overzealous cleanup of MWT2_UC_PRODDISK triggered by our site almost running out of space over the weekend.  ggus 52475.
    6)  10/19:  Tadashi modified PandaMover to delete redundant files.  (Thanks Charles for the feedback.)
    7)  10/20: Large number of failed jobs at BNL, with the message "Get error: Too many/too large input files."  Ongoing discussions about how to deal with this issue (i.e., in the pilot, 
    split the jobs, etc.)
    8)  10/21: Jobs were failing at all U.S. sites due to them attempting to access the Oracle db at BNL.  Affected tasks were aborted.  (Number of failed jobs >25k.) 
    Follow-ups from earlier reports:
    (i) ATLAS User Analysis Test (UAT) re-scheduled for October 28-30.

Analysis queues (Nurcan)

DDM Operations (Hiro)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • Fred spent some time revising the SquidTier2 page
    • Checked squid version at all Tier 2 sites
    • Please send any documentation problems to John or Fred
    • Two environmental vars to be set at each site - squid-frontier variable; one for xml file location for pool conditions files; This still needs a standardization for these two variables; also caching of over-subscribed conditions file (not sure how serious this is). This is similar to the issue for dbRelease file (though it gets staged by the pilot). Athena is using remote IO since the file is never staged (thus pcache does not apply).
  • this week
    • All sites need send their Squid url to John DeStefano at BNL.
    • Fred, Xin, and Alessandro working to finalize the installation of PFC.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    USATLAS Throughput  Meeting Notes – October 13, 2009
    Attending:  Shawn, David, Mike, Sarah,  Doug, Jeff,  Horst, Hiro
    Excused: Karthik, Jason
    Primary topic of discussion was last week’s perfSONAR installation/configuration for USATLAS.     A survey during the call showed that OU’s instance was working fine since configuration, MWT2_IU had some issues with services stopping but a reconfiguration and reboot fixed it.  AGLT2_UM had problems with the perfSONAR-BUOY services stopping as well is PingER stopping.   The Wisconsin site is up but is not yet configured properly.  Neng is looking into this.  The AGLT2_UM issues are being debugged by the Internet2 developers.    The AGLT2_MSU instances also seem to be running without an issue so far.  Didn’t get reports from the other sites.  
    Jeff Boote mentioned syslog configuration specifically on the AGLT2 boxes.  UM needs to look at it to try for a more rational syslog configuration that also sends data to the central syslog host UM uses.   
    Jeff also mentioned if perfSONAR  software changes are needed another ISO could be produced.      We will have to see what debugging the problems to-date turns up.
    Sarah provided our first perfSONAR measurement question for testing from IU to UTA.   Sarah is seeing a lot of packet loss to UTA SWT2 (70/600) during the OWAMP testing.   Even losing 1 OWAMP packet/600 could be significant so this is really a large loss that needs to be tracked down. The relevant traceroute is  here (both directions):
    [knoppix@Knoppix ~]$ traceroute netmon1.atlas-swt2.org
    traceroute to netmon1.atlas-swt2.org (, 30 hops max, 40 byte packets
     1 (  11.924 ms  0.301 ms  0.335 ms
     2  xe-0-2-0.2012.rtr.ictc.indiana.gigapop.net (  0.237 ms  0.265 ms  0.246 ms
     3  tge-0-1-0-0.2093.chic.layer3.nlr.net (  6.450 ms
    5.447 ms  5.252 ms
     4  hous-chic-67.layer3.nlr.net (  31.837 ms  31.056 ms
    30.961 ms
     5  hstn-hstn-nlr-ge-0-0-0-0-layer3.tx-learn.net (  30.759 ms  30.803 ms  30.726 ms
     6  dlls-hstn-nlr-ge-1-0-0-3002-layer3.tx-learn.net (
    36.091 ms  36.169 ms  36.092 ms
     7 (  36.112 ms  36.139 ms  36.218 ms
     8  as16905_uta7206_m320_nlr.uta.edu (  37.548 ms  37.546 ms  37.494 ms
     9  netmon1.atlas-swt2.org (  37.664 ms  37.645 ms  37.669 ms
    Reverse traceroute to my laptop:
    Executing exec(traceroute, -m 30 -q 3 -f 3,, 140)
    traceroute to (, 30 hops max, 140 byte packets
     3 (  1.885 ms  1.772 ms  1.807 ms
     4  hstn-dlls-nlr-ge-3-0-0-3002-layer3.tx-learn.net (
    7.169 ms  7.150 ms  7.099 ms
     5  hstn-hstn-nlr-layer3.tx-learn.net (  7.650 ms  7.598 ms  7.536 ms
     6  chic-hous-67.layer3.nlr.net (  33.475 ms  33.489 ms
    33.393 ms
     7  xe-1-2-0.2093.rtr.ictc.indiana.gigapop.net (  37.669 ms  37.555 ms  37.657 ms
     8  tge-1-2.9.br.ul.net.uits.iu.edu (  37.695 ms  37.733 ms  37.751 ms
     9  tge-1-4.912.cr.ictc.net.uits.iu.edu (  38.877 ms  38.922 ms  40.268 ms 10  149-166-143-177.dhcp-in.iupui.edu (  37.809 ms
    37.905 ms  37.944 ms
    Testing from Tier-2 to Tier-3 enabled for Hiro’s ( NET2 - Duke  and MWT2_UC - Argonne).     Moving 7 files from dataset.   See Hiro’s update page at:
    Milestone for 1GB/sec for 1 hour was ALMOST completed from BNL to MWT2_UC.    Need to redo this during the next week.   Sites should contact Hiro to arrange a throughput test.  Need to get 1GB/sec for one hour from BNL -> (set of one or more Tier-2s).   Individual sites with 10GE should strive for 400MB/sec for > ½  hour.
    IU notices a slowdown via Hiro’s automated load-test starting between Sep 30 and October 1st 2009.   Sarah is looking into what changed.
    Future calls will regularly discuss perfSONAR measurement results once we start acquiring enough data from our testing configuration.
    Hiro will be contacting Jeff Boote (Internet2) to get information on the API for accessing perfSONAR measurement results for future integration into his plots.
    Please send along any corrections or additions to these minutes via email to the list.   We plan to meet again next week at the normal time.
    • Tier 3 is now added to Hiro's throughput test
    • Can now do any-to-any site testing.
  • this week:
    Meeting Notes from USATLAS Throughput Call
    Attending:  Shawn, Dave, Jason, Sarah, John, Horst, Karthik, Hiro, Doug
    Discussion about perfSONAR status.   
    AGLT2 (rc4 version?), BNL, MSU, MWT2_IU (previous version) all lost configurations (apparent disk-problems, read-only disk, rebooting loses the config and data on disk).  Possible issue for the KOI hardware?   Needs to be debugged.
    Issues with perfSONAR-BUOY Regular Testing (Throughput and/or One-Way Latency) changing from "Running" to "Not Running". Happening at AGLT2, BNL, MWT2, OU.
    Dave reported running perfSONAR on a box for a few weeks (non-KOI; Intel box) and has had no problems.  However tests only running a few days.
    Internet2 developers will be looking into these two problems...may be the same problem (I/O errors may cause services to stop and/or loss data).   May request files and/or access to problematic perfSONAR instances.  
    Next discussion was about using perfSONAR results to find issues:
    1) SWT2-UTA possibly an issue.  OU, MWT2_IU and AGLT2 show low throughput.  IU was showing lots of packet lose in latency tests.   However BNL->SWT2-UTA and MWT2_UC->SWT2-UTA seems OK.   Needs looking into.  Perhaps path differences may show where the problem lies.
    2) AGLT2-BNL showing asymmetry again.  AGLT2->BNL is good (>910Mbits/sec) but BNL->AGLT2 is poor (40-70 Mbits/sec) since October 15th.  Need to investigate.
    Discussion about Hiro's new testing.   Now Tier-2 to Tier-2 testing in place.  Only 10 files moved in tests and only using DATADISK.  Select Tier-2 site (_DATADISK version) from http:///www.usatlas.bnl.gov/dq2/throughput and you can see results (scroll-down for graphs).
    Slowdown to IU Tier-2 at the end of October: reason was found. The gridftp2 settings lost on the FTS channel was lost, preventing door-to-door transfers.  Hiro fixed and throughput restored.
    Discussion about LHC network (LHCOPN CERN<->BNL).   BNL/John Bigrow wants to test new LHCOPN link to BNL for load-balancing. Needs to arrange high-throughput test from CERN <-> BNL (could use iperf).  Will work with Hiro/Eduardo/Simone on this sometime next week.
    Milestone: BNL->(set of Tier-2's at 1GB/sec for 1 hour).  Test this Friday (October 23 at either 10 or 11 AM).   Need to get feedback from other's on which time is best.
    We will NOT be meeting next week unless someone wants to chair the meeting in my absence ( I will be on a plane to Shanghai).    Next meeting TBD.

    • Still having perfSONAR issues. Hangs followed by lost settings upon reboot.
    • Problems at UTA
    • Problem at UMICH where data transfers are slower in one direction than the other.
    • IU slowness fixed by updating FTS.
    • 1 GBps benchmark on Friday. All sites will participate.

Site news and issues (all sites)

  • T1:
    • last week(s): Saturday morning 1000's of jobs in holding - there was a schedd config problem. This issue needs attention. Upgrades yesterday: Xin upgraded OSG 1.2.3; Oracle maintenance for LFC backend. Chris made kernel updates on worker nodes for security; dcache upgrades. SRM load quite high, but understood. Gridftp doors are in between public and internal networks, and so traffic is going through firewalls, ticket submitted to dcache developers; computing room infrastructure developments - certification of 1 MW flywheel passed, all passed.
    • this week:
      • Increase analysis slots to 500.

  • AGLT2:
    • last week: Squid install resolved; next week a short outtage at UM for hardware relocation; SL 5 update, rolling update. Dell compute and storage expected to arrive at the end of the month.
    • this week: * Rearranging machine room to accommodate new hardware. * Getting ready to upgrade to SL5 via rolling upgrades. * Already have 50% of slots are analysis. ~650 slots.

  • NET2:
    • last week(s): SE datadisk filled - recovered. Today there were SRM problems, fixed with a restart. Perfsonar, squid, SL5 updates.
    • this week:
    • Running normally.
    • Normally run 50% analysis: ~350 slots currently.
    • Still tuning perfSONAR.
    • Freeing 50 TB of space.

  • MWT2:
    • last week(s): Still working on SL5 upgrade of last week - test jobs are running on worker nodes, then will do a rolling update.
    • this week:
    • SRM problem showing more free space than was available.
    • 90 TB of dark files cleaned.
    • 2200 slots total / 800 slots analysis

  • SWT2 (UTA):
    • last week: Had an issue with the UTA_SWT2 storage system, hopefully it will last until the new Dell purchase arrives; packet loss issue.
    • this week:
    • 200 analysis slots.
    • Working on procurement.

  • SWT2 (OU):
    • last week: Finally in the process of getting a new quote from Dell and DDN. Will probably get more compute nodes.
    • this week:
    • Still waiting for new quote.

  • WT2:
    • last week(s): rhel5 migration continuing ~ 100 systems complete. New xrootd release with expected new features, will deploy later. Westgrid certificates causing problems for Bestman. OSG 1.2.3 updated.
    • this week:
    • Need to change Panda parameters for use with analysis jobs.
    • Down to last 15 TB of free space. Actively cleaning.
    • Expecting new storage soon.
    • Working on getting 50% of slots for analysis but non-trivial to do. Hope to get 500 slots for analysis.
    • New storage hardware installed.
    • Canadian grid certificates problem solved.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
  • this week:
    • BNL updated

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Tier 3 data transfers

  • last week
    • no change
  • this week

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting


  • last week
    • None
  • this week

-- RobertGardner - 21 Oct 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback