r6 - 14 Nov 2009 - 19:55:04 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesNov4



Minutes of the Facilities Integration Program meeting, Nov 4, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Jim C, Fred, Sarah, John DeStefano, Jason, Rob, John B, Aaron, Justin, Michael, Saul, Bob, Kaushik, Mark, Tom, Rik, Charles, Torre, Wei
  • Apologies:

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week:
    • Will have a visit to Wisconsin in a couple of weeks
    • Subscription test this week. Identified 10 sites that could have test storage elements brought up.
    • Thoughput tests from Tier 2 to Tier 3.
  • this week:
    • Tier 3 meeting at ANL: https://atlaswww.hep.anl.gov/twiki/bin/view/Tier3Setup/29Oct09Meeting
    • Full report next week at the T2/T3 workshop
    • 50 people (1/2 in person, 1/2 over EVO)
    • Concentrating on T3G? . Will have an SE, an interactive and batch cluster.
    • 12 people have indicated interest in developing T3G? . Organizational phone meetings on Friday.
    • Complete description by the end of the year.
    • Working closely with OSG and the Condor team.

UAT program (Kaushik, Jim C)

  • last week(s):
  • this week:
    • 112 Ganga, 50 Panda users participating
    • Some load balancing among clouds
    • First two days for job submission (Wed, Thurs); data retrieval didn't seem to start until Monday
    • Failure rates ~40%, might have included pre-trets.
    • Failure rates ~40%, might have included jobs failing because trigger information was not included in the AODs. Does this include jobs killed by users?
    • Kaushik is preparing efficiencies of users versus clouds.
    • MWT2 - there was a bad node causing failures. There were some data movement / load related issues (>1000 jobs analysis jobs)
    • SLAC - container issues causing failures. Wei: most dominant reasons - killed by user, and a user using a version of pyutils that didn't support xrootd (needs to be checked in Release 15.1.0). An then an old release, 14.4.0. A few failures because the xrootd server failed. Also moving release install area off the xrootd server.
    • Need to exclude cancelled jobs from statistics.
    • We need to go through the failures looking back into the database.
    • Over the weekend, there were a large number of release 14 jobs holding database connections open
    • We need to get a better handle on what to expect for users retrieving data - dq2 tracking retrievals per site, peaked on Monday.

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Formalize procedure for cleaning USERDISK (procedure just needs to be put in twiki by Armen); this will be done centralized.
    • Checksum problems with dCache causing problems. Load too high to get checksum.
    • Time-out problems xrootd. Wei suggests reducing the number of transfers done at the same time. Default is 200.
    • Call backs to Panda server timing out.
    • A new dq2 version is available.
  • this week:
    • MinutesDataManageNov3
    • LOCALGROUPDISK to be deployed at every T2. US-only usage; will be monitored by Hiro's system. Timeframe: within a week. There is a stability issue with xrootd.
    • Michael: issue of publishing of SEs in the GIS (BDII). The reason is we want to allow data replication between Tier 2's and other Tier 1s. We need to make sure we get our SEs get published into OSG interoperability BDII. Start within a week. Xin to follow-up

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  10/15: From Tomasz, regarding recent issues with nagios:
    It seems that by now we have a partial understanding of what caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications.
    2)  10/15: From Wei, regarding Bestman / CA certs issue:
    The issue between Bestman and users with Canada's westgrid certificates (see my original email below) is addressed via a workaround provided by the LBL team.  The workaround is to replace 
    $VDT_LOCATION/bestman/lib/globus/cry*.jar in bestman or above by the .jars in bestman
    3)  10/16: Jobs were failing at MWT2_UC with the error "Input file DBRelease-7.5.1.tar.gz not found."  From Sarah:
    DBRelease file was transferred with __DQ2 extension, which is incompatible with panda/pilot/athena software. I've renamed the file and
    updated the lfc registration.
    4)  10/17: Storage upgade at SLAC completed-- no major interruptions.
    5)  10/19: Job failures at MWT2_UC likely due to disk cleanup removing still-needed files -- from Charles:
    I think the production job failures at MWT2_UC may have been due to overzealous cleanup of MWT2_UC_PRODDISK triggered by our site almost running out of space over the weekend.  ggus 52475.
    6)  10/19:  Tadashi modified PandaMover to delete redundant files.  (Thanks Charles for the feedback.)
    7)  10/20: Large number of failed jobs at BNL, with the message "Get error: Too many/too large input files."  Ongoing discussions about how to deal with this issue (i.e., in the pilot, 
    split the jobs, etc.)
    8)  10/21: Jobs were failing at all U.S. sites due to them attempting to access the Oracle db at BNL.  Affected tasks were aborted.  (Number of failed jobs >25k.) 
    Follow-ups from earlier reports:
    (i) ATLAS User Analysis Test (UAT) re-scheduled for October 28-30.

  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  10/28 - 30: UAT -- a postmortem announcement to follow.
    2)  10/29: Jobs failing at MWT2_UC with "Get error: lsm-get failed (51456): 201 Copy command failed."  See eLog 6462 for details from Charles.  
    https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/6462.  Also RT 14475.
    3)  10/29: A new test instance of the RT server at BNL was announced by Jason (message to the usual mail lists).  Try it out at: https://rt.racf.bnl.gov/rt3/
    4)  10/30: BNL -- 32 TB of storage in MCDISK was offline for a period of time.  Resolved -- see: https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/6489
    5)  Over this past weekend a problem arose with FTS proxy delegation at BNL.  Hiro tracked it down to a clock skew, possibly related to the daylight savings time change.
    6)  11/3:  All-day outage at BNL for major core network upgrades -- completed as of ~8:00 p.m. CST.  See:
    7)  11/3: Maintenance outage at MWT2 sites for a dCache upgrade (1.9.5).  Completed, test jobs submitted by site admins, and the queues are back to 'online' as of this morning (11/4).
    8)  11/3: Short (~3 hour) maintenance outage at UTD-HEP.  Once it was over test jobs were submitted by EU shifter, but they were using old releases (v12 & 13).  
    New jobs have been submitted this morning -- waiting for results.
    9)  11/4: SLAC set offline to investigate problem where jobs are failing with " Required CMTCONFIG
    (i686-slc4-gcc34-opt) incompatible with that of local system."  RT 14512, eLog 6610.
    Follow-ups from earlier reports:
    (i) Reminder -- the next tier 2/3 meeting will be held at UTA 11/10 - 11/12.  See:
    (ii) Shift summary from last week available at:

Analysis queues (Nurcan)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • FTS 2.2 still not shown to be working well everywhere; checksum support shown to work; improper implementation of shares; lots of things still not understood
    • Thus site services consolidation schedule still up in the air
    • Note logging service at BNL from SS continue to work okay
    • Hiro is running scans on LFCs - be on the lookout for problems
  • this meeting:
    • There was a major problem with delegation with the DQ2 delegation - caused by clock skew. (NTP server was unreachable). Fixed.
    • Tier 3 sites to stop DQ2 site services:
       As we agree, we must consolidate all US T3 DDM related services: DQ2 SS
      and LFC to BNL.   As the first step, I would like to bring all DQ2 SS to
      BNL tomorrow.  Basically, I need to ask you to turn off DQ2 since BNL's
      DQ2 SS will serve your sites.  If you run DQ2 SS serving the following
      sites, please stop your DQ2 (or remove them from your configuration
      OUHEP (is this T3?)
      WISC XYZ
      UTD XYZ
      ILLINOIS XYZ  (done)
      DUKE XYZ (done)
      ANL XYZ (done)
      If you know any other sites, please let me know.
      Please keep your LFC's running.    That will be the second step.
      I would like to do this tomorrow at 12PM US Eastern time.
      If you have any questions, please let me know. 
    • FTS 2.2 news: seems to be stable, there will be throughput tests at CERN; Hiro will test Bestman sites with this deployment. Upgrade at Tier 1's not expected until March(?).
    • DQ2 update at Tier 2s then? Any problem with existing site services? None.
    • LFC updates - Hiro will send an email querying as to which specific version. Sites should update before next year
    • New monitor for DQ2 site services at BNL, integrated into Panda page as well (see "transferring").

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • Fred spent some time revising the SquidTier2 page
    • Checked squid version at all Tier 2 sites
    • Please send any documentation problems to John or Fred
    • Two environmental vars to be set at each site - squid-frontier variable; one for xml file location for pool conditions files; This still needs a standardization for these two variables; also caching of over-subscribed conditions file (not sure how serious this is). This is similar to the issue for dbRelease file (though it gets staged by the pilot). Athena is using remote IO since the file is never staged (thus pcache does not apply).
    • All sites need send their Squid url to John DeStefano at BNL.
    • Fred, Xin, and Alessandro working to finalize the installation of PFC.
  • this week
    • Two working Frontier launch pads in North America w/ cache consistency checks enabled (BNL, Triumf).
    • New validated squid at BU - all looks okay.
    • Need to setup a failover system if the local squid fails, or if launch pad goes down fail over to Triumf; ACLs need to be adjusted.
    • Discussion about which variables to use for enabling Frontier access at the site. The ATLAS setup script presumably does this automatically. Fred will follow-up w/ Alessandro and Xin.
    • Are the install instructions up to date with the new setup for failover? Not yet.
    • Fred: attempting to get conditions pool files copied / updated at Tier 2s. Also PFCs in place using Alessandro's script needs to be tested. Xin: Rod Walker will put frontier information for US sites into ToA.
    • Discussion of Athena access to conditions data via direct reading versus direct copy. Take up with Richard Hawking.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    Meeting Notes from USATLAS Throughput Call
    Attending:  Shawn, Dave, Jason, Sarah, John, Horst, Karthik, Hiro, Doug
    Discussion about perfSONAR status.   
    AGLT2 (rc4 version?), BNL, MSU, MWT2_IU (previous version) all lost configurations (apparent disk-problems, read-only disk, rebooting loses the config and data on disk).  Possible issue for the KOI hardware?   Needs to be debugged.
    Issues with perfSONAR-BUOY Regular Testing (Throughput and/or One-Way Latency) changing from "Running" to "Not Running". Happening at AGLT2, BNL, MWT2, OU.
    Dave reported running perfSONAR on a box for a few weeks (non-KOI; Intel box) and has had no problems.  However tests only running a few days.
    Internet2 developers will be looking into these two problems...may be the same problem (I/O errors may cause services to stop and/or loss data).   May request files and/or access to problematic perfSONAR instances.  
    Next discussion was about using perfSONAR results to find issues:
    1) SWT2-UTA possibly an issue.  OU, MWT2_IU and AGLT2 show low throughput.  IU was showing lots of packet lose in latency tests.   However BNL->SWT2-UTA and MWT2_UC->SWT2-UTA seems OK.   Needs looking into.  Perhaps path differences may show where the problem lies.
    2) AGLT2-BNL showing asymmetry again.  AGLT2->BNL is good (>910Mbits/sec) but BNL->AGLT2 is poor (40-70 Mbits/sec) since October 15th.  Need to investigate.
    Discussion about Hiro's new testing.   Now Tier-2 to Tier-2 testing in place.  Only 10 files moved in tests and only using DATADISK.  Select Tier-2 site (_DATADISK version) from http:///www.usatlas.bnl.gov/dq2/throughput and you can see results (scroll-down for graphs).
    Slowdown to IU Tier-2 at the end of October: reason was found. The gridftp2 settings lost on the FTS channel was lost, preventing door-to-door transfers.  Hiro fixed and throughput restored.
    Discussion about LHC network (LHCOPN CERN<->BNL).   BNL/John Bigrow wants to test new LHCOPN link to BNL for load-balancing. Needs to arrange high-throughput test from CERN <-> BNL (could use iperf).  Will work with Hiro/Eduardo/Simone on this sometime next week.
    Milestone: BNL->(set of Tier-2's at 1GB/sec for 1 hour).  Test this Friday (October 23 at either 10 or 11 AM).   Need to get feedback from other's on which time is best.
    We will NOT be meeting next week unless someone wants to chair the meeting in my absence ( I will be on a plane to Shanghai).    Next meeting TBD.
    • Still having perfSONAR issues. Hangs followed by lost settings upon reboot.
    • Problems at UTA
    • Problem at UMICH where data transfers are slower in one direction than the other.
    • IU slowness fixed by updating FTS.
    • 1 GBps benchmark on Friday. All sites will participate. * this week:
    • Throughput test for BNL - week ago. Pushed 1.4 GB/s out of BNL to Tier 2s: MWT2, AGLT2, WT2.
    • Passed milestones for BNL, UC, SLAC DONE
    • Need to test NET2, re-do AGLT2.

Site news and issues (all sites)

  • T1:
    • last week(s): Increase analysis slots to 500.
    • this week: Network upgrade yesterday - Shegeki Misawa - based on Force10 and Foundry Networks. Started all services from zero, to an entire Tier 1 facility ~ 3 hours. Some unexpected issues identified (involving unintended package updates). Completely new network, lots of cabling! Five racks of Dell nodes, Dell on-site. Networking going to the new data center. Ordering 1, 1/2 PB of disk, Nexan disk arrays (PCI FC connected to Thors; Nexan controller is powerful). Fully configured S2A990? 2 TB drives providing 2PB of storage - evaluating. Wei notes that at WT2 Solaris 10, update 7 - seeing interrupts going to only one CPU; not seen at BNL - CPUs are load-balanced. Wei: should be partially solved with Update 8. Now an additional 10G to CERN (total 20 G now). Finding 1.5 Gbps capacity at times.

  • AGLT2:
    • last week: Rearranging machine room to accommodate new hardware. Getting ready to upgrade to SL5 via rolling upgrades. Already have 50% of slots are analysis. ~650 slots.
    • this week: Updated site certification tables. 18 MD1000 shelves, and blade chasis. MSU will be provisioning new blade chasis as well. Rolling updates.

  • NET2:
    • last week(s): Running normally. Normally run 50% analysis: ~350 slots currently. Still tuning perfSONAR. Freeing 50 TB of space.
    • this week: Will be updating site certification table as well. Found some bestman hangs in the past week or two. Will change BU cluster to SL 5. UAT tests post-mortem. At HU installed Squid for local users.

  • MWT2:
    • last week(s): SRM problem showing more free space than was available. 90 TB of dark files cleaned. 2200 slots total / 800 slots analysis
    • this week: LFC updated to 1.9.7-4, dCache 1.9.5-6 updated. Both went smoothly. All back online today. Ran 14K jobs over UAT with max of 1100 jobs. Some bugs in pcache exposed under load.

  • SWT2 (UTA):
    • last week: 200 analysis slots. Working on procurement.
    • this week: Ran smoothly during UAT; will do LFC upgrade during production (the LFC clients will retry). Working on procurement.

  • SWT2 (OU):
    • last week: Finally in the process of getting a new quote from Dell and DDN. Will probably get more compute nodes.
    • this week: Looking at a network bandwidth asymmetry.

  • WT2:
    • last week(s): Need to change Panda parameters for use with analysis jobs. Down to last 15 TB of free space. Actively cleaning. Expecting new storage soon. Working on getting 50% of slots for analysis but non-trivial to do. Hope to get 500 slots for analysis. New storage hardware installed. Canadian grid certificates problem solved.
    • this week: relocating ATLAS release to new server; xrootd.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting


  • last week
    • None
  • this week
    • Thursday, November 25 - we probably should have a meeting on that day. (Day before Thanksgiving).

-- RobertGardner - 04 Nov 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback