r3 - 18 Nov 2009 - 14:25:05 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov18



Minutes of the Facilities Integration Program meeting, Nov 18, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Rob, Aaron, John DeStefano, Doug, Rik, Michael, Sarah, Charles, Bob, Armen, Kaushik, Mark, Horst, Wei, Saul, Torre
  • Apologies: none

Integration program update (Rob, Michael)

  • SiteCertificationP11 - FY10Q1
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
    • this week
      • Upcoming schedule for LHC operations. Increased activity by this Friday. Expect some beam splash events soon. Then rotating beams with, and w/o RF capture.
      • Day 13 could mean 900 GeV collisions.
      • Data before Christmas will be 'special' - including RAW and ESD data. Expect normal distribution.
      • Will be a wide distribution, especially at the Tier 2s. Very important to have all Tier 2's stable and available.
      • We should refrain from upgrades and other extended downtimes.
      • End of this week to December 18 - refrain from any disturbance.
      • December 18 to January 10 - will be a better window for upgrades and scheduled interventions.
      • Are the space token upgrades really necessary before December 18?
      • Expect adc-operations list will be used to communicate data replication, etc. and Michael will send summaries.
      • Active cleaning going on right now to create space for the new data - though there are issues Kaushik reports. Believes though that all Tier 2's are in good shape.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week:
    • Tier 3 meeting at ANL: https://atlaswww.hep.anl.gov/twiki/bin/view/Tier3Setup/29Oct09Meeting
    • Concentrating on T3G? . Will have an SE, an interactive and batch cluster.
    • 12 people have indicated interest in developing T3G? . Organizational phone meetings on Friday.
    • Complete description by the end of the year.
    • Working closely with OSG and the Condor team.
  • this week:
    • Discussions w/ Torre about use of pathena + panda at the Tier 3. Will be testing this at Duke soon.
    • Follow-up phone meeting for the organizational meeting - Monday/Tuesday of next week.
    • Setting up VMs to test clusters (Marco helping)
    • Contact w/ Massimo Lamana regarding Tier 3 issues ATLAS-wide, to be discussed at CERN at the end of the month, perhaps end of January a meeting.
    • CERN web filesystem meeting; will be testing next generation version. Can the US setup a mirrored CERN VM site?
    • Subscription tests to SEs have not yet begun
    • Kaushik believes we still need a manual approval process to ensure reasonable requests
    • Immediate need is to get test subscriptions for bringing up Tier 3 SEs.
    • Hiro: LFC migration to BNL-LFC; Illinois-hep now migrated. OU later today. UTD ? Need to make a msyql dump. Wisconsin tomorrow. Then done.

UAT program (Kaushik, Jim C)

  • last week(s):
    • ADC daily operations notes
    • See https://twiki.cern.ch/twiki/bin/view/Atlas/UserAnalysisTest
    • 112 Ganga, 50 Panda users participating
    • Some load balancing among clouds
    • First two days for job submission (Wed, Thurs); data retrieval didn't seem to start until Monday
    • Failure rates ~40%, might have included pre-trets.
    • Failure rates ~40%, might have included jobs failing because trigger information was not included in the AODs. Does this include jobs killed by users?
    • Kaushik is preparing efficiencies of users versus clouds.
    • MWT2 - there was a bad node causing failures. There were some data movement / load related issues (>1000 jobs analysis jobs)
    • SLAC - container issues causing failures. Wei: most dominant reasons - killed by user, and a user using a version of pyutils that didn't support xrootd (needs to be checked in Release 15.1.0). An then an old release, 14.4.0. A few failures because the xrootd server failed. Also moving release install area off the xrootd server.
    • Need to exclude cancelled jobs from statistics.
    • We need to go through the failures looking back into the database.
    • Over the weekend, there were a large number of release 14 jobs holding database connections open
    • We need to get a better handle on what to expect for users retrieving data - dq2 tracking retrievals per site, peaked on Monday.
  • this week:
    • Will be another UAT at some point. May try for next week.
    • Post mortem tomorrow, Nurcan presenting new results (from Saul).
    • Jim Cochran has asked for plans for future tests and lessons learned for a presentation in two weeks.
    • Fred: did raise the issue of the memory cut-off.. results were mixed, should be discussed at tomorrow's meeting.

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Formalize procedure for cleaning USERDISK (procedure just needs to be put in twiki by Armen); this will be done centralized.
    • Checksum problems with dCache causing problems. Load too high to get checksum.
    • Time-out problems xrootd. Wei suggests reducing the number of transfers done at the same time. Default is 200.
    • Call backs to Panda server timing out.
    • A new dq2 version is available.
    • LOCALGROUPDISK to be deployed at every T2. US-only usage; will be monitored by Hiro's system. Timeframe: within a week. There is a stability issue with xrootd.
    • Michael: issue of publishing of SEs in the GIS (BDII). The reason is we want to allow data replication between Tier 2's and other Tier 1s. We need to make sure we get our SEs get published into OSG interoperability BDII. Start within a week. Xin to follow-up
  • this week:

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  11/5: UTD-HEP -- following the FTS upgrade at BNL file transfers at UTD-HEP were failing due to the older version of the SRM s/w at the site.  It was upgraded, and transfers are now succeeding.  
    (FTS error was "locality is NONE," fixed by adding [retention:REPLICA][latency:ONLINE] to each space token definition.)
    2)  11/6: SLAC migrated the atlas s/w releases area to a different NFS server, but initially there were problems with some of the s/w paths / links, etc.  Wei resolved the problems -- SLAC set back to 'online'.
    3)  11/6: BNL -- disk failures in one of the storage servers for MCDISK.  From Pedro:
    One of our storage servers was suffering from several disk failures.  This affected 33TB of data stored in the MCDISK space token.  The problematic disks have been replaced and the RAID set is being rebuilt.  All data is available again.
    4)   11/6: Jobs were failing at OU_OCHEP with the error "lfc_getreplicas(): SFN not set in LFC."  Understood -- file removed during a disk clean-up.  From Wensheng:
    The dq2-cache cleaner that ran yesterday removed the physical copy of the file DBRelease-7.5.1.tar.gz. A bit later pandamover brought it back, jobs have been running fine since then.
    5)  11/7 - 11/8: AGLT2 -- jobs were failing with input file staging errors -- Bob tracked down the problem to a hung gridftp server.  Issue resolved -- site set back to 'online'.  RT 14551.
    6)  11/8-11/9 -- very large number of panda jobs failed with the error "No reply to sent job."  To fix the problem with ORACLE CERN PandaDB as Tadashi reported, the panda server was down 
    for 5~10 min at 15:00 CERN time today (11/9) in order to optimize a schema in Oracle database.
    7)  11/11, early a.m. -- intermittent network problems at BNL -- issue resolved:
    The RHIC and US Atlas facilities were experiencing intermittent network connectivity problems. The source of the problem has
    been identified and steps were taken to correct the problem.  The underlying cause is under investigation and, in the event that the problem recurs, steps will be taken to resolve the problem quickly.
    Follow-ups from earlier reports:
    (i)  UAT -- a postmortem announcement to follow.
    (ii) A new test instance of the RT server at BNL was announced by Jason (message to the usual mail lists).  Try it out at: https://rt.racf.bnl.gov/rt3/
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  11/11-11/12: IU_OSG -- kernel upgrades completed, site set back to 'online'.
    2)  11/11: BNL -- ~150 jobs failed with stage-in errors -- issue was an off-line storage server -- resolved.  RT 14585.
    3)  11/12: BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until
    Monday, November 16th, and eventually to the 21st of December.
    4)  11/13: ~500 failed jobs at BU with local site mover errors.  The log extract included "no space left on device."  From Saul:
    We got short of disk space in the process of moving our DATADISK.  It should be fixed now.  eLog 6926.
    5)  11/13: At the beginning of the shift BNL and AGLT2 had no activated jobs, but plenty of assigned ones.  Issue with the BNL_ATLAS_DDM queue was eventually resolved.  See extensive mail thread for details.
    6)  11/14: Jobs at AGLT2 were gradually draining out.  From Bob:
    Running job count at aglt2 began to drop at 17:40pm.  I subsequently found a crashed "ypbind", and restarted it at 20:15.  All times EST.  Grid services are once again authenticating, however, we expect a number 
    of dead/crashed jobs to show up from this time period.
    7)  11/15: srm storage filled up at UTD-HEP.  Some issues running the "proddisk-cleanse.py" script.  Being worked on.  Site set 'off-line'.  RT 14708.
    8)  11/17 early a.m.: AGLT2 -- transfer errors, jobs were failing with " Put error: Copy command returned error code 256 and output: httpg://head01.aglt2.org:8443/srm/managerv2: CGSI-gSOAP: Could not open connection!  Resolved -- from Shawn:
    The /var partition on the dCache headnode was full. This was apparently due to excessive logging into the postgres DB. Some space has been freed and both postgres and dcache services restarted on head01.aglt2.org.
    9)  11/17: LFC migration to BNL completed for tier 3 site IllinoisHEP.  Test jobs submitted, but they seem to still be using the old LFC information.  Wensheng updated the ToA, and the jobs have now finished successfully.  
    Will set the site to 'on-line' once the output file transfers complete.
    Follow-ups from earlier reports:
    (i)  UAT -- postmortem will be held November 19, 2:00pm CET.
    (ii) A new test instance of the RT server at BNL was announced by Jason (message to the usual mail lists).  Try it out at: https://rt.racf.bnl.gov/rt3/

Analysis queues (Nurcan)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • There was a major problem with delegation with the DQ2 delegation - caused by clock skew. (NTP server was unreachable). Fixed.
    • Tier 3 sites to stop DQ2 site services:
       As we agree, we must consolidate all US T3 DDM related services: DQ2 SS
      and LFC to BNL.   As the first step, I would like to bring all DQ2 SS to
      BNL tomorrow.  Basically, I need to ask you to turn off DQ2 since BNL's
      DQ2 SS will serve your sites.  If you run DQ2 SS serving the following
      sites, please stop your DQ2 (or remove them from your configuration
      OUHEP (is this T3?)
      WISC XYZ
      UTD XYZ
      ILLINOIS XYZ  (done)
      DUKE XYZ (done)
      ANL XYZ (done)
      If you know any other sites, please let me know.
      Please keep your LFC's running.    That will be the second step.
      I would like to do this tomorrow at 12PM US Eastern time.
      If you have any questions, please let me know. 
    • FTS 2.2 news: seems to be stable, there will be throughput tests at CERN; Hiro will test Bestman sites with this deployment. Upgrade at Tier 1's not expected until March(?).
    • DQ2 update at Tier 2s then? Any problem with existing site services? None.
    • LFC updates - Hiro will send an email querying as to which specific version. Sites should update before next year
    • New monitor for DQ2 site services at BNL, integrated into Panda page as well (see "transferring").
  • this meeting:
    • Will be testing FTS 2.2 this week.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • Two working Frontier launch pads in North America w/ cache consistency checks enabled (BNL, Triumf).
    • New validated squid at BU - all looks okay.
    • Need to setup a failover system if the local squid fails, or if launch pad goes down fail over to Triumf; ACLs need to be adjusted.
    • Discussion about which variables to use for enabling Frontier access at the site. The ATLAS setup script presumably does this automatically. Fred will follow-up w/ Alessandro and Xin.
    • Are the install instructions up to date with the new setup for failover? Not yet.
    • Fred: attempting to get conditions pool files copied / updated at Tier 2s. Also PFCs in place using Alessandro's script needs to be tested. Xin: Rod Walker will put frontier information for US sites into ToA.
    • Discussion of Athena access to conditions data via direct reading versus direct copy. Take up with Richard Hawking.
  • this week
    • Fred is attempting stress testing servers at BNL. Problems getting POOL conditions files available at BNL's HOTDISK.
    • Entries are not being generated in the correct format. Might be related to using an old release of dq2-client tools.
    • May need to post-pone the test till after Christmas.

Throughput Initiative (Shawn)

Site news and issues (all sites)

  • T1:
    • last week(s): Network upgrade yesterday - Shegeki Misawa - based on Force10 and Foundry Networks. Started all services from zero, to an entire Tier 1 facility ~ 3 hours. Some unexpected issues identified (involving unintended package updates). Completely new network, lots of cabling! Five racks of Dell nodes, Dell on-site. Networking going to the new data center. Ordering 1, 1/2 PB of disk, Nexan disk arrays (PCI FC connected to Thors; Nexan controller is powerful). Fully configured S2A990? 2 TB drives providing 2PB of storage - evaluating. Wei notes that at WT2 Solaris 10, update 7 - seeing interrupts going to only one CPU; not seen at BNL - CPUs are load-balanced. Wei: should be partially solved with Update 8. Now an additional 10G to CERN (total 20 G now). Finding 1.5 Gbps capacity at times.
    • this week:

  • AGLT2:
    • last week: Updated site certification tables. 18 MD1000 shelves, and blade chasis. MSU will be provisioning new blade chasis as well. Rolling updates.
    • this week: Taken delivery of full complement of compute nodes and disk servers at both UM and MSU. Plan to run w/ 12 jobs / machine. 24GB mem plus hyperthreading. See https://hep.pa.msu.edu/twiki/bin/view/AGLT2/HepSpecResults.

  • NET2:
    • last week(s): Will be updating site certification table as well. Found some bestman hangs in the past week or two. Will change BU cluster to SL 5. UAT tests post-mortem. At HU installed Squid for local users.
    • this week: BDII problem currently - someone is trying to copy a dataset to CERN's scratch disk. Will need to publish SRM information correctly in OSG.

  • MWT2:
    • last week(s): LFC updated to 1.9.7-4, dCache 1.9.5-6 updated. Both went smoothly. All back online today. Ran 14K jobs over UAT with max of 1100 jobs. Some bugs in pcache exposed under load.
    • this week: main issue is stabilizing dcache - some issues with pool selection since upgrade.

  • SWT2 (UTA):
    • last week: Ran smoothly during UAT; will do LFC upgrade during production (the LFC clients will retry). Working on procurement.
    • this week: Will do LFC later this afternoon. Purchasing proceeding.

  • SWT2 (OU):
    • last week: Looking at a network bandwidth asymmetry.
    • this week: 80 TB being purchased; ~200 cores. 100 TB also on order.

  • WT2:
    • last week(s): relocating ATLAS release to new server; xrootd.
    • this week: Publishing BDII - working. Site maintenance tomorrow; Linux kernel for security patches. Might do LFC upgrade.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting


  • last week
    • Thursday, November 25 - we probably should have a meeting on that day. (Day before Thanksgiving).
  • this week
    • Fred points out that BNL has two complete and independent sets of software installed - one installed by Alex Undress, one installed by Xin from the kits.

-- RobertGardner - 17 Nov 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback