r4 - 09 Dec 2009 - 12:56:32 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesDec9



Minutes of the Facilities Integration Program meeting, Dec 9, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees:
  • Apologies: none

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • no report
  • this week:

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  11/25: UTD-HEP -- Site set back to 'online' after test jobs finished successfully.
    2)  BNL: cyber-security port scans, originally scheduled for December 2/3, have been rescheduled for December 21/22.
    3)  11/27: ~600 failed jobs at MWT2 with the error "SFN not set in LFC for guid..."  Resolved -- from Sarah:
    The problem is that we ran proddisk_cleanse.py at our site to clean up proddisk, but some of the cleaned files were needed by activated jobs. The files have been re-transferred now.  It's possible we'll have a few more errors, 
    but it should be mostly clear now.  ggus 53690.
    4)  11/27 - 11/29: Transfer errors between SWT2_CPB and FZK -- problem resolved.  ggus 53707, RT 14813, eLog 7566.
    5)  11/30: New pilot version from Paul (41a):
    * Further modifications for SLC5/gcc43/i686-slc5-gcc43-opt jobs for installations with shared compiler and automatic selection of gcc43. AtlasLogin/AtlasSettings verification is now looking in $SITEROOT if the corresponding dirs exist there, if not, 
    it will fall back to swdir (constructed from appdir/$VO_ATLAS_SW_DIR) as before. If it is not needed, the pilot will not set up the compiler as it did/does at BNL/CERN. Successfully tested at INFN-ROMA1 using release 15.3.1 + 
    Changes are compatible with current tests at BNL.
    * User analysis jobs are now setup with forceConfig where available, as well as explicitly setting non-empty AtlasVersion/AtlasProject.
    * File stager corrections including release check (file stager not compatible with release < 15.4.0). Direct access is now switched off if copysetup does not contain proper setup.
    * Protection against down site in user analysis trf download, including re-trials and fallback to optional download site.
    * HOTDISK correction. Replica randomization now properly handles hotdisk exception (should not be randomized).
    * Storage paths for mover log files now more detailed (fully date based, previous version only created monthly subdirs).
    6) 11/30: UTD-HEP -- ~200 failed jobs due to missing db release v7.5.1 -- site LFC still had an entry, though no longer on disk.  Wensheng removed the LFC entry, PandaMover re-staged the file, production resumed.  ggus 53722, eLog 7596.
    7)  12/1: Minor pilot update from Paul (41b):
    * Correction for install jobs. The previous pilot version had a problem with the internal handling of prodSourceLabel=software, now corrected.
    * After the getJob operation, the pilot now stores the dispatcher return code StatusCode in a file. Requested by Peter Love.
    8)  12/1: Power outage in the CERN computing center affected a large number of systems related to panda / production.    Very large number of "lost heartbeat" jobs were seen at most sites as a result.
    9)  12/2: Some sites (for example AGLT2) were draining due to a pilot problem.  The voms proxy on the submit host at BNL had a lifetime under 24 hours, causing pilots to fail with the error "Voms proxy certificate does not exist or is too short."  
    From Xin:
    The BNL voms server only gives 24hours voms proxy. And 1 out of 4 times so far voms-proxy-init hits BNL one.  I disabled the BNL voms server in the configure file .glite/vomses.
    Follow-ups from earlier reports:
    (iii) BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until Monday, November 16th, and eventually to the 21st of December.  
  • this meeting:
     Yuri's summary from the weekly ADCoS meeting:
    1)  12/2-3: Another instance of a db release file in the LFC for UTD-HEP, but no longer on disk.  Fixed by Wensheng (thanks!).  RT 14843.  (One more instance of this issue on 12/6 as well.)
    2)  12/3: From Charles at UC:
    We had an apparent power interruption at UC last night at around 2AM CST. Expect some "lost heartbeat" errors from jobs that were running at the time.
    3)  12/3: BNL: From Michael:
    Due to a configuration issue associated with the dccp client some jobs at BNL failed. The problem was resolved in the meanwhile.  (~4k failed jobs.)  eLog 7687.
    3)  12/3: IU_OSG -- Jobs were failing with the error "Put error: lfc-mkdir failed: LFC_HOST iut2-grid5.iu.edu cannot create....  Could not secure the connection |Log put error: lfc-mkdir failed."  From Aaron at MWT2:
    This has been resolved by a restart of proxies at IU_OSG.  RT 14849.
    4)  12/5-7: Power problems at AGLT2 -- from Bob:
    On Saturday night (~11:40pm EST) there was a power hit at Michigan State that took out a number of worker nodes.  It also apparently took out a central air conditioner.  On Sunday night (~11:20pm) that central air caught up with a major switch room at the MSU campus, 
    and took down the network switch equipment for 2 hours, completely isolating more than half of our dcache disk servers from the systems that remained up at University of Michigan.  Three of these did not restore properly when the network connectivity was 
    re-established and were manually restarted early this morning, total down time for them about 8 hours.  All jobs running at the time at MSU were lost.  We had other issues this afternoon with network instability, that may have blown our running job load, 
    but should now be back on track.  All of these blown jobs should eventually show up with lost heart beat.
    5)  12/7: SLAC -- ADCoS shifter reported t1-t2 transfer errors.  ggus 53942.  This issue was resolved by restarting the SRM service.
    6)  12/7: BNL DQ2 site services s/w upgraded to the newest production version (Hiro).
    7)  12/7: AGLT2_PRODDISK to BNL-OSG2_MCDISK transfer errors.  From Shawn:
    We have two storage nodes with dCache service problems. I believe a simple restart should fix it.  ggus 53915, eLog 7819.
    8)  12/8: Power outage at BNL completed:
    The partial power outage at the RACF that affected a portion of the Linux Farm cluster on Tuesday, Dec. 8 is now over. All affected systems (ATLAS, BRAHMS, LSST, PHENIX and STAR) have been restored and are available to the Condor batch system again.
    Follow-ups from earlier reports:
    (i) BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until Monday, November 16th, and eventually to the 21st of December.
    (ii) BNL -- cyber-security port scans, originally scheduled for December 2/3, have been rescheduled for December 21/22.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • One user in US trying to run on first real data: data09_900GeV.00140541.physics_MinBias.merge.AOD.f175_m273. A job on a local file is successful.
    • UAT postmortem on Nov. 19th: http://indicobeta.cern.ch/conferenceDisplay.py?confId=74076
    • Analysis of errors from Saul on a subsample of 5000 failures: http://www-hep.uta.edu/~nurcan/UATcloseout/
      • Two main errors: Athena crash (43.6%) and staging input file failed (43.1%). Athena crash mostly refer to user job problems (forgetting to set trigger info to false, release 15.1.x does not support xrootd at SLAC and SWT2, etc.). Stage-in problems mostly refer to BNL (storage server problem) and MWT2 (dcache bugs, file-locking bug in pcache) jobs.
    • "Ran out of memory" failures are from one user at BNL long queue and AGLT2 as seen in the subsample. I have contacted with user as to a possible memory leak in user's analysis code.
    • DAST team has started training new shifters this month; 3 people in NA time zone, 2 people in EU time zone. 2 more people will train starting in December.
  • this meeting:

DDM Operations (Hiro)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • https://www.racf.bnl.gov/docs/services/frontier/meetings/minutes/20091124/view
    • New squid server at BNL separate from the launch pad. frontier03.usatlas.bnl.gov;
    • TOA listings in place now
    • PFC generation now working: Xin: integrated Alessandro's script for use on OSG sites. Working at BNL - now working on Tier 2s, sending installation job via Panda. AGLT2 needs to update OSG wn-client. xrootd sites don't have the same.
    • Presentations tomorrow and Monday
    • Frontier & Oracle testing - hammering servers - sending blocks of 3250 jobs. Very good results with frontier, millions of hits on the squid with only 60K hits in oracle (thus protected). Repeating tests but directly accessing oracle - couldn't get above 300 jobs. Now doing about 650 jobs simultaneous jobs. John and Carlos looking at dCache and oracle load. Impressive - only one failed job. Each job takes 4GB of raw data and makes histos (reasonable pattern).
    • Yesterday saw 6 GB/s throughput from dCache pool. Today only 4 GB/s, though oracle heavily loaded 80% cpu but holding well; 20 MB/s peak on oracle nodes. Protecting oracle from throughput as well as queries. Utilization of oracle when using frontier was maybe 5% on each node. Impressive.
  • this week

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
  • this week:
    Notes for USATLAS Throughput Meeting
    Attending: Shawn, Andy,  Dave, Karthik, Jeff,
    Excused: Horst
    1) Dave mentioned problem attributed to time-change.  Karthik noted his throughput node was running OWAMP.  Andy will be in touch with Karthik about node details to debug what may have happened in this case.  Shawn noted the SNMP archive was running on the latency node at UM but was "Not Running" on the throughput node.  Originally (after install) was running on both.  Not sure of the details but this doesn't seem to be a problem.  Discussion about perfSONAR measurements.   Items
    a) UIUC measurements show an asymmetry:  Outbound from UIUC is typically good (>900 Mbps) while inbound is usually much less (~500 Mbps UM, ~130 Mbps BU).   Need path details.   Path from UIUC to UM throughput node is:
    Traceroute security issues.
    Executing exec(traceroute, -m 30 -q 3 -f 3,, 140)
    traceroute to (, 30 hops max, 140 byte packets
    3 (  0.634 ms  0.331 ms  0.292 ms
    4  * uiuc-vrfeo-dmzo-lnk.gw.uiuc.edu (  0.944 ms  0.585 ms
    5  ur1rtr-uiuc.ex.ui-iccn.org (  0.537 ms  0.626 ms  0.463 ms
    6  t-710rtr.ix.ui-iccn.org (  3.235 ms  3.320 ms  3.184 ms
    7  nlr-710rtr.ex.ui-iccn.org (  4.918 ms  4.197 ms  3.958 ms
    8 (  3.829 ms  3.716 ms  3.671 ms
    9 (  9.429 ms  9.409 ms  9.412 ms
    10  psum02.aglt2.org (  9.301 ms  9.372 ms  9.354 ms
    Path the other direction:
    Executing exec(traceroute, -m 30 -q 3 -f 3,, 140)
    traceroute to (, 30 hops max, 140 byte packets
    3  ge-2-1-0.348.rtr.chic.net.internet2.edu (  181.511 ms *  186.440 ms
    4  710rtr-internet2.ex.ui-iccn.org (  6.271 ms  6.273 ms  6.249 ms
    5  t-ur2rtr.ix.ui-iccn.org (  9.044 ms  9.195 ms  9.201 ms
    6  iccn-ur2rtr-uiuc2.gw.uiuc.edu (  9.236 ms  9.863 ms  10.244 ms
    7  t-dmzo.gw.uiuc.edu (  21.031 ms  9.177 ms  9.109 ms
    8 (  9.292 ms  9.147 ms  9.186 ms
    9 (  9.227 ms  9.139 ms  9.053 ms
    10 (  9.256 ms  9.208 ms  17.992 ms
    11  osgx4.hep.uiuc.edu (  9.266 ms  9.329 ms  9.242 ms
    b) Bandwidth tests to SWT2-UTA are very poor...typically 20-30 Mbits/sec.   Tests to the Australian Tier-2 in Melbourne are 50-100 Mbits/sec for comparison.   Need to determine what issues there may be along the SWT2-UTA path.
    c) For next week we should have more sites join and review their ongoing tests.   Some things to check:  i) Find the best and worst paths (in One-Way Latency graphs) for packet loss (tests with least and most packet losses for a 24 hour period), ii) Find any tests with large asymmetries in throughput,  iii) Find throughput tests which are unusually bad compared to expectations.
    2) Milestones and benchmarking: Postponed topic until next week (need better attendance)
    3) Site reports:  No other issues reported from UIUC, OU or UM.
    4) AOB:  We will plan to meet again next week but this will be the last Throughput meeting of 2009.   In 2010 we will start bi-weekly meetings.   Please try to attend next week and review your perfSONAR tests before the call.  
    Please send along any corrections or additions via email.  Thanks,

Site news and issues (all sites)

  • T1:
    • last week(s): 960 new cores now commissioned (now being used for Fred's jobs). Production/analysis split to be discussed. Evaluating DDN storage, to be delivered 2.4 PB raw (1.7 useable), also some small Nexan arrays to be added to the Thors. Commissioned 3 SL8500 tape libraries, to be shared with RHIC.
    • this week:

  • AGLT2:
    • last week: See issues discussed above. Still transitioning SL5, cleaning up configuration to clear up old history.
    • this week:

  • NET2:
    • last week(s): Error from transfers to cern scratch space token - required several fixes. Johannes jobs are running out of storage on worker nodes, fixed. No other outstanding problems, observed users accessing the new data.
    • this week:

  • Aside: Kausihk notes that three metrics are being used by ATLAS to determine if sites can participate in analysis. Hz (10), efficiency, total events processed / 24 hours. There is on-going discussion in the ICB (Michael and Jim will be in this meeting). The main point is ATLAS management should not be making decisions regarding data placement and use resources.

  • MWT2:
    • last week(s): Consultation with dCache team regarding cost function calculation and load balancing among gridftp doors in latest dCache release. LFC ACL incident; fixed. Procurement proceeding.
    • this week:

  • SWT2 (UTA):
    • last week: LFC upgraded; fixed BDII issue; applied SS update; SRM restart to fix reporting bug; purchases coming in.
    • this week:

  • SWT2 (OU):
    • last week: Looking at a network bandwidth asymmetry. 80 TB being purchased; ~200 cores. Another 100 TB also on order.
    • this week:

  • WT2:
    • last week(s): completed LFC migration; 160 TB useable last friday; 160 TB in January;
    • this week:

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting


  • last week
  • this week

-- RobertGardner - 08 Dec 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback