r3 - 23 Dec 2009 - 12:49:18 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesDec23



Minutes of the Facilities Integration Program meeting, Dec 23, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees:
  • Apologies: Rob, Horst, Bob, Shawn, Charles, Aaron, Fred, Sarah, Nate

Integration program update (Rob, Michael)

  • SiteCertificationP11 - FY10Q1
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Opportunistic storage for Dzero - from OSG production call. They want it at more OSG-US ATLAS sites than is used today. Asking for 0.5 to 1 TB. We're not sure what it means to configure. Request also coming from Brian Bockelman as well - but few details. There of course are authorization and authentication issues that would need to be configured. Mark: UTA has given 10-15 slots on the old cluster with little impact. We need to have someone from US ATLAS leading the effort to support D0, working through the configuration issues, etc. Mark will follow-up with Joel.
      • LHC shutdown a few hours ago - no more operations till feb.
      • Operations call this morning - reprocessing operations December 22 - but will not be using the Tier 2's.
      • Interventions should be completed within the January timeframe.
    • this week

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • Storage element subscription to Tier 3 completed. Hiro: its working.
    • SE prob at ANL - runaway processes - will monitor
    • Panda submission to Tier 3's - Torre and Doug were going to work on this.
    • T3-OSG meeting - security issues discussed. Have some preliminary ideas.
    • Hiro: T3 cleanup - there is a program being developed to ship a dump of T3-LFC to each T3 - Charles will work on ccc.py to use this dump (sql database)
    • Justin: subscriptions working fine at SMU
  • this week:

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  12/9: Panda server modified to use new db accounts.  Temporarily created a problem with attempts to modify the status of sites via the usual 'curl' interface.  Fixed by Graeme.
    2)  12/9: Some sites noted an increase in the number of pilots waiting in their queues.  Possibly due to (from Torre):
    The autopilot setup on voatlas60 is the same as its been for a couple of weeks, but condor there has been tuned up and it seems it is now more effective at getting pilots to the US queues. The motivation for CERN submissions is to have a centrally managed submit point for everyone that 
    provides redundancy for regional submission, 
    so I think we should adapt ourselves to pilots coming from CERN as well as BNL. 
    Whatever the nqueue setting for a queue is, each submitter will maintain that nqueue independently, so two equally successful submitters will result in ~double the pilot flow. Hence I would suggest reducing nqueue (not necessarily by a factor 2) such that pilot flow is reasonable again.
    3)  12/10: Job failures at MWT2_IU & IU_OSG with stage-in/out errors -- from Charles:
    dCache service at IU was interrupted for some maintenance which took longer than expected. We're back online now. If job recovery is enabled for MWT2_IU (which I believe is the case) these output files should be recoverable.  RT 14890, eLog 7892.
    4)  12/10 p.m. - 12/11 a.m.: A couple of storage server outages at BNL -- resolved.  eLog 7910.
    5)  12/15: Pilot update from Paul (v41c):
    * Local site mover is now using --guid option. Requested by Charles.
    * Correction for appdir used by CERN-UNVALID since previous pilot version caused problems there (pilot v 40b used until now). $SITEROOT was used to build path to release instead of schedconfig.appdir. CERN-PROD and CERN-RELEASE were not affected since $SITEROOT and appdir both points to .../release area. 
    * Pilot options -g  and -m  can now be used to specify locations and destinations of input and output files in combination with mv site mover (compatible with Nordugrid). Requested by Predrag Buncic for CERNVM project.
    * Empty copyprefix substrings replaced with dummy value. Initially caused problems at UTD-HEP due to misconfiguration in schedconfig.
    * STATUSCODE file now created in all getJob scenarios. Requested by Peter Love.
    * Value of ATLAS_POOLCOND_PATH dumped in pilot log. Requested by Rod.
    * The xrdcp site mover (written by Eric for use at ANALY_LYON) has been updated to also work at ANALY_CERN.
    * Note: There will be at least one more minor pilot release before Christmas.
    6)  12/14: From Bob at AGLT2:
    At approximately 4:50am EST today, cluster activity at AGLT2 began to ramp down.  We discovered processes were hung on dCache admin nodes and probably on a few disk servers as well.  At 10:35am cluster activity resumed to normal after services were restarted.  
    We expect this will throw errors in running jobs during this time period.
    7)  12/15: Jobs failures at OU with stage-in errors.  Coincided with a pilot update, which exposed some needed updates to schedconfigdb entries for the site.  Alden made the updates to schedconfigdb, Paul is working on a modification to the pilot which should be ready in the next day or so.  
    Site set to 'off-line'.  RT #14912.  
    12/16 a.m. -- problem now appears to be solved, OU set back to 'on-line'.
    Follow-ups from earlier reports:
    (i) BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until Monday, November 16th, and eventually to the 21st of December.
    (ii) BNL -- cyber-security port scans, originally scheduled for December 2/3, have been rescheduled for December 21/22.
    • excessive pilots observed at some sites. there is a second pilot submitter instance. Look at nqueue setting, may need to be tweaked down.
  • this meeting:
     Yuri's summary from the weekly ADCoS meeting:
    1)  12/16-17: AGLT2 -- file transfer errors -- "locality is UNAVAILABLE" -- resolution (from Shawn):
    UMFS07.AGLT2.ORG (hosting dCache pools for DATADISK, CALIBDISK and PRODDISK again had problems.   This was traced to a combination of old driver and newer firmware.   Driver and kernel were updated and system was rebooted.   Last 1.5 hours SRM shows no errors.  Things seem to be working so I am closing this ticket.  eLog 8128, RT 14916.
    2)  12/17-18: AGLT2 maintenance outage -- initially an issue with dCache after re-starting -- from Shawn:
    We found that jobs using 'dccp' at our site were failing after coming back online from our upgrade. We checked the systems and the /pnfs area was seemingly mounted correctly on the affected nodes but the 'dccp' copy command would fail like:
    dccp /pnfs/aglt2.org/atlashotdisk/ddo/DBRelease/ddo.000001.Atlas.Ideal.DBRelease.v070302/DBRelease-7.3.2.tar.gz /tmp/test.db
    Failed to open config file /pnfs/aglt2.org/atlashotdisk/ddo/DBRelease/ddo.000001.Atlas.Ideal.DBRelease.v070302/.(config)(dCache)/dcache.conf
    Failed to create a control line
    Failed open file in the dCache.
    Can't open source file : Can not open config file System error: No such file or directory
    Other nodes would work correctly with the same command.
    To fix the issue we found we had to 'umount /pnfs' and then 'mount /pnfs' to restore proper functioning. Note that prior to the remount the /pnfs mount seemed to be OK (you could do 'ls /pnfs/aglt2.org' for example) but 'dccp' would fail as above.
    This must have had something to do with the reboot of the /pnfs headnode creating some kind of "stale" mount during our upgrade today. The unusual thing is that the /pnfs headnode was rebooted before we rebuilt/upgraded our new worker nodes. eLog 8166.
    3)  12/18 (ongoing): IU_OSG -- site was origianlly set off-line due to pilot failures that were blocking jobs from other VO's -- nothing unusual seen on the pilot submit host -- tried some test jobs yesterday (12/22) -- 9 of 10 failed with the error " Pilot has decided to kill looping job."  
    Jobs were seemingly "stuck" on the WN's -- problem still under investigation.
    4)  12/19: MWT2_IU -- ~65 failed jobs with stage-in/out errors -- quickly resolved -- from Sarah:
    We had one missing file, an 'lfc ghost':
    I've manually fetched it from BNL. We should see these errors stop.
    5)  12/19: Discussions about the best way to submit / track change requests for schedconfigdb (Alden, others).  New e-amil address: schedconfig@gmail.com
    6)  12/20-21: UTD-HEP -- scheduled power outage -- site took this opportunity to upgrade their bestman s/w -- test jobs completed successfully, back to 'online'.
    7)  12/21: BNL -- US ATLAS conditions oracle cluster db maintenance:
    Memory RAM in the cluster nodes will be upgraded from 16GB to 32GB.  The intervention will be done in rolling fashion (one node at the time), no database service interruption is expected during this maintenance.
    8)  12/21-22: BNL -- cyber-security port scanning.  Comment from Hiro:
    As noticed by the several people in this morning, many jobs failed due to the error caused by LFC in this morning (11:09 AM to be exact.) This
    seems to be caused by the scheduled Nessus security scan. Although the outage was very brief (about 30 seconds), since the persistent
    connections from the client to LFC services seemed to have been lost during that time, some of jobs, which happen to have connections (or
    trying to make connection) to LFC during that time, lost the connection to BNL LFC, resulting in failed jobs. LFC itself is working fine after
    this brief outage.
    9)  12/22: UTA_SWT2 -- Maintenance outage (SL5, many other s/w upgrades) is completed.  atlas s/w releases are being re-installed by Xin (this was necessary since the old storage was replaced).  Test jobs have finished successfully -- will resume production once the atlas releases are ready.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • User activity has already slowed down this week. Jump with real data arriving didn't materialize. Next three weeks should be clear for upgrades.
    • No major problems with data access (yet). Sometimes release isn't installed at the site. Why was the job scheduled there? Should we put release matching in.
    • Probs accessing conditions database. Rod has been responding to some of the problems. Recent releases seem to solve the problems.
    • User support during the break - mostly one shifter on duty. Next week all north american timezone shifts. Following week it will be a EU-zone person mainly with a NA-zone person only for Thursday-Friday.
  • this meeting:

DDM Operations (Hiro)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • Each T2 must test against all other T2s
    • Check spreadsheet for correctness
    • Asymmetries between certain pairs - need more data
    • Will start a transaction-type test (large number of small files; check summing needed)
  • this week:
    • Jan 12 next meeting - will start bi-weekly.

Site news and issues (all sites)

  • T1:
    • last week(s): One of the production panda sites is being used for high lumi pileup, high memory jobs (3 GB/core). Stability issues with Thor/Thumpers - some problems with high packet rate with link aggregated NICs. 2PB disk purchase on-going. Another Force10 switch with 60 GB inter-switch links.
    • this week:

  • AGLT2:
    • last week: Running well - issue with dccp copies seemed to hang - had to reboot dcache headnode. Would like to do some upgrades of storage nodes Friday. Trying out Rocks5 build for updating nodes.
    • this week: I have edited the SiteCertificationP11 table and set our SL5 column green. We are idling the last of our SLC4.8 worker nodes overnight and will rebuild them into SL5 tomorrow, completing the transition. The bulk of our purchase this quarter is now up and running as well, but I've left that FabricUpgrade? indicator blue as we still have one full blade chassis we cannot bring up until January when we complete our IP address migration. The FabricUpgrade? information was updated to reflect the additions.

  • NET2:
    • last week(s): working with local users can access pool conditions data at HU. Separate install queue for software kits.
    • this week:

  • MWT2:
    • last week(s): Update of myricom driver updates to troubleshoot.
    • this week: Upgraded MWT2_UC to dCache 1.9.5-11. Order for ~ 1PB useable storage placed with Dell. IBM would like to discuss possibility of a pricing matrix with US ATLAS on January 7.

  • SWT2 (UTA):
    • last week: Major upgrade at UTA_SWT2 - replaced storage system, new compute nodes - all in place. Reinstalling OSG. SRM, xroot all up. Hopefully up in a day or two.
    • this week:

  • SWT2 (OU):
    • last week: Equipment for storage is being delivered. Will be taking a big downtime to do upgrades, OSG 1.2, SL 5, etc.
    • this week:

  • WT2:
    • last week(s): All is well. Some SL5 migration still going. Suddenly a number of older machines from babar have become available. working xrootd-solaris bug.
    • this week:

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting


  • last week
  • this week

-- RobertGardner - 23 Dec 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback