r6 - 06 Jan 2010 - 14:20:38 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan6



Minutes of the Facilities Integration Program meeting, January 6, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Michael, Justin, Rob, Shawn, Charles, Aaron, Jason, Jim, Mark, Nurcan, Wei, Patrick, Rik, Doug, John DeStefano, Horst, Karthik, Xin, John B, Saul, Wensheng
  • Apologies: Kaushik

Integration program update (Rob, Michael)

  • SiteCertificationP12 - FY10Q2
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Opportunistic storage for Dzero - from OSG production call. They want it at more OSG-US ATLAS sites than is used today. Asking for 0.5 to 1 TB. We're not sure what it means to configure. Request also coming from Brian Bockelman as well - but few details. There of course are authorization and authentication issues that would need to be configured. Mark: UTA has given 10-15 slots on the old cluster with little impact. We need to have someone from US ATLAS leading the effort to support D0, working through the configuration issues, etc. Mark will follow-up with Joel.
      • LHC shutdown a few hours ago - no more operations till feb.
      • Operations call this morning - reprocessing operations December 22 - but will not be using the Tier 2's.
      • Interventions should be completed within the January timeframe.
    • this week
      • Quarterly reports due!
      • Mid-feb machine shutdown ends
      • Our readiness - need site reports below; site upgrades must be completed by the end of the month; SL5 and OSG 1.2
      • Another round of reprocessing coming up, analysis has been ramping up

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • Storage element subscription to Tier 3 completed. Hiro: its working.
    • SE prob at ANL - runaway processes - will monitor
    • Panda submission to Tier 3's - Torre and Doug were going to work on this.
    • T3-OSG meeting - security issues discussed. Have some preliminary ideas.
    • Hiro: T3 cleanup - there is a program being developed to ship a dump of T3-LFC to each T3 - Charles will work on ccc.py to use this dump (sql database)
    • Justin: subscriptions working fine at SMU
  • this week:
    • there is a vm cluster at ANL which is in a good configuration to use as a reference pattern for small sites. Now testing user functionality - dq2, pathena, condor-submits; need to test instructions
    • More hardware arrived at ANL - will be setting up a real Tier 3 model cluster
    • Progress on Panda; Pacballs for SL5
    • Panda - Torre working on it w/ Tadashi and others
    • Pacballs - new releases in native SLC5; old releases are still in SLC4, older version of gcc - will still need compat libs.
    • Tier3 data deletion - new program from Hiro
    • Still no definitive answer for gridftp alone versus srm+gridftp (Hiro will check)
    • Frontier+Squid - is there anything Tier 3 specific? Yes - SE and HOTDISK

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Run reprocessing at BNL only - not at Tier 2s - concern over reduced holiday coverage
    • Validation going on.
    • 900 GeV production going on. Some 7 TeV has started.
    • High memory tests going on.
    • Hopefully sufficient work to carry through the holiday
    • Shifters taking off xmas, NY, eve/day
    • There will be another reprocessing campaign in mid-January; note these campaigns will be short ~ 1 week.
    • More queue filler tasks need to be defined.
  • this week:
    • reprocessing completed 12/31/09 - see summary from Alexei from today's shift summary
    • NET2 job failures due to missing input files - they were being centrally removed from PRODDISK(!); missing files replaced, production resumed.
    • SLAC job failures - pilot failing a curl command - update from Paul: not sure why, will put in error handling.
    • Large number of jobs submitted by Pavel - stuck in waiting but not sure why. Are there dependencies - an input dataset missing?

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting -- off for the holiday!
    1)  12/23:   ~2k 'holding' jobs at BNL were killed rather than wait for the usual 48 period required for them to become lost heartbeat jobs.  (This was done since the stuck jobs were from the reprocessing tasks.)
    2)  12/24: Reprocessing is almost completed -- from Alexei:
    100% of ESD production is done. TAG_COMM merging is practically done and merged datasets are subscribed to Tier-1s centers.
    Bulk ESD datasets (+ TAG_COMM) distribution within clouds will be started this week.
    DESD and AOD replication within clouds is going smoothly.
    3)  12/24: Job failures at AGLT2 due to missing file DBRelease-7.9.1.  File was re-subscribed -- issue resolved.
    4)  12/25: Job failures at AGLT2 due to missing cache -- Xin installed the s/w, issue resolved.  RT 14957.
    5)  12/25-29: Job failures at UTD-HEP due to missing releases 15.5.4 and 15.6.1.  Updates were needed in schedconfigdb (thanks Alden) to fix some issues with the s/w installation jobs.   Test jobs succeeded, site back to 'online'.  RT 14959.
    6)   12/27-28: Job failures at OC_OCHEP_SWT2 with the error:
    27 Dec 2009 20:42:46| !!WARNING!!2999!! Could not create dir: /ibrix/data/dq2-cache/mc09_900GeV.108314.pythia_sdiff_Perugia0.merge.HITS.e504_s655_s657_tid104092_00_sub04733071, [Errno 28] No space left on device.
    (Number of sub-directories reached an ibrix limit.)  Disk clean-up done, issue resolved.  RT 14963.
    7)  12/28:  BU_ATLAS_Tier2o -- Job failures with the error "No space left on device."  Disk clean-up performed -- issue resolved.  eLog 8317.
    8)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'
    Under investigation.
    Follow-ups from earlier reports:
    i)  12/18 (ongoing): IU_OSG -- site was originally set off-line due to pilot failures that were blocking jobs from other VO's -- nothing unusual seen on the pilot submit host -- tried some test jobs yesterday (12/22) -- 9 of 10 failed with the error " Pilot has decided to kill looping job."  
    Jobs were seemingly "stuck" on the WN's -- problem still under investigation.
    ii)  12/19: Discussions about the best way to submit / track change requests for schedconfigdb (Alden, others).  New e-amil address: schedconfig@gmail.com
    iii)  12/21: BNL -- US ATLAS conditions oracle cluster db maintenance:
    Memory RAM in the cluster nodes will be upgraded from 16GB to 32GB.  The intervention will be done in rolling fashion (one node at the time), no database service interruption is expected during this maintenance.  Note: now scheduled for 1/06/10.
    iv)  12/22: UTA_SWT2 -- Maintenance outage (SL5, many other s/w upgrades) is completed.  atlas s/w releases are being re-installed by Xin (this was necessary since the old storage was replaced).  Test jobs have finished successfully -- will resume production once the atlas releases are ready.

  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  12/31: Final summary of the reprocessing project (from Alexei):
    All steps of December09 reprocessing campaign are finished today.
    ESD production was finished a few days ago. AOD and DESD production,
    Histogram and Ntuple merging were run successfully afterward.
    A corrected version provided by software team yesterday allowed to merge
    TAG_COMM files in due time.
    ALL outputs are subscribed to Tiers of ATLAS according
    to the requested pattern. (More than 95% of data are already replicated,
    it is close to 100% for Tier-1s.)
    We are happy to announce that reprocessing was fully done before end
    of 2009, as it was requested by the Reprocessing Coordinator.
    2)  1/4: From Sarah at MWT2, in response to transfer error alerts:
    One of our pools was down, due to running low on memory. I've boosted the memory allocation and the pool is back up. We should see these files transfer successfully soon.
    Follow-up comment from Shawn:
    We have seen this "locality is UNAVAILABLE" at AGLT2 as well. Seems to be new in the sense that before running 1.9.5 I don't recall having these messages.
    3)  1/4:  NET2 -- jobs were failing with errors about missing input files.  This was tracked down to the fact that central deletions were removing files from the PRODDISK space token.  Hiro requested that this action be stopped.  Missing files were replaced, site back to 'online'.  See RT 14984, eLog 8422.
    4)  1/4:  schedconfigdb misconfiguration for the test site ANALY_MWT2_X was fixed by Alden.  (This resolved an issue where the site was not receiving pilots.)
    5)   1/5:   MWT2 -- maintenance outage -- from Sarah:
    We're performing network maintenance this morning on the switch that supports MWT2_IU.  The maintenance will occur between 8am and 12pm, and is expected to last 30 minutes.
    6)  1/6:  Maintenance outage at SLAC to inspect a failed fan on a storage box.  13:00-17:00 PST / 21:00-1:00 UTC.
    7)  1/6: BNL -- Conditions db maintenance completed:
    The BNL US ATLAS Conditions Database maintenance has been successfully done. OS and Database memory configuration in the cluster nodes have been adjusted to the new memory available.  No service interruption observed during this intervention.
    8)  1/6:  AGLT2 -- transfer errors like:
    >From CERN-PROD_DATATAPE to AGLT2_CALIBDISK is failing at high rate: Fail(88.0)/Success(0.0)
    number of errors with following message: 88
    Error message from FTS: [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]
    Resolved -- from Shawn:
    I found the dCache srm service stopped on head01.  There was a 'pg_dump' which had been running for 55 cpu minutes.  I restarted postgres and all dCache services on head01. SRM is again operational.
    Follow-ups from earlier reports:
    i)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    ii)  12/18 (ongoing): IU_OSG -- site was originally set off-line due to pilot failures that were blocking jobs from other VO's -- nothing unusual seen on the pilot submit host -- tried some test jobs yesterday (12/22) -- 9 of 10 failed with the error " Pilot has decided to kill looping job."  
    Jobs were seemingly "stuck" on the WN's -- problem still under investigation.
    ii)  12/19: Discussions about the best way to submit / track change requests for schedconfigdb (Alden, others).  New e-mail address: schedconfig@gmail.com
    iv)  12/22: UTA_SWT2 -- Maintenance outage (SL5, many other s/w upgrades) is completed.  atlas s/w releases are being re-installed by Xin (this was necessary since the old storage was replaced).  Test jobs have finished successfully -- will resume production once the atlas releases are ready.  
    1/6:  Investigating one final issue with the s/w installation pilots.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • User activity has already slowed down this week. Jump with real data arriving didn't materialize. Next three weeks should be clear for upgrades.
    • No major problems with data access (yet). Sometimes release isn't installed at the site. Why was the job scheduled there? Should we put release matching in.
    • Probs accessing conditions database. Rod has been responding to some of the problems. Recent releases seem to solve the problems.
    • User support during the break - mostly one shifter on duty. Next week all north american timezone shifts. Following week it will be a EU-zone person mainly with a NA-zone person only for Thursday-Friday.
  • this meeting:
    • There has been some real activity during the holiday break compared to last year.
    • There was one person on shift which was sufficient. Now things are ramping back up.
    • No major issues

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • looks okay overall - very efficient for the last week.
    • there was a bug in the pilot code that would register the file incorrectly in the LFC - expect to be fixed in an update. More critical for T3s.
    • discussing w/ ddm developers speeding up call-backs
    • FTS for checksum checking - testing at BNL version 2.2 - needs version 2.2.2 not 2.2.0. Still waiting for production version to arrive. will postpone throughput test for checksums until that is done.
    • Should start monitoring SAM tests. We need to get a hold of the ATLAS availability calculation.
    • There is a package required in the OSG software - Michael is working with Alessandro De G to
    • Should we upgrade DQ2 site services for Tier 2s. NE, MW, SW. Hiro will update the DQ2 twiki.
  • this meeting:
    • The new tool from Hiro can potentially delete data at T3s
    • We need to formulate a sensible dataset deletion policy for T3s with regards to obsolete datasets
    • Would this create a mess in the central catalogs?
    • Action item: Rik, Doug, Hiro, Rob, Charles, (Michael & Jim as observers) to formulate a position for US ATLAS
    • PRODDISK deletion centrally as noted above, stopped. resolved.
    • GROUPDISK needs a tweak in each DQ2 SS. Hiro will send out an email.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • Each T2 must test against all other T2s
    • Check spreadsheet for correctness
    • Asymmetries between certain pairs - need more data
    • Will start a transaction-type test (large number of small files; check summing needed)
  • this week:
    • Jan 12 next meeting - will start bi-weekly.

Site news and issues (all sites)

  • T1:
    • last week(s): One of the production panda sites is being used for high lumi pileup, high memory jobs (3 GB/core). Stability issues with Thor/Thumpers - some problems with high packet rate with link aggregated NICs. 2PB disk purchase on-going. Another Force10 switch with 60 GB inter-switch links.
    • this week: Maintenance: network upgrades and installation of new 700 TB NEXAN disk behind Thor server (FC connected), will have 4.5 PB (useable).

  • AGLT2:
    • last week: Running well - issue with dccp copies seemed to hang - had to reboot dcache headnode. Would like to do some upgrades of storage nodes Friday. Trying out Rocks5 build for updating nodes.
    • this week: all wn's running SL5, have been running OSG 1.2 for a while. Transition to new network address space. Will install new blade chasis at UM. 1.9.5-11 dCache upgrade. Running well.

  • NET2:
    • last week(s): working with local users can access pool conditions data at HU. Separate install queue for software kits.
    • this week: Recovered from data deletion problem. Shutdown in near future - in a week or so. LFC to be upgraded; all wn already at RH5. GK and interactive node to be upgraded to RH5, and new OSG 1.2. Facility upgrade: getting two racks of Dell Nahelms at HU Odyssey, one rack of storage at BU.

  • MWT2:
    • last week(s): Update of myricom driver updates to troubleshoot.
    • this week: Downtime tomorrow for SL5 for dcache head nodes; next week there will be more downtimes.

  • SWT2 (UTA):
    • last week: Major upgrade at UTA_SWT2 - replaced storage system, new compute nodes - all in place. Reinstalling OSG. SRM, xroot all up. Hopefully up in a day or two.
    • this week: still working on the topics above; Note Dell R610 servers sometimes dropping the NIC. There is a kernel driver problem in the default SL 5.3. Justin has a solution. Next: will do CPB upgrade. 200 TB of storage waiting to be installed. Ordered 400 TB storage, hopefully mid-Feb. Broadcom NIC and BNX2 driver.

  • SWT2 (OU):
    • last week: Equipment for storage is being delivered. Will be taking a big downtime to do upgrades, OSG 1.2, SL 5, etc.
    • this week: Equipment delivered. 80TB SATA drive order for DDN. Final quote for additional nodes in the next week.

  • WT2:
    • last week(s): All is well. Some SL5 migration still going. Suddenly a number of older machines from babar have become available. working xrootd-solaris bug.
    • this week: RH5 done. OSG 1.2 finished. WLCG-client installation, validation. Two outages: this PM, failing fan on storage node. Next week probably as well for storage sys rearrangements. Also will make available scratch and localuser groups. Will probably delete some old releases. Can we delete release 13? Will put in a new NFS server for these releases.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting


  • last week
  • this week

-- RobertGardner - 05 Jan 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback