r3 - 27 Oct 2010 - 13:55:10 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct27



Minutes of the Facilities Integration Program meeting, Oct 27, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg


  • Meeting attendees: Nate, Charles, Michael, Torre, Fred, Tom, Torre, Dave, Alden, Mark, Kaushik, Sarah, John B, Saul, Bob, Shawn, Patrick, Karthik, Hiro, Horst, Doug, Armen, Wensheng
  • Apologies: none

Integration program update (Rob, Michael)

  • IntegrationPhase15 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Security contacts in OIM - this week's incident a good reason for Tier 3. Need to apply workaround asap. Another one came out yesterday as well - rds module - needs to be blacklisted (does not require a reboot). 3904; (didn't come through OSG security). Was already known in EGI. All sites need to apply the workaround asap.
      • Thanks to Wei for hosting the face-to-face meeting last week.
      • Unforeseen technical LHC stop starting yesterday - to address injection issues.
      • December 7 is planned shutdown for the holiday break; 9 or 11 week shutdown, under discussion. This would be a good time for intrusive maintenance activity.
      • Next Monday the next re-processing campaign will begin. Data needed for the winter conferences.
      • Attempt to coordinate downtimes so that multiple Tier 2's are not down at once. Use the tier2 mailing list.
    • this week
      • CVE-2010-3856, patch is available - email Sarah if you like a script to
      • rds module - new kernel module also available
      • Any last updates to facilities spreadsheet?
      • Michael - machine status update - still anticipate providing high intensity beams overnight. Tomorrow will mark the end of the p-p run. Next week it will switch to heavy ion.
      • Reprocessing has been delayed, tomorrow at the latest. Second phase to cover data in October, ADC can proceed immediately.
      • Some discussion about ESD and merged ESDs as input for AOD production. Seeing ~40 TB arriving overnight - majority are ESDs. (First pass reconstruction from recent runs.)
      • MC reprocessing will run everywhere, data reprocessing at the Tier 1.
      • 500M events was the original goal for MC - about 50% complete

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Brandeis installation progressing.
  • CVMFS support at CERN - question - we now have an important statement from ATLAS supporting this.
  • Statement was for support R&D for cernvm, and that CVMFS for Tier 3 and Tier 2.
  • CVMFS - infrastructure in place for conditions data as well - close to be able to test.
  • Data management at Tier 3g - phone call coming up this afternoon w/ Wei and Andy.
  • More Tier 3's are coming online - Santa Cruz acquiring equipment.
  • Q: inconsistency between Tier 3 LFC and physical storage. And some T3 acting as sources.
  • Hiro has a program for LFC-local storage consistency.
  • dq2-get-FTS plugin - should be ready for next distribution.
  • dq2-client developer discussion - option into dq2-get to allow DDM convention for global namespace. Similar modification for dq2-ls.
  • Doug: can Hiro provide instructions to use plugin with current
this week:
  • Site are receiving funds and making orders to Dell
  • Brandeis is up - there are problems with Panda pilots; working out last bits of xrootd; Stony Brook is coming online.
  • T3 and DYNES project - there are requests due in about a month; are the T3's paying attention, or are they working on purchases, etc.
  • Work is proceeding on the federated xrootd
  • T3 documentation moved to twiki at CERN, public svn repo
  • Columbia T3?
  • Trip planned to work on xrootd and puppet configuration management

OSG extension proposal

  • officially submitted today
  • working meeting on larger proposal Nov 11,12

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Armen: no major issues - following some cleanup issues.
    • See table of group disk space requirements -next year it will double.
    • Content in the generic, legacy ATLASGROUPDISK - will be retired in the next 6 months. There are physics subdirectories within the space token. Note difference between DQ2 endpoints versus the space tokens created.
    • The accounting is done in DQ2.
  • this week:
    • Some clean-up of sites, have about 30% available
    • Waiting on central ops cleanup before doing more; expect campaign to last a month
    • Proddisk clean up issue - panada mover files unregistered - sit at sites as dark data; can central ops handle this? meanwhile Charles provided a script to recognize these and cleanup
    • Hiro is running userdisk cleanup at sites, in progress

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting (this week from Alessandra Forti):
    1)  10/14 - 10/19: Savannah 74004 submitted due to backlog of jobs in the transferring state at SLAC.  Issue was SRM access problems.   All but two of the job outputs now transferred, and the missing ones re-ran elsewhere.  
    Savannah ticket closed, eLog 18448.
    2)  10/15: BNL - file transfer errors ("failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2").  From Michael: The problem was caused by an operator error. It's already solved.  ggus 63164 closed, eLog 18333.
    3)  10/16: BNL - file transfer errors.  Issue understood (high loads on database hosts).  ggus tickets 63175/77 closed, eLog 18368.
    4)  10/16 - 10/18: Very large number of failed jobs from various merge tasks with the error "createHolder: holder can't be done, no predefined storage found."  Tasks eventually aborted.  Savannah 74075.
    5)  10/17: Jobs from task 177589 were failing at HU_ATLAS_Tier2 & UTA_SWT2 with PFC errors like "poolToObject: caught error: FID "540D53BB-0DCE-DF11-B2D0-000423D2B9E8" is not existing in the catalog."  
    Problem at HU likely due to the timing of the PFC sync between BU and HU.  Issue of PFC creation at UTA_SWT2 being worked on - should be fixed this week (see (ii) below).
    6)  10/18: From Bob at AGLT2: The nfs server containing atlas releases (but not atlas home directories) crashed around 12:30pm.  Just power cycled around 1:55pm.  Waiting now for fallout from the crash.
    No spike in errors observed.
    Follow-ups from earlier reports:
    (i)  Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 10/14:  Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS.
    Update 10/20: Still trying to understand the brokerage problem.
    (ii)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    Update from Patrick, 10/4: We are investigating the use of round-robin DNS services to create a coarse load balancing mechanism to distribute data access to multiple Frontier/Squid clients.
    (iii)  9/30: HU_ATLAS_Tier2 - jobs from several tasks were failing with the error "TRF_UNKNOWN | 'poolToObject: caught error."  ggus 62642 in-progress, eLog 17662. 
    (iv)  10/4: ANL_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62750 in-progress, eLog 17803.
    (v)  10/13:   WISC - file transfer errors, apparently due to an expired host cert:
    /DC=org/DC=doegrids/OU=Services /CN=atlas03.cs.wisc.edu has expired.  ggus 63038 in-progress, eLog 18215.
    Update 10/14: host certificate updated - issue resolved.  ggus ticket closed.
    • Open question of job brokerage at OU_OSCER - perhaps need Tadashi's intervention
    • Many open issues have been closed in the past week
    • Large spike in job failures over the weekend - merge tasks failing badly
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS production voms proxy.  Hiro set the site off-line in DDM until the problem is resolved.  ggus 63319 in-progress, eLog 18494.
    2)  10/21: HU_ATLAS_Tier2 - job failures due to missing/incomplete ATLAS release 16.0.2.  Missing s/w installed, issue resolved.  https://savannah.cern.ch/bugs/index.php?74275 closed, eLog 18551.
    3)  10/21 - 22: SLAC maintenance outage - completed, back on-line as of ~4:45 p.m. CST Friday.  ggus tickets 63369 & 63372 were opened during this period, both subsequently closed; eLog 18550.  From Wei:
    It took much longer than we expect. WT2 is now back online to the status before the outage (with one failed disk). But at least we produced more error/warning logs in the effect to satisfy Oracle's disk warranty requirement.
    4)  10/23- 10/25: SWT2_CPB went off-line on Saturday due to a problem with the building generator-backed power feed to the cluster UPS.  Power was restored, but it was decided to use this outage to make a 
    planned change to the xrootd system.  Back on-line as of 11:00 p.m. on Monday.  eLog 18640.
    5)  10/24: MWT2_DATADISK - file transfer errors with "source file doesn't exist."  Issue understood - from Wensheng:
    This is a kind of race condition that happened. The dataset replica removal at MWT2_DATADISK was triggered for space purpose. There are multiple replicas available elsewhere.  Savannah 74358 closed, eLog 18618.
    6)  10/27: HU_ATLAS_Tier2 - job failures with lsm errors:
    "27 Oct 04:18:14|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed: time out after 5400 seconds."  ggus 63486 in-progress, eLog 18670.
    7)  10/27 early a.m.:  RT # 18441 was generated for SWT2_CPB_SE due to one or more RSV tests failing for a short period of time.  Issue understood - from Patrick: The addition of new storage to the SE required a restart of the SRM.   
    This seems to have occurred during the RSV tests, as later tests are passing.  Ticket closed.
    8)  10/27: Job failures at several U.S. sites due to missing atlas s/w release 16.0.2.  Issue understood - from Xin:
    SIT released a new version of the pacball for release 16.0.2, so I had to delete the existing 16.0.2 and re-install them. So far the base kit 16.0.2 has been re-installed, and cache is also available at most sites, 
    I just start the re-installation of cache, which should be done in a couple of hours.  ggus 63503 in-progress (but can probably be closed at this point), eLog 18678.
    Follow-ups from earlier reports:
    (i)  Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 10/14:  Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS.
    Update 10/20: Still trying to understand the brokerage problem.
    (ii)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    Update from Patrick, 10/4: We are investigating the use of round-robin DNS services to create a coarse load balancing mechanism to distribute data access to multiple Frontier/Squid clients.
    Update from Patrick, 10/21: This issue has been resolved.  The POOLFileCatalog.xml file is now being generated correctly for the cluster and we have configured a squid instance to support Frontier access, when needed.  
    ggus & RT tickets closed.
    (iii)  9/30: HU_ATLAS_Tier2 - jobs from several tasks were failing with the error "TRF_UNKNOWN | 'poolToObject: caught error."  ggus 62642 in-progress, eLog 17662. 
    Update: as of 10/20 issue resolved, and the ggus ticket was closed.
    (iv)  10/4: ANL_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62750 in-progress, eLog 17803.
    Update 10/22: ggus ticket closed by Doug B.  eLog 18539.
    • A quiet week in the US cloud
    • All carryover issues from last week resolved - good DONE
    • Two new shifters added in US timezone this week

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Meeting notes:
    • October 27 for deployment date all sites updated with new release
    • Expect this to be out on Friday
    • Hiro is testing between Tier 1's
  • this week:
    • Meeting yesterday
      USATLAS Throughput Meeting - October 26, 2010
      Attending:  Shawn, Dave, Andy, Philippe, Sarah, Karthik, Hiro, John, Horst, Tom, Doug 
      Excused: Jason
      1) No updates for retesting OU - BNL path.   Karthik reported that as of the USATLAS facility meeting there was still poor throughput.   John will re-run BNL tests to various ESnet locations.    Dave reported that not too much progress.  The perfSONAR box was moved and then broke.  Need to get back to.  Karthik reports on tests during the call: OU->Kansas, OU->ESnet(BNL) gets 3 Mbps.  Unable to run reverse direction. Fixed reverse direction (problem in config at OU) (More details later in notes)
      2) perfSONAR status.   BNL and OU have CDs burned. Plan to install/upgrade soon.  Illinois has success in using the LiveCD.  Attempt to upgrade using net-install option not completely successful.  The perfSONAR MA won't start in this case.  Followed Jason's instructions (twice)...need to work with Jason on debugging.  MSU updated both nodes to v3.2 using LiveCD method; latency is OK, bandwidth has a service not running.  Philippe has email out about this problem.
      3) Monitoring -  Nagios monitoring discussed. Tom gave overview of current situation and is willing to work with our group in defining further monitoring capabilities for the dashboard. Dashboard for perfSONAR seems very useful and should meet our needs for monitoring perfSONAR instances.  Currently have SLAC, BU and OU instances down.  Discussed possible extensions for Tom's Nagios dashboard.   Tom can also add additional email notifications.   If site's want to add additional responsible perfSONAR people they can send Tom the address(es).   
      Hiro is working on gathering the perfSONAR data and augmenting it with additional tracking of the traceroute (forward and reverse) between sites.    
      Further testing info.   John gets ~1 Gbps BNL-Chicago while only 16-200 Mbps BNL-Kansas City.  Traceroute shows the path to OU includes both Chicago and Kansas City.  Karthik's tests from bnl-pt1.es.net to Kansas City got 4.5 Gbps and 4.2 Gbps Kansas to bnl-pt1.es.net.  John's succeeding tests show BNL-Kansas City close to 1 Gbps.    Could be real congestion is complicating the testing.   Situation seems to be that there is a problem between OU and Kansas City but this could also be real traffic congesting the links.  Needs further work.
      Hiro mentioned Tier-2 to Tier-2 tests (worldwide) are underway.   Important to have network data to help support this work longer term.
      Doug mentioned alerting ATLAS sites to the DYNES process and the need to make sure we have a large number of ATLAS institutions participating.  Note: deadline for DYNES site submissions is the end of November!  USATLAS related sites should be strongly encouraged to participate.  See http://www.internet2.edu/ion/dynes.html for more information (and pass the word).
      We plan to meet again in 2 weeks at our regular time.  Please send along correction or additions to the list.
    • John and BNL made measurements between OU and BNL to understand connectivity. Finding problems between Chicago and OU - Kansas City, but precise location not determined.
    • 3.2 release of perfsonar - what to use this version to collect infrastructure - to get sites updated.
    • Nagios monitoring at BNL - nice way to keep track of services, which are up, etc?
    • Eg. UDP packet transmission test (10 out of a group of 600)
    • Hiro is gathering the data from instances, making it available for plots from his site
    • DYNES information for site has been available - finalized by November 1. Encourage all sites on the call to be part of DYNES. Need all Tier 2's be a part of it.
    • http://www.internet2.edu/ion/dynes.html

Site news and issues (all sites)

  • T1:
    • last week(s): will be adding 1300 TB of disk; installation is ready and will be handed over to dcache group to integrate into the dcache system by next week. CREAM investigations are continuing. LCG made an announcement to the user community that we'll deprecate the existing CE by the end of the year. Urging sites to convert. Have discussed with OSG on behalf of US ATLAS and US CMS - Alain Roy is working on this, will be ready soon. Submission and Condor batch backend sites will need to be tested. Preliminary results looked good to a single site, but CMS found problems with submission to multiple sites. Plan is to submit 10K jobs from BNL submit host to 1000 slots at UW, to validate readiness (Xin). Note: no end of support for the current OSG gatekeeper, GT2-based.
    • this week:

  • AGLT2:
    • last week:Taken delivery of all of the disk trays, under test. Coordinate turning on shelves between the sites. Looking at XFS su & sw sizes as tool to optimize MD1200 performance. At Michigan, two dcache headnodes, 3 each at a site. Expect a shutdown in December. Major network changes, deploy 8024F's. Performance issue with H800 and 3rd MD1200 shelf.
    • this week:

  • NET2:
    • last week(s): ANALY queue at HU - available to add more analysis jobs. Expect to stay up during the break.
    • this week:

  • MWT2:
    • last week(s): Security mitigation complete. Pool with a bad DIMM and needs a bios update. Running stably.
    • this week:

  • SWT2 (UTA):
    • last week: Working to conditions access setup correctly for UTA_SWT2 cluster since its being converted to analysis; a second squid server at UTA, using same DNS name.
    • this week:

  • SWT2 (OU):
    • last week: Everything running smoothly. Only issue getting OSCER production jobs.
    • this week:

  • WT2:
    • last week(s): Deployed pcache, working fine. 4 hour shutdown to update kernels(?) Two disks failed last week, need to produce more logging.
    • this week:

Carryover issues ( any updates?)

HEPSpec 2006 (Bob)

last week:

  • HepSpecBenchmarks
  • MWT2 results were run in 64 bit mode by mistake; Nate is re-running.
  • Assembling all results in a single table.
  • Please send any results to Bob - important for running the benchmark.
  • Duplicate results don't hurt.

this week:

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
    • Testing at BNL - 16.0.1 installed using Alessandro's system, into the production area. Next up is to test DDM and poolfilecatalog creation.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:


  • last week
    • Please check out Bob's HS06 benchmark page and send him any contributions.
  • this week

-- RobertGardner - 26 Oct 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback