r3 - 20 Oct 2010 - 14:02:47 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct20

MinutesOct20

Introduction

Minutes of the Facilities Integration Program meeting, Oct 20, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Taiwan: 0809092282 (Toll Free)
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Dave, Michael, Karthik, Aaron, Nate, Charles, Patrick, Jason, Fred, Sarah, Justin, John B, Alden, John De Stefano, Wei, Armen, Mark, Tom, Hiro
  • Apologies: Horst

Integration program update (Rob, Michael)

  • IntegrationPhase15 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Finalizing SLAC meeting agenda
      • Facility capacity updates
      • Autumn reprocessing campaign for Winter conferences:
      • Machine has made lots of progress - successful stores are leading to large integrated luminosities.
      • Several presentations at this week's ATLAS week
      • Will Tier 2's be involved? Most likely not.
      • Oct 19 - a sign-off meeting before the bulk of prod jobs to be released on Oct 25.
      • Q: what about ntuple creation - part of reprocessing? Probably yes.
    • this week
      • Security contacts in OIM - this week's incident a good reason for Tier 3. Need to apply workaround asap. Another one came out yesterday as well - rds module - needs to be blacklisted (does not require a reboot). 3904; (didn't come through OSG security). Was already known in EGI. All sites need to apply the workaround asap.
      • Thanks to Wei for hosting the face-to-face meeting last week.
      • Unforeseen technical LHC stop starting yesterday - to address injection issues.
      • December 7 is planned shutdown for the holiday break; 9 or 11 week shutdown, under discussion. This would be a good time for intrusive maintenance activity.
      • Next Monday the next re-processing campaign will begin. Data needed for the winter conferences.
      • Attempt to coordinate downtimes so that multiple Tier 2's are not down at once. Use the tier2 mailing list.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Brandeis installation progressing.
  • CVMFS support at CERN - question - we now have an important statement from ATLAS supporting this.
  • Statement was for support R&D for cernvm, and that CVMFS for Tier 3 and Tier 2.
  • CVMFS - infrastructure in place for conditions data as well - close to be able to test.
  • Data management at Tier 3g - phone call coming up this afternoon w/ Wei and Andy.
  • More Tier 3's are coming online - Santa Cruz acquiring equipment.
  • Q: inconsistency between Tier 3 LFC and physical storage. And some T3 acting as sources.
  • Hiro has a program for LFC-local storage consistency.
  • dq2-get-FTS plugin - should be ready for next distribution.
  • dq2-client developer discussion - option into dq2-get to allow DDM convention for global namespace. Similar modification for dq2-ls.
  • Doug: can Hiro provide instructions to use plugin with current
this week:

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Not too many issues to report - most sites at or near full capacity
    • Issue yesterday 15.6.9 - failures, CMT error. Problem with the release install - looks like its resolved now. Xin: there were two problems: 15.6.9 - special cache was installed in most of the US sites - broke the requirements file, causing HC job failures. Caused by install script mis-handling special cache. A separate problem was caused by Alessandro's system. Both have been fixed.
    • Seems like in the past week there has been more aggressive blacklisting of sites, not sure why - seems premature? Sometimes for Athena crashes rather than site problems. Potentially new shifters - not properly trained.
    • Sometimes shifter are too aggressive in creating tickets - do they get guidance to recognize spikes.
    • Michael: please raise this issue again with shift crew.
  • this week:
    • Lots of production tasks available, shouldn't see any drains
    • Analysis is spikey - correlated to LHC going into a study mode
    • Reprocessing expecting to submit significant load; expect significant MC reprocessing there. Make sure PRODDISK is cleared out (or is central OPS doing this? yes.). No estimate of input data volume. About 540M events; 300M monte carlo; 100M pileup. B events for each.
    • Armen working on cleaning up space at sites - goal to get to 40-50% at each site.
    • There is dark data on sites - that needs to be cleaned up.
    • BU is running proddisk-cleanse daily, about 200GB per day.

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=110437
    
    1)  10/7: OU_OCHEP_SWT2: after an outage to apply security patches test jobs were successful, so site set back to on-line.  eLog 17928.
    2)  10/7 - 10/8: BNL, file transfer errors - "failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries]."  Issue resolved (high PNFS load - restarted).  ggus 62869 closed, eLog 17927.
    3)  10/8: OU_OCHEP_SWT2_PRODDISK file transfer errors.  Issue was a DNS server failure - switched to a back-up, problem solved.  eLog 17968.
    4)  10/8: ILLINOISHEP_DATADISK file transfer errors.  Issue with the dCache postgres db resolved.  ggus 62901 closed, eLog 17979.
    5)  10/9: BNL-OSG2 file transfer errors ("failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2").  Issue understood and resolved (a failed "pin manager" process, which was restarted).  ggus 62905 closed, eLog 17986. 
    6)  10/9 - 10/12:  ANL_LOCALGROUPDISK file transfer errors ("source file doesn't exist] Source Host [atlas11.hep.anl.gov]").  Bad datasets were removed from the site - issue resolved.  ggus 62909 closed, eLog 18145. 
    7)  10/10: WISC, file transfer errors.  Problem fixed - ggus 62918 closed, eLog 18057.
    8)  10/11: BNL-OSG2_DATADISK file transfer errors.  Issue resolved (pnfs db - see details in eLog 18100).  ggus 62957 closed. 
    9)  10/12: BNL dCache maintenance: SRM database hardware upgrade, SRM server hardware upgrade, pnfs server hardware upgrade.  Completed as of ~3:00 p.m. EST.  eLog 18173.
    10)  10/12: SWT2_CPB_USERDISK file transfer errors.  Problem was due to ~32k directories in an ext3 file system.  Cleanup performed, issue resolved.  (During an SRM restart RT 18366 was opened / subsequently closed.)  eLog 18251.
    11)  10/13:   WISC - file transfer errors, apparently due to an expired host cert:
    /DC=org/DC=doegrids/OU=Services /CN=atlas03.cs.wisc.edu has expired.  ggus 63038 in-progress, eLog 18215.
    
    Follow-ups from earlier reports:
    (i)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w releases 
    to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.  Also, work underway to enable analysis queue at OU (Squid, schedconfig mods, etc.)
    As of 9/7: ongoing discussions with Alessandro DeSalvo regarding atlas s/w installations at the site.
    Update 9/13: Thanks to Rod Walker for necessary updates to ToA.  Analysis site should be almost ready for testing.
    Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 9/21: Some modifications to the schedconfigdb for the the analysis queue were implemented.
    Update 9/29: On-going testing of the analysis cluster (Fred, Alessandro, et al).
    Update 10/7: remaining configuration issues for ANALY_OU_OCHEP_SWT2 apparently resolved - jobs now running successfully in the queue.  Also, we're trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS.
    (ii)  9/20: Transfers to UTD_LOCALGROUPDISK failed with the error "System error in open: No space left on device."  ggus 62248 in-progress, eLog 17144. 
    Update 9/29: ggus 62248 still shown "in-progress."  Need to follow-up on this issue.
    Update 10/8: Clean-up created needed free space in the storage - site set back to on-line.
    (iii)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    Update from Patrick, 10/4: We are investigating the use of round-robin DNS services to create a coarse load balancing mechanism to distribute data access to multiple Frontier/Squid clients.
    (iv)  9/26: WISC_LOCALGROUPDISK DDM transfers failing:
    [INVALID_PATH] globus_ftp_client: the server responded with an error550 550-Command failed : globus_gridftp_server_posix.c:globus_l_g
    fs_posix_stat:358:550-System error in stat: No such file or directory550-A system call failed: No such file or directory550 End.]  ggus 62427 in-progress, eLog 17463.
    Update: solved as of 10/8.  (Same issue as reported in (iix) below.)
    (v)  9/26: SLAC - job failures with the error "COOL exception caught: Connection on "oracle://ATLAS_COOLPROD/ATLAS_COOLOFL_RPC" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnection" 
    from "CORAL/Services/ConnectionService" )."  ggus 62434 in-progress, eLog 17508.
    Update 10/6: not a site issue - ggus ticket closed.
    (vi)  9/30: HU_ATLAS_Tier2 - jobs from several tasks were failing with the error "TRF_UNKNOWN | 'poolToObject: caught error."  ggus 62642 in-progress, eLog 17662. 
    (vii)  10/2: WISC_DATADISK transfer errors:
    failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  ggus 62694, eLog 17713.  Site was blacklisted.  (Comment from the site admin, 10/3: Most of the data servers are back. 
    But there are still some failing servers. We are working on it.)
    Update: solved as of 10/8.  (Same issue as reported in (iix) below.)
    (iix)  10/4: WISC_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62748 in-progress, eLog 17802.
    Update to both of these tickets, from Wen: I think this problem is caused by some data servers which failed after last power cut. Now all dataservers are back. So We can close it.  ggus 62694/62748 both closed.
    (ix)  10/4: ANL_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62750 in-progress, eLog 17803.
    (x)  10/4: New site ANALY_SLAC_LMEM is being tested (Wei's request).
    Update 10/10: test jobs completed successfully, site set to "brokeroff."
    
    
    • Over the weekend a large number of tasks were failing - eg. SUSY evgen jobs.
    • DDM central catalog services updated again, no indication of problems this time.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting (this week from Alessandra Forti):
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-10_18_10.html
    
    1)  10/14 - 10/19: Savannah 74004 submitted due to backlog of jobs in the transferring state at SLAC.  Issue was SRM access problems.   All but two of the job outputs now transferred, and the missing ones re-ran elsewhere.  
    Savannah ticket closed, eLog 18448.
    2)  10/15: BNL - file transfer errors ("failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2").  From Michael: The problem was caused by an operator error. It's already solved.  ggus 63164 closed, eLog 18333.
    3)  10/16: BNL - file transfer errors.  Issue understood (high loads on database hosts).  ggus tickets 63175/77 closed, eLog 18368.
    4)  10/16 - 10/18: Very large number of failed jobs from various merge tasks with the error "createHolder: holder can't be done, no predefined storage found."  Tasks eventually aborted.  Savannah 74075.
    5)  10/17: Jobs from task 177589 were failing at HU_ATLAS_Tier2 & UTA_SWT2 with PFC errors like "poolToObject: caught error: FID "540D53BB-0DCE-DF11-B2D0-000423D2B9E8" is not existing in the catalog."  
    Problem at HU likely due to the timing of the PFC sync between BU and HU.  Issue of PFC creation at UTA_SWT2 being worked on - should be fixed this week (see (ii) below).
    6)  10/18: From Bob at AGLT2: The nfs server containing atlas releases (but not atlas home directories) crashed around 12:30pm.  Just power cycled around 1:55pm.  Waiting now for fallout from the crash.
    No spike in errors observed.
    
    Follow-ups from earlier reports:
    (i)  Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 10/14:  Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS.
    Update 10/20: Still trying to understand the brokerage problem.
    (ii)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    Update from Patrick, 10/4: We are investigating the use of round-robin DNS services to create a coarse load balancing mechanism to distribute data access to multiple Frontier/Squid clients.
    (iii)  9/30: HU_ATLAS_Tier2 - jobs from several tasks were failing with the error "TRF_UNKNOWN | 'poolToObject: caught error."  ggus 62642 in-progress, eLog 17662. 
    (iv)  10/4: ANL_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62750 in-progress, eLog 17803.
    (v)  10/13:   WISC - file transfer errors, apparently due to an expired host cert:
    /DC=org/DC=doegrids/OU=Services /CN=atlas03.cs.wisc.edu has expired.  ggus 63038 in-progress, eLog 18215.
    Update 10/14: host certificate updated - issue resolved.  ggus ticket closed.
    
    • Open question of job brokerage at OU_OSCER - perhaps need Tadashi's intervention
    • Many open issues have been closed in the past week
    • Large spike in job failures over the weekend - merge tasks failing badly

DDM Operations (Hiro)

Throughput Initiative (Shawn)

Site news and issues (all sites)

  • T1:
    • last week(s): will be adding 1300 TB of disk; installation is ready and will be handed over to dcache group to integrate into the dcache system by next week. CREAM investigations are continuing. LCG made an announcement to the user community that we'll deprecate the existing CE by the end of the year. Urging sites to convert. Have discussed with OSG on behalf of US ATLAS and US CMS - Alain Roy is working on this, will be ready soon. Submission and Condor batch backend sites will need to be tested. Preliminary results looked good to a single site, but CMS found problems with submission to multiple sites. Plan is to submit 10K jobs from BNL submit host to 1000 slots at UW, to validate readiness (Xin). Note: no end of support for the current OSG gatekeeper, GT2-based.
    • this week:

  • AGLT2:
    • last week:Delivery of disks at both sites. Awaiting headnodes at UM sites. Will be 1.9PB when operational. One more PO to go out. SFP+ switches won't be delivered till Nov 1. Looking at locality of client and pools between the two sites - are there dcache reconfigurations or locality options. Local site mover may be helpful in this context. Tom: all parts arrived at MSU. Upcoming outtages coming when new switches and hardware arrive. (full day scale outages)
    • this week: Taken delivery of all of the disk trays, under test. Coordinate turning on shelves between the sites. Looking at XFS su & sw sizes as tool to optimize MD1200 performance. At Michigan, two dcache headnodes, 3 each at a site. Expect a shutdown in December. Major network changes, deploy 8024F's. Performance issue with H800 and 3rd MD1200 shelf.

  • NET2:
    • last week(s): All running smoothly - full capacity, mainly production. Westmere nodes at HU have been in use (432 cores). John has new HU ANALY queue up and working. There are some errors at the moment - Athena crashes or a release installation issue. Continue to move space tokens around into the new storage. Still migrating GROUPDISK - mostly done - expect larges tracks of data to free up later today. John - on-going problem with Gratia: cron job can't catch up. The "suppress-local" which was ineffective; instead use Wei's patch which works great, so fixed now. Condor-G problems keeping site filled; found a bug in the globus perl module interfacing with LSF; Jaimie provided patch, fixed.
    • this week: ANALY queue at HU - available to add more analysis jobs. Expect to stay up during the break.

  • MWT2:
    • last week(s): Quiet week, focusing on stability. Slowing migrating compute nodes to SL 5.5. Moving database for SRM door onto higher end server reducing failure rate. Emails sent to users regarding.
    • this week: Security mitigation complete. Pool with a bad DIMM and needs a bios update. Running stably.

  • SWT2 (UTA):
    • last week: Completed security updates completed on both clusters - all went fine. LFC version update announcement - we'll need to update the facility LFC's at some point.
    • this week: Working to conditions access setup correctly for UTA_SWT2 cluster since its being converted to analysis; a second squid server at UTA, using same DNS name.

  • SWT2 (OU):
    • last week: Site name available but the queue name isn't visible for ANALY queue. Will consult Alden.
    • this week: Everything running smoothly. Only issue getting OSCER production jobs.

  • WT2:
    • last week(s):
    • this week: Deployed pcache, working fine. 4 hour shutdown to update kernels(?) Two disks failed last week, need to produce more logging.

Carryover issues ( any updates?)

HEPSpec 2006 (Bob)

last week:

  • HepSpecBenchmarks
  • MWT2 results were run in 64 bit mode by mistake; Nate is re-running.
  • Assembling all results in a single table.
  • Please send any results to Bob - important for running the benchmark.
  • Duplicate results don't hurt.

this week:

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
    • Testing at BNL - 16.0.1 installed using Alessandro's system, into the production area. Next up is to test DDM and poolfilecatalog creation.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
  • this week
    • Please check out HS06 benchmark page and send Bob any contributions.


-- RobertGardner - 20 Oct 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback