r3 - 21 Apr 2009 - 14:05:20 - KaushikDeYou are here: TWiki >  Admins Web > MinutesDataManageApr21

MinutesDataManageApr21

Introduction

Minutes of the US ATLAS Data Management meeting, Apr 21, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Tuesdays, Noon Central
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Charles, Shawn, Pedro, Bob, Hiro, Saul, Armen, Patrick, Wensheng, Michael, Rob, John
  • Apologies: Wei
  • Guests: None

Topics for this week

  • Reprocessing status - Kaushik
    • Only 2000 jobs left (plus ~6000 already in queue)
    • Should finish by tomorrow
  • HPSS/dcache status
    • pnfs load got very high friday, aborted ~150 merge tasks, pandamover stopped, dcache restarted...
      • Everything looks calm since then
      • There will be post-mortem report soon
    • Still lots of jobs needing small files from tape - sent another list of ~200 tasks to Borut to be aborted
    • HPSS migration buffer (dcache <-> HPSS) full today, many small files (300k 25MB files)
      • Staging from dcache to HPSS stopped for a while (~2 hours ago), so that migration can clean up buffer cache
      • It may take a while (~24 hours) to clear backlog, but will turn staging back on later today (in fact 1:30pm)
      • Buffer size 13 TB, very expensive fiber channel disk, plan to add 20 TB more this year
      • Write pool is only 10% full, will continue to pile up, but will catch up soon (no loss of data)
      • All 10 drives allocated to write - so no drives for reading till later today
    • Access logging now on srm server, as well as SRM watcher
    • David working on publicly available HPSS monitoring, 2-3 weeks
    • John has tool to capture dashboard errors, and create alarms - could be useful for Tier 2
  • Storage cleanup - All
    • Decision - all aborted and obsolete datasets in US will now be cleaned by central operations (ADC DDM)
    • _PRODDISK and _SCRATCHDISK will be cleaned locally (by site admins)
    • Keep eye on your site this week, in case there are unexpected problems
    • Many issues being discussed by email
    • For example, link: http://atlddm02.cern.ch/dq2/accounting/site_view/AGLT2_MCDISK/30/
    • Charles will work this week on script to cleanup old aborted datasets (erased from DQ2)
  • DQ2 adler32 plugin - Hiro
    • BNLPANDA and BNLDISK in active mode, starting today
    • Dashboard will show these errors as 'failed to validate'
    • Frequency low, few files in past couple of days
    • AGLT2 will try this next
  • SCRATCHDISK deployment
    • BNL - done, AGLT2 - 9TB (45TB total, done), MWT2 - 10 TB (done), NET2 - flexible TB (done), SLACT2, SWT2 - later, after cleanup
  • BNL migration to space tokens - Armen, Pedro, Hiro
    • Hiro is changing directory permission now, probably done by tomorrow
    • Start allocation of new Thor's this afternoon
    • Check with Panda team if they are ready
  • Hot issues
    • Completing datasets at T2 - consistency checking
    • Shawn - tried DQ2 consistency checker, appears to change dataset status to incomplete
    • Next re-subscribing all incomplete datasets, will try with sources option (all US MCDISK and DATADISK)
    • All sites should try consistency checker - Shawn will share script
    • LFC connection errors yesterday - problem solved today, Hiro's configuration change probably fixed it
  • AOB


-- KaushikDe - 21 Apr 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback