r12 - 15 Apr 2009 - 11:15:40 - KaushikDeYou are here: TWiki >  Admins Web > MinutesUSComputingApr09

MinutesUSComputingApr09

Introduction

Minutes of US ATLAS Computing meeting, Apr 13-15, 2009
Location: BNL 2-84

Attending

  • Meeting attendees: Michael Ernst, Jim Shank, Torre Wenaus, Srini Rajagopalan, Alexei Klimentov, Kaushik De, Pedro Salgado, Hiro Ito, Armen Vartapetian, Yuri Smirnov, Xin Zhao, Dantong Yu, Maxim Potekhin, John Hover, Jose Caballero, Sergey Panitkin, Tadashi Maeno

Agenda: Monday, Apr 13th:

  • 11 am - Reprocessing working group (daily mtg group)
  • 1 pm - Reprocessing post-mortem
    • Current campaign
      • US Status
        Number of jobs Status
        146.0 ABORTED
        151.0 PREPARED
        940.0 RUNNING
        5661.0 PENDING
        32163.0 TOBEDONE
        65902.0 DONE
        • Hiro - with RAW files, many files per tape, single tape staging rate is 30 MB/s, which translates to 50 files/hr assuming 2GB per file
        • Kaushik - with 5 drives reading RAW data, on average, maximum rate is 250 files/hr, 6000 files/day
        • Alexei - why so much variation, 2k-8k successful jobs per day?
        • Kaushik - maybe error rate, but that was not the case April 11th
        • Yuri - maybe delay in updating prodDB by Bamboo, currently ~9k discrepancy between number of jobs finished in PandaDB vs prodDB, need to discuss with Tadashi (action item)
        • Michael - 10 more drives on order, which should double rate to 12k files/day
        • Jim - also need to solve small file problem, quickly
        • Hiro/Michael - could be factor of 25-50 slower reading, for small files
        • Action item - collect all the data for future post-mortem
        • Srini - should we plan on block pre-staging ahead of reprocessing? Under certain conditions.
      • Error rate for IDCosmic tasks
        Number of jobs Error Code
        12103.0 ALLOK
        10321.0 EXEPANDA_DQ2PUT_LOCAL-OUTPUT-FILE-MISSING
        1631.0 EXEPANDA_DQ2_SERVERERROR
        1476.0 None
        802.0 TRFERROR
        185.0 EXEPANDA_DQ2GET_INFILE
        131.0 EXEPANDA_DQ2PUT_FILECOPYERROR
        96.0 EXEPANDA_JOBDISPATCHER_HEARTBEAT
        28.0 EXEPANDA_DQ2_STAGEIN
        27.0 EXEPANDA_GET_ADLER32MISMATCH
        8.0 EXEPANDA_JOBKILL_BYPILOT
        7.0 EXEPANDA_DQ2PUT_MKDIR
        6.0 EXEPANDA_ATHENA_RAN-OUT-OF-MEMORY
        5.0 EXEPANDA_UNKNOWNERROR_JOBWRAPPERCRASH
        3.0 EXEPANDA_GET_NOSUCHDBRELEASEFILE
        3.0 EXEPANDA_DQ2PUT_LFC-REGISTRATION-FAILED
        2.0 EXEPANDA_GET_FAILEDTOGETLFCREPLICAS
        1.0 EXEPANDA_GET_NOSUCHFILE
        1.0 EXEPANDA_JOBDISPATCHER_NOREPLYTOSENTJOB
        1.0 EXEPANDA_DQ2PUT_FILESIZEONSE
        1.0 EXEPANDA_RUNJOBEXCEPTION
        1.0 EXEPANDA_DQ2PUT_FILECOPYTIMEOUT
      • Error rate for non-IDCosmic tasks
        Number of jobs Error Code
        53759.0 ALLOK
        12900.0 EXEPANDA_LOSTJOB_NOTFINISHED
        5168.0 None
        1765.0 EXEPANDA_DQ2_SERVERERROR
        1372.0 TRFERROR
        1118.0 EXEPANDA_GET_FAILEDTOGETLFCREPLICA
        761.0 EXEPANDA_DQ2PUT_FILECOPYERROR
        650.0 EXEPANDA_DQ2GET_INFILE
        501.0 EXEPANDA_DQ2_STAGEIN
        403.0 EXEPANDA_DQ2PUT_LFC-REGISTRATION-FAILED
        327.0 EXEPANDA_ATHENA_RAN-OUT-OF-MEMORY
        315.0 EXEPANDA_JOBDISPATCHER_HEARTBEAT
        244.0 EXEPANDA_GET_FAILEDTOGETLFCREPLICAS
        219.0 EXEPANDA_PUT_LFCMODULEIMPORT
        93.0 EXEPANDA_JOBKILL_BYPILOT
        76.0 EXEPANDA_DQ2PUT_FILECOPYTIMEOUT
        41.0 EXEPANDA_DQ2PUT_MKDIR
        37.0 EXEPANDA_UNKNOWNERROR_JOBWRAPPERCRASH
        11.0 EXEPANDA_GET_NOSUCHDBRELEASEFILE
        6.0 EXEPANDA_TRF_INSTALL-DIR-NOT-FOUND
        5.0 EXEPANDA_GET_NOSUCHFILE
        2.0 EXEPANDA_RUNJOBEXCEPTION
      • Action item - request ~220k pileup jobs already defined for US (using small HITS files) to be aborted, and redefined after merging has completed (get numbers from Borut)
      • Action item - need dedicated US study group to prepare report on storage needs for US, taking into account data, MC and user needs, so that disk/tape and T1/T2 purchase can be planned (take into account all data types, how long they are kept on disk...)
    • STEP09 plans
  • 3 pm - Storage status and plans
    • Hot topics (tape priority, pnfsid...)
      • Tape priority (pedro) - coding done, waiting for VM to be deployed
      • Testing - for two weeks, starting monday 20th
      • Make default 0 priority for production, test with small manual datasets at higher priority
      • Using pnfsid (Hiro) - pandamover is already using pnfsid, Hiro filling LFC
      • Time scale - few million entries filled, will take ~month to catch up ~40M files
      • Updates will be done either with Oracle trigger, or asynchronously with cron
      • Torre - how to make pnfs catalogue scalable in the long run?
      • Pedro - will evaluate Chimera in May, need this for locking problem anyway, but Chimera may not solve performance problem
      • Kaushik - new problem, multiple files with same GUID in same storage (Wensheng - plz send bug report to Alexei)
    • BNL, Tier 2's MCDISK and DATADISK
      • 090413_Storage.pdf: BNL Space Token status - Armen
      • Action item - ADC operations will provide list of old data on DATADISK at BNL - and will start deleting after agreement from US operations
      • Action item - also delete 'obsolete' data from US automatically by central DDM operations
      • Action item - check if pathena is setting correct owner for libDS
    • USERDISK
      • Users should be encouraged to move precious data to GROUPDISK
    • GROUPDISK
    • SCRATCHDISK
    • LOCALGROUPDISK
    • BNL migration from BNLPANDA to space tokens
      • 864TB of new Thor's available
      • Current rate of writing into BNLPANDA - 20TB/week
      • Next 1PB available - Sep. 2009
      • Action - we move production to space tokens as soon as ownership fixed on all space tokens
      • Action item (Hiro) - start changing file ownership to usatlas1 (take ~week)
    • Tape and disk needs for rest of year
      • Put procurement process in place this month for next 1PB
      • Another 1 PB Dec. 2009
    • Data placement, management, deletions
    • Optimization of data placement on tape (rather than "data placement" in general we need to come up with a plan as to what data needs to be stored as a group/file family, ideally workflow oriented rather than just grouping by directory)
    • staging - PanDAMover vs SRMBringOnline vs ... (incl. the question of file-based vs dataset-based staging)
    • Missing files, corrupted files

Agenda: Tuesday, Apr 14th:

  • 10 am - Panda discussions (migration to Oracle, migration to CERN, personnel...)
  • 1 pm - Data management phone meeting
  • 2 pm - L1/L2/L3 manager's meeting
  • 3 pm - CondorG/Condor/Grid issues
    • CondorG cannot saturate BNL queues the past few weeks
    • Killing GAHP server every 3 hours to close connections - but slows down submission rate whenever restarted
    • If communication lost with grid monitor, 1 hour wait during which no status updated, so new jobs not sent
    • Why is communication getting lost with grid monitor?
    • Jaime will provide new binary today, to reduce/stop GAHP server kills
    • Stop -forcex from pilot scheduler temporarily, to study why condor_rm is not removing most jobs
    • Gass cache is cleaned up daily (older than 30 days removed)
    • Similar problem seen at SWT2
    • Re-evaluate status after new binary is tried
    • Follow-up on scalability testing (John)
    • Need to reduce scheduling time - bring it down to ~min (not ~10 min we see currently)
  • 4 pm - User analysis issues
    • US Mega-jamboree - 300M events realistic, assuming 1min/ev, 45 TB data sample, assuming 150 KB/ev
    • HC - start ASAP (action item), FT continuously, request stress part under US control, provide US job types
    • World wide analysis tests - http://indico.cern.ch/conferenceDisplay.py?confId=52942
      • Attempt to align with mega-jamboree
    • ATLAS wide test of generic analysis pilots on all clouds
    • Broader testing of root analysis

Agenda: Wednesday, Apr 15th:

  • 9 am - Remaining issues, spill overs, followups
    • Missing tools/functionalities
      • Need comprehensive monitoring of services (tape, storage, network, batch...) for debugging
      • Need better monitoring of data replication to US - need high level view, like dashboard (action item)
      • Need to make services robust, with automatic error discovery and recovery - cannot wait for people to find problems
      • Need flag from data preparation to indicate which runs/streams should be exported
      • Better monitoring of pandaMover
    • Frontier discussion
    • Important tests for next 2-3 months
    • Improving communications between operations and sites
  • 1 pm - US facilities meeting

-- KaushikDe - 13 Apr 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png 20090413_1230_HPSS_Batch_View.tiff (267.1K) | Main.psalgado, 13 Apr 2009 - 12:34 | HPSS Batch View (20090413 1230)
png 20090413_1230_Tape_Drive_Usage.tiff (167.9K) | Main.psalgado, 13 Apr 2009 - 12:35 | Tape Drive Usage (20090413 1230)
pdf 090413_Storage.pdf (54.7K) | KaushikDe, 13 Apr 2009 - 12:41 | BNL Space Token status - Armen
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback