MinutesUSComputingApr09
Introduction
Minutes of US ATLAS Computing meeting, Apr 13-15, 2009
Location: BNL 2-84
Attending
- Meeting attendees: Michael Ernst, Jim Shank, Torre Wenaus, Srini Rajagopalan, Alexei Klimentov, Kaushik De, Pedro Salgado, Hiro Ito, Armen Vartapetian, Yuri Smirnov, Xin Zhao, Dantong Yu, Maxim Potekhin, John Hover, Jose Caballero, Sergey Panitkin, Tadashi Maeno
Agenda: Monday, Apr 13th:
- 11 am - Reprocessing working group (daily mtg group)
- 1 pm - Reprocessing post-mortem
- Current campaign
- US Status
| Number of jobs | Status |
| 146.0 | ABORTED |
| 151.0 | PREPARED |
| 940.0 | RUNNING |
| 5661.0 | PENDING |
| 32163.0 | TOBEDONE |
| 65902.0 | DONE |
- Hiro - with RAW files, many files per tape, single tape staging rate is 30 MB/s, which translates to 50 files/hr assuming 2GB per file
- Kaushik - with 5 drives reading RAW data, on average, maximum rate is 250 files/hr, 6000 files/day
- Alexei - why so much variation, 2k-8k successful jobs per day?
- Kaushik - maybe error rate, but that was not the case April 11th
- Yuri - maybe delay in updating prodDB by Bamboo, currently ~9k discrepancy between number of jobs finished in PandaDB vs prodDB, need to discuss with Tadashi (action item)
- Michael - 10 more drives on order, which should double rate to 12k files/day
- Jim - also need to solve small file problem, quickly
- Hiro/Michael - could be factor of 25-50 slower reading, for small files
- Action item - collect all the data for future post-mortem
- Srini - should we plan on block pre-staging ahead of reprocessing? Under certain conditions.
- Error rate for IDCosmic tasks
| Number of jobs | Error Code |
| 12103.0 | ALLOK |
| 10321.0 | EXEPANDA_DQ2PUT_LOCAL-OUTPUT-FILE-MISSING |
| 1631.0 | EXEPANDA_DQ2_SERVERERROR |
| 1476.0 | None |
| 802.0 | TRFERROR |
| 185.0 | EXEPANDA_DQ2GET_INFILE |
| 131.0 | EXEPANDA_DQ2PUT_FILECOPYERROR |
| 96.0 | EXEPANDA_JOBDISPATCHER_HEARTBEAT |
| 28.0 | EXEPANDA_DQ2_STAGEIN |
| 27.0 | EXEPANDA_GET_ADLER32MISMATCH |
| 8.0 | EXEPANDA_JOBKILL_BYPILOT |
| 7.0 | EXEPANDA_DQ2PUT_MKDIR |
| 6.0 | EXEPANDA_ATHENA_RAN-OUT-OF-MEMORY |
| 5.0 | EXEPANDA_UNKNOWNERROR_JOBWRAPPERCRASH |
| 3.0 | EXEPANDA_GET_NOSUCHDBRELEASEFILE |
| 3.0 | EXEPANDA_DQ2PUT_LFC-REGISTRATION-FAILED |
| 2.0 | EXEPANDA_GET_FAILEDTOGETLFCREPLICAS |
| 1.0 | EXEPANDA_GET_NOSUCHFILE |
| 1.0 | EXEPANDA_JOBDISPATCHER_NOREPLYTOSENTJOB |
| 1.0 | EXEPANDA_DQ2PUT_FILESIZEONSE |
| 1.0 | EXEPANDA_RUNJOBEXCEPTION |
| 1.0 | EXEPANDA_DQ2PUT_FILECOPYTIMEOUT |
- Error rate for non-IDCosmic tasks
| Number of jobs | Error Code |
| 53759.0 | ALLOK |
| 12900.0 | EXEPANDA_LOSTJOB_NOTFINISHED |
| 5168.0 | None |
| 1765.0 | EXEPANDA_DQ2_SERVERERROR |
| 1372.0 | TRFERROR |
| 1118.0 | EXEPANDA_GET_FAILEDTOGETLFCREPLICA |
| 761.0 | EXEPANDA_DQ2PUT_FILECOPYERROR |
| 650.0 | EXEPANDA_DQ2GET_INFILE |
| 501.0 | EXEPANDA_DQ2_STAGEIN |
| 403.0 | EXEPANDA_DQ2PUT_LFC-REGISTRATION-FAILED |
| 327.0 | EXEPANDA_ATHENA_RAN-OUT-OF-MEMORY |
| 315.0 | EXEPANDA_JOBDISPATCHER_HEARTBEAT |
| 244.0 | EXEPANDA_GET_FAILEDTOGETLFCREPLICAS |
| 219.0 | EXEPANDA_PUT_LFCMODULEIMPORT |
| 93.0 | EXEPANDA_JOBKILL_BYPILOT |
| 76.0 | EXEPANDA_DQ2PUT_FILECOPYTIMEOUT |
| 41.0 | EXEPANDA_DQ2PUT_MKDIR |
| 37.0 | EXEPANDA_UNKNOWNERROR_JOBWRAPPERCRASH |
| 11.0 | EXEPANDA_GET_NOSUCHDBRELEASEFILE |
| 6.0 | EXEPANDA_TRF_INSTALL-DIR-NOT-FOUND |
| 5.0 | EXEPANDA_GET_NOSUCHFILE |
| 2.0 | EXEPANDA_RUNJOBEXCEPTION |
- Action item - request ~220k pileup jobs already defined for US (using small HITS files) to be aborted, and redefined after merging has completed (get numbers from Borut)
- Action item - need dedicated US study group to prepare report on storage needs for US, taking into account data, MC and user needs, so that disk/tape and T1/T2 purchase can be planned (take into account all data types, how long they are kept on disk...)
- STEP09 plans
- 3 pm - Storage status and plans
- Hot topics (tape priority, pnfsid...)
- Tape priority (pedro) - coding done, waiting for VM to be deployed
- Testing - for two weeks, starting monday 20th
- Make default 0 priority for production, test with small manual datasets at higher priority
- Using pnfsid (Hiro) - pandamover is already using pnfsid, Hiro filling LFC
- Time scale - few million entries filled, will take ~month to catch up ~40M files
- Updates will be done either with Oracle trigger, or asynchronously with cron
- Torre - how to make pnfs catalogue scalable in the long run?
- Pedro - will evaluate Chimera in May, need this for locking problem anyway, but Chimera may not solve performance problem
- Kaushik - new problem, multiple files with same GUID in same storage (Wensheng - plz send bug report to Alexei)
- BNL, Tier 2's MCDISK and DATADISK
- 090413_Storage.pdf: BNL Space Token status - Armen
- Action item - ADC operations will provide list of old data on DATADISK at BNL - and will start deleting after agreement from US operations
- Action item - also delete 'obsolete' data from US automatically by central DDM operations
- Action item - check if pathena is setting correct owner for libDS
- USERDISK
- Users should be encouraged to move precious data to GROUPDISK
- GROUPDISK
- SCRATCHDISK
- LOCALGROUPDISK
- BNL migration from BNLPANDA to space tokens
- 864TB of new Thor's available
- Current rate of writing into BNLPANDA - 20TB/week
- Next 1PB available - Sep. 2009
- Action - we move production to space tokens as soon as ownership fixed on all space tokens
- Action item (Hiro) - start changing file ownership to usatlas1 (take ~week)
- Tape and disk needs for rest of year
- Put procurement process in place this month for next 1PB
- Another 1 PB Dec. 2009
- Data placement, management, deletions
- Optimization of data placement on tape (rather than "data placement" in general we need to come up with a plan as to what data needs to be stored as a group/file family, ideally workflow oriented rather than just grouping by directory)
- staging - PanDAMover vs SRMBringOnline vs ... (incl. the question of file-based vs dataset-based staging)
- Missing files, corrupted files
Agenda: Tuesday, Apr 14th:
- 10 am - Panda discussions (migration to Oracle, migration to CERN, personnel...)
- 1 pm - Data management phone meeting
- 2 pm - L1/L2/L3 manager's meeting
- 3 pm - CondorG/Condor/Grid issues
- CondorG cannot saturate BNL queues the past few weeks
- Killing GAHP server every 3 hours to close connections - but slows down submission rate whenever restarted
- If communication lost with grid monitor, 1 hour wait during which no status updated, so new jobs not sent
- Why is communication getting lost with grid monitor?
- Jaime will provide new binary today, to reduce/stop GAHP server kills
- Stop -forcex from pilot scheduler temporarily, to study why condor_rm is not removing most jobs
- Gass cache is cleaned up daily (older than 30 days removed)
- Similar problem seen at SWT2
- Re-evaluate status after new binary is tried
- Follow-up on scalability testing (John)
- Need to reduce scheduling time - bring it down to ~min (not ~10 min we see currently)
- 4 pm - User analysis issues
- US Mega-jamboree - 300M events realistic, assuming 1min/ev, 45 TB data sample, assuming 150 KB/ev
- HC - start ASAP (action item), FT continuously, request stress part under US control, provide US job types
- World wide analysis tests - http://indico.cern.ch/conferenceDisplay.py?confId=52942
- Attempt to align with mega-jamboree
- ATLAS wide test of generic analysis pilots on all clouds
- Broader testing of root analysis
Agenda: Wednesday, Apr 15th:
- 9 am - Remaining issues, spill overs, followups
- Missing tools/functionalities
- Need comprehensive monitoring of services (tape, storage, network, batch...) for debugging
- Need better monitoring of data replication to US - need high level view, like dashboard (action item)
- Need to make services robust, with automatic error discovery and recovery - cannot wait for people to find problems
- Need flag from data preparation to indicate which runs/streams should be exported
- Better monitoring of pandaMover
- Frontier discussion
- Important tests for next 2-3 months
- Improving communications between operations and sites
- 1 pm - US facilities meeting
--
KaushikDe - 13 Apr 2009
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
Attachments
20090413_1230_HPSS_Batch_View.tiff (267.1K) | Main.psalgado, 13 Apr 2009 - 12:34 | HPSS Batch View (20090413 1230)
20090413_1230_Tape_Drive_Usage.tiff (167.9K) | Main.psalgado, 13 Apr 2009 - 12:35 | Tape Drive Usage (20090413 1230)
090413_Storage.pdf (54.7K) |
KaushikDe, 13 Apr 2009 - 12:41 | BNL Space Token status - Armen