MinutesBNLdcacheApr29
Introduction
Minutes of the BNL dcache/HPSS Optimization meeting, Apr 29, 2009
- Coordinates: Building 510 (Physics), Rm 2-160 at BNL, 9:30 am EDT
- (309) 946-5300, Access code: 280250; Dial *6 to mute/un-mute.
Attending
- Meeting attendees: Pedro, Ofer, Rob, Armen, Pedro, Charles, Michael, Wensheng, Xin, Iris, Jane, David, John, Torre, Tadashi, Dantong, Shigeki
- Apologies:
- Guests: None
dCache overview & Plans - Pedro
- current dCache issues
- plan for the next 3 months
- Discussion on PostGres? :
- Database now ~160 GB
- Move to 64 bit (4GB -> 48GB cache) ~May 11th
- Deploy SSD's ~July
- Possible solution - move to Oracle with Chimera ~summer
- How do we test:
- Chaotic user analysis with large IO (number of files) from disk
- Test harness 1: Jason's test
- Test harness 2: ~1000 pilots, each reading ~100 files, no processing, read from disk (Xin, Wensheng)
- Production reading from tape (large number of dccp -p)
- Test harness 1: Jason's test
- Test harness 2: real merge jobs
Scalability of pnfs server - Hiro, Pedro, Shigeki, Michael
- Maximum connections, load...
- HPSS:
- David - sometimes see duplicate requests from dcache, upto 6 (but not too harmful)
- Current queue depth is 30k - it would be good to limit clients to this number
- PNFS server:
- How many dccp -p commands per minute can be supported? (Pedro)
- pnfs load plots
Storing pnfsid in LFC - Hiro
- Reasoning, overview and current status
- Maintaining data integrity by using pnfsid stored in LFC
- Plans/procedure for keeping cache updated in LFC
Pandamover status & plans
- Tuning it - wait for pnfs metrics from Pedro
- Switch to DQ2? Try for 7 days, starting May 6/7th.
- Improved monitoring
- New table in PandaDB?
- Monitoring based on table (Alexei's team)
- Handling error conditions (right now if pnfsid is missing, retry using filename) - continue for now
Plans for Panda pilot
- Sites should run local movers - best way to optimize local site performance
HPSS/dcache monitoring - Pedro, David, Shigeki
- Available tools
- First responders
- storage management group responsibilities
- [staging] insure that there is enough stage requests on HPSS (30k) before queueing on dCache
- [staging] warn and follow-up on failures of copy operations from HPSS disk cache to read pools
- [migration] warn and follow-up on failures of copy operations from write pools into HPSS disk cache
- hpss team responsibilities
- Second responders - shift team procedures
- Additional alarms, emails...
Test and development plan for next 6 months
Follow up meeting ~May 28th
-- PedroSalgado - 29 Apr 2009
-- PedroSalgado - 28 Apr 2009
- added current storage management group responsibilities regarding staging
- added links for dCache monitoring
-- KaushikDe - 27 Apr 2009
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
Attachments
20090429_dCache_Issues.pdf (164.5K) |
PedroSalgado, 29 Apr 2009 - 09:40 | Overview of the current dCache issues.
20090429_Storage_Group_Plans.pdf (183.6K) | Main.psalgado, 29 Apr 2009 - 08:48 | Storage management group plans for the next 3 months.
20090429_Staging_priority.jpg (43.5K) |
PedroSalgado, 29 Apr 2009 - 09:35 | Stage priority web service
BNL_dCache_performance_meeting_2009_04_28.pdf (63.8K) |
HironoriIto, 29 Apr 2009 - 11:17 |
lfc_with_dcache.pdf (1887.6K) |
HironoriIto, 29 Apr 2009 - 11:18 |