PandaDDMOct07 - Panda/DDM Workshop Oct 3-5 2007, BNL
Emphasis is on open technical and planning discussions, not long presentations. The lead person on a topic is invited to come with enough transparencies to cover the essentials and seed the discussion. I'll take notes through the workshop, others are welcome to as well.
Morning sessions start at 9am, afternoon sessions at 2pm.
The Orange Room is at the front of the physics building, near the main entranceway, behind the stairs, to the left of the small seminar room.
Slides are attached; see the bottom of the page.
== Wed Oct 3
Morning (Orange Room)
- DDM and DQ2
- DQ2 development status and priorities - Miguel
- ATLAS and US ATLAS ops experience with DDM - Alexei, Kaushik, Rod, Rob, ...
- Data movement policy issues, computing model adherence - Alexei, ...
- storage system and file mover isses - Hiro, Wensheng. ...
- DDM and Panda
- Panda data mover (PDM) - Tadashi (few slides to seed discussion)
- PDM role relative to subscriptions - discussion
- PDM/DQ2 convergence, code reuse? joint project? - discussion
- DQ2 0.4 installation at BNL and UTA
Working session among Miguel, Hiro, Patrick, ...
- Panda technical/strategy discussion - Torre, Kaushik, Tadashi, ...
In parallel with the DQ2 working session
== Thu Oct 4
Morning (Orange Room)
- Panda operations, expansion, near term issues and priorities
- Panda production status and issues - Kaushik, Tadashi
- Panda production on LCG - Rod
- Multi-cloud operation and task assignment - Tadashi, Rod, Kaushik
- AutoPilot - Torre
- Panda monitoring evolution - Torre
Multi-cloud, PandaMover, ...
1pm - High-Availability MySQL DB based on DRBD-Heartbeat, Ming Yue (Orange Room)
Afternoon (Orange Room)
- Future/ongoing developments in Panda
- OSG workload management program and Panda - Torre
- Condor schedd glide-in and the Panda pilot factory - Barnett
- Panda and the FNAL/CMS GlideinWMS - Maxim
- Panda DB scaling studies - Yuri, Aaron
- Security issues - Torre
- Future/ongoing developments in DDM
- Hierarchical datasets: ideas and plans - Miguel, ...
- Optimizing data transfer: datablocks, zipfiles, site service
improvements... - Miguel, Alexei, ...
- File catalogs: Pedro's LRC, LFC, ... - Wensheng, Miguel, Alexei ...
== Fri Oct 5
Morning (Orange Room)
- Re-discussions as needed
- Discussion summaries
Afternoon (Orange Room)
- conclusions, action items and next steps
Miguel - DQ2 status
- 0.4.0 affects site services only and is an important upgrade. Client tools and central catalogs unchanged.
- Sophisticated brokering is still at the DQ2 level, no near term prospect of moving more of the brokering responsibility to FTS
- Improvements in DB schema, extensive use of new MySQL? features (version 5.1) including triggers and stored procedures
- Main problems:
- Small files. But BNL/Panda OK because small files are in small datasets which are closed.
- Many open datasets.
- Prodsys needs to help! Make better use of datasets: closed datasets (especially not huge open datasets), datablocks.
- Location catalog extension. Keeps missing file guids for sites in central catalog. Concerns over costs and consistency in maintaining this info centrally. Maintained by callbacks.
- 0.4 is on BNLDISK, BNLTAPE, and LCG sites. Was used for functional tests. Next deployment will be great lakes. Want to use fair share features.
- 0.5: location catalog schema, site services improvements
- 1.0.0 will have a new central catalog schema with dataset IDs that incorporate a timestamp
- will permit time-based partitioning of DB tables
- recognition that dataset hierarchy (at least 1 level) needed. Will allow smaller datasets, sized to be optimal datablocks for file transfer, to be aggregated in collections. DQ2 deals with the blocks, not the collections.
- under discussion for 1.0. One level deep. Enough, and simpler.
- Miguel needs single point of contact from US for central catalog issues.
- Asks that we clarify DQ2 role on OSG and signal possible contributions.
- big datasets are biggest problem. Panda dispatch, destination blocks are examples of good dataset usage.
- in functional tests, some datasets arrive in hours, some in days. Same source (CERN), same dest (eg. RAL). Miguel: consequence of moving files to separate queues if not found promptly. Know how to reshuffle queues to address this.
- Miguel: overlapping datasets makes implementation more difficult. Alexei: not needed for the next year.
- Miguel notes we are moving more files than ever before.
- Alexei priorities: hierarchy, and subscription lifetime. Subscription cancellation policy should be implemented.
- cf. document on subscription policy. We reviewed it.
- disable t2-t2 transfers between clouds
- proposed that specification of sources and wait for sources should only be allowed for production
- Alexei: follow policy as documented for 6 months, then review policy on restricting specification of sources, wait-for-sources.
- physics datasets should not be subscribed to. DQ2 responsibility is movement of data blocks. Aggregation into physics datasets is higher level responsibility.
- Alexei: should clean tid datasets to solve problem of duplicate files with different retry numbers. This means an extra scan. Alexei will do this. Cleaned up tid will appear as a new dataset version.
- Kaushik: dq2 attach in Eowyn still missing for Panda. Tadashi will help Karthik implement this so final dq2 attach happens in Eowyn and not panda. With this, duplicate file problem should largely disappear. Have to investigate this.
- Feb 2008: new DQ2 implementation supporting 1 level hierarchy.
- Alexei: should be fewer versions, patches deployed. Only when a critical bug needs fixing. Miguel: doesn't disagree. Would be good to have more manpower to deal with moving code changes from dev to ops most effectively.
- DQ2 core development manpower is Miguel (site services), Pedro (central catalogues) and Vincent (new fellow, working on location catalogue and consistency tracker). Significantly less than the 3 FTEs dedicated for development as they are also running Tier-0 export exercises, M4/5 data preparations, helping CERN ATLAS facility, operating site services for LCG, helping users, etc
DQ2 core development manpower is Miguel, Pedro, Vincent only. Significantly less than 3 FTEs given their T0, deployment, ops activities as well.
- Patrick: need better communication on DQ2 releases. eg. there was no announcement to go to 0.3.2
- Also Miguel noted later in the meeting that 0.3.2 was in fact an important bug-fix release. Wasn't aware that OSG had not fully deployed it.
- Miguel will add monitoring of releases in the field, so outdated versions can be detected.
- Rob: in integration effort, working on specific integration effort for DDM, testing new releases in dedicated testbeds. Miguel: main interactions have been with Hiro as the interface. Will publish releases to wiki and to standard mailing lists to ensure proper communication.
- Need more people on ops! Both centrally and at sites
- Sites would like more access to their VO boxes. eg. French people and Rod.
- ARDA. Several times it was urgently needed but not available. Not supposed people will edit scripts or write code, but have a checklist to recover available.
- Most important for ops: hierarchical datasets and use of datablocks for transfer
- Agreed action: prepare checklist on DDM operations procedures so that more people, including remote people, can address problems round the clock
- Good example: a long message from Pedro on dealing with an ARDA mon problem, detailed procedural instructions on diagnosing and addressing the problem
- Clearly define things to be checked, address all possible cases
- Kaushik: main operational issues:
- Stable services. A problem particularly since 0.3 deployment. Strongly affects OSG because we run so many site services. Usually one site service crashing/down each day; hard to sustain production.
- Miguel: partly dev issue, partly deployment. Support would be simpler if site services were centralized.
- Kaushik, Alexei: scalability a concern if we centralize services. US T2s ask for a lot of data. eg. 100% of AODs, unlike other T2s.
- If recommendation is to go to 1 site service, scalability of that needs to be demonstrated first
- Expectation is that when new service release is deployed, it is stable and ready for production.
- Hiro: hard to do adequate testing in prod environment. And in isolated tests, coverage is incomplete. After production deployment, unforeseen problems grind things to a halt.
- All agree longer testing and less patching is needed for deployed releases.
- Run an extra VO box at BNL, eg. to serve T3s, and use to thoroughly test new release candidates in realistic environment
- Rob finds it an interesting idea, for testing at realistic scale. Could be possible adding this to integration environment. Address at SLAC workshop.
- Data transfer shares is another concern for Kaushik. Big ops problem. When eg. AOD data movement is added to Panda production data movement, problems start.
- Miguel: addresses in 0.4.0. Shares are working much better.
- Later in the meeting: noted by Alexei that DB release dataset distribution happened very quickly, taking priority over thousands of other queued transfers.
- Alexei: as soon as 0.4 deployed in US, resume AOD transfers with default share, and use production share for production. Look at performance.
- Hiro: will run tests with UTA running 0.4. Many subscriptions to multiple shares to examine fair share performance.
- Miguel stresses need to use good hardware. Heavy CPU usage, from GSI alone.
- DO NOT run site services MySQL DB and site services themselves on same machine. Combines CPU-intense services with I/O intense DB.
- Miguel: saw huge improvement when these are separated.
- Rob: an important point. Implies new hardware, we have not done this before.
- Miguel currently using midrange server, not particularly new.
- Dantong: can you be more specific on service hardware specifications. Miguel will provide info.
- Will have an 0.4 release on timescale of a release. (Need to move away from deploying 'release candidates'.
- Will then do AOD subscriptions for 1 or 2 sites.
- Also great lakes has asked for 3 different types of data, can test fair share with different shares for these.
- Miguel: first 80% of the problem is easy. Last 20% is where the work is, and kills you with lost files if you're not extremely careful, eg. with tricky race conditions between checking catalog and starting transfer.
- Kaushik: recognize this is not a full scale system. Works for dispatch blocks. Solves a big problem with prestaging. Very stable. Slower than DQ2. DQ2 can move more files faster. Queue issues, need optimization of job flow through the system.
- Hiro: why not use dccp -P to prestage in bulk, earlier in the process. Allow HPSS more opportunity to optimize retrieval. Have done this for 20-30k files. Not instantly, but at 'Condor pace' would be OK.
- Rod also interested in prestaging this way.
- Kaushik - role we see for the data mover
- On for all dispatch blocks from BNL to T2s. Many issues we're working through. Major motivation for using it is staging issue. Have improved that part of the code in this meeting, much faster. It is killing dCache.
- Miguel sees problems with balancing the systems together. Doing coherent throttling is important. DQ2 and DataMover? in competition is not going to work. Just kills underlying storage.
- Kaushik: don't plan to try this for anything other than dispatch blocks, where it is only mechanism for prestaging.
- Miguel: can try other things but must test very carefully. Keep in mind that the purpose of DQ2 is to serve not one site, not one channel. Could tune up one site's performance, but does not solve DM problem to serve many sites.
- Stability was the first motivation for this. PDM doesn't suffer from this. Other issue was stagein. We will stick with this usage.
- Miguel: would like to know how you do the stagein. This would be a useful contribution to the larger DDM effort. Not addressed in DQ2 to this point.
Wed pm discussions
- add to monitoring capability to list jobs associated with a task, using new taskid field in job record
- brokerage weight based on GB that would actually have to be moved. Use size of missing files. Number of GB rather than number of files.
- approach to improving prestaging in PandaMover? was identified
- matchmaking on releases.
- Release checking should be based on reasonably reliable check that the release is actually there. eg. presence of setup file
- Release presence info is in the schedconfig DB. Presently refreshed by every autopilot job. Change to periodic refresh by dedicated autopilot job type.
- pulling jobs for clouds from proddb. The chosen cloud needs to be fed back to Eowyn. Extif now has the info from the Panda job record, needs to send it back in a status update.
- Extif needs to be smarter in getting job assignments in proportion to cloud structure.
Panda production status, issues Thursday morning
- Update Panda shift wiki with updated instructions. Include analysis. Shift people should look at analysis savannah bugs.
- Rod: foresees providing shift team member from Canada. Kaushik: good, not immediately necessary.
- Need a shift training meeting soon. Should cover analysis.
- Significant increase in analysis support load will come. Need better coverage for user bug reports, support.
- Train shifters on pathena basics? Discuss in a shift training meeting.
- Torre will look at PAS/Omega possibilities for increased pathena support
- Canadian shifter - possible role is technical user support for analysis, including analysis/athena knowledge
- Good FAQ will help for user support
Miscellaneous points from talks
- Rod - LCG production
- should use ATLAS production role everywhere for LFC accesses
- would like to be able to get list of files Panda requires on a site. Together with modify time, can be basis for cleanup that protects files Panda actually needs. Any files not on this list are eligible for deletion.
- want support for dccp in, lcg-cr out in pilot
- working out statistics for a cloud. How many jobs to ask for for a cloud. Make it more concrete to get this information from Panda. GetJobStatistics? giving activated, assigned for a cloud, for use in the executor. Kaushik: comes from a table, will have to extend the table.
- Patrick: use flag on LRC http interface to indicate production data location, so a production-specific storage path can be returned, distinct from the default taken from ToA? . Put it in a local config file on the machine (not in ToA? ).
- Torre: task assignments to clouds will be added to cloud page on monitor
- Add to queuedata DB whether a queue supports job retrying
- Ming - DRBD
- Sounds promising. Needs more testing to stress test to the point where any performance degradation from the replication on every disk write can be seen.
- Good point from Miguel: can configure InnoDB? to flush to disk only every few seconds.
- Paul - pilot developments
- we don't understand why stderr can't be written to a file. Torre, Rod will discuss with Paul
- we need a job recovery DB.
- need systematic use of timestamps in pilot log files.
- want support for dccp in, lcg-cr out in pilot
- Alexei: DDM priority input from this meeting to sw week discussions? Miguel: will include this input.
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
(203.5K) | TorreWenaus
, 04 Oct 2007 - 09:19 | The FNAL/CMS GlideinWMS? : experience at BNL, Maxim Potekhin
(423.5K) | TorreWenaus
, 04 Oct 2007 - 13:52 | MySQL? mirroring with DRBD, Ming Yue
(93.5K) | TorreWenaus
, 04 Oct 2007 - 11:32 | The Panda Simulator, Aaron Thor
(36.0K) | TorreWenaus
, 04 Oct 2007 - 11:44 | Pilot plans and todo list, Paul Nilsson
(30.5K) | TorreWenaus
, 03 Oct 2007 - 10:01 |
(160.0K) | PohsiangChiu
, 04 Oct 2007 - 01:13 | Introduction to Pilot Factory using Schedd Glidein
(44.0K) | TadashiMaeno
, 03 Oct 2007 - 09:22 | slides for PandaMover?
(234.0K) | TorreWenaus
, 05 Oct 2007 - 16:09 | DQ2 new features, Miguel Branco
(272.5K) | TorreWenaus
, 05 Oct 2007 - 16:08 | DQ2 status, Miguel Branco
(183.6K) | TorreWenaus
, 04 Oct 2007 - 14:17 | Panda security issues
(16.5K) | TorreWenaus
, 04 Oct 2007 - 09:44 | Panda production on LCG, Rod Walker