Open Science Grid (OSG) middleware extensions activity at BNL
Program overview
The Open Science Grid (OSG) is a consortium recently awarded funding
as a follow-on to the Particle Physics Data Grid (PPDG) and other
US grid projects, with the
broadened objectives of establishing a US computing grid
infrastructure serving LHC and other application science needs, and
integrating this US infrastructure with international grids such as
the LHC Computing Grid (LCG). The OSG work program
is in two principal areas: providing an open distributed computing
facility through computing center collaboration and a common
middleware foundation, and secondly, extending OSG capability through
targeted, science-driven software tools augmenting the
foundation middleware that are required by LHC and other demanding
science applications. BNL is involved in OSG through ATLAS and STAR
and is participating in both OSG activity areas. BNL focus topics in the facility
area are in security systems and mass storage interfaces. In the
software extensions area we focus principally on workflow management and
distributed data management and storage issues. The high level BNL tasks are
- Provide distributed production and data management services at BNL and among US ATLAS collaborating institutions, coherently with international ATLAS via integration with CERN and LCG services.
- Provide distributed data management services for STAR between BNL and LBNL.
- Augment these production-oriented job and data management services with support for ATLAS and STAR distributed analysis.
The purpose of the software component of this effort is to
- work with OSG participants and collaborating projects (particularly Condor) in developing a workflow management system supporting US ATLAS distributed production/analysis and wider OSG use, based primarily on integration of existing Condor and experiment software
- support the integration, deployment and operation of this system for ATLAS
- provide support and maintenance of this system for OSG users
- participate in the leadership of the extensions area of OSG
Manpower
OSG support for BNL extensions at BNL is three FTEs. ATLAS contribution as of Oct 2006 is about 1.4 FTEs.
Planned activities for the three OSG supported FTEs are as follows. The work will involve collaboration with the OSG community, particularly the Condor project (M. Livny et al), CMS (F. Wuertheim et al), and STAR (J. Lauret et al).
- 0.5 FTE: Deputy extensions co-coordinator, supporting BNL's leadership role in OSG extensions work (technical management, OSG user liaison).
- 1.0 FTE: Contribute to just-in-time workload management extensions. Adapt/extend experiment distributed production/analysis systems for a generic workload management service, in collaboration with middleware providers (principally Condor).
- 1.0 FTE: Middleware evaluation, integration for workload management. Work with middleware providers (principally Condor) to identify and/or specify middleware components useful to just-in-time workload management; test and evaluate performance and functionality of select middleware components; and integrate select middleware into the generic workload management service and associated toolkit.
- 0.5 FTE: Contribute to the storage management extensions area, particularly in evaluation of storage management tools and their integration into overall distributed processing workflow.
Timeline
High level timeline is as follows. See
PandaExtensions and
StorageExtensions for more detail and milestones.
First 6 months (from Sep 2006):
- Deepen Panda integration with grid middleware
- Evolve Panda into a generic workflow management system deployed for general use on OSG
- Define and begin joint work with Condor on Panda-Condor integration to leverage more Condor middleware and increase the functionality while decreasing the 'footprint' of Panda
- Evaluate and possibly integrate select promising middleware components in workload and storage management
- Explore with STAR at BNL and others possible collaborative development of distributed production, analysis and data management services
- Participate in development of data transfer test suite (initially SRM/dCache)
6-12 months (to fall 2007):
- Full deployment of a generic workflow management system to OSG, together with collaboratively developed tools defined in the first six months
- Completion of the first phase of Panda/Condor integration
Year 2
- Define and execute the second phase of Panda/Condor integration
- Iterate on the generic workflow management implementation and feature set in light of production deployment experience
- Continue with select middleware evaluation, feedback to developers, possible integration
Year 3+
- Iterative development with lessons from production deployment, OSG community feedback
- Ongoing middleware evaluation and feedback
- Integration of newly matured and validated middleware
- Scaling in support of scale-up by science communities
- Shift emphasis to hardening and support from integration/development
OSG Year 2 US ATLAS science milestones
- Successful support for M3-M6 cosmic sample processing - Jan 5, 2008. Data distribution and processing support utilizing US ATLAS and opportunistic resources for the M3-M6 cosmic samples (cosmic running through Nov/Dec). Commitment to 100% availability of these samples with 48hr latency, at T1 and requesting T2s. Commitment to 100% availability of select analysis (AOD etc) samples, with processing support, at T1 and at least 2 T2s. Demonstrated capability to utilize these samples from T3s and opportunistic OSG resources.
- Support for data distribution for T1, T2 at 95% average data availability with 48hr latency – March 1, 2008. At least 95% average data availability with 48hr availability latency for full T1, T2 data samples serving US ATLAS community (AODs, DPDs, ESD samples, conditions), at T1 and all T2s
- 100% utilization of full T1 & T2 processing capacity - July 1, 2008. Quantitative metric for job counts based on US ATLAS facility capacity. 100% utilization of full 2007 capacity.
- Use of 10 non-ATLAS opportunistic sites - July 1, 2008. Quantitative commitment to number of opportunistic non-US ATLAS resources (10 sites) that are production-capable.
High level milestones
| Date |
Milestone |
| Mar 28, 2007 |
Complete deployment of dCache at US ATLAS Tier 2s |
| Jun 1, 2007 |
Production deployment of experiment-neutral Panda as supported general OSG service |
| Jun 15, 2007 |
ATLAS validation of OSG infrastructure and extensions in full-chain production challenge |
| Sep 1, 2007 |
Deliver Panda/Condor integration phase 1 |
| Nov 30, 2007 |
Workload and data management extension performance baseline documentation completed |
| Dec 1, 2007 |
Panda/Condor integration phase 1 deployed and in production on OSG |
| Sep 1, 2008 |
Deliver Panda/Condor integration phase 2 |
| Dec 1, 2008 |
Panda/Condor integration phase 1 deployed and in production on OSG |
Feb 7, 2007 - Progress since Sep 2006
- Defined initial program in Panda/Condor/CMS OSG extensions collaboration, on pilot factories (Sep)
- Defined initial program in generic VO-neutral Panda development (TestPilot subsystem design) (Sep)
- First dedicated manpower on BNL OSG program (UT Arlington student, 50%) (Sep)
- Established BNL Condor testbed for Panda/Condor program (Sep/Oct)
- Demonstrated Panda pilot operation and job processing on generic OSG, LCG sites (Oct)
- Demonstrated Panda job processing for non-ATLAS VO (CHARMM) on OSG (Nov)
- Reached operating scale of 250 Panda queues at almost 200 SEs (gatekeepers) across OSG and LCG (Dec)
- Defined collaborative program with CMS as well as Condor, using CMS glide-in factory as pilot factory basis (Dec)
- Established second Panda instance (CERN) to test multi-instance operation (Dec)
- Investigated and tested various Condor components for use in pilot factory and Panda integration (Oct-Dec)
- Wrote pilot factory design/implementation plan http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/PilotFactoryPlan (Dec)
- Resolved Panda DB performance issues, improving scalability of current Panda configuration by ~5-10x (Jan)
- Adapted generic VO-neutral Panda subsystem (TestPilot) to support ATLAS production (Jan)
- Proceeding with BNL OSG extensions hires (after hold due to budget issues) (Feb)
GUMS development
OSG also provides support for a
GUMS development program at BNL
Program details
See
PandaExtensions for details on the OSG extensions program in 'just-in-time workload management'.
See
StorageExtensions for notes on the OSG-wide program (not just BNL) in storage systems.
The evolving
complete OSG WBS
Major updates:
--
TorreWenaus - 25 Sep 2006
About This Site
Please note that this site is a content mirror of the BNL USATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your BNL USATLAS account.
Attachments