r9 - 21 Aug 2007 - 17:02:43 - TorreWenausYou are here: TWiki >  AtlasSoftware Web > OSGAtBNL

Open Science Grid (OSG) middleware extensions activity at BNL


Program overview

The Open Science Grid (OSG) is a consortium recently awarded funding as a follow-on to the Particle Physics Data Grid (PPDG) and other US grid projects, with the broadened objectives of establishing a US computing grid infrastructure serving LHC and other application science needs, and integrating this US infrastructure with international grids such as the LHC Computing Grid (LCG). The OSG work program is in two principal areas: providing an open distributed computing facility through computing center collaboration and a common middleware foundation, and secondly, extending OSG capability through targeted, science-driven software tools augmenting the foundation middleware that are required by LHC and other demanding science applications. BNL is involved in OSG through ATLAS and STAR and is participating in both OSG activity areas. BNL focus topics in the facility area are in security systems and mass storage interfaces. In the software extensions area we focus principally on workflow management and distributed data management and storage issues. The high level BNL tasks are

  1. Provide distributed production and data management services at BNL and among US ATLAS collaborating institutions, coherently with international ATLAS via integration with CERN and LCG services.
  2. Provide distributed data management services for STAR between BNL and LBNL.
  3. Augment these production-oriented job and data management services with support for ATLAS and STAR distributed analysis.

The purpose of the software component of this effort is to

  • work with OSG participants and collaborating projects (particularly Condor) in developing a workflow management system supporting US ATLAS distributed production/analysis and wider OSG use, based primarily on integration of existing Condor and experiment software
  • support the integration, deployment and operation of this system for ATLAS
  • provide support and maintenance of this system for OSG users
  • participate in the leadership of the extensions area of OSG

Manpower

OSG support for BNL extensions at BNL is three FTEs. ATLAS contribution as of Oct 2006 is about 1.4 FTEs.

Planned activities for the three OSG supported FTEs are as follows. The work will involve collaboration with the OSG community, particularly the Condor project (M. Livny et al), CMS (F. Wuertheim et al), and STAR (J. Lauret et al).

  • 0.5 FTE: Deputy extensions co-coordinator, supporting BNL's leadership role in OSG extensions work (technical management, OSG user liaison).
  • 1.0 FTE: Contribute to just-in-time workload management extensions. Adapt/extend experiment distributed production/analysis systems for a generic workload management service, in collaboration with middleware providers (principally Condor).
  • 1.0 FTE: Middleware evaluation, integration for workload management. Work with middleware providers (principally Condor) to identify and/or specify middleware components useful to just-in-time workload management; test and evaluate performance and functionality of select middleware components; and integrate select middleware into the generic workload management service and associated toolkit.
  • 0.5 FTE: Contribute to the storage management extensions area, particularly in evaluation of storage management tools and their integration into overall distributed processing workflow.

Timeline

High level timeline is as follows. See PandaExtensions and StorageExtensions for more detail and milestones.

First 6 months (from Sep 2006):

  • Deepen Panda integration with grid middleware
  • Evolve Panda into a generic workflow management system deployed for general use on OSG
  • Define and begin joint work with Condor on Panda-Condor integration to leverage more Condor middleware and increase the functionality while decreasing the 'footprint' of Panda
  • Evaluate and possibly integrate select promising middleware components in workload and storage management
  • Explore with STAR at BNL and others possible collaborative development of distributed production, analysis and data management services
  • Participate in development of data transfer test suite (initially SRM/dCache)

6-12 months (to fall 2007):

  • Full deployment of a generic workflow management system to OSG, together with collaboratively developed tools defined in the first six months
  • Completion of the first phase of Panda/Condor integration

Year 2

  • Define and execute the second phase of Panda/Condor integration
  • Iterate on the generic workflow management implementation and feature set in light of production deployment experience
  • Continue with select middleware evaluation, feedback to developers, possible integration

Year 3+

  • Iterative development with lessons from production deployment, OSG community feedback
  • Ongoing middleware evaluation and feedback
  • Integration of newly matured and validated middleware
  • Scaling in support of scale-up by science communities
  • Shift emphasis to hardening and support from integration/development

OSG Year 2 US ATLAS science milestones

  1. Successful support for M3-M6 cosmic sample processing - Jan 5, 2008. Data distribution and processing support utilizing US ATLAS and opportunistic resources for the M3-M6 cosmic samples (cosmic running through Nov/Dec). Commitment to 100% availability of these samples with 48hr latency, at T1 and requesting T2s. Commitment to 100% availability of select analysis (AOD etc) samples, with processing support, at T1 and at least 2 T2s. Demonstrated capability to utilize these samples from T3s and opportunistic OSG resources.
  2. Support for data distribution for T1, T2 at 95% average data availability with 48hr latency – March 1, 2008. At least 95% average data availability with 48hr availability latency for full T1, T2 data samples serving US ATLAS community (AODs, DPDs, ESD samples, conditions), at T1 and all T2s
  3. 100% utilization of full T1 & T2 processing capacity - July 1, 2008. Quantitative metric for job counts based on US ATLAS facility capacity. 100% utilization of full 2007 capacity.
  4. Use of 10 non-ATLAS opportunistic sites - July 1, 2008. Quantitative commitment to number of opportunistic non-US ATLAS resources (10 sites) that are production-capable.

High level milestones

Date Milestone
Mar 28, 2007 Complete deployment of dCache at US ATLAS Tier 2s
Jun 1, 2007 Production deployment of experiment-neutral Panda as supported general OSG service
Jun 15, 2007 ATLAS validation of OSG infrastructure and extensions in full-chain production challenge
Sep 1, 2007 Deliver Panda/Condor integration phase 1
Nov 30, 2007 Workload and data management extension performance baseline documentation completed
Dec 1, 2007 Panda/Condor integration phase 1 deployed and in production on OSG
Sep 1, 2008 Deliver Panda/Condor integration phase 2
Dec 1, 2008 Panda/Condor integration phase 1 deployed and in production on OSG

Feb 7, 2007 - Progress since Sep 2006

  • Defined initial program in Panda/Condor/CMS OSG extensions collaboration, on pilot factories (Sep)
  • Defined initial program in generic VO-neutral Panda development (TestPilot subsystem design) (Sep)
  • First dedicated manpower on BNL OSG program (UT Arlington student, 50%) (Sep)
  • Established BNL Condor testbed for Panda/Condor program (Sep/Oct)
  • Demonstrated Panda pilot operation and job processing on generic OSG, LCG sites (Oct)
  • Demonstrated Panda job processing for non-ATLAS VO (CHARMM) on OSG (Nov)
  • Reached operating scale of 250 Panda queues at almost 200 SEs (gatekeepers) across OSG and LCG (Dec)
  • Defined collaborative program with CMS as well as Condor, using CMS glide-in factory as pilot factory basis (Dec)
  • Established second Panda instance (CERN) to test multi-instance operation (Dec)
  • Investigated and tested various Condor components for use in pilot factory and Panda integration (Oct-Dec)
  • Wrote pilot factory design/implementation plan http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/PilotFactoryPlan (Dec)
  • Resolved Panda DB performance issues, improving scalability of current Panda configuration by ~5-10x (Jan)
  • Adapted generic VO-neutral Panda subsystem (TestPilot) to support ATLAS production (Jan)
  • Proceeding with BNL OSG extensions hires (after hold due to budget issues) (Feb)

GUMS development

OSG also provides support for a GUMS development program at BNL

Program details

See PandaExtensions for details on the OSG extensions program in 'just-in-time workload management'.

See StorageExtensions for notes on the OSG-wide program (not just BNL) in storage systems.

The evolving complete OSG WBS


Major updates:
-- TorreWenaus - 25 Sep 2006

About This Site

Please note that this site is a content mirror of the BNL USATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your BNL USATLAS account.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback