r15 - 23 Feb 2007 - 13:27:11 - TorreWenausYou are here: TWiki >  AtlasSoftware Web > PandaExtensions

OSG Workload Management -- Extending Panda for the OSG


Introduction

The Applications Area of the Open Science Grid (OSG? ) organizes and manages a small number of science-driven projects in extending the existing grid middleware to support higher level services and offer them to the OSG community. One of these projects is in workload management, specifically in developing a 'just-in-time' workload management system and tool set based on the 'pilot job' approach to job submission and management.

For some background see this excerpt of an OSG proposal in this area. For more information on the OSG extensions program at BNL see OSGAtBNL.

A major part of the project is generalizing the US ATLAS Panda system to become a VO-neutral workload management system supported on OSG for general use. A major part of this generalization is an integration program with Condor, to be defined and carried out through this project, with the objective of moving functionality out of Panda proper and into a Condor layer. This will slim down the Panda layer while leveraging the functionality, generic nature, and grid-universality of Condor to reach the objective of a powerful and broadly usable just-in-time workload management. The community can either use the full system or subcomponents such as Condor extensions for highly scalable just-in-time workload management that we expect to emerge from this project.

The project involves a mix of ATLAS in-house work funded by US ATLAS, primarily on generalizing the existing Panda into a generic system, plus middleware extension and integration work funded by OSG, carried out primarily at BNL in collaboration with the Condor project (Miron Livny/UWisconsin et al) and CMS (Frank Wuertheim/UCSD et al). Collaboration with STAR (Jerome Lauret/BNL et al) on some aspects is also anticipated.

Project participants

  • Torre Wenaus, BNL (ATLAS)
  • Barnett Chiu, UT Arlington/BNL (ATLAS)
  • Sudhamsh Reddy, UT Arlington/BNL (ATLAS)
  • Miron Livny et al, U Wisconsin Madison (Condor)

Collaborating partners:

  • Frank Wuertheim, UC San Diego (CMS)

Current activities

The project is just starting up (Sep 2006), with the initial highest priority activities as discussed in the program description below.

Initial activities:

Program outline

Four principal, concurrent activities:

  • Generalization of existing Panda to an experiment-neutral just-in-time workload manager
    • Remove ATLAS specificity and make it a generic, modular system usable by any VO via standard interfaces and VO-specific customization via plugins
    • Usable by, and supported for, any OSG VO
    • Supporting VO-defined back end job submission tools and data management tools
  • Selective middleware technology studies and functionality/performance evaluations
    • Select technologies for integration with generic Panda
  • Integration of select middleware components -- particularly Condor components -- into generic Panda
    • Collaborating with Condor et al on needed middleware extensions
    • Objective to 'slim down' Panda itself while increasing its functionality and generic support for any VO and a wide range of grid resources
    • Program divided into two phases, integration phase 1 (IP1) and IP2
  • Inter-experiment collaboration
    • Identifying and integrating high level components from other experiments which are/could be common tools
    • Particular focus on CMS and STAR

Descriptions of these activities follow.

Generalizing Panda

Issues in making Panda generic

  • Standardized capabilities/requirements descriptions -- ClassAds?

Selective middleware technology studies and functionality/performance evaluations

A-List tools and technologies to take up, validate, integrate

  • CMS/ARDA Dashboard together with client tools (MonaLisa, http (?)) as basis for monitoring

'Yes' list -- things worth looking at

  • Standard testing infrastructure (test harness, testing strategy) for workload management components
  • Condor-C Native
  • Condor-C bulk scheduling
  • "schedd-on-the-side"? New. CMS reports good results. 10k jobs/day with excellent reliability in CMS T2s, keeping queues full. Not much scalability testing done.
  • glide-in
    • glide-in with GCB with multiple schedds. Running in CDF. Production ready.

'Maybe' list -- too little known to say at this point. Worth learning more

'No' list -- not worth investigating at this stage

  • Condor-C gLite
  • pre-WS GRAM
  • WS GRAM
  • Resource broker with classic SE
  • Resource broker with gLite SE

Open issues in tools and technologies

Condor (et al) middleware integration

Integration Phase 1

  • Site-level pilot factory based on Condor
    • schedd, glide-in based
    • evaluate Condor-C
  • Security: pilot authentication with user identity
    • glexec integration
  • Application of glide-ins
    • in the pilot factory
    • as basis for pilots, if sufficiently scalable?
  • ClassAds for describing/matching capability/requirements
  • Pilot generalization as gateway interface to grid computing resource (eg Condor pool)
  • Condor submission into Panda?

Integration Phase 2

  • Phase 1 carry-over, iteration after feedback, program adjustments
  • Generalized data-CPU co-location mechanism
  • Scalability enhancements
  • Matchmaking?
  • Workflow logic

Inter-experiment collaboration

Milestones and effort

Major milestones are in bold.

Delivery date Description Effort funded by Status
Sep 1 2006 First planning meeting for OSG just-in-time workload management extensions (at FNAL); define initial program (Glide-in factory for Condor-based site-local pilot submission with centralized grid-wide pilot management (factory submission to sites via the grid)   Done
Sep 1 2006 Define initial BNL manpower for pilot factory effort: Barnett Chiu assisted by Torre Wenaus ATLAS Done
Sep 7 2006 Define and initiate prototyping program for Panda generalization to support non-ATLAS usage: TestPilot ATLAS Done
Sep 15 2006 Establish Condor testbed at BNL for pilot factory work ATLAS Done
Oct 16 2006 Demonstrate Panda pilot operation and job processing on generic OSG, LCG sites (no ATLAS specificity) ATLAS Done
Oct 16 2006 Demonstrate ATLAS analysis (pathena) operation on generic OSG, LCG sites ATLAS  
Oct 16 2006 Provide framework for Panda-based execution of non-ATLAS VO workload on generic OSG ATLAS Done
Oct 30 2006 Demonstrate Panda-based execution of non-ATLAS VO workload on generic OSG ATLAS  
Nov 1 2007 Draft Panda/Condor integration phase 1 (IP1) plan in place (Glide-in based pilot factory) ATLAS, OSG, Condor Done
Dec 7 2007 Second planning meeting for OSG workload management extensions (at UT Arlington); finalize IP1 plan   Done
Dec 14 2007 Final IP1 plan and milestones -- cf Arlington ATLAS, OSG, Condor Done
Jan 1 2007 Dedicated BNL hire(s) for OSG extensions program OSG  
Feb 1 2007 Functionality and performance assessments for Condor-G and Condor glide-in done, with requests fed back to Condor ATLAS, OSG  
Mar 1 2007 Deployment of prototype experiment-neutral Panda as prototype OSG service. In use by CHARMM ATLAS, OSG Done
Mar 1 2007 OSG authentication infrastructure integration based on glexec -- depends on availability OSG  
May 1 2007 Support deployment of OSG just in time workload management for ATLAS production, analysis ATLAS, OSG  
Jun 15 2007 ATLAS validation of OSG extensions in full-chain production challenge ATLAS  
Jul 1 2007 Production deployment of experiment-neutral Panda as supported general OSG service OSG, ATLAS  
Sep 1 2007 Deliver Panda/Condor integration phase 1 (IP1) OSG, ATLAS  
Nov 1 2007 IP1 validated for deployment OSG, ATLAS  
Nov 30 2007 Workload and data management extension performance baseline documentation completed    
Dec 1 2007 IP1 deployed and in production on OSG OSG, ATLAS  
Feb 1 2008 IP2 plan, milestones, deliverabled defined    
Sep 1 2008 Deliver Panda/Condor integration phase 2 (IP2) OSG, ATLAS  
Nov 1 2008 IP2 validated for deployment OSG, ATLAS  
Dec 1 2008 IP2 deployed and in production on OSG OSG, ATLAS  

Program for 2009+:

  • Iterative development with lessons from production deployment, OSG community feedback
  • Ongoing middleware evaluation and feedback
  • Integration of newly matured and validated middleware
  • Scaling in support of scale-up by science communities
  • Shift emphasis to hardening and support from integration/development


Major updates:
-- TorreWenaus - 19 Sep 2006

About This Site

Please note that this site is a content mirror of the BNL USATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your BNL USATLAS account.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback