The main Panda web pages
are in the ATLAS wiki at CERN and take precedence over the material presented below. Pages here are supplementary to those, and largely deprecated.
The Panda Production ANd Distributed Analysis system has been developed by
ATLAS since summer 2005 to meet ATLAS requirements for a data-driven
workload management system for
production and distributed analysis processing capable of operating at
LHC data processing scale.
ATLAS processing and analysis places challenging requirements on throughput, scalability, robustness, efficient resource utilization,
minimal operations manpower, and tight integration of data management
with processing workflow.
Panda was initially developed for US based ATLAS production and analysis, and assumed that role in late 2005.
Since September 2006 Panda has
also been a principal component of the US Open Science Grid (OSG) program
in just-in-time (pilot-based) workload management. In October 2007 Panda was
adopted by the ATLAS Collaboration
as the sole system for distributed processing production
across the Collaboration. At time of writing (May 2008) the transition to
Panda based production across ATLAS is almost complete, and Panda project
activity is expanding to include participants outside the US.
Panda has processed close to 12 million jobs as of May 2008, currently at a typical rate of about 50k jobs/day and 14k CPU wall-time hours/day for production at 100 sites around the world, and 3-5k jobs/day for analysis. The Panda analysis user community numbers over 500, about 100 of whom are heavy/frequent users at any given time.
As ATLAS datataking ramps up over the next few years, job counts are estimated to reach on the order of 500k jobs/day, with the greatest increase coming from analysis.
Design and implementation
Principal design features
The principal features of Panda's design as driven primarily by ATLAS operational requirements (also by considerations of more general OSG use) are as follows.
Architecture and workflow
- Support for both managed production and individual users (analysis) so as to benefit from a common WMS infrastructure and to allow analysis to leverage production operations support, thereby minimizing overall operations workload.
- A coherent, homogeneous processing system layered over diverse and heterogeneous processing resources. This helps insulate production operators and analysis users from the complexity of the underlying processing infrastructure. It also maximizes the amount of Panda systems code that is independent of the underlying middleware and facilities actually used for processing in any given environment.
- Extensive direct use of Condor (particularly CondorG), as a job submission infrastructure of proven capability and reliability. While other job submission approaches are also supported such as local batch and GLite, CondorG is the backbone submission system of Panda and has been very successful as such.
- Use of pilot jobs for acquisition of processing resources. Workload jobs are assigned to successfully activated and validated pilots based on Panda-managed brokerage criteria. This 'late binding' of workload jobs to processing slots prevents latencies and failure modes in slot acquisition from impacting the jobs, and maximizes the flexibility of job allocation to resources based on the dynamic status of processing facilities and job priorities. The pilot is also a principal 'insulation layer' for Panda, encapsulating the complex heterogeneous environments and interfaces of the grids and facilities on which Panda operates.
- Coherent and comprehensible system view afforded to users, and to Panda's own job brokerage system, through a system-wide job database that records comprehensive static and dynamic information on all jobs in the system. To users and to Panda itself, the job database appears essentially as a single attribute-rich queue feeding a worldwide processing resource.
- System-wide site/queue information database recording static and dynamic information used throughout Panda to configure and control system behavior from the 'cloud' (region) level down to the individual queue level. The database is an information cache, gathering data from grid information systems, site experts, the data management system and other sources. It supports information access and limited remote control functions via an http interface. It is used by pilots to configure themselves appropriately for the queue they land on; by Panda brokerage for decisions based on cloud and site attributes and status; and by the pilot scheduler to configure pilot job submission appropriately for the target queue.
- Integrated data management based on dataset (file collection) based organization of data files, with datasets consisting of input or output files for associated job sets. Movement of datasets as required for processing and archiving is integrated directly into Panda's workflow. This design matches the data-driven, dataset-based nature of the overall ATLAS computing organization and workflow.
- Automated pre-staging of input data (either to a processing site or out of mass storage, or both) and immediate return of outputs, all asynchronously, minimizing data transport latencies and delivering (for analysis) the earliest possible first results. Pre-staging of input data prior to initiation of jobs is an essential efficiency and robustness measure, such that jobs themselves are not subject to data staging/transfer latencies and failure modes, improving efficiencies for job completion and resource utilization.
- Support for running arbitrary user code (job script), as in conventional batch submission. There is no ATLAS specificity or workload restrictions in the design.
- Easy integration of local resources. Minimum site requirements are a grid computing element or local batch queue to receive pilots, outbound http support, and remote data copy support using grid data movement tools.
- Simple client interface allows easy integration with diverse front ends for job submission to Panda.
- Authentication and authorization is based on grid certificates, with the job submitter required to hold a grid proxy and VOMS role that is authorized for Panda usage. User identity (DN) is recorded at the job level and used to track and control usage in Panda's monitoring, accounting and brokerage systems. The user proxy itself can optionally be recorded in MyProxy for use by pilots processing the user job, when pilot identity switching/logging (via gLExec) is in use.
- Support for usage regulation at user and group levels based on quota allocations, job priorities, usage history, and user-level rights and restrictions.
- A comprehensive monitoring system supporting production and analysis operations; user analysis interfaces with personalized 'My Panda' views; detailed drill-down into job, site and data management information for problem diagnostics; usage and quota accounting; and health and performance monitoring of Panda subsystems and the computing facilities being utilized.
- Security in Panda employs standard grid security mechanisms -- see PandaSecurity
Jobs are submitted to PanDA via a simple python client interface by which users define job sets, their associated datasets and the input/output files within them.
Job specifications are transmitted to the PanDA server via secure http (authenticated via a grid certificate proxy), with submission information returned to the client. This client interface has been used
to implement PanDA front ends for ATLAS production (PandaExecutorInterface
), distributed analysis (pathena? ), and US regional production.
The PanDA server receives work from these front ends into a global job queue, upon which a brokerage module operates to prioritize and assign work on the basis of job type, priority, input data and its locality, available CPU resources and other brokerage criteria. Allocation of job sets to sites is followed by the dispatch of corresponding input datasets to those sites, handled by a data service interacting with the ATLAS distributed data management system. Data pre-placement is a strict precondition for job execution: jobs are not released for processing until their input data arrives at the processing site. When data dispatch completes, jobs are made available to a job dispatcher.
An independent subsystem manages the delivery of pilot jobs to worker nodes via a number of scheduling systems. A pilot once launched on a worker node contacts the dispatcher and receives an available job appropriate to the site. If no appropriate job is available, the pilot may immediately exit or may pause and ask again later, depending on its configuration (standard behavior is for it to exit). An important attribute of this scheme for interactive analysis, where minimal latency from job submission to launch is important, is that the pilot dispatch mechanism bypasses any latencies in the scheduling system for submitting and launching the pilot itself. The pilot job mechanism isolates workload jobs from grid and batch system failure modes (a workload job is assigned only once the pilot successfully launches on a worker node). The pilot also isolates the Panda system proper from grid heterogeneities, which are encapsulated in the pilot, so that at the Panda level the grid(s) used by Panda appears homogeneous. Pilots generally carry a generic 'production' grid proxy, with an additional VOMS attribute 'pilot' indicating a pilot job. Analysis pilots may use glexec to switch their identity on the worker node to that of the job submitter (see PandaSecurity
The overall PanDA architecture is shown below (click on the image to get a pdf file).
Job flow in PanDA is shown here.
The implementation of PanDA is shown below. Supported front ends are the ATLAS production system, pathena (a distributed analysis interface to the ATLAS offline software framework, Athena), and a generic submission tool used by non-ATLAS VOs. The PanDA server containing the central components of the system is implemented in python (as are all components of PanDA, and the DDM system DQ2) and runs under Apache as a web service (in the REST
sense; communication is based on HTTP GET/POST with the messaging contained in the URL and optionally a message payload of various formats). Relational databases (MySQL in initial implementation, later migrated to Oracle) implement the job queue and all metadata and monitoring repositories. Condor-G and PBS are the implemented schedulers in the pilot scheduling (resource harvesting) subsystem. A monitoring server works with the databases, including a logging DB populated by system components recording incidents via a simple web service behind the standard python logging module, to provide web browser based monitoring and browsing
of the system and its jobs and data.
- PandaServer - central PanDA hub composed of several components that make up the core of PanDA. Implemented as a stateless REST web service over Apache mod_python and with a RDBMS back end
- PandaTaskBuffer - the PanDA job queue manager, keeps track of all active jobs in the system
- PandaBrokerage - matches job attributes with site and pilot attributes. Manages the dispatch of input data to processing sites, and implements PanDA's data pre-placement requirement
- PandaJobDispatcher - receives requests for jobs from pilots and dispatches job payloads. Jobs are assigned which match the capabilities of the site and worker node (data availability, disk space, memory etc.) Manages heartbeat and other status information coming from pilots.
- PandaDataService - data management services required by the PanDA server for dataset management, data dispatch to and retrieval from sites, etc. Implemented with the ATLAS DDM system? .
- PandaDB - database for PanDA
- PandaClient - PanDA job submission and interaction client
- AutoPilot - Pilot submission, management and monitoring system. Supersedes first generation PandaJobScheduler?
- PandaPilot - the lightweight execution environment for PanDA jobs. Pilots request and receive job payloads from the dispatcher, perform setup and cleanup work surrounding the job, and run the jobs themselves, regularly reporting status to PanDA during execution. Pilot development history is maintained in the pilot blog.
- PandaSchedConfig - Db table to configure resources - used by Panda server and AutoPilot
- PandaMonitor - web based monitoring and browsing system that provides an interface to PanDA for operators and users
- PandaLogger - logging system allowing PanDA components to log incidents in a database via the Python logging module and http
- Bamboo? - interface between PanDA and the ATLAS production database. Supersedes the PandaExecutorInterface
- PanDA extensions - Extending PanDA for general use (Open Science Grid WMS program)
Security in Panda employs security mechanisms that are standard in web and grid circles, augmented by built-in capabilities to monitor, track and control usage. See PandaSecurity? for information.
Notes and action items
- 12 Jan 2013