r31 - 27 Feb 2008 - 02:17:37 - TorreWenausYou are here: TWiki >  AtlasSoftware Web > AutoPilot

AutoPilot - Generic pilot and scheduler package for Panda


Introduction

The AutoPilot package (formerly TestPilot) provides generic implementations of Panda pilot and scheduler for use in more varied environments than the ATLAS-specific production pilots and schedulers.

The objectives of AutoPilot are to

  1. provide a pilot implementation that contains no US ATLAS or ATLAS specific content, such that it can be used in a wide range of contexts: within ATLAS but outside OSG, within OSG but outside US ATLAS, from an 'off-grid' laptop or workstation or batch queue, etc.
  2. use the HTTP GET/POST mechanisms supported and widely used by Panda, together with a modular code organization of context-specific plugins (supporting plugin implementations for particular VOs, regions, working groups, or individuals) to support easy customization of the pilot and scheduler
  3. support modular integration into the pilot of context-dependent data management tools via plugins, such that data handling specific to a region, grid, VO etc. can be implemented easily
  4. provide DQ2-based data handling plugins, based on existing tools, to support ATLAS data handling in LCG and OSG
  5. provide an interactive, command-line pilot such that Panda jobs can be run from a terminal window for debugging purposes
  6. provide a modular, generic scheduler implementation that can support an arbitrary assortment of job submission back ends, selectable via a registry of configured back ends
  7. support remote control of the scheduler such that its behaviour can be automatically and dynamically controlled by an external service (in particular, the Panda server, such that the server can regulate pilot submission rates depending on Panda job queue content and other server-resident info)
  8. provide scheduler and pilot databases acting as
    • a repository for scheduler configuration and state information
    • a registry of the job submission back ends the scheduler is able to use
    • a log for scheduler/pilot activity such that pilot submission status can be monitored remotely and automatically, in a way independent (as much as possible) from the back end job submitter used by the scheduler
  9. support scheduler usage on an individual or group basis as well as centrally-run, with the monitoring and control mechanisms to make this practical
  10. support pathena (and similar) well
  11. bring Panda (including pathena) into production on LCG resources
  12. make Panda usable by VOs other than ATLAS, in the OSG in particular, as part of the OSG extensions program in workload management
  13. provide the platform for Panda usage of the PilotFactory under development as part of the OSG extensions program
  14. support take-up and try-out of Panda by independent interested users/institutes by making the threshold to install and integrate very low
  15. evaluate Subversion as the basis of the AutoPilot code repository, the principal motivators being the ease with which Subversion repositories can be centrally managed while supporting easy and dynamic web-based retrieval of current production code

Architecture

diagram

Implementation and status

The code is under the panda area of the BNL Subversion repository:
  • AutoPilot - source code for the AutoPilot generic pilot (atlasProdWrapper.py, prodPilot.py, etc.), pilot scheduler/monitor (pilotScheduler.py), and configuration DB (pilotController.py).
  • pilot3 - ATLAS-specific pilot code invoked by AutoPilot for ATLAS analysis and production jobs

A browser/monitor for this system is at http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?tp=main

The thin pilot wrapper that's actually submitted to the batch queue is atlasProdWrapper.sh (or variants)

The wrapper downloads the actual pilot code which asks Panda for a job. The job contains a 'transformation' field that is a URL to the script to be run as the job.

The transformation script is an application or VO specific wrapper to do preparatory/cleanup work surrounding the job (environment, data management, error processing). Example is http://www.usatlas.bnl.gov/svn/panda/autopilot/trunk/trans-atlasprod.sh
It invokes the job-specific script+joboptions to do the real work of the job.

Development

To check out a readonly copy of the code, which you can do with a standard svn client, do

    svn co http://www.usatlas.bnl.gov/svn/panda/autopilot/trunk autopilot
    svn co http://www.usatlas.bnl.gov/svn/panda/pilot3
    

To check out with write access so you can commit code changes, do for example

    svn co  https://svn.usatlas.bnl.gov/svn/panda/autopilot/trunk autopilot
    svn co  https://svn.usatlas.bnl.gov/svn/panda/pilot3
    

For write access you need to use an svn client capable of handling grid certificate based authentication. At BNL will have to add the following path to your PATH environment variable. /direct/usatlas+u/umesh/bin

If you need to set up a Subversion client to communicate with the BNL code repository on a server at a location other than BNL, this document would be very useful. The repository uses authentication by grid proxy for secure https checkouts, so the standard version of svn does not work, you need a version that supports X509 certificates (but see note below if you only need to do readonly checkouts).

The BNL svn server configuration is such that everytime you access it, it would ask for your .p12 certificate file location and the Passphrase. To avoid doing this everytime you interact with the server, you will have to make a few entries to the ~/.subversion/servers file. This document on BNL Subversion usage will help you make those entries.

Note that the code uses a dbaccess module which is not in the repository. It must be obtained from an existing AutoPilot source area or from T. Wenaus.

Important note: If you only require read access to the code, you can check out using http rather than https (and note the different hostname), with a standard version of svn:

    svn co  http://www.usatlas.bnl.gov/svn/panda/autopilot/trunk
    svn co  http://www.usatlas.bnl.gov/svn/panda/pilot3
    

Scheduling and monitoring pilots

The procedure to run the pilot scheduler and associated monitor is as follows:

  • the server you're going to run the scheduler on must have Condor installed and must have a basic Apache installation, with URLs accessible from the WAN (for log file access from the AutoPilot browser)
  • contact Torre Wenaus to
    • register your machine as a scheduler server
      • you will need to provide the machine name, the name of a directory where the scheduler can write log files, and a URL pointing to that directory
    • arrange for Panda queues, sites and tags to be defined as needed, to send pilots for your own use to your own resources (if applicable)
  • install Subversion (svn) client if necessary (see Development section above)
  • create a work area and cd there
  • check out the AutoPilot code from the Subversion repository: svn co https://svn.usatlas.bnl.gov/svn/panda/autopilot/trunk; mv trunk autopilot
  • cd to autopilot area
  • edit setup.sh appropriately, put it at $HOME/panda_setup.sh and source it
    • you may want to source this file in your login
  • create a file mycron containing something like 0 3,9,15,21 * * * /...path.../pilotCron.sh --tag=YOURTAG --pandasite=YOURSITE
    • if you're submitting to a single site, you can use --queue= rather than --tag= (a tag is a collection of queues marked with the tag)
    • you can add several if you want to run pilots for several tags, sites
  • add also one instance of the scheduler monitor to your cron: 0 3,9,15,21 * * * /...path.../pilotCron.sh --monitor
  • register this as your cron: crontab mycron
  • Do an initial launch of the scheduler(s) and monitor; the cron will take care of it thereafter. Snip out the command parts of the cron and run in background, eg: ./pilotCron.sh --tag=YOURTAG --pandasite=YOURSITE &
  • You will probably want to adapt cleanSpace.py and add it to your cron to clean up your log areas
That should be it. If you want to stop a scheduler or monitor use ./tellService.py NNN stop where NNN is the service ID from the AutoPilot monitor (below). To restart it, run the appropriate pilotCron.sh instance from your mycron.

Scheduler instances and pilots can be monitored via the Panda AutoPilot monitor http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?tp=main

Managing the AutoPilot databases

AutoPilot uses a number of database to manage information on pilots, queues, submit hosts and so on. One such DB is the schedconfig table which records information on the configuration and status of all the queues known to AutoPilot. Loading and updating of these DBs is handled by the pilotController.py script. If you need to use this script, these are the steps to follow.

  • check out AutoPilot (see development section above)
  • if you are working with US sites that still rely on the old Panda site information in the Panda repository, in panda/jobscheduler/siteinfo.py, check out Panda from ATLAS CVS next to your autopilot directory. $PANDA_HOME should be set to the panda directory.
  • always before you make changes to pilotController.py, do svn update
  • if it is schedconfig you need to change, for an explanation of the schedconfig DB see this twiki
  • always promptly commit your changes, so that other pilotController.py users pick up your changes
  • always before you run pilotController.py, do svn update
  • if you are working with a panda checkout and getting siteinfo.py from it, you must ensure it also is up to date when you run pilotController.py. This is best done by using a script to do the update, with the script also updating panda CVS. See updateQueues in the autopilot svn for such a script.
  • you should have the DQ2 client environment set up when you run pilotController.py (it uses TiersOfAtlas? )
  • you should have the LCG UI environment set up if you are configuring LCG sites
  • if you are configuring LCG sites, you should run lcgLoad.py from the autopilot directory before running pilotController.py (to build an updated set of queue configurations from BDII info), and use the --loadlcg option when you run pilotController.py (see the updateQueues.sh script)
  • in the unlikely event that you are working with pilot3 and need to load storage info from pilot3/storage_access_info.py, do an svn checkout of pilot3 next to autopilot and use the --loadstorage option when you run pilotController.py

The schema of the AutoPilot DBs are defined in tables.sql in the autopilot svn directory.

Address questions to Torre Wenaus (has the blame for the implementation) or Rod Walker (suffers the consequences of the implementation).

Applications

Current applications include:


Major updates:
-- TorreWenaus - 05 Nov 2007
-- TorreWenaus - 20 Sep 2006

About This Site

Please note that this site is a content mirror of the BNL USATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your BNL USATLAS account.


Attachments


pdf AutoPilot.pdf (32.0K) | TorreWenaus, 25 Nov 2007 - 14:35 | AutoPilot schematic (PDF)
jpg AutoPilot.jpg (105.6K) | TorreWenaus, 25 Nov 2007 - 14:35 | AutoPilot schematic (JPG)
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback