r37 - 10 Jun 2010 - 10:43:08 - TorreWenausYou are here: TWiki >  AtlasSoftware Web > AutoPilot

AutoPilot - Generic pilot and scheduler package for Panda


Introduction

The AutoPilot package (formerly TestPilot) provides generic implementations of Panda pilot and scheduler for use in more varied environments than the ATLAS-specific production pilots and schedulers.

The objectives of AutoPilot are to

  1. provide a pilot implementation that contains no US ATLAS or ATLAS specific content, such that it can be used in a wide range of contexts: within ATLAS but outside OSG, within OSG but outside US ATLAS, from an 'off-grid' laptop or workstation or batch queue, etc.
  2. use the HTTP GET/POST mechanisms supported and widely used by Panda, together with a modular code organization of context-specific plugins (supporting plugin implementations for particular VOs, regions, working groups, or individuals) to support easy customization of the pilot and scheduler
  3. support modular integration into the pilot of context-dependent data management tools via plugins, such that data handling specific to a region, grid, VO etc. can be implemented easily
  4. provide DQ2-based data handling plugins, based on existing tools, to support ATLAS data handling in LCG and OSG
  5. provide an interactive, command-line pilot such that Panda jobs can be run from a terminal window for debugging purposes
  6. provide a modular, generic scheduler implementation that can support an arbitrary assortment of job submission back ends, selectable via a registry of configured back ends
  7. support remote control of the scheduler such that its behaviour can be automatically and dynamically controlled by an external service (in particular, the Panda server, such that the server can regulate pilot submission rates depending on Panda job queue content and other server-resident info)
  8. provide scheduler and pilot databases acting as
    • a repository for scheduler configuration and state information
    • a registry of the job submission back ends the scheduler is able to use
    • a log for scheduler/pilot activity such that pilot submission status can be monitored remotely and automatically, in a way independent (as much as possible) from the back end job submitter used by the scheduler
  9. support scheduler usage on an individual or group basis as well as centrally-run, with the monitoring and control mechanisms to make this practical
  10. support pathena (and similar) well
  11. bring Panda (including pathena) into production on LCG resources
  12. make Panda usable by VOs other than ATLAS, in the OSG in particular, as part of the OSG extensions program in workload management
  13. provide the platform for Panda usage of the PilotFactory under development as part of the OSG extensions program
  14. support take-up and try-out of Panda by independent interested users/institutes by making the threshold to install and integrate very low
  15. evaluate Subversion as the basis of the AutoPilot code repository, the principal motivators being the ease with which Subversion repositories can be centrally managed while supporting easy and dynamic web-based retrieval of current production code

Architecture

diagram

Implementation and status

The code is under the panda area of the CERN Subversion repository:
  • AutoPilot - source code for the AutoPilot generic pilot (atlasProdWrapper.py, prodPilot.py, etc.), pilot scheduler/monitor (pilotScheduler.py), and configuration DB (pilotController.py).
  • pilot3 - ATLAS-specific pilot code invoked by AutoPilot for ATLAS analysis and production jobs

A browser/monitor for this system is at http://panda.cern.ch/?tp=main

The thin pilot wrapper that's actually submitted to the batch queue is atlasProdWrapper.sh (or variants)

The wrapper downloads the actual pilot code which asks Panda for a job. The job contains a 'transformation' field that is a URL to the script to be run as the job.

The transformation script is an application or VO specific wrapper to do preparatory/cleanup work surrounding the job (environment, data management, error processing). Example is trans-atlasprod.sh
It invokes the job-specific script+joboptions to do the real work of the job.

Development

To check out a readonly copy of the code, do

    svn co http://svnweb.cern.ch/guest/panda/autopilot/trunk autopilot
    svn co http://svnweb.cern.ch/guest/panda/pilot3
    

To check out with write access so you can commit code changes, do for example

    svn co  svn+ssh://svn.cern.ch/reps/panda/autopilot/trunk autopilot
    svn co  svn+ssh://svn.cern.ch/reps/panda/pilot3
    

Note that the code uses a dbaccess module which is not in the repository. It must be obtained from an existing AutoPilot source area or from T. Wenaus.

Scheduling and monitoring pilots

The procedure to run the pilot scheduler and associated monitor is as follows:

  • the server you're going to run the scheduler on must have Condor installed and must have a basic Apache installation, with URLs accessible from the WAN (for log file access from the AutoPilot browser)
  • contact Torre Wenaus to
    • register your machine as a scheduler server
      • you will need to provide the machine name, the name of a directory where the scheduler can write log files, and a URL pointing to that directory
      • the list of registered machines can be checked here http://panda.cern.ch:25980/server/pandamon/query?tp=main#hosts
      • the needed information is typically like this one
    'name' : 'pandadev01.usatlas.bnl.gov'
    'nickname' : 'pandadev01'
    'host' : 'pandadev01.usatlas.bnl.gov'
    'system' : ' osg lcg-cg condor glidein '
    'rundir' : '/usatlas/prodjob/share/schedlogs'
    'runurl' : 'http://pandadev02.usatlas.bnl.gov:25880/glidein_schedlogs/'
    
    • arrange for Panda queues, sites and tags to be defined as needed, to send pilots for your own use to your own resources (if applicable)
  • install Subversion (svn) client if necessary (see Development section above)
  • create a work area and cd there
  • check out the AutoPilot code from the Subversion repository (see Development section above)
  • cd to autopilot area
  • edit setup.sh appropriately, put it at $HOME/panda_setup.sh and source it
    • you may want to source this file in your login
    • your panda_setup.sh should look like this one
    export PANDA_URL=http://panda.cern.ch:25080/server/panda
    export PANDA_URL_SSL=https://panda.cern.ch:25443/server/panda
    export PANDA_HOME=$HOME
    export SCHEDULER_LOGS=/home/username/logs/scheduler
    export CRON_LOGS=/home/username/logs/cron
    export PYTHONPATH=$HOME/panda/monitor:$HOME/panda/panda-server/current/pandaserver:$PYTHONPATH 
    
  • create a file mycron containing something like 0 3,9,15,21 * * * /...path.../pilotCron.sh --tag=YOURTAG --pandasite=YOURSITE
    • if you're submitting to a single site, you can use --queue= rather than --tag= (a tag is a collection of queues marked with the tag)
    • you can add several if you want to run pilots for several tags, sites
  • add also one instance of the scheduler monitor to your cron: 0 3,9,15,21 * * * /...path.../pilotCron.sh --monitor
  • register this as your cron: crontab mycron
  • Do an initial launch of the scheduler(s) and monitor; the cron will take care of it thereafter. Snip out the command parts of the cron and run in background, eg: ./pilotCron.sh --tag=YOURTAG --pandasite=YOURSITE &
  • You will probably want to adapt cleanSpace.py and add it to your cron to clean up your log areas

Note to submit pilots you must to have a valid grid proxy with a VOMS role accepted by Panda. The current list of valid roles is:

    /atlas/usatlas/Role=production
    /atlas/usatlas/Role=pilot
    /atlas/Role=production
    /atlas/Role=pilot
    
    /osg/Role=pilot
    
If you want a new role being included in this list please contact the Panda team and request it.

That should be it. If you want to stop a scheduler or monitor use ./tellService.py NNN stop where NNN is the service ID from the AutoPilot monitor (below). To restart it, run the appropriate pilotCron.sh instance from your mycron.

Scheduler instances and pilots can be monitored via the Panda AutoPilot monitor http://panda.cern.ch/?tp=main

Managing the AutoPilot databases

AutoPilot uses a number of database to manage information on pilots, queues, submit hosts and so on. One such DB is the schedconfig table which records information on the configuration and status of all the queues known to AutoPilot. Loading and updating of these DBs is handled by the pilotController.py script. If you need to use this script, these are the steps to follow.

  • check out AutoPilot (see development section above)
  • if you are working with US sites that still rely on the old Panda site information in the Panda repository, in panda/jobscheduler/siteinfo.py, check out Panda from ATLAS CVS next to your autopilot directory. $PANDA_HOME should be set to the panda directory.
  • always before you make changes to pilotController.py, do svn update
  • if it is schedconfig you need to change, for an explanation of the schedconfig DB see this twiki
  • always promptly commit your changes, so that other pilotController.py users pick up your changes
  • always before you run pilotController.py, do svn update
  • if you are working with a panda checkout and getting siteinfo.py from it, you must ensure it also is up to date when you run pilotController.py. This is best done by using a script to do the update, with the script also updating panda CVS. See updateQueues in the autopilot svn for such a script.
  • you should have the DQ2 client environment set up when you run pilotController.py (it uses TiersOfAtlas)
  • you should have the LCG UI environment set up if you are configuring LCG sites
  • if you are configuring LCG sites, you should run lcgLoad.py from the autopilot directory before running pilotController.py (to build an updated set of queue configurations from BDII info), and use the --loadlcg option when you run pilotController.py (see the updateQueues.sh script)
  • in the unlikely event that you are working with pilot3 and need to load storage info from pilot3/storage_access_info.py, do an svn checkout of pilot3 next to autopilot and use the --loadstorage option when you run pilotController.py

The schema of the AutoPilot DBs are defined in tables.sql in the autopilot svn directory.

Address questions to Torre Wenaus (has the blame for the implementation) or Rod Walker (suffers the consequences of the implementation).

Applications

Current applications include:


Major updates:
-- TorreWenaus - 05 Nov 2007
-- TorreWenaus - 20 Sep 2006

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf AutoPilot.pdf (32.0K) | TorreWenaus, 25 Nov 2007 - 14:35 | AutoPilot schematic (PDF)
jpg AutoPilot.jpg (105.6K) | TorreWenaus, 25 Nov 2007 - 14:35 | AutoPilot schematic (JPG)
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback