####################################################################### # # File: README.txt # Author: Barnett Chiu # Date: 08/10/2009 # # Descriptions of PilotFactory and its related # components and usage. # ####################################################################### [Intro] 1. Introduction Pilot Factory (PF) is an application that performs pilot submissions through Condor's glidein mechanism. The goal of Pilot Factory is to provide an alternative way of distributing pilots with minimal amount of workload on a cluster headnode. In contrast to Condor-G-based pilot submissions, PF reduces significant amount of GRAM traffic from pilot stream by dynamically deploying a Condor schedd (glidein) on the target headnode as a portal to redirect pilots to the native scheduler. Once the glidein is properly installed and running actively, pilots are then submitted from the local cluster (e.g. Grid site A) to the target remote cluster (Grid site B) using Condor-C. Recall that the Condor-C mechanism involves two schedds: i) local schedd (client), and ii) remote schedd (server). The client schedd is the job queue to which jobs are submitted from the local cluster (submitter site) while the server schedd (usually running remotely) is ultimately the job queue that communicates with the target cluster for resource allocations (w.r.t given user jobs). In the conext of PF, schedd glidein plays the role of the client whereas the native schedd on the remote cluster head (e.g. Globus-controlled gatekeeper machine) acts as the server. Hence, glideins in PF run as mediums to relay pilot jobs in replace of redundant and repetitous monitoring activities in Globus GRAM as in the case of conventional Condor-G-based pilot submissions. Individual pilots going through GRAM is considered as an overload because pilots are designed to be light-weight user jobs that simply perform (customized) preprocessing steps such as valications of computing environment before pulling real payloads from job servers (e.g. PanDA system). Each pilot is identical in terms of its functionality and the submission mechanism(*1). Consequently, it is often redundant to apportion computational resource to each pilot for its monitoring, security-driven tasks and job status check as in the case of regular user jobs. In current PF implementation, glideins are installed using regular user jobs using Condor-G. Ideally glideins can run on the cluster headnode for as long as the site's administrative policy allows. Therefore, Condor-G is only required for occassional glidein deployments; yet, strictly speaking, glideins are deployed on-demand and are subject to removal when necessary. As an example, when a cluster X is overloaded with job streams, say jobs from sites A, B, C, of the same VO domain, site B may decide to remove its glidein temporarily in order to reduce the job traffic coming into cluster X; subsequently, site B redirects its remaining workload to cluster Y with relatively more free job slots, which involves deploying a new glidein on cluster Y's headnode. Exactly how to acheive the workload balancing on the cluster-level is outside the scope of this documentation; however, the flexibility of glidein deployments makes it easier to implement such an idea. 2. PF Components Conceptually, PF consists of three components: a) Glidein Launcher: Dynamically distributes Condor schedd (glidein) to target sites. b) Pilot Generator: Periodically submits pilots to glidein, from which to the native scheduler. Pilots are forwarded from the glidein to the native Condor scheduler via Condor-C. Specifically, it is the GAHP server process that interfaces with site's native batch system. In addition to Condor-C, Condor currently supports the GAHP server interfacing with PBS, and LSF. c) Glidein Monitor: Continuously monitors the active glidein(s) and invokes the glidein laucher in the event of glidein failures. 3. Tasks in a PF Cycle Briefly speaking, PF distributes pilots by accomplishing the following tasks: 1) Install glidein on the target headnode. One first decides on which remote clusters (sites) to run jobs (from a set of available grid sites); then one executes the glidein launcher to deploy the schedd glidein to the target headnodes of the chosen clusters. Similarly, when it is desired to remove certain clusters from the production, then these glideins can be "uninstalled" on demand. Details on schedd glidein can be found here: http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/ScheddGlidein 2) Start Glidein Monitor. Since glidein is deployed as a user job, it usually has a life span (i.e. maximum wall time limit). Glidein Monitor will periodically query the active glidein and detect any failures should them occur. If the glidein is removed for some reason, the monitor will make an attempt to reinstall glidein. And, 3) Start Pilot Generator. Pilot Generator can be any applications that perform periodical job submissions. Pilot Factory module also comes with simple job submitters. AutoPilot used in ATLAS production is an option to perform such task as well. For details regarding AutoPilot, please consult with the following websites: 3.1) An Intro to AutoPilot http://twiki.grid.iu.edu/bin/view/VirtualOrganizations/VOInfo/AutoPilot (See also [AutoPilot] section below) 3.2) Implementation of AutoPilot http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/TestPilot Current implementation allows for the integration of 1) and 2) above; that is, glidein(s) are both installed and renewed using the same module -- GlideinMonitor, which includes a wrapper in the form of bash shell script (.sh) and core program in Python (.py). See [Quick Start] section for more information. Note that Pilot Generator (i.e. 3) above) works independently with the glidein and its montior; therefore, there is no strict order to follow for 3), which can very well execute prior to 1) and 2). Nonetheless, a practical usage case is to install glidein(s) prior to sending pilots with a Pilot Generator (which is the purpose of using Pilot Factory!). Note again that PF still requires Globus software when deploying glideins since they are distributed in the form of Condor-G jobs. [Availability] 1. Codes The code is located under the panda/pilotfac area of the BNL Subversion repository: http://www.usatlas.bnl.gov/svn/panda/pilotfac/ 1.1 Modules to work with AutoPilot PF project was initially developed to work compatibly with AutoPilot; thus for convenience, related modules taken from PF are also available through AutoPilot download link (underneath glidein_mon): http://www.usatlas.bnl.gov/svn/panda/autopilot/trunk/glidein_mon/ 2. Related Files Please ensure that all of the following files are downloaded and kept in the same directory: (1) Script that implements schedd glidein * condor_glidein (2) Pilot Generator to be used with condor_glidein (Optional). * pilotSubmitter.py * Generator.py * SubmitFileMaker.py * generator.config # configuration file AutoPilot is another alternative and is widely used in ATLAS production. AutoPilot is available via the following link: http://www.usatlas.bnl.gov/svn/panda/autopilot/trunk/ (3) Glidein monitoring tools: * GlideinMonitor.sh * GlideinMonitor.py * GlideinFactory.py * CondorClient.py * CondorGlidein.py * ConfigParser.py * Utility.py * pattern.py * schedd_glidein.config # configuration file (4) Example pilot: * testPilot.py (Optional) [Component Summary] 1) Glidein core (condor_glidein) Most of the time, the user is not expected to use the core glidein script directly. GlideinMonitor.py (a wrapper program) will invoke proper set of modules to take care of everything about glidein deployment. However, to be complete, below includes some details on the core glidein script -- condor_glidein. IMPORTANT NOTE [1] (easily neglected but yet important) All related options are ALSO available via schedd_glidein.config. This configuration file is created to facilitate the option specification. All options available from condor_glidein can also be specified in terms of variable assignments within the configuration file. Other variables (in small amount) are created to customize the wrapper code GlideinFactory.py. See [Quick Start] section for more details. Since a schedd glidein is essentially a Condor schedd deployed remotely, there is inherently no difference between a glidein schedd and a regular schedd. Hence, a glidein schedd subsumes all features associated with a regular schedd, having numerous (non-trivial) amount of variables (or in Condor's terminology, macros) for the configuration. Consequently, there is not an easy way to configure a glidein merely through user options. Important configuratable variables (options) are thus included in *schedd_glidein.config* for the ease of use. Please refer to the comments in schedd_glidein.config for option descriptions. Ok. Having said that ... Schedd-glidein script mentioned in the previous section was implemented based on Condor's existing glidein script, condor_glidein, and hence a lot of the usage cases (in terms of options) remain indentical. A few more options are added for the purpose of the schedd-glidein mechansim. Below is a summary of these options: REMINDER [1]: Options are also available to configure via schedd_glidein.config a. -type: select types of glidein (e.g. startd, schedd) b. -dyn|-dynamic: Use dynamic spool and log directories. Dynamic directories are not used by default. In the case of dynamic directory, each glidein has its individual and unique spool and log directory. When the new glidein is deployed, a new set of spool and log directories will be created. c. -schedd, -collector: specify submission target for glidein requests; in other words, glidein requests can be directed toward not only the local schedd (which updates itself to a local collector) but also a remote schedd (which may update itself to either the local collector or a non-local collector). For example, you can submit startd-glidein request to a glidein schedd! d. -setup_keyword, -startup_keyword: send messages to communicate with the external or wrapper program built on top of condor_glidein script (in this way, condor_glidein becomes a reusable component from which other programs can extend) e. -forcelink: specify the URL for retrieving glidein binaries; value to this option has higher precedence over the value to SCHEDD_GLIDEIN_SERVER_URLS macro defined in condor_config. However, this option accept only one address. f. -remotedir: specify the path to remote_initialdir command in the setup submit file. g. -tcp: explicitly tell glidein daemons to use TCP to update their information to local collector using TCP. Similar to the case of -forcelink, option -tcp supersedes the macro definitions in condor_config. h. -noproxy: Disable proxy test. Proxy certificate will not be checked when this option is included i. -remotedir Set remote_initialdir (the remote initial directory where glidein-related scripts are executed) j. -clean Clean up old dynamic log and spool directories generated by glideins k. -lowport, -highport, ... These options are for debugging purposes. Please use -help to look for more information. 2) Pilot Generator (pilotSubmitter.py) This is the application that functions as a pilot submitter. The pilot generator in PF module is implemented based on a pilotSubmitter object that provides all of the operations (in terms of class methods) needed for pilot submissions. Following are the dependent modules for pilotSubmitter: 2.1 Generator: contains core functions for job submission tasks; majority of the features in pilotSubmitter are inherited from Generator 2.2 SubmitFileMaker: provides submit file templates in the format of, a. Condor-C, to be used with glidein, b. Condor-G, for conventional remote job submissions c. Vanilla Universe, for local submissions 2.3 generator.config: configures pilotSubmitter. 3) Glidein Monitor (GlideinMonitor.py) GlideinMonitor deploys and monitors glideins. Following are the dependent modules for GlideinMonitor: 3.1 CondorGlidein: implements a wrapper of condor_glidein plus a few glidein-related methods 3.2 CondorClient: contains simple python APIs for Condor client commands and parsers handling command outputs 3.3 GlideinFactory: provides methods needed for monitoring glidein activities and maintain a fix amount of glidein instances per site Currently, GlideinFactory supports maintaining only one instance of schedd glidein per site because this is the most practical use case for schedd glidein. Although monitoring and maintaining more than one schedd glidein is possible but it is still under development. [Quick Start] Before we delve into the usage and design details, let us go through the essential steps of applying Pilot Factory to submit pilot jobs. This could provide an overall picture immediately without distractions from the glory details on individual pieces. Note, however, if you wish to reuse PF components (e.g. CondorClient.py, a Condor client tool interface) for other purposes or customize PF behaviors on a more subtle level, please also refer whenever necessary to [Component Descriptions] section below. In light of the fact that distributing pilots through schedd glidein is the main scheme for Pilot Factory, this section will first walk through the steps concerning schedd glidein deployment. Once the glidein is installed, you could either choose to use pilotSubmitter module for pilot dissemination s or other applications such as AutoPilot AS LONG AS the chosen submitter application supports Condor-C. Assuming that all PF-specific files mentioned earlier are located under /home/pilotfac, follow the steps below to configure PF: Step 0. This is a preliminary step. Oftentimes, you would need to configure a special-purposed Condor enviornment in order to use glideins since the Condor system using the global administrative policy may or may not fit the requirements of distributed-across-WAN nature of glideins. Recall that a glidein (see also the chapter of Grid Computing chapter in Condor manual) works almost the same way as regular Condor daemons except that they are often installed remotely at the time of need. A schedd glidein is by implication a remotely-deployed Condor schedd daemon. In order for the remote daemon to communicate with the local collector, a proper networking environment has to be provided. 0.1 If TCP is required for network communication, include the following variables in your condor_config file: UPDATE_COLLECTOR_WITH_TCP = True COLLECTOR_SOCKET_CACHE_SIZE = 128 0.2 It is likely that the outbound connection of a given cluster is restricted to a certain port range. For instance, in a typical OSG environment, port range is often defined by the environment variable: GLOUS_TCP_PORT_RANGE. Assuming that GLOUS_TCP_PORT_RANGE takes on a value of "20000, 30000" ( quoting is just for clarity here), include the following in your condor_config (or condor_config.local): HIGHPORT = 30000 LOWPORT = 20000 IMPORTANT NOTE [2] 0.3 Tell Condor where to look for glidein binaries: SCHEDD_GLIDEIN_SERVER_URLS = \ http://www.usatlas.bnl.gov/~pleiades/glidein/schedd_based Recall that GLIDEIN_SERVER_URLS is used to point to the location(s) from which startd-based glidein binaries can be obtained (see the chapter of Grid Computing, section Glidein, for more information); similarly, SCHEDD_GLIDEIN_SERVER_URLS refers to the location(s) where schedd-based glidein binaries can be obtained. If security is of a greate concern, then you may want to configure a secure glidein tarball server and use secure links instead (e.g. gsiftp, https, etc). Use a backslash (\) to separate multitple links. 0.4 Since pilots are routed through a glidein (as a schedd) to the native schedd on the headnode of a remote cluster, your Condor needs to support Condor-C. Include the following macro definitions in your condor_config.local to use Condor-C (also please refer to Grid Computing chapter in Condor manual for more details of Condor-C): CONDOR_GAHP=$(SBIN)/condor_c-gahp CONDOR_GAHP_WORKER = $(SBIN)/condor_c-gahp_worker_thread C_GAHP_LOG=$(LOG)/CGAHPLOG.$(USERNAME) C_GAHP_WORKER_THREAD_LOG=$(LOG)/CGAHPWorkerLog.$(USERNAME) Certainly, it is assumed here that all the dependent macros (e.g. SBIN, LOG, etc) have their legitimate values defined. Note that it is highly recommended that these variables ( or macros) are included in condor_config.local instead of condor_config. If you are using mulitple Condor systems and each of them requires different Grid credentials, it would be a good idea to create a shell script to have CONDOR_CONFIG environment variable point to the path of the desired condor_config file (that which includes all of the settings metioned above) prior to subsequent glidein setups (to be mentioned shortly); this script shall also include instructions for obtaining proper Grid credentials (e.g. Grid proxy certificate, VO-specific credential, etc). grid-proxy-init, and voms-proxy-init (VO-specific) are among the most popuplar client tools for this purpose. Step 1. Launch schedd glidein(s). As mentioned earlier, customizing a Condor schedd for an optimal behavior could involve in several configuration macros. Therefore, using the "raw condor_glidein script" is NOT recommended. Instead, PF module comes with GlideinMonitor.py and GlideinMonitor.sh (wrapper of its .py counterpart) to perform glidein deployments. All you have to do is register GlideinMonitor.sh as a cron job, which will subsequently invoke the main monitor program, GlideinMonitor.py, referencing the configurations specificed in schedd_glidein.config. Depending on the average load of the local Condor system, it can take a few minutes to possibly much longer to establish a glidien (which essentially involves in two Condor-G jobs, one being setup job and the other, startup job). Emperically speaking, setting the GlideinMonitor cron job at an interval of two hours is sufficient. Certainly, this is assuming that the glidein can be established in the target headnode within two hour. This is highly probable unless the Condor queue is overcrowded. Here is an example of such cron job: # glidein factory monitor (lines separated only for description # purpose) 10 */2 * * * source /home/pilotfac/GlideinMonitor.sh --config-file=/home/myconfig/schedd_glidein.config > /usatlas/u/sm/pilotfac/glidein_mon/mon.log 2>&1 ... Cron(1) This example cron job launches GlideinMonitor.sh every 2 hours (at 5 minutes past the hour), referencing its configuration file under the diretory: /home/myconfig. Note that by default the name of the GlideinMonitor configuration file is schedd_glidein.config; however, you are free to choose a preferable alternative. Note also that if you use the default file name and keep it underneath the PF work directory (/home/pilotfac in this example), then the option --config-file can be omitted. Redirect the output of the schell script to a separate file (e.g. mon.log) when keeping debugging messages is desired. Two key files are involved in this cron: i) GlideinMonitor.sh ii) schedd_glidein.config 1.1 GlideinMonitor.sh All the configurable variables in the shell script is documented in the form of comments. Primary variables are as follows: (1) globus_location: the path to Globus client commands; this should be the same as GLOBUS_LOCATION environment variable in most cases. Consult your system admin for this location if it is not (properly) defined. IMPORTANT NOTE [3] (2) condor_config: the path to the Condor configuration file (3) condor_bin_path: the path to Condor binaries; this should include both bin and sbin delimited by semi- colon (bash shell syntax) (4) work_dir: the directory of glidein log files (condor_glidein core script produces logs in the event of glidein installation failures) IMPORTANT NOTE [4] (Esp. for AutoPilot users) (5) autopilot_config: the path to "panda_setup.sh"; this is a tricky one. One could very well be perplexed by the existance of panda_setup.sh and its purpose. Since PF is originally designed to work with AutoPilot system in which panda_setup.sh is, by convention, used to configure the key environment variables shared by AutoPilot-related modules. In particular, the source code for AutoPilot is by convention stored underneath $(PANDA_HOME)/autopilot where PANDA_HOME is usually the home directory. The PF-specific extention files for AutoPilot are available underneath: $(PANDA_HOME)/autopilot/glidein_mon for the convenience of AutoPilot user. For instance, specify the following in the GlideinMonitor.sh if your panda_setup.sh is kept underneath $(HOME)/panda_setup.sh autopilot_config=$(HOME)/panda_setup.sh 1.2 schedd_glidein.config schedd_glidein.config file contains variables that customize the glidein behavior. This configuration file is created to replace potentially verbose and massive command-line options required to properly invoke the glidein core, condor_glidein, in its use of schedd glidein deployments. Majority of the variables are documented in the configuration file and most of the time, one should not need extra variables un-specified in the schedd_glidein.config template included in the PF module. Below is a list of the most essential variables: (1) path: the path to condor_glidein (2) condor_config: the path to condor_config; this ensures that the glidein requests are sent to the desired Condor job queue (schedd). (3) glidein_host: host name of the remote cluster headnode; a short host name is OK as long as it does not lead to confusion. (4) type: the type of the glidein; for schedd-based glidein, simply specfiy 'schedd' (quoting optional). IMPORTANT NOTE [5] (5) count: the number of glidein instances; this is usually configure to 1. Note that running glidein with more than one instance is untested for the current release. (6) rsl_append: RSL string for Globus software used in the glidein host. It is very likely that the target headnode requires special RSL string to schedule Condor-G jobs or you personally have a preference over the job queue specifically for the glidein requests (recall that condor_glidein core uses Condor-G mechanism to install glideins), in which case, rsl_append should be assigned with the desired RSL string. For example, to schedule the glidein job on a queue named "short" and run the job as a single process, you need the following RSL: (jobtype=single)(queue=short) This RSL is then specified as: rsl_append = \(jobtype=single\)\(queue=short\) The escape character '\' is REQUIRED for the currrent release of Pilot Factory. However, this unnatual limitation shall be removed in the newer version. Now, with glidein up and running, we are ready to use pilotSubmitter ... IMPORTANT NOTE [6] For AutoPilot user, skip step 2 ~ 5 and go straight to the section [AutoPilot User] below. 1.3 An extra cron job to periodically renew the glidein (Optional) In some versions of Condor, a minor memory-leak issue may occur with condor_schedd and condor_c-gahp processes. Should this be the case, add an extra cron job to periodically remove the old glidein and re-establish a new one. Suppose that you use the monitor cron job mentioned earlier (i.e. Cron(1)), then add the following cron: 5 */2 * * * source /home/pilotfac/GlideinMonitor.sh --release-all > /dev/null 2>&1 ... Cron(2) Option --release-all simply tells GlideinMonitor to release currently running glidein(s) at the target host (which should have been specified in schedd_glidein.config). Observe that Cron(2) runs synchronously with Cron(1) but with 5 minites behind. This small time period is to allow the old glidein to exit gracefually before a new glidein is deployed. Step 2. Create a running directory (denoted rundir) to keep your pilot script(s). For example, if you decide to place your files for pilot jobs underneath 'factory', then create a directory named 'factory' directly underneath /home/pilotfac. So in this case, your 'rundir' will be: /home/pilotfac/factory; this is where GlideinFactory locate your pilot script (or general job script). Step 3. Prepare your files for pilot jobs (e.g. pilot scripts) and keep them under rundir: /home/pilotfac/factory Note that PF module comes with an optional toy pilot: TestPilot.py Step 4. Configure the pilot generator, PilotSubmitter, using its configuration file (also see section [Component Description] above). Similar to the case of GlideinMonitor, parameters are documented in the default configuration file template which, in the case of PilotSubmitter, is generator.config. Again, users are free to choose their own configuration file names, only in this case, an option --config is required to explicitly tell pilotSubmitter to look for this configuration file. Major parameters include the following (also refer to the documentation inside generator.config) : - glidein_host - server_schedd_name - server_collector_name - executable - arguments - rundir Go back to /home/pilotfac and open 'generator.config' file and you will see all the parameters are commented out. First uncomment the parameter glidein_host, which shares the same semantic as the one in GlideinMonitor module; it refers to the target host where glidein(s) are distributed. Next, uncomment 'server_schedd_name' and 'server_collector_name' and assign them with proper values ... "What are the 'proper values' for the two parameters mentioned above?" These two parameters correspond to schedd and collector names for the grid_resource command in a Condor-C submit file. They represent the schedd and collector running on a cluster headnode (e.g. gatekeeper machine). The name of the schedd is usually the 'hostname' of the headnode and the name of the collector can be queried with the following Globus client command: globus-job-run /jobmanager-fork \ -s `which condor_config_val` collector_host If this command does not give you the answer, either you do not have a legitimate proxy certificate or other batch system type instead of Condor is employed on the remote headnode(*1)! Please consult your system admin in these cases. Now, assuming you have all the values ready, assign them to server_schedd_name and server_collector_name. Next, configure executable and arguments. As the variable names suggest, assign the (file) name of the pilot script and its arguments to these two parameters respectively. For example, if your pilot script file is testPilot.py(*2) and it takes the command-line arguments: --site=TEST, then assign values as follows: executable = testPilot.py arguments = '--site=TEST' # quoting is not necessary ... # ... when there is no space Finally, the other parameter to configure is 'rundir'; for this example, please assign /home/pilotfac/factory to rundir. Again, this is a directory where you keep pilot job files (recall as well that /home/pilotfac is the PF working directory for this example) [note] 1. Pilot Factory aims to support other batch systems as well. In fact, it is the Condor schedd piece, the main daemon process in schedd glidein, that supports other batch system types. This example assumes the remote headnode uses Condor system as the job scheduler. 2. testPilot.py simply gathers basic hardware configurations of the worker node and performs simplistic CPU-bound operations. Now, we are ready to submit pilots ... Step 5. Assuming that queue depth is equal to 5. Issue the following command to execute pilotSubmitter.py that submits pilots continuously with corresponding the desired queue depth. # redirect stdout and stderr to separate files to facilitate ... # debugging ./pilotSubmitter.py --nqueue=5 --glidein-host=galaxy.usatlas.bnl.gov 1> submitter.out 2> submitter.err & Note that option --glidein-host and --nqueue are not required as long as they are being specified in the configuration file (i.e. generator.config). However, specifying the "key" options such as --glidein-host in the crontab or in the shell has the benefit of distinguishing different instances of pilotSubmitters ... TIP [1] Suppose that you need mulitiple instances of pilotSubmitter running concurrently, each of them submitting pilots to different hosts, then you would need your own configuration files for these submitter instances. Particularly, glidein_host in this case very likely assumes different values in different configuration files. In this case, the path to the individual, non-default configuration file needs to be specified using the option: --config. For instance, if one starts another instance of pilotSubmitter referencing the configuration file: /home/geneator_config/config_for_machine_A Then one shall issue the following command with an extra --config option: # lines separated only for illustrative purpose ./pilotSubmitter.py --nqueue=5 --glidein-host=darkmatter.usatlas.bnl.gov --config=/home/geneator_config/machine_A.config 1> submitter.out 2> submitter.err & Again, --nqueue and --glidein-host serve no functional purpose except for distinguishing submitter instances AS LONG AS they are already specified in machine_A.config. Nonetheless, the user option takes precedence over configuration file; be careful NOT to use inconsistent values (unless it makes sense to you to be inconsistent with the respective configuration file) After this command, you should be able to see 5 idle pilot jobs sitting on the job queue. Use condor_q to verify this. For example, condor_q -name galaxy.usatlas.bnl.gov where 'galaxy.usatlas.bnl.gov' represents the name of the glidein host. You will also see job directories being created underneath /home/pilotfac/factory TIP [2] pilotSubmitter module also supports pilot submission in Condor-G and Vanilla Universe. Please see [Design and Implementation] section below for usage details. That's it! To sum up, Pilot Factory conceptually only involves the following two logical steps: (1) set up schedd glidein, and (2) submit pilots using a designated pilot submitter. Ensure that you do not miss any notes marked with IMPORTANT NOTE since these aspects once neglected may cause PF to fail. Please go to sections below for more design-specific details. [AutoPilot User] As mentioned earlier, any Pilot Generator application is interoperable with PF provided that Condor-C pilot submission method is supported. AutoPilot is one of such applications widely used in ATLAS. The usage of AutoPilot in PF mode is mostly the same as its regular mode. In addition to the regular user options (e.g. --montior in AutoPilot's monitoring mode, --queue in submitter mode, etc), the name of the target glidein host is required to direct pilots to the desired glidein instance; this is achieved by using --glidein-host and assign it the name of the host. Short host name is OK so long as it does not cause confusion (theorectically, it is possible that two hosts share the same domain name cut at the first dot but differ at the rest of the name). An example for applying AutoPilot to PF, assuming that glidein(s) are already available at the target host: # (1) glidein pilot monitoring (line separations only for illustrative # purpose) 1,31 */3 * * * /home/autopilot/glidein-pilot-cron.sh --monitor --glidein-host=gridgk01 > /dev/null 2>&1 ... Cron(3) # (2) glidien pilot submission 5,35 */3 * * * /home/autopilot/glidein-pilot-cron.sh --queue=BNL_GLIDE_1-condor --nqueue=2 --pandasite=ANALY_BNL_test3 --pilot=atlasProd --glidein-host=gridgk01 > /dev/null 2>&1 ... Cron(4) These cron jobs are self-explanatory especially for familiar users of AutoPilot. Cron(3) represents the pilot monitor while Cron(4) represents its corresponding pilot submitter. Notice that the queue BNL_GLIDE_1-condor in the submitter cron job is defined via pilotController.py, pointing to the glidein instance(s) running on the glidein host specified by the option --glidein-host=gridgk01 in Cron(3). In other words, the host referred to by the queue (using --queue) of a submitter cron needs to be consistent with that of the monitoring cron. Note that for historical reasons in terms of implementation, --glidein-host is also required for the submitter cron (i.e. Cron(4) albeit the fact that this information should have been made available via the queue. Please refer to the AutoPilot documnetation for more details. [Design and Implementation] Detailed descriptions for the files that implement PilotFactory are given. You can go through only "condor_gliein" and "pilotSubmitter" sections and read their usage examples if not concerned about the implementation specifics ... 1. condor_glidein Usage of the schedd glidein command is almost identical to the existing condor_glidein command except that extra options are included to choose between startd and schedd glidein, select batch system to support, choose between TCP and UDP updates and make customizable messages to send to external programs that utilize condor_glidein as a building block. You need to have a valid grid certificate to use this script. Below are a few usage examples: Eg1. Generate submit files involved in a glidein request that deploys an instance of schedd glidein on the host galaxy.far.away.gov using Globus jobmanager: jobmanager-fork condor_glidein -count=1 -arch 6.8.1-i686-pc-Linux-2.4 \ -setup_jobmanager=jobmanager-fork \ -type=schedd \ -gensubmit \ galaxy.far.away.gov/jobmanager-fork -gensubmit says: generate submit file only Eg2. Request an instance of schedd glidein on galaxy.far.away.gov (as opposed to merely generating submit files in Eg1) condor_glidein -count=1 -arch 6.8.1-i686-pc-Linux-2.4 \ -setup_jobmanager=jobmanager-fork \ -type=schedd galaxy.far.away.gov/jobmanager-fork Eg3. Request 3 instances (*1) of schedd glideins again on galaxy.far.away.gov and forcefully "reinstall" glidein binaries and configuration file. condor_glidein -count=3 -arch 6.8.1-i686-pc-Linux-2.4 \ -setup_jobmanager=jobmanager-fork \ -type=schedd \ -forcesetup \ galaxy.far.away.gov/jobmanager-fork -type=schedd -forcesetup says: install new binaries and configuration file even if old ones already exist! [note] 1. Most likely you would not want to actually install more than one schedd glidein on a headnode. More information and examples can be found here: http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/ScheddGlidein.html 2. Generator.py * Generator.py holds definition and implementation of Generator class * Generator class provides utilities for (1) configuring submit file (2) submitting jobs, and (3) monitoring jobs As an overall summary, Generator defines the following methods that handle job-submission-related tasks (not an exhaustive list): - submitJob, a generic method for job submissions (any jobs) - getJobSpec, for producing submit description file - getJobProfile, which obtains a status summary of job(s) currently sitting on queue (i.e. those visible from condor_q) - countJob, which gives number of jobs currently in particular state such as idle, running ... etc. For example, you can count the number of idle jobs on queue. - listJob, which lists all jobs currently in a particular state. For example, you can list all the completed jobs still on the queue. - queryJob, which can query particular job attribute and its value. * Generator provides three kinds of operation mode: - job submissions to glidein - job submissions in Condor-G - local job submissions They are specified by the following variables respectively: glidein_mode, globus and local. "How to specify the operation mode?" Some examples: Eg. Instantiate a Generator object, g, operating in Condor-G submission mode g = Generator(globus=1, ...) # where '...' contains other parameters to be ... # introduced shortly Eg. Instantiate a Generator object operating in glidein submission mode g = Generator(glidein_mode=1, ...) * Using Generator methods to customize the desired behaviors for job submissions: Basically, Generator needs to know the information below in order to properly handle job submissions: a. Initial run directory (defined by 'rundir' variable): Before submitting jobs, Generator will first search job files under 'rundir'. Job files include the executable, Condor submit file and input files (if needed). If any job file is missing, Generator will then attempt to search through the current working directory where Generator.py and its dependent files are located at(*1). If Generator failed to locate missing job files, it will then try to download these files from a remote server. Eg. Specify `rundir' as "factory" in Generator instantiation g = Generator(rundir=factory, ...) This would assign initial run directory to "factory" in current working directory. You can also specify an absolute path for `rundir'. [note] 1. Ideally, the user can configure search directories to include those other than the current working directory However, this search logic has not been implemented. b. Parameters for completing a Condor submit file For convenience, the parameters defined in Generator class share the same names as those commands in a regular Condor submit file. For example, the parameters include: executable, grid_resource, arguments, transfer_input_files ... and so on. Please check the definition of `jdl_commands' in Generator class for a complete list of such parameters. NOTICE that it may not be practical for Generator to define parameters that cover all possible commands for a submit file. Only frequently-adjusted commands are available for user to customize through Generator (why? see the following section). You can provide your own submit file template if it is desirable to configure other commands not included in Generator. "How to use Generator to complete a submit description file?" Users can either provide their own submit file or configure a desired submit file via user options or a configuration file. Parameters to configure include executable name, arguments, input files, and grid resource string among others, which maps exactly to the commands in a Condor submit description file. Eg1. Create a Generator object that submits Condor-G jobs to a remote host: galaxy.far.away.gov with executable: teleport.py, and arguments: --speed=6c --dimension=5 g = Generator(globus=1, remote_host='galaxy.far.away.gov', executable='teleport.py', arguments='--speed=6c --dimension=5') Eg2. Create a Generator object that submits jobs to the glidein on the host: galaxy.far.away.gov, which has a Condor system whose schedd is: galaxy.schedd.gov, and collector: galaxy.collector.gov:10000(*2). All the other parameters remain the same as Eg1. g = Generator(glidein_mode=1, glidein_host='galaxy', server_schedd_name='galaxy.schedd.gov', server_collector_name='galaxy.collector.gov:10000', executable='teleport.py', arguments='--speed=6c --dimension=5') [note] 2. Schedd glidein uses Condor-C to "redirect" jobs to the schedd running on the remote host (usually a headnode of a site). In a Condor-C submit file, grid_resource command consists of three parameters: (1) grid type (2) schedd name, and (3) collector name. Grid type in Condor-C is always 'condor'. The schedd and collector given in this example are used to formulate the following Condor-C resource string: condor galaxy.schedd.gov galaxy.collector.gov:10000 Notice that the remote schedd in the context of Condor-C or glidein is also called a "server schedd" as opposed to the "client schedd" to which the job is submitted from a submit host (where your Generator.py would be running in). Jobs are forwarded from client schedd to the server schedd on the remote host where they are eventually assigned to the worker nodes. "What if I need to submit a job in which a lot of other commands are involved and need to be configured? Wouldn't it look hideous in the instantiation of such an object?" You can provide a configuration file or configure these parameters in the default configuration file: generator.config. Consider the example Eg2 in the last paragraph, since Condor-C grid_resource value is less likely to change, you can specify the following parameters in a configuration file: server_schedd_name = galaxy.schedd.gov server_collector_name = galaxy.collector.gov:10000 Then, you would instantiate a Generator object as below instead: g = Generator(glidein_mode=1, glidein_host='galaxy', executable='teleport.py', arguments='--speed=6c --dimension=5') Or, you can even put EVERYTHING in the configuration file including "executable", "arguments", "glidein_host", and really just instantiate a Generator object like: g = Generator() It is suggested though to retain the operation-mode parameter such as "glidein_mode" in the constructor for clarity. Configuration file is meant to hold parameters that do not change over time or across different runs. The "grid_resource" value in Condor-G or Condor-C submit file would be an example for such parameters. NOTICE that Generator internally reads "generator.config" by default. If instead, you want to place all parameters in a different configuration file, say "myfavoriate.config", then you will need to explicitly specify it in the constructor using `config_file': g = Generator(config_file='myfavorite.config') This is particularly useful when you need to submit jobs to multiple sites simultaneously with each Generator object (in a program) referencing a different configuration file (holding different site parameters) "Can I configure all submit file commands only via Generator?" Generator class only provides limited but essential tunable parameters for creating a job submit file, which can then be submitted via class method (please refer to submitJob() method for more details). However, if you need to submit a job that requires a relatively more comprehensive configurations in the submit file, please provide your own submit file and then assign that file to `user_submit_file' variable. E.g. Provide my own submit file, named myfile.cg, for Generator to submit. g = Generator(user_submit_file='myfile.cg', ...) Notice that since every submit-file-related parameter is expected to be available in a user-provided submit file itself, parameters such as executable, arguments, etc, are omitted in the constructor. Specifically, if myfile.cg is a Condor-G submit file, then you will instantiate a Generator object as follows: g = Generator(globus=1, user_submit_file='myfile.cg') And that's it! And yes, you still need to provide the operation mode (i.e. globus=1) in the constructor since Generator does not automatically check submit file's syntax if it is not produced internally by Generator. "Configuration File. How does it look like?" Please refer to the default configuration file: generator.config that comes with the package. 3. pilotSubmitter.py In practice, the pilot submitter -- pilotSubmitter.py -- not only supports job submissions to schedd glidein but also supports Condor-G and local pilot submissions. In terms of the design, pilotSubmitter is direct derivative of Generator, which is a general job submitter. pilotSubmitter.py defines a pilotSubmitter class, derived from Generator, that contains extra methods tailored for pilot jobs. pilotSubmitter.py also comes with an entry of execution (i.e. main() function), in which a pilotSubmitter object is created in a way similar to Generator examples except that setting parameters is conveniently delegated to user options. In other words, user options are made available to properly "configure" pilotSubmitter object (via its constructor call) to obtain desirable behaviors for pilot submissions. Logics using pilotSubmitter object are included in main(), which performs the following tasks: a. Submit pilots continuously and periodically with an adjustable queue depth. b. Appropriately handle completed, held, and idle jobs. For example, it is usually desirable to transfer all the outputs produced by the jobs upon their completion. First, let us see how to construct a desired pilotSubmitter object. Similar to Generator, pilotSubmitter sets its internal parameters either through the arguments in the constructor or through a configuration file. The following examples assume that submit file settings such as the executable and arguments for pilot jobs are already provided in the default configuration file, generator.config. So in generator.config, you should include: # generator.config defines the following two common parameters executable = testPilot.py # your pilot script arguments = '--site=my_test_site' # pilot's arguments An example: Eg1. Create a pilotSubmitter object that can submit pilot jobs in Condor-G to the remote host: galaxy.far.away.gov (with executable and arguments as they are in the default configuration file) s = pilotSubmitter(globus=1, remote_host='galaxy.far.away.gov') "What if I want to submit pilots locally?" You can create an object that turns local mode on: s = pilotSubmitter(local=1) That's it! And it will read the same configuration as described in generator.config only that, this object will be used to submit pilots to local schedd. "How to use a pilotSubmitter object to achieve desirable pilot submissions?" So, a program named pilotSubmitter.py is implemented using such a pilotSubmitter object(*1) [note] *1. For the sake of convenience, a main program entry function, main(), that uses pilotSubmitter object, is placed in the same file where pilotSubmitter class is defined. You can, certainly, write your own main() function that uses pilotSubmitter object in a different manner. pilotSubmitter.py provides user options. These user options are made available to not only configure pilotSubmitter object but also instruct how pilots should be submitted. See examples below (Assume again the executable and its arguments are already available in the configuration file): Eg2. Submit pilots in Condor-G to the host: galaxy.far.away.gov with queue depth equal to 3 (i.e. at most 3 idle pilots on condor_q) pilotSubmitter.py --globus --remote-host=galaxy.far.away.gov \ --nqueue=2 ... (cmd 1) Now, you can also start another process reading the same configuration file(*2) but submitting pilots instead to blackhole.far.away.gov: pilotSubmitter.py --globus --remote-host=blackhole.far.away.gov --nqueue=5 # notice that queue depth = 5 now ... (cmd 2) [note] 2. "cmd 1" and "cmd 2' above share the same executable, arguments and inputs (if any) BUT they target toward different sites (i.e. the same job submitted to different sites). Eg3. Submit pilots to the glidein running on the host: galaxy.far.away.gov, with a queue depth equal to 3, two input files: ufo_model.in, the_grey.in, and also append a job attribute: Cepheid_Group = "short", to the submit file. pilotSubmitter.py --nqueue=3 --glidein-host=galaxy.far.away.gov \ --infile='ufo_model.in, the_grey.in' \ --append='Cepheid_Group = "short"' ... (cmd 3) Eg4. Submit pilots locally with a queue depth equal to 3 pilotSubmitter.py --local --nqueue=3 ... (cmd 4) Notice that in cmd 1, 2 and 4 all contain an option carrying operation-mode information (i.e. Condor-G mode with --globus as in cmd 1, and 2, and local mode with --local as in cmd 4) but "cmd 3" does not have an --glidein option. Omitting --glidein option is okay because pilotSubmitter.py operates in glidein mode by default. You may be wondering why the program was not designed to "figure out" the operation mode automatically from the other command-line options. Wouldn't the pilotSubmitter process with a --glidein-host option imply a glidein mode? This is done so in order to facilitate using configuration file to hold commonly-used parameters. Imagine if, for convenience, both remote_host and glidein_host appear in the configuration file, how does the pilotSubmitter processes reading the same configuration file know which type of job to submit (Condor-G or Condor-C to schedd glidein?) If the operation mode is specified, then confusion is eliminated. "Why don't we specify all the parameters using only user options?" Of course, you can specify executable, arguments and other submit-file-related parameters (if any) solely through user options, only this way, it would make the command line appear complicated and scary. This is a trade-off. [Older Documentations] Here's some older documentation on PF (subject to updates) http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/PilotFactory.html