r38 - 12 Jan 2013 - 06:22:24 - MaximPotekhinYou are here: TWiki >  AtlasSoftware Web > PilotFactory

PilotFactory - A Condor-Based System for Scalable Pilot Generation across the Grid


Introduction

Note: this page is mainly of historical interest as the project has been discontinued

The first project in the OSG workload management program at BNL is to develop a Condor-based 'pilot factory' which can be used by Panda or other just-in-time workload management systems to deliver pilots to worker nodes across the grid in a scalable way. The current Panda job schedulers deliver pilots to grid nodes via Condor-G, such that a centrally operated scheduler can send pilots to participating sites across the grid. However this has encountered scaling limitations in sending pilots into a site through the Globus headnode. Pilots are intrinsically no different than ordinary user jobs except that they are much more light-weighted. More specifically, pilots validate the computing resources and perform generic preprocessing steps for user jobs; therefore by nature, they are designed to minimize resource consumptions. However, with the increase in demand of production and user jobs, more pilots need to submitted accordingly to service and prepare these jobs, which in turn leads to heavier GRAM traffic. In the Condor-G scenario, each running pilot job requires a jobmanager process on the headnode and as a result, huge amount of pilot jobs would suggest a relatively larger memory consumption on the headnode in addition to the brokering activities underlying GRAM.

The objective of this project is to develop a pilot factory that submits pilots in a way that reduces or, better yet, circumvents excessive communications in GRAM protocol. One possibility is to submit a pilot submission system to the headnode and disseminate pilots locally to worker nodes within the site perimeter, generating no GRAM traffic. In this way, the computing demand experienced by the headnode is much reduced in that instead of a constant pilot-level traffic, Globus jobmanger as an intermediate broker, is only responsible for occasional submissions of a long-lived "pilot generator" [*1]. In other words, pilot submission system is submitted only in response to occasional downtimes due to site maintenance, networking issues or a limited walltime imposed by the site policy.

The pilot factory as suggested above would consist of a component that remotely installs and executes a pilot generator and a monitor that ensures a persistent pilot submission. As nice as it sounds, however, the pilot submission in this approach would require surrendering the job control over to the site, rendering it difficult for the submit host to access output and log files produced by pilot jobs. This is due to the fact that the submit host, the "generator launch pad" of such a pilot generating or submission system, do not usually share a file system with remote sites. Hence, contents of these pilot outputs would then have to be transferred through an external link such as HTTPS or GSIFTP, resulting in an added complexity and network traffic.

In order to gain the benefits of local submissions and facilitate pilot job management, an alternative to the Condor-G based pilot factory is to develop a mechanism of not only achieving local pilot submission on the headnode but also allowing for factory user to easily obtain and monitor pilot outputs. The new submission approach that satisfies all the requirements above was inspired from the idea of Condor-C, an alternative job submission mechanism that allows for jobs to be "forwarded" between a Condor schedd (i.e. a job request agent) to another. Yet in order to use Condor-C, we would still need a communication gateway on the headnode itself in order to relay jobs directly to the native Condor system. Furthermore, this gateway also needs to be flexible enough to interface with the other batch systems for sites that do not employ Condor for job management. Schedd-based glidein is therefore conceived to support these features. With schedd glidein running as a client with respect to site's native batch system, the pilot factory can also benefit from Condor's client tools for managing pilot jobs and the security features inherent in Condor's schedd. [See ScheddGlidein section for more details]

The pilot factory using schedd-based glidein further removes the need of installing a pilot submission system on the headnode, leaving a thin layer - glidein schedd to redirect pilots to the native batch system. Once a glidein schedd is installed and running, it can be utilized exactly the same way as local schedds and therefore, pilots submitted via a glidein schedd can almost equate to those submitted within local Condor pool, resembling Condor's vanilla universe jobs. [See ScheddGlidein section for more details]

In addition, the factory should also allow dynamic reconfigurations from an external controller. This feature can be used, for example, to allow the Panda server to regulate the pilot submission rate at a site based on arrival of analysis jobs at the Panda server destined for that site.

Related Condor Experience

1. Development history

This section summarizes technical issues encountered during the development and their solutions.

See http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/CondorExperience

2. Other background materials and work summary

Visit http://www.usatlas.bnl.gov/twiki/bin/view/AtlasSoftware/CondorWorkSummary

Pilot Factory Design and Implementation

Proposed Solutions

Please refer to PilotFactoryPlan page.

Factory Components

Pilot Factory consists of three major components:

  1. Schedd-Based Glidein
  2. Pilot Generator (using AutoPilot)
  3. Factory Starter

As described above, glidein schedd serves as a uniform submission portal to sites with pre-agreement to join with local resource pool. In the current scheme, schedd glidein is requested using a single command, extended from Condor's existing client tool condor_glidein. Usage examples can be found in the section of Schedd-Glidein Usage and Examples

Pilot Generator continuously submits pilots to sites whenever worker nodes become available (rigorously speaking, whenever there are free job slots) and interacts with the databases that maintain consistent states of these pilot jobs. Since the pilot generator proposed in PilotFactoryPlan functionally overlaps with a subset of programs in AutoPilot, the implementation of such pilot submission system reuses existing codes in AutoPilot and extends it with glidein-specific functionalities such as pilot submissions to glideins and glidein monitor. Please refer to AutoPilot for implementation details and usage examples.

Specifically, a Pilot Generator is composed by the following modules:

  1. pilotScheduler.py (pilot submitter/monitor)
  2. ServiceRegistry.py (action controller with heart beat)
  3. SchedulerUtils.py (communicator with Panda DBs)
  4. constants.py (container of error codes)

pilotScheduler is the main program that plays the role of Pilot Generator described in the PilotFactoryPlan page. It is pilotScheduler that pulls the "template" of submit description file for a pilot job from a corresponding database, fills in submit file the missing values determined at the runtime and retrieves pilot script from a server, and then finally submits this pilot job.

Factory Starter is essentially a program starter logically similar to the glidein script in that it also dynamically deploys a set of target programs. In the Pilot Factory scenario, Factory Starter first installs Pilot Generator either (1) locally on your disk or (2) remotely on site's headnode, with a Condor job and then sends another job to start pilotScheduler for pilot submissions. In the original design plan, Factory Starter was designed to adopt the second scheme in which Condor-G is used to remotely set up pilot submitter on site's headnode so that Factory Starter can then later on submit pilots "locally" to schedd glideins running on the headnode. The idea of deploying Pilot Generator on the headnode is to further reduce network traffic between pilots and the resident glidein, in which case, all the pilot submissions will only occur within the headnode. Submitting pilots in this manner could potentially increase performance gain upon heavy pilot flow in response to high demand of user jobs. However, since Condor-C and "Condor-to-non-Condor" job submisson (see ScheddGlidein) are not limited by the location of the client schedd [*2], functionally speaking, it does not matter where pilot submission program is installed. The reason is that with the schedd glidein installed on the headnode, the aforementioned grid jobs can then use glidein as a "client schedd" to redirect pilot jobs to headnode's batch system. For experimental purposes, both the remote and the local deployment are implemented. In the case of of local deployment where Generator is directly installed on the local host, vanilla jobs are used to handle both setup and startup procedures. The type(s) of Condor jobs involved for setup and startup procedures are summarized below:

1. Remote Generator deployment

  • Use Condor-G to set up Pilot Generator on the remote headnode
  • Use Condor-C to start Generator with appropriate options inserted including the name of the glidein instance (See Using Pilot Factory section below)

Note that in fact, Condor-G and Condor-C can be used interchangeably for Generator setup procedures and startup procedures, yet it is more intuitive to use Condor-C to start Generator, which will then submit pilots again in Condor-C or Condor-to-non-Condor grid job (similar and analogous to Condor-C).

2. Local Generator deployment

  • Use vanilla universe for both setup and startup of Pilot Generator.

Setting and starting Generator could very much have been started manually and separately with simpler scripts without Condor involved; however, a better experience could be provided for factory user if all the deployment steps involved could be presented in a more coherent way. Factory Starter is designed to connect setup and startup steps together to form one single logical operation, making use of Condor's monitoring ability to facilitate Generator resubmission. As mentioned in Introduction section, occasional downtimes of the headnode could occur, during which schedd glidein processes would be killed. Similar scenario could also take place on the local host. As a result, using existing facilities available in Condor to monitor these events would significantly simplify the implementation for Factory monitoring functionalities.

Schematic

Fig. 1 below is a Pilot Generator submitting pilots locally on the headnode to the native batch system. The generator is not required to live in the headnode.

  • Fig 1:
    PF_schematic2.JPG

Code

The code is located under the panda/pilotfac area of the BNL Subversion repository, from which please look for the following modules.

  • Schedd Glidein
    • conodr_glidein
  • Pilot Generator
    • pilotScheduler.py (main)
    • ServiceRegistry.py
    • SchedulerUtils.py
    • constants.py
  • Factory Starter
    • PLauncher.py (main)
    • launcherController.py
    • GenSetup.sh

Using Pilot Factory

Pilot submission using Pilot Factory proposed above involves the following procedures:

1. Deploy Schedd Glideins

1.1 Submit schedd glidein to the site. For example, the following command requests two instances of schedd glideins on gridgk10.racf.bnl.gov (note that this is a made-up host name, you need to replace it with a desired host):

condor_glidein -count 2 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk10.racf.bnl.gov/jobmanager-fork -type schedd

1.2 It there is a need to customize your own setup or startup submit file and manually submit them, add -gensubmit to generate these submit files; glidein job in this case will not be submitted:

condor_glidein -count 2 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk10.racf.bnl.gov/jobmanager-fork -type schedd -gensubmit

1.3 Check a list of available glideins once they are up and running

condor_status -schedd -c 'is_glidein=?=true'

1.4 Check schedd glidein queue to see any jobs running on glidein

condor_q -name myschedd01@gridgk10.racf.bnl.gov

1.5 Please look for more examples on ScheddGlidein page.

2. Run Pilot Generator with Factory Starter

Since Factory Starter essentially installs and executes pilotScheduler, the discussion will start with the usage of pilotScheduler specially for glideins followed by instructions on how to use PLauncher to start this program with desired behaviors.

2.1 Use pilotScheduler as Pilot Generator

In the following examples, assume a schedd glidein instance: myschedd01@gridgk10.racf.bnl.gov available at our disposal. The goal then is to use pilotScheduler to monitor this glidein instance and submit to it pilots.

Submitting pilots to glidein using pilotScheduler involves two steps:

1. Start Pilot Generator as a glidein monitor.

[example]

"Run pilotScheduler as a glidein monitor for the glidein instance: myschedd01@gridgk10.racf.bnl.gov"

[command]

./pilotScheduler.py --glidein-monitor --client-schedd=myschedd01@gridgk10.racf.bnl.gov --client-collector=gridui03.usatlas.bnl.gov:24000 > pf_monitor.out 2> pf_monitor.err &

[option description]

  1. --glidein-monitor puts pilotScheduler.py in glidein monitoring mode
  2. --client-schedd tells the monitor which glidein instance to examine. If pilotScheduler were started by Factory Starter, this option may not be needed since Factory Starter will automatically obtain this information (more below)
  3. --client-collector tells the monitor which collector host the glidein instance updates its classAds to. This is optional since pilotScheduler can automatically obtain this value by querying the glidein instance.
  4. The monitor outputs are redirected to separate files as shown in the command to present from clobbering the terminal

[discussion]

For reasons mentioned above, a simpler and equivalent command would be:

./pilotScheduler.py --glidein-monitor --client-schedd=myschedd01@gridgk10.racf.bnl.gov > pf_monitor.out 2> pf_monitor.err &

The commands above will update states of pilot jobs, transfer pilot job outputs and the result of corresponding user job to a predetermined pilot directory (part of the path name is determined at runtime). The overall result of pilot jobs can be view at AutoPilot monitoring page.

2. Start Pilot Generator as a pilot submitter.

[example]

"Start pilotScheduler as a pilot submitter that submits pilots to the glidein instance: myschedd01@gridgk10.racf.bnl.gov whose headnode contains the local schedd, gridgk10.racf.bnl.gov and the collector gridgk10.racf.bnl.gov:9660"

[command]

./pilotScheduler.py --queue=bnl-glidein-cc --nqueue=5 --pilot=myPilot --pandasite=mySite --client-schedd=myschedd01@gridgk10.racf.bnl.gov --server-schedd=gridgk10.racf.bnl.gov --server-collector=gridgk10.racf.bnl.gov:9660 > pf_pilot_submit.out 2> pf_pilot_submit.err &

[option description]

  1. --client-schedd: same as the glidein monitor, this option takes in the name of the glidein schedd. Apparantly, the value to this option needs to be consistent with monitor started earlier
  2. --server-schedd: specify the schedd instance on headnode
  3. --server-collector: specify the collector instance on headnode

The following are regular command-line options for pilotScheduler. Please visit AutoPilot page for more information.

  1. --queue specifies the name of job queue to submit pilots. (See explanations here)
  2. --nqueue tells pilotScheduler the maximum number of pilots allowed to sit in the job queue
  3. --pandasite determines the type of user job that pilots will retrieve from Panda server
  4. --pilot selects the type of pilot script; each type of user job is associated with a particular type of pilot.

[discussion]

This will submit pilots to a schedd glidein running on a site. The settings would be slightly more involved compared to the monitor since parameters associated with site's native batch system need to be passed to pilotScheduler. In this example, assume that the target site uses Condor and the name of the schedd and the collector on the headnode are gridgk10.racf.bnl.gov and gridgk10.racf.bnl.gov:9660 respectively. These two values are the only information that would require factory users to gather beforehand. The name of the schedd is usually the host name of the headnode. It would take a bit of work to get the name of the collector. One way is to execute globus-job-run to the site with condor_config_val command [*3]. If globus-job-run is not available, then sending a Condor-G job to the site with condor_config_val as the executable will achieve the same result. Click on the queue name (bnl-glidein-cc in this example) to look into all the pilots sent to the glidein [*4].

2.2 Use Factory Starter to Initiate Pilot Submissions

Factory Starter can dynamically install and execute Pilot Generator similar to the logics involved in dynamic deployment of glideins. In addition, Factory Starter submits Pilot Generator as a user job with two options available: (1) deploy Pilot Generator locally as vanilla universe jobs, or (2) remotely install and execute Pilot Generator as Condor-G/Condor-C jobs if it is desirable to run the submitter directly on the headnode in order to further reduce network traffic. Example commands below perform local installation.

1. Perform environment variable setup (Optional)

[command]

source GenSetup.sh

[discussion]

This script will export environment variables representing values to common user options for PLauncher such as the URL for retrieving Pilot Generator tarball (--inurl), base directory (--basedir), environment setup file for Pilot Generator (--env), and contact string (--contactstr), among others. This script can be extended to provide values to literally all options needed to run PLaucher. While it is not necessary to provide default values with GenSetup.sh, it may be desirable for the factory user to configure tedious option value beforehand. For example, the value to --inurl could potentially be very long if there are multiple URLs available for retrieving the tarball. An XML version of such configuration file will be supported later on.

2. Start Pilot Generator in glidein monitoring mode with Factory Starter

[example]

"Locally install and execute Pilot Generator as a glidein monitor for the glidein instance, myschedd01@gridgk10.racf.bnl.gov"

[command]

./PLauncher.py --contactstr=gridgk10.racf.bnl.gov/jobmanager-condor --env=setupBNL.sh --local --client-schedd=myschedd01@gridgk10.racf.bnl.gov

[option description]

  1. --contactstr takes in a contact string (or resource string), which determines the host where glideins are located at and the jobmanager type; in this example, PLauncher will dynamically install and execute Pilot Generator in monitoring mode for the glidein instance on gridgk10.racf.bnl.gov. The option --contactstr can be replaced by --glidein-host in combination with --jobmanger; if --jobmanger is not specified, jobmanager-condor is the default.
  2. --environ or --env specifies the environment setup file to be executed in order to run pilotScheduler; the environment variables in the file will be exported and evaluated in the shell with the values inserted into Environment command in the startup submit file. While inserting values to these variables is not necessary if Pilot Generator is locally installed [*5], it is important to do so for remote installation. Environment variables do not necessarily evaluate to values expected
  3. --local installs Pilot Generator on local host with a default base directory being Generator under current working directory (i.e. `pwd`/Generator); if not specified, PLauncher will deploy Pilot Generator on the remote host specified by --contactstr or --glidein-host.
  4. --client-schedd: value to this option will be passed to pilotScheduler to determine pilot submission target (glidein); it is semantically the same as the option --client-schedd in pilotScheduler; if not specified, PLaucher will automatically choose the best glidein on the given host (determined by --contactstr) determined by a glidein selection algorithm. Please check the comments for select_glidein() function for more details.

[discussion]

Since --contactstr is essentially a combination of --glidein-host and --jobmanager, the same command can be replaced by:

./PLauncher.py --glidein-host=gridgk10.racf.bnl.gov --env=setupBNL.sh --local --client-schedd=myschedd01@gridgk10.racf.bnl.gov

You have probably noticed that --jobmanger=jobmanger-fork is missing. Since the example above is a local deployment (because of --local), the value to --jobmanger is irrelevant and will be ignored even if it were specified; do add --jobmanager option, though, for the remote deployment of Pilot Generator especially when using other types of jobmanager than jobmanager-fork is desirable (e.g. jobmanager-condor). Value to the option --jobmanager is default to be jobmanger-fork when not specified.

If you would like to examine or modify the submit files for setup or start up job, include --gen-submit in the command:

/PLauncher.py --glidein-host=gridgk10.racf.bnl.gov --env=setupBNL.sh --local --client-schedd=myschedd01@gridgk10.racf.bnl.gov --gen-submit

In this way, setup and startup jobs for Pilot Generator will NOT be submitted with only the generation of submit files.

Apparantly, these commands above assume that values to all the options including the contact string, glidein schedd are not given in GenSetup.sh; otherwise, they could have been all omitted. In the examples above, notice that no information regarding running Pilot Generator as a glidein monitor was specified in the command-line options; this is due to the fact that PLaucher automatically insert "--glidein-monitor" as the job parameter to Pilot Generator. The default is to run Pilot Generator as a glidein monitor.

3. Start Pilot Generator in pilot submission mode with Factory Starter

[example]

"Locally install and execute Pilot Generator that submit pilots to myschedd01@gridgk10.racf.bnl.gov whose headnode contains the local schedd, gridgk10.racf.bnl.gov and the collector, gridgk10.racf.bnl.gov:9660"

[command]

./PLauncher.py --contactstr=gridgk10.racf.bnl.gov/jobmanager-condor --env=setupBNL.sh --local --client-schedd=myschedd01@gridgk10.racf.bnl.gov --server-schedd=gridgk10.racf.bnl.gov --server-collector=gridgk10.racf.bnl.gov:9660 --jobopts='--queue=bnl-glidein-cc --nqueue=5 --pilot=myPilot --pandasite=mySite

[option description]

  1. --server-schedd specifies the name of the schedd on site's headnode
  2. --server-collector specifies the name of the collector on site's headnode
  3. --jobopts takes in a string to pass to pilotScheduler as its command-line options. As you see, options enclosed by --jobopts are almost identical to the example mentioned earlier in pilotScheduler section except that --server-schedd and --server-collector are taken off. PLaucher will automatically insert these options as job parameters to pilotScheduler so glidein-specific options for pilotScheduler become optional within --jobopts.

[discussion]

You could also include --glidein-debug to see more debugging messages. Drop --local option to perform remote Pilot Generator setup.

Factory Implementation: Second Phase

The first phase of Pilot Factory developement is essentially a factory for generic pilots. In response to security concerns, it is desirable for the inclusion of gLExec with pilots. Since it is technically difficult to trace and determine the actual user job retrieved by the pilot on a worker node, melicious attempts could potentially happen when pilots are running under the same UID as user jobs. For example, pilot submitters may have relatively higher file permissions then those of user jobs; sharing the same UID with such pilots would imply "promoting" job users' accessibilty to files with security concerns. gLExec intervenes in the scenario to isolate a pilot with its user job by assigning a separate and legimiate UID to the user job.

Condor startd is a highly desirable candidate in integrating pilots with gLExec since it has existing interface with gLEexe. Besides, Condor startd also extracts useful system information of the computing element, providing pilots a more comphrehensive survey of the computing environment. Moreover, startd can also partition a CPU (or multiple CPUs on MPI machines) into several logical job slots (virtual machines) so that it allows for pilots to assign jobs of different purposes to the same machine characterized as backgroud job and foreground job. Please refer to Startd Pilot and SPFsection in PilotFactoryPlan for more details.

Historical Notes and Alternative Plans

1. Multiple pilot submissions with workflow control

  • Evaluate whether a DAG running in a schedd and submitting chained jobs (use pilot output to control downstream steps) can be used to implement the factory
  • Will it require modifications to dagman? Possibly

2. Submission Target

  • Head node? Not the Globus gatekeeper. We don't want to add gatekeeper load, and we don't want to be susceptible to gatekeeper load. Dedicated machine?
  • Condor head node?
  • Dedicated machine? Won't always be possible
  • Edge services facility / VO box? Ideal?
  • Worker node? Sites unlikely to like the idea of a persistent pilot factory on worker node. Negotiate for this on opportunistic sites without a dedicated machine or VO box?

3. Implementation Issues:

  • Condor-G vs Condor-C for pilot factory submission
  • Condor-C avoids Globus gatekeeper and GRAM
  • Pilot factory submission mechanism. Use glide-in to launch
  • schedd has same scalability limits as GRAM, circa 1 Hz submission rate. Implications

Other suggestions

  • startd that can fetch job self-contained to the worker node (Miron)

Meetings and discussions

Notes

[1] A pilot generator is synonymous to a pilot submitter, or pilot submission system. These terminologies are self-explanatory and were used in other contexts throughout the development; for the sake of consistency, they will be used interchangeably to fit appropriate context.

[2] There are two schedds that need to be distinguished in Condor-C: client schedd and server schedd. The client schedd is the one to which the user submits jobs and it is from the client schedd that jobs are forwarded to the server schedd, where jobs are eventually scheduled as job requests. It is the server schedd that acts as the job queue from which job are to be matched against free machines. Schedd glidein is deployed on the headnode to function as the client schedd with respect to headnode's local schedd as the server. In the case where other type of batch system is used on the headnode, the batch system as a whole is a unit that functions as a server to the glidein schedd.

[3] For example, the following command uses condor_config_val to query the value of COLLECTOR_HOST macro from the collector on the host gridgk10.racf.bnl.gov:

globus-job-run gridgk10.racf.bnl.gov/jobmanager-fork -s `which condor_config_val` collector_host

Alternatively, look into the path to Condor configuration file either through site administrator or with globus-job-run command. Headnode's configuration file is usually located at /etc/condor/condor_config. The name of the collector is the value of COLLECTOR_HOST macro in the file. The command to query the location of Condor configuration file is as follows:

globus-job-run gridgk10.racf.bnl.gov/jobmanager-fork -s `which condor_config_val` -config

[4] A job queue is a logical unit that defines attributes such as the system it supports (e.g. Condor, OSG), legitimate submit command to send jobs to this queue, working directory, data directory, name of its gatekeeper, and associated submit file template, among others. Profile of a job queue is maintained by schedconfig table in AutoPilot's database. A glidein job queue is similar to site's regular job queues except that the job template is in the form of Condor-C or Condor-to-non-Condor submit file. Please check the queue: bnl-glidein-cc for an example.

[5] Specifying --env=myEnv.sh mainly perform two operations: (1) export all environment variables defined in the file (e.g. myEnv.sh) (2) insert echo statements with values to these variables to myEnv.sh so that later on they can be written to the job output file of setup job (Condor redirects standard output and error to job files). Values contained in the output file will be parsed and interpreted by PLaucher so that these values can then be inserted to Environment command in the startup job submit file. When the startup job is running, these values will then be exported as the shell environment for running Pilot Generator. Naturally, local installation do not need all the hassles in the second logical operation since all the variables will have already be exported as the shell environment in the first operation; however, it is important to perform the latter in the case of remote installation in order for Pilot Generator on the remote machine to have an environment to reference to.

The testbed setup and how-to

General setup

In the following, we are referring to the production account and directories under its home directory. This is the crontab on the primary test machine pandadev01 (linebreaks inserted for readability)

0 0,12 * * * /usatlas/u/sm/prx.sh > /dev/null 2>&1
0 1 * * * /usatlas/u/sm/panda_nightly.sh > /dev/null 2>&1

##### Pilot Factory #####
# all machine-dependent shell scripts and config file are kept in *pilotfac* directory i.e. /usatlas/u/sm/pilotfac
# These files include: (a) GlideinMonitor.sh (b) glidein-pilot-cron.sh (c) schedd_glidein.config

# glidein factory monitor
0,10,15,20,25,30,40,45,50,56 */1 * * *\
source /usatlas/u/sm/pilotfac/pandadev01.d/GlideinMonitor.sh\
 --config-file=/usatlas/u/sm/pilotfac/pandadev01.d/schedd_glidein.config > /usatlas/u/sm/pilotfac/glidein_mon/mon.log 2>&1

# glidein pilot monitoring
1,31 */3 * * *    /usatlas/u/sm/pilotfac/pandadev01.d/glidein-pilot-cron.sh\
 --monitor --glidein-host=gridgk01 > /dev/null 2>&1
# glidien pilot submission
5,35 */3 * * *\
/usatlas/u/sm/pilotfac/pandadev01.d/glidein-pilot-cron.sh\
 --queue=BNL_GLIDE_1-condor --nqueue=2 --pandasite=ANALY_BNL_test3 --pilot=atlasProd --glidein-host=gridgk01 > /dev/null 2>&1

There are essentially three cron jobs: one for glidein and the other two for AutoPilot. Basically the factory consists of schedd-glidein and AutoPilot. The schedd-glidein is automatically lauched by GlideinMonitor? .sh (which sets up and installs schedd glidein on a given remote host whenever there doesn't exist a running glidein on that host). The other two cron jobs are AutoPilot's monitor and pilot submitter.

Condor configuration

Before setting up a factory, you need to ensure that the Condor's environment is compatible with glidein's requirements. The global condor_config on production machines do not work with glideins. The reason mainly is because they use lower port number for outbound TCP connection which will then be blocked by the firewall.

For an example of condor configuration file, please log on to the test bed machine as the production account and source mycondor (mycondor/bin/mycondor), which will assign CONDOR_CONFIG to the file that works with glidein; plus it will show the path that condor_config. You'll need a valid grid certificate to run this script properly.

Latest README.txt (in SVN)

The latest piece of documenteation checked in with the code can be found in the README.txt file.


Major updates:
-- MaximPotekhin - 29 Oct 2008
-- BarnettChiu - 10 Dec 2007
-- TorreWenaus - 21 Sep 2006

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf PF_schematic.pdf (26.6K) | PohsiangChiu, 11 Dec 2007 - 00:48 | PF Schematic
jpg PF_schematic.jpg (28.7K) | PohsiangChiu, 11 Dec 2007 - 02:00 |
bmp PF_schematic.bmp (631.6K) | PohsiangChiu, 11 Dec 2007 - 02:08 | PF Schematic 2
jpg PF_schematic2.JPG (22.1K) | PohsiangChiu, 11 Dec 2007 - 02:20 | PF Schematic 2
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback