"What is Condor and what does it do?"
Here is a big picture of a Condor pool, a set of computers managed by Condor
. There are four essential components in Condor:
- schedd: schedules and sends user jobs to available computing resource; represents user jobs
- startd: publishes characteristics (e.g. memory, load average ...etc) of computing resource, sets policies of resource usage (i.e. maximum walltime) and runs user jobs.
- collector: receives info from all the other daemons; acts a database (but memory-resident) that holds info of Condor pool.
- negotiator: matches a user job with an available computing resource; acts as the "brain" of the pool with algorithms for match-making built in.
Here is how they connect to each other: (1) user submit jobs to the scheduler (schedd
) with a job description that says what kind of resource she needs (e.g. type of OS, machine architecture, memory requirement ... etc) (2) startd
gathers info on machine characteristics and machine's preferences over user jobs (3) In the meanwhile, collector
gathers information from both schedd
sends necessary info to negoriator
selects a job with the highest priority, runs match-making algorithm(s) and determines the best fit for this job. (6) negotiator
(which owns that matched job) that there is a machine ready to rock. (7) schedd
then initiate communications with the startd
on the matched machine by saying that I'll send a job to you, please reserve the resource for me. (8) schedd
then sends the job to the startd
runs the job... Of course, there are other details but these steps are the keys.
To quickly familiarize a Condor system, referring to general descriptions of those aforementioned components and visualize how they interact with each other could be a good way to go. Below are a few links to look at:
- Description of a user job (aka submit description file, SDF) has its own syntax. Writing test user jobs can be good way to get a feel about Condor's abilities (and limitations) and to directly interact with schedd and startd. Please refer to Ch2 in the manual to see how to write a job description and how to submit a job. Specifically, here is the link to see a few examples of job descriptions. Since SDF essentially contains a collection of "commands" and once a SDF is finalized, it is then to be submitted via condor_submit command. Therefore, to see a full capacity of SDF and condor_submit, please check condor_submit page
- For general descriptions of Condor daemons, please visit here. Specifically sec 3.1.2.
- Capability of each daemon can be adjusted/fine-tuned by a variety of configuration macros (or variables). For these info, please refer to Condor configuration files. Specifically, detailed description of daemons' configuration variables starts at sec 3.3.9.
- A job is categorized by different universes. User knows where (local or remote) and how (sequential or parallel) job is to be processed simply by identifying the universe. Vanilla, Grid universes are the most frequently used so far at BNL. Please refer to sec 2.4.1 for general descriptions of all kinds of universes and sec 5.3 of different types for Grid universe jobs.
- Presentations on Condor, please clink here
"There are so many jobs with so few computer resources at hand. What do we do?"
Condor people introduced a concept called glidein
thereby enabling users to turn remote resources into their own. This is essentially achieved via remotely installing a daemon, startd
, that publishes characteristics and user-defined usage policies of the remote machine to collector
. User then will be able to submit job via schedd
to this newly acquired computing resource. Please refer to sec 5.4
for more information.
The glidein mechanism mentioned above is based on Condor startd
; in other words, it is a startd-based glidein. While the idea is good, however, there are a few problems to think about: (1) How many glideins should a user request at a time? Requesting too many ones could potentially waste computing resources but requesting too few while more could have been offered may not optimize the resource usage. Where is the balance? (2) Acquired resources have their life span depending on the policy set by the remote site; once the "lease" of these resources is expired, user will have to make a new glidein request again. (3) Jobs will still need to go through network traffic to reach these glidein resources since the submit hosts and glideins are mostly probably not on the same file system. (4) Gatekeeper would still experience some GRAM traffic while these glidein requests are in progress since these requests are essentially Condor-G jobs.
In particular, (4) is a major issue when there is a relatively huge amount of Condor-G jobs coming into gatekeeper machines causing heavy GRAM traffic. To maximize the job submission rate and achieve local submission without going through GRAM, the idea of scheduler glidein is conceived. Once the glidein scheduler is set up, it then serves as a conduit between user jobs and the remote compute resources. In other words, users submit jobs to glidein scheduler just like the way they do to the local scheduler. Glidein scheduler then matches a job, just like local scheduler, with the best resource satisfying both user's demands and the resource's usage policies. As a result, users will be able to tap into remote resource and use them as if they sat on their local pool. In one way or another, the end result of scheduler glidein is similar to that of Condor's original startd glidein.
The implementation of Condor scheduler glidein (or referred to as schedd-based glidien) is based on existing Condor startd-based glidein. Please visit here
for the implementation details.
Please visit PilotFactory
page for general descriptions on Pilot Factory and the PilotFactoryPlan
page for descriptions on how Pilot Factory is to be implemented. Phase 1, Generic Pilot Factory
is near the end of implementations and source code will soon be available for browsing.
Please visit GlideinWMS
Please visit the link to the last JIT meeting (next section below) where you can find nice graphical illustrations on GlideinWMS? .
Just-in-time Workload Management Project
is the web page from the last meeting containing slides on panda-condor integrations, GlideinWMS? , and Pilot Factory among others.
- 24 May 2013