r101 - 15 Dec 2006 - 00:23:04 - PohsiangChiuYou are here: TWiki >  AtlasSoftware Web > PilotFactoryPlan

PilotFactoryPlan - Pilot Factory Development Plans


Proposed Solution

To make the best use of Condor features available, the PilotFactory component can be achieved through schedd-based glidein and an enhanced version of Condor startd. The schedd-glidein mechanism is used to facilitate local pilot submissions by exporting Condor schedd to the remote front-end machine and using the schedd as a gateway to disseminate pilots to backend nodes. The new version of startd is to operate in conjunction with the generic pilot [*1] to increase further the efficiency of pilot submissions [*2] and to allow for job preemptions (if desired). Based on these ideas, the PilotFactory would consist of three phases of development as described in the following sections:

1. Pilot Factory with Generic Pilot (Generic Pilot Factory, GPF)

The factory consists of a Condor schedd and a Generator application. Condor schedd is set up and activated on the gatekeeper or a dedicated machine at the remote site via a schedd-glidein. For the ease of explanation, let us assume the machine that the local Condor pool glides into with a schedd a head node. The Generator is then submitted through a Condor-C job from the local submit machine and runs under the remote schedd as a scheduler universe job [*3]. Once the Generator is up and running, it then submits pilot jobs to Condor schedd, which in turn submits those pilots to the native batch system [*4]. An advantage of using schedd is that the Generator does not need to know how to talk to the batch systems other than Condor since schedd has the ability to send jobs to different types of batch systems via GAHP [*5]. A policy for the rate at which pilots are given to the local batch system can be adjusted via the Generator.

This scheme would require no modification of the Condor daemons and pilot scripts; however, it would require determining of the proper Condor configuration for schedd-based glidein as well as writing a Generator application. An application in TestPilot has similar requirements and functionalities as the Generator does. Please see the related informatioin here.

As alluded before, an open question is how and where the pilot factory runs. One option is that the factory is started remotely via GRAM on the gatekeeper node. Another option is to run the factory on a dedicated submit node distinct from the gatekeeper. Since policies for resource usage differ from site to site, both options should be supported.

2. Startd Pilot

One potential problem of running the Generator on the head node (i.e. a gatekeeper or dedicated machine) is that a tremendous amount of pilot traffic could still occur between the head node and worker nodes [*6] since all the pilot jobs are submitted through the head node; in addition, user analysis jobs as well as production jobs are expected to increase dramatically once the LHC at CERN is activated and high-energy physics experiments begin next year. While heavy network traffic is saved through local pilot submissions, however, a substantial overhead due to these job submissions still exists. This is what triggers the idea of startd pilot, "instead of generating pilots solely on the head node itself, what if the worker nodes could also share some of the load?" In other words, if worker nodes could somehow repeatedly run the pilot script and then strategically run those jobs obtained via pilots, then the head node would not have to perform as many job submissions to the native batch system. To achieve this, instead of submitting generic pilots, the Generator submits startd pilots. TIP See this justification for more details on how startd pilot can help ameliorate pilot-wise efficiency.

Startd pilot, as the name suggests, is an intercooperation between Condor startd and the generic pilot. Therefore, a message exchange interface has to be built in order for the startd to communicate with the pilot. The pilot needs to pass job information (e.g. job description, job priority ... etc) to startd so that startd knows how to run the jobs [*7]. On the other hand, startd needs to inform the pilot of the current state of job slots (or VMs) so that if none of the job slots is available for executing a new job, then pilot simply exits without doing anything further. Depending on the policy of running jobs, startd may also need to explicitly request a certain type of job from pilot. Consider the scenario for example, a startd pilot is running on a backend node and the startd is configured to maneuver two job slots where one of them is currently occupied by a production job. As the production job goes, startd continues on running the pilot which, if not informed to fetch any specific type of job, has a good chance to acquire another production job again. There is nothing wrong of simultaneously running all production jobs on the same worker node; however, allowing this to happen implies that user analysis jobs would have relatively less chance or, in the worst case, no chance to run on the available worker nodes at a given snapshot of the site cluster, rendering analysis jobs "piled up" on Panda server. Since the amount of analysis jobs is expected to be huge, a better practice is to allocate at least one of the job slots to user analysis job on any worker node. Since the pilot, having no clue on the desired job type, fetches another production job, at this point, either the newly coming job has to wait on a separate/external job queue to allow for a analysis job in the next try, (in some sense, this defeats the purpose of having a job queue already in Panda server) or the currently-running production job has to be preempted. Nonetheless, production jobs are the primary data source for user analysis jobs and hence, it is probably not a smart choice to preempt them. In any case, if the pilot can wisely select the right type of job (i.e. an analysis job for the above case), the external job queue could on average have less pending jobs or may not be necessary at all [*8]. As for the implementation, Condor startd needs a slight enhancement to support communications with an external component such as a pilot. What follows is a possible scheme to construct a Startd Pilot.

As mentioned before, an interface for message exchange between Condor startd and pilot needs to be established. With the interface (or communication channel) available, the new version of startd daemon sends a message to the generic pilot to obtain a job, interprets the job description (i.e. converting Panda job description into _ClassAd_ format) and runs it on specially-configured VMs (i.e. condor job slots sharing the same CPU)... (more) ...Each VM could either run a background job or a foreground job.

Background job are those that typically have lower priority but yet are desirable to stay in the job slots till completion. Production jobs would be classified as background jobs. Foreground jobs have high priority and can be subject to job preemption when necessary. User analysis jobs are usually run as foreground jobs. Precise usages of VM allocation and how it can apply to realize various strategies of job preemption can be complex and will be discussed on other threads. Nonetheless, here is an example scenario to put these ideas altogether:

Example of Using Startd Pilot

On a worker node, a startd pilot is running in a job slot assigned by the native batch system, the startd component then subdivides the given job slot into two smaller pieces, say VM1 and VM2, for Panda jobs. Startd runs pilot, which fetches a production job from Panda server. Startd then runs the production job on VM1 as background. Startd keeps on running pilot script, which gets an analysis this time. Startd then suspends the production job and runs the analysis job on VM2 as foreground. Startd again periodically runs pilot; however, this time startd passes a message to pilot that either of the VMs is available so pilot simply exits. After a short while, the analysis job finishes and exits. Startd then continues to run the production job on VM1 until it finishes or the next analysis job comes over. In the meantime, the startd component keeps running pilots to fetch new jobs until the native batch system decides to run another user job (and therefore has to remove Startd Pilot) ... This process keeps going on and on with some details changed depending on policies enforced [*9].

3. Startd-based Pilot Factory (SPF)

Combining Generic Pilot Factory (GPF) with Startd Pilot gives Startd-based Pilot (SPF). As a result, instead of producing generic pilots as GPF does, Startd-based Pilot Factory produces startd pilots. Observe that Startd Pilot itself acts like a pilot factory with limited life span because the Condor startd keeps running generic pilot at a fixed rate on a worker node; however, the Startd Pilot process will eventually be preempted just like other user jobs, say, within a few hours or even shorter duration depending on the policy enforced by the native batch system. Hence, Generator keeps submitting certain amount of Startd Pilot jobs to schedd while other Startd Pilots are still running so that there are always standby Startd Pilot jobs awaiting for vacancies in the pool. There is an open question, however, as for the approximate number of standby Startd Pilot jobs to be submitted by Generator. In other words, with an optimal scheduling policy, Generator would submit "just enough" Startd Pilot jobs to the schedd, incurring minimal overhead to the head node. Yet, this optimal scheduling policy for pilot jobs could be influenced by numerous factors such as the total available nodes in the pool, policy enforced by the native batch system, OS-enforced restrictions on CPU usage, priority of the Start Pilot job itself, networking properties at the site, overall traffic of analysis and production jobs to the site, among others. As a result, an optimal scheduling policy may be hard to obtain; therefore, a sophisticated learning algorithm may be necessary to "learn" the overall characteristics of the site and dynamically determine the best approximations to the optimal policy.

More Information

Schedd Glidein

Similar to Condor startd glidein (i.e. condor_glidein), schedd-based glidein also consists of two Condor-G job submissions. The first Condor-G job sets up all schedd-related daemons while the second Condor-G job runs Condor master which then spawns a process running schedd. The schedd glidein mechanism is currently not available in Condor v6.9.0 so a slight modification to the existing code is required. One possibility is to modify condor_glidein script with those fragments embedding the glidein setup script, startup script, glidien-specific configurations, submit descriptions files that run these scripts and so on. I would suggest that condor_glidein be extended with an option for selecting startd or schedd glidein with the default set to startd glidein as it is in the current version. For more details, please check Schedd-based Glidein page. %DRAWING{Schedd Glidein}%

Generator

Similar to pilotScheduler.py in TestPilot, Generator would provide services such as single/multiple pilot submissions to a specific site or to all sites avaialable, submitting pilots to a specific queue on a node, monitoring pilot jobs, listing all the available queues for running pilots and so on. One thing worth noticing is that just like pilotScheduler, the Generator application also needs to support heartbeat so that every tens of seconds (configurable), Generator will keep updating the various statistics and current states of pilot jobs associated with a particular service that launched them. As long as these pilots are still running, heartbeat persists and statistical information are updated until the user issues "stop" command, or upon failure or user keyboard interrupt. Those pilot statistics and state information are maintained in servicelist table. Consider pilotScheduler application for example, the service ./pilotScheduler.py --queue=IU_ATLAS_Tier2-pbs sends pilots to the queue at IU managed by PBS batch system. Statics and states of the pilot jobs via running this service will reflect in the table servicelist every tens of seconds defined by heartbeat. This is essential for coordinating pilot jobs. The same rationale applies to the Generator application except that Generator may need to provide two separate heartbeats, one for generic pilot and the other for startd pilot.

Pilot Submission Efficiency

In GPF scheme, the Generator submits generic pilot to the schedd, which then sends the pilot to the native batch system and depending on the scheduling policy of the native batch system, pilot jobs may not be scheduled in a timely fashion. Assume the generic pilot submission rate from Generator is equal to Gp pilots/min (or Gx pilots per unit time). And, suppose that the average time it takes for a pilot job to reach a worker node from schedd is Tsw minutes; hence, the average rate at which a pilot departs from schedd to a worker node is 1/Tsw (pilots/min). Gp is in general larger (if not much larger) than 1/Tsw, since the pilot submission time from the Generator to schedd is negligible compared to the scheduling period from schedd to a worker node via native batch system. In the idealist case where scheduling period approaches zero (which occurs when the native batch system is extremely nice to pilot jobs), then Gp ~ 1/Tsw (where "~" reads as approximately equals to). Now, suppose the small amount of time for a worker node to run the pilot script is tp, the job retrieval time via pilot is tpw (i.e. the time it takes to get a job from Panda to worker node), and finally the average time to execute the job is Tj. Since Tj is in general much larger than tp and tpw, we can approximate (Tj + tp + tpw) to Tj. Therefore, the time it takes for a generic pilot job to run from start (departing from schedd) to completion would on average take Tsp + Tj (minutes). Assume at this point, then, the native batch system allocates x worker nodes for running pilot jobs. On average, the site cluster is able to finish approximately x jobs in parallel within the period of Tsp + Tj minutes. Note that a naive assumption is made here with respect to the average job processing time. Because the processing time for a user analysis job or a production job can be very different, counting the average may have little meaning.

Now, consider the SPF scheme where the Generator submits startd pilots instead of generic pilots. Assume that the submission rate is Gsp (pilots/min). Tsw' approximately equals to Tsw, where Tsw' equals to the average time for a startd pilot to depart from schedd to a worker node, assuming for simplicity that native batch system treats startd pilot the same as generic pilot jobs (e.g. equal in priority). Similarly, (Tj + tp' + tpw') ~ Tj, where tp' equals to the time for running startd pilot and tpw' is the time for the (generic) pilot to pull a job from Panda down to worker node. Here comes the difference between GPF and SPF. Since the start part in a start pilot has the ability to repeatedly run generic pilot script, the effect is almost the same as locally submitting pilot jobs from the worker node, only the scheduling period is zero. In GPF, before a pilot, submitted from the Generator, can get a job from Panda server, it has to experience a delay due to scheduling period imposed by the native batch system and sometimes, this delay can be significantly large. As a result, Tsw can be large. On the other hand, even though start pilot jobs can also experience similar delays (i.e. Tsw' ~ Tsw); however, once startd pilots "settle down" on some worker nodes, they can repeatedly run pilots and have them get jobs thereon, saving the travel time from schedd to the worker node (i.e. the Tsw term can be dropped). With the same condition assumed in the GPF case, the native batch system now schedules x worker nodes for start pilot jobs, resulting in x startd pilots running in parallel at the cluster. In the beginning, similar to GPF, the SPF also takes Tsp + Tj to finish x jobs from Panda; however, since those worker nodes can again "launch pilots" directly (as a result of running pilot script with startd), the time to get the next x jobs done would then reduce to (tp'+ tpw' + Tj), which is usually far smaller than (Tsp + Tj). In addition, in SPF scheme, since startd can manipulate several job slots at the same time, it is possible to have multiple jobs running simultaneously on the same worker node. Therefore, for the same amount of time, more jobs can be finished... &theta &para

Notes

  1. The term "generic pilot" is used to make a distinction from other types of pilot jobs with similar yet different functionalities such as the startd pilot and the multitasking pilot. It is generic because the functionalities supported are minimal yet self-contained. The generic pilot checks the environment settings of the remote machine, retrieves analysis or production jobs from Panda server (or more specifically, JobDispatcher), and report job status back to Panda server upon their completion (or failure). Generic pilot has the advantage of having more flexibility to fit into different problem domains, being computationally cheap and also it can be expanded upon to fit different needs. For example, suppose the input datasets of a particular analysis job are not present at the site where the job is destined to, the generic pilot can be extended to support site DDM access, with DQ2-specific commands to retrieve these datasets. ...Back
  2. The startd runs the pilot script at a fixed rate to send job requests to Panda. This is effectively equivalent to generic pilot job submissions from Generator (see GPF). With startd being able to run pilots from given backend nodes, much less job submissions will be required from Generator. The reason is that Generator will then only have to submit startd pilots and leave the responsibility of running pilots (or equivalently, pilot submissions) to these startd pilot jobs. Please check Startd Pilot for more details. ...Back
  3. If a job runs under scheduler universe, it runs with schedd rather than startd. ...Back
  4. Condor schedd first spawns a process running gridmanager. Gridmanager then spawns a process running GAHP server program. GAHP contains libraries that support communication with various batch systems so that through GAHP as a medium, schedd can submit jobs to nodes managed by other batch systems (e.g. PBS, LSF ...etc). GAHP also supports condor_submit-related functions so that the local Condor schedd can submit jobs to another Condor schedd at the remote site. This is effectively how Condor-C works. Moreover, GAHP also wraps GRAM protocol for communicating with the gatekeeper running Globus software; as a result, schedd can submit Condor-G jobs. Gridmanager converts ClassAds? to Globus RSL for a Condor-G job and ClassAds? to other RSLs associated with other batch systems when the schedd needs to directly submit jobs to them. This is essential for schedd-based glidein since there is a good chance that the backend nodes behind the gatekeeper that one glides into with a schedd are not managed by Condor. As a summary, following is the sequence of processes on a submit machine in the order of their process creation:

    master -> schedd -> gridmanager -> GAHP: {GRAM, condor_submit, PBS, LSF ...}

    ...Back
  5. schedd submits jobs to other batch systems via GAHP, see [*4] ...Back
  6. The term worker node and backend node are used interchangeably. ...Back
  7. ......Back
  8. Whether it is necessary to have external job queue(s) that works alongside with startd pilot is really contingent on the policy of job preemption. A "smart" policy of job preemption would result in empty queue or a queue with minimal pending jobs. The loading of this external queue also depends on the number of job slots that startd controls and how jobs flow into the slots. For example, a production-analysis-production-analysis sequence or any other proper combinations of production and analysis jobs (which is influenced by the time it takes to process them) could result in minimal load on the external queue; however, for the time being, it is not entirely clear as for the necessity of the external job queue and how to it. ...Back

Q&A

Reference


Major updates:
-- BarnettChiu - 26 Nov 2006

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback