r52 - 30 Jan 2009 - 13:40:14 - PohsiangChiuYou are here: TWiki >  AtlasSoftware Web > ScheddGlidein

ScheddGlidein - Schedd-based Glidein Implementation


Introduction

Since the idea of schedd glidein is very much related to Condor-C, it is helpful to start with a brief description of how Condor-C works. Condor-C stands for Condor to Condor and as the name suggests, it is designed to facilitate job submission and management across the boundary of local Condor pools. Through Condor-C, jobs are transferred from the local client schedd to the remote server schedd where they are being matched against available machines. In this manner, multiple Condor pools are virtually joined together to provide larger sets of computing resources. This is particularly useful upon handling larger amount of user jobs.

Inspired by Condor-C, schedd glidein also aims to expand local resources but advances further by interconnecting sets of computing resources across various sites managed by possibly different batch systems. Glidein schedd works similarly to Globus jobmanger in the sense that it also serves as a medium that bridges between local submit hosts and remote worker nodes; nonetheless, they function in a fundamentally different manner because glidein schedd purely operates as a gateway or a funnel to the native batch system on the site's headnode without brokering and job managing activities as in the case of Globus jobmanager. In other words, glidein schedd simply mirrors the remote batch system on its headnode so that actions involved in remote job submissions can then be reduced to those similar to local job submission within site boundary.

Schedd-based glidein is specifically tailored for light-weight jobs such as pilots. It is for the reason that pilots are designed to perform simple preprocessing steps before pulling real job payloads from a remote server (e.g. Panda server), there is no reason for them to impose excessive overhead on the headnode. On the other hand, pilot submission rate will soon become a major concern when sites need to cope with rapidly increasing analysis and production jobs. With Condor-G, increasing pilot submission rate would naturally lead to a heavier GRAM traffic flow towards site headnodes. Schedd-based glidein is also created to alleviate such overhead on the headnode.

In the application of Pilot Factory, schedd glideins are first submitted to the sites that offer to join with the local resource pool. Once glideins are set and running, pilot submission system then submits pilots directly to these glideins. If the remote site uses Condor as a job management system, then pilots will be submitted via Condor-C. For Condor-C pilot submission with schedd glideins, pilot's submit file needs to be configured to let the server schedd on site's headnode submit pilots as vanilla universe jobs [*1]. The pilot jobs in vanilla universe with Condor-C (denoted as Condor-C vanilla for simplicity later on) are almost functionally identical to those submitted locally as regular vanilla-universe jobs. The only difference is that these pilots need to be forwarded from a client schedd to the server schedd with Condor-C before scheduled as vanilla universe jobs. Glidein schedd plays the role of this client schedd in the overall picture. Since Condor's schedd daemon also supports interfacing with other batch systems such as PBS with proper configurations, glidein schedd can literally mirror various batch systems, hiding the heterogeneity of sites and presenting pilot submission system with a uniform submission portal. In the case of non-Condor batch system such as PBS, pilots will be submitted to glidein schedd as Grid universe jobs. The grid universe job here is not to be confused with Condor-G [*2]. Since the detail such grid jobs is tied into how Condor schedd functions, further discussion will be deferred to the Schedd Glidein Application section below.

Implementation

Condor's existing startd-based glidein is essentially a dynamic mini-Condor pool that presents a remote computing resource to local Condor pool as an add-on resource. In other words, the remote worker node joins the local pool primarily through the functionalities of Condor's startd daemon that lives in the node. Similarly, schedd-based glidein is achieved by dynamically installing and executing a set of schedd-related daemons such as master, schedd, gridmanager, GAHP wrappers. However, unlike the startd glidein that lives in the worker node, schedd glidein is targeted to run on site's headnode in order to interface site's resources as a whole through the batch system on the headnode. Note that a headnode can be either a Globus-controlled gatekeeper or dedicated machines with Globus software installed.

One way to perform remote daemon setup is to submit two consecutive Condor-G jobs. The first Condor-G delivers and executes a setup script, which primarily does the following tasks: (1) downloading a glidein tarball with schedd-related binaries from a remote server via a selection of protocols such as HTTP, HTTPS and GSIFTP; (2) generating a proper glidein configuration file (3) generating a startup script that spawns Condor processes by starting Condor master, which then spawns Condor schedd. With all essential files present on the remote head node, the second Condor-G job is then submitted to execute the startup script generated during the first Condor-G job. A mini-Condor pool that represents a job submission portal is now up and running. Certainly, schedd glidein can be analogous to startd glidein as a mini-Condor pool running on a remote machine and hence, glideins are really created for specialized uses with features extracted from a complete Condor system. .

Schematic

schedd_glidein_schematic.jpg

Usage and Examples

The code is available under the panda/pilotfac area of the BNL Subversion repository, from which please look for condor_glidein.

Usage of the schedd glidein command is mostly identical to the existing condor_glidein command with added options to choose from startd and schedd glidein, options for using TCP and for communicating with external programs when the glidein script is used as a wrapper.

Two preliminary steps need to be done before using glidein script:

(1) You need to have a valid grid certificate (x509 certificate) since glidein requests are made in the form of Condor-G jobs. (2) Configure SCHEDD_GLIDEIN_SERVER_URLS. Please assign valid URLs (http-based or gsiftp) to SCHEDD_GLIDEIN_SERVER_URLS macro in your condor_config file. For example, if schedd-related binaries are available at: https://www.glideinexperience.com/schedd_glidein then you would configure this macro as follows:

SCHEDD_GLIDEIN_SERVER_URLS = https://www.glideinexperience.com/schedd_glidein

Step (2) is similar to startd-glidein scenario, in which case, you would need to configure GLIDEIN_SERVER_URLS and point it to valid URLs to retrieve startd-related binaries (usually in the form of tarball).

Configuring condor_config file is not the only way to notify condor_glidein the location to retrieve binaries; if modifying condor_config is undesirable, step (2) can also be achieved via option -forcelink followed by the link given above.

Here is an example link for SCHEDD_GLIDEIN_SERVER_URLS: http://www.usatlas.bnl.gov/~pleiades/glidein/schedd_based for you to get a quick start.

If you are more security conscious and do not wish to use HTTP link, you can also first download the binaries from the example link and then set up a secure server that uses HTTPS or GSIFTP before getting start with condor_glidein. Specifically, if you are interested in customizing your own tarballs for glidein jobs, here is a recipe for manifesting schedd glidein:

  1. condor_master and condor_schedd: basic binaries for Condor schedd to function
  2. condor_gridmanager, gahp_server, grid_monitor.sh: consider these binaries if submitting Condor-G jobs is desirable
  3. condor_c-gahp, condor_c-gahp_worker_thread: consider these if Condor-C submission is desirable
  4. binaries for supporting other batch systems such as PBS

Compress the desired binaries (where condor_master and condor_schedd are absolutely necessary) into a tarball and then upload it to your server. With binaries ready, set SCHEDD_GLIDEIN_SERVER_URLS to the IP address of your tarball server.

Now, we are ready for a few usage exampes:

1. Generate submit files involved in a glidein request procedure where 3 instances of schedd glidein are to be installed on the host gridgk10.racf.bnl.gov using Globus jobmanager-fork

condor_glidein -count 3 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk10.racf.bnl.gov/jobmanager-fork -type schedd -gensubmit

  1. -gensubmit produces a glidein setup script, and both submit files. By issuing this command, you can fine-tune the submit files for your needs, say, modifying the base directory where binaries are stored.
  2. -type is either startd or schedd; when type is set to schedd, then the command sets up schedd-based glidein; where -type is not specified, the type is default to startd glidein
  3. contact string is usually host name followed by the jobmanager: remote-host/jobmanager:[port]
  4. -setup_jobmanager is used to select the type of jobmanager; since schedd glidein is meant to be running on the headnode, we will use jobmanager-fork here.
  5. -count specifies the number of glidein instance to set up; when not specified, the default is one instance, which is Pilot Factory is intended to use
  6. Note that you can certainly choose to install schedd glideins on worker nodes as long as the headnode shares at least some directories with worker nodes and supports remote file system authentication; however, setting schedd glidein in this manner is beyond the scope of the discussion here.

2. Request an instance of schedd glidein on gridgk10.racf.bnl.gov

condor_glidein -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk10.racf.bnl.gov/jobmanager-fork -type schedd

3. Request two instances of schedd glidein on gridgk10.racf.bnl.gov and forcefully download new binaries and configuration file.

condor_glidein -count 2 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk10.racf.bnl.gov/jobmanager-fork -type schedd -forcesetup

  1. -forcesetup tells the command to install a new set of daemons plus a new configuration file

If network connection via UDP is an issue, you can choose TCP for the glidein to update its status to local collector. To achieve this, you will need to set two configuration macros on your local Condor configuration file:

UPDATE_COLLECTOR_WITH_TCP = True
COLLECTOR_SOCKET_CACHE_SIZE = 128 (if not set, collector will refuse to receive TCP packets)

You could also use the option -tcp to force the command to use TCP. Most likely, site adminstrator may set a port range for outbound TCP connections. This range can be found in GLOUS_TCP_PORT_RANGE environment variable. Glidein command will automatically look up this variable and determine the port range through which glideins sends classAds updates to the local collector. In the case where GLOUS_TCP_PORT_RANGE is not defined on the remote headnode, you will need to manually set lowport and highport values through glidein command-line options: -lowport and -highport

4. After glideins are running, use the following commands as examples to query glidein

Use Condor client tool to check the status of glidein jobs plus their specifics:

4.1 Check schedd glidein queue to see any jobs running on glidein. Glidein schedd name consists of two parts: (1) schedd name (2) host name; where myschedd01 below is a fictitious name for illustration purpose.

condor_q -name myschedd01@gridgk10.racf.bnl.gov

4.2 Check a list of available glideins

condor_status -schedd -c 'is_glidein=?=true'

4.3 Check the detail classAd description of a particular glidein instance; this will list all the attributes associated with the glidein

condor_status -schedd -l -c 'is_glidein=?=true' -c 'Name == "myschedd01@gridgk10.racf.bnl.gov"'

4.4 From the same glidein instance above, query a specific attribute !ScheddIpAddr

condor_status -schedd -l -c 'is_glidein=?=true' -c 'Name == "myschedd01@gridgk10.racf.bnl.gov"' -format "%s\n" ScheddIpAddr?

Code Analysis

Schedd-based glidein can be implemented based on existing condor_glidein command. As mentioned before, condor_glidein submits two Condor-G jobs to the Globus-controlled gatekeeper with one being the setup job and the other being the startup job. Sample submit description files are available at CG1 and CG2 sections below.

The first Condor-G job (CG1), default as glidein_remote.submit, includes a setup script, which upon running, will then generate the following two files and save them to proper directories: glidein_condor_config and glidein_startup. glidein_remote_setup also checks to see if necessary binaries are already present on headnode's file system. If binaries are not present, GridFTP is then initiated to download the desired binaries suited for the machine architecture and OS of the gatekeeper. For security reason, glidein_remote_setup also offers an option for installing gridmap file, trusted CA and checking against proxy certificate.

After the setup step is completed, the second Condor-G job will then be initiated in response to a key message contained in the setup output file. Default as glidein_run.submit, CG2 will run the startup script, i.e. the glidein_startup generated and installed by the setup script in CG1. The startup script basically checks X509 user proxy and uses it as condor daemon proxy and then runs Condor master. The master daemon will then spawns schedd as specified by DAEMON_LIST parameter in glidein_condor_config file.

By default, glidein configuration file is saved underneath $HOME/Condor_glidein/; all the glidein-specific daemons and the startup script are placed under the architecture directory named after the condor version and the machine architecture of the headnode (e.g. $HOME/Condor_glidein/6.8.1-i686-pc-Linux-2.4/).

Submit Files Example

1. Setup Job (first CG)

In arguments macro, you can modify the fifth argument that represents a download link for glidein tarball. Simply replace it with the URL of your own server. Since this value is actually obtained from GLIDEIN_SERVER_URLS defined in the Condor configuration file for your Condor pool, you could also configure this macro to reflect the change.

universe = grid
grid_resource = gt2 gridgk10.racf.bnl.gov/jobmanager-fork

executable = glidein_remote_setup.5091

# Manually quote the URL in case it has characters meaningful to RSL
arguments = $(DOLLAR)(HOME)/Condor_glidein $(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4 6.8.1-i686-pc-Linu\
x-2.4 $(DOLLAR)(HOME)/Condor_glidein/local 'http://gridui01.usatlas.bnl.gov:25880/glidein/binaries/schedd_based' 0

#avoid trouble with scratch directory creation
remote_initialdir = /tmp

output = glidein_setup.output.5091
error = glidein_setup.error.5091
log = glidein_setup.log.5091
queue

2. Startup Job (second CG)

As you can see from expressions in the format of _condor_[Macro], this file exports values to certain macros to the environment such as GLIDEIN_HOST (the URLs for retrieving glidein binaries), LOCAL_DIR, and SBIN, among others, for Condor daemons to reference later on when they are started. These expressions along with those in glidein_condor_config will be read and parsed by related condor daemons. This submit description file will instantiate 3 schedd glideins on the headnode, which is specified by GlobusRSL attribute. Condor simply passes on GlobusRSL information to Globus jobmanager.

universe = grid
grid_resource = gt2 gridgk10.racf.bnl.gov/jobmanager-fork 
executable = $(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4/glidein_startup
arguments = -dyn -f

environment = CONDOR_CONFIG=$(DOLLAR)(HOME)/Condor_glidein/schedd_glidein_condor_config;    \ _condor_CONDOR_HOST=gridui01.usatlas.bnl.gov;_condor_GLIDEIN_HOST=gridui01.usatlas.bnl.gov;  \
_condor_LOCAL_DIR=$(DOLLAR)(HOME)/Condor_glidein/local; \
_condor_SBIN=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4; \
_condor_LIB=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4; \
_condor_LIBEXEC=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4; \
_condor_CONDOR_ADMIN=pleiades@bnl.gov;_condor_NUM_CPUS=1;_condor_UID_DOMAIN=racf.bnl.gov; \
_condor_FILESYSTEM_DOMAIN=racf.bnl.gov;_condor_MAIL=/bin/mail; \
_condor_STARTD_NOCLAIM_SHUTDOWN=1200;_condor_START_owner=pleiades; \
_condor_UPDATE_COLLECTOR_WITH_TCP=True

transfer_Executable = False

GlobusRSL = (count=3)(jobtype=single)
Notification = Never
Queue

Applications

1. Pilot Factory

As described in Pilot Factory Twiki page, a Pilot Factory consists of three major components:

  1. Schedd-based glidein
  2. Pilot submitter
  3. Factory Starter

In the application of Pilot Factory, schedd glideins are dynamically installed on demand to remote sites, thereby providing pilot submission system a uniform submission portal. Two types of pilot submissions play into the scenario:

Schedd Glidein with Condor System

As mentioned before, glidein schedd serves as the client schedd by "redirecting" user jobs to the native batch system that ultimately serves as the job queue for scheduling these jobs to be matched against worker nodes. If the batch system on remote site is a Condor system, the pilots are submitted as Condor-C jobs as mentioned earlier.

Schedd Glidein with Non-Condor Systems

In the case where site's batch system is not a Condor, schedd will spawn a gridmanger process, which in turn spawns another process representing a GAHP wrapper. The GAHP wrapper is essentially a script that provides an interface to issue commands for the target batch system (e.g qsub for submitting jobs in PBS). The wrapper is used in combination with corresponding configuration files that tell Condor where to find the batch system's binaries and the spool directory where the final job outputs are kept. Pilot submission system submits such pilots in grid universe with grid_universe assigned with the type of remote batch system. For example, for Condor to PBS pilot submission, grid_universe is set to pbs string; similarly LSF to lsf string. Please also refer to Condor manual Grid Universe section for more details (though not much description is written so far).

2. Startd Glidein and Startd Pilot

Schedd glidein can be used as a portal to submit startd glideins in order to expand local resources and similarly, to submit startd pilots.

3. Integration with glideinWMS

glideinWMS is an ongoing project developed at Fermi Lab by Igor Sfiligoi. Please look for the specifics in glideinWMS homepage

A possible integration between schedd glidein and glideinWMS is the following:

  1. Use Glidein Factory to select desired and eligible sites for submitting schedd glideins.
  2. Dynamically install schedd glideins on these sites
  3. Use VO Frontend to match jobs against these glidein schedds

Note that matching against glidein schedds could reduce the number of match-making at local pool as compared to the scenarios in glidein startd. The reason is that glidein schedd literally represents and mirrors a site instead of individual worker nodes.

Notes

[1] This is achieved by adding +remote_jobuniverse = 5 to Condor-C submit file where number 5 stands for vanilla universe.

[2] There are three types of grid universe jobs in Condor system: (1) Condor-G grid job. (2) Condor-C grid job, and (3) Condor to other batch systems. The grid jobs in (2) and (3) do not interact with Globus with GRAM.


Major updates:
-- TWikiAdminGroup - 20 Oct 2018

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


jpg schedd_glidein_schematic.jpg (75.2K) | PohsiangChiu, 11 Dec 2007 - 02:12 | Schedd glidein schematic
pdf schedd_glidein_schematic.pdf (73.4K) | PohsiangChiu, 10 Dec 2007 - 16:07 | Schedd Glidein Illustration
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback