CondorExperience - Functionality and Performance Assessments for Condor-G and Glidein
Condor-G Performance
The performance of Condor-G is related to both the nature of the jobs and the policy of the gatekeeper. Also, since achieving
high throughput is the main goal of Condor, the performance evaluation needs to be done with respect to a subtantial amount
of jobs as a whole. Therefore, there are two major performance metrics to be considered: queuing time and running time.
Queuing time is the average amount of time it takes for jobs submitted to be launched to the worker nodes; in other words,
the time it takes for jobs to go from submit machines to execute machines. The running time is that the average amount
of time for jobs to run from start to completion. These metrics are being reduced to the most simplest forms since the job
preemption and checkpointing are not yet being considered. The queuing time is largely influenced by the policy
enforced by the gatekeeper and also the requirements and preferences set on the work nodes. The running time analysis
could be more complex; however, without considering the factor of job preemptions and checkpointing (or even failures), it
is largely determined by the contents of the jobs. These jobs may be intrincically different in that the executables
and the amount of I/Os are most likely different.
The way to have a steady queuing time is to fix the requirements and rank attributes of both jobs and machines and have all
the tests performed with respect to the same gatekeeper before targeting the next one. However, it would be an overly ideal case
where all the jobs and machines have exactly the same requirements. Gaussian distribution from the looser
requirements to the stricter requirements of jobs and machines would be one possible and reasonable test case. To fix the factor
of rank attribute, we can assume all the test is done with respect to the jobs from the same user. The running time can be
make steady using the similar methodology by assuming that the jobs (the executable in particular) consume Gaussian distribution
of CPU time and so does the I/O time. The case for CPU time and I/O time would have to be separated since they are intricically
different, meaning that the order of running time for CPU-intensive and I/O-intensive jobs are different.
Test Cases
The test cases can be categorized by the goal of testing. If queuing time is more important then running time, then
the factors influencing the running will have to be fixed when Condor-G test is performed. Again, we can consider two
cases for fixing the running time of jobs: Gaussion distributions (or other distribution depending on the overall nature of
physics experiments) of CPU time and I/O time among the tested jobs. For example, if there are 1600 jobs to be tested,
200 jobs would be least CPU-intensive, 800 jobs would be in the medium range of CPU usage, and the other 600 jobs would the
most CPU-intensive (1,4,6,4,1)...
Test Methods
1. Between Condor and Gatekeepers
Use
Grid Exerciser to do functionality test with respect to gatekeepers
2. Condor-G performance metrics
For the lack of benchmarks for general Condor-G performance test (to my knowledge), here is a simple idea for the
general testing:
- Create a program (in Perl or shell script for example) that automatically generate submit description files with options available for specifying the executable, requirements, input/output/log, grid_resource directories, and so on.
- In the case of testing Condor-G, a wrapper program on top of the submit-file-generation program can be developed to automate the submit file generation even further based on the observation that submit files, defined in the same universe (i.e. grid universe for Conodr-G jobs), usually differ in merely a few commands such as the executable, input/output, among others. The command, grid_resource, does not change if the test is performed with respect to the same gatekeepers running same version of GRAM (i.e. test is performed within a site such as BNL). As a result, it is possible to have this wrapper program select among a list of test programs available and then substitute the executable in the submit file (to generate a new job) as an example.
- Simiarly, the wrapper program can alter the requirements of jobs.
- Set cron jobs that generate these submit files and then submit them to the gatekeeper that Condor-G communicates with.
Shell script version of such programs are under way. As an example,
condor_gen_cg -exec /bin/ps
-out $HOME/condor_test/out.txt
-in $HOME/condor_test/in.txt
Glidein
This section refers to the startd-based glidein, which is already supported with current condor_glidein command. Below are
a list of questions arose during the test and their solutions.
Estimation of the Life Span of a Glidein
[Q1]
Is there anyway that the user at the submit side, can estimate the duration of a glidein resource?
[A1]
There's no way for the glidein to figure out how long the batch system is willing to let it run for.
If you use the -runtime option to condor_glidein, the glidein will shutdown after that much time
and advertise the shutdown time as DaemonStopTime in the
startd ad.
Startup Script Runs Forever
[Q2]
Sometimes when no jobs were submitted to glidein nodes, the startup script kept running while it should have been
terminated after a certain period of time?
[A2]
With a
condor_status -l command, if UpdateSequenceNumber keeps increasing, that means the glidein
startd is
still running, and the value increases each time the
startd re-advertises itself (once every 5 minutes by default).
If the glidein shuts down, the job should leave the queue. If the glidein is indeed not shutting down when it should,
then the glidein's log files (i.e. MasterLog, StartdLog) would be most useful in figuring out why. However, the
reason that the glidein startup scirpt kept running with glideins being gone was not investigated further since
this problem did not occur later on.
Schedd-based Glidein
The Internals
For implementation details, please visit
schedd-based glidein page
Service Extensions
The current version of schedd-based glidein includes support for Condor-G, Condor-C, scheduler universe jobs.
Proper configuration can be added to support gt4 GRAM as well as other batch systems such as PBS and LSF.
As an example,
1. Include the following configurations for gt4
GT4_GAHP = $(SBIN)/gt4_gahp
GT4_LOCATION = $(LIB)/gt4
GRIDFTP_SERVER = $(LIBEXEC)/globus-gridftp-server
GRIDFTP_SERVER_WRAPPER = $(LIBEXEC)/gridftp_wrapper.sh
2. Include the following for PBS and LSF
GLITE_LOCATION = $(LIB)/glite
PBS_GAHP = $(GLITE_LOCATION)/bin/batch_gahp
LSF_GAHP = $(GLITE_LOCATION)/bin/batch_gahp
UNICORE_GAHP = $(SBIN)/unicore_gahp
NORDUGRID_GAHP = $(SBIN)/nordugrid_gahp
These service extensions with schedd-based glidein can be selected via options of the command condor_glidein.
Glidein Binaries
Schedd-based glidein binaries are available
here
Problems Experienced and Their Solutions
This section is better illustrated in the form of questions and answers:
Glidein Binaries
[Q1]
Which daemons are necessary for schedd-based glidein?
[A1]
a. Always needed: {condor_master, condor_schedd}
b. Grid universe jobs: condor_gridmanager, {gt2: gahp_server, gt4: gt4_gahp}
c. Condor-C jobs: {condor_c-gahp, condor_c-gahp_worker_thread}
d. PBS/LSF jobs: everything under lib/glite except batch_gahp_daemon
Authentication
[Q2]
The glidein
schedd was able to show up with
condor_status -schedd but when a job is submitted to this
schedd, the following errors returned:
condor_submit cgtest1 -name agrd0926@acas0011.usatlas.bnl.gov
Submitting job(s)
ERROR: Failed to connect to queue manager agrd0926@acas0011.usatlas.bnl.gov
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5004:Failed to get authorization from server. Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
[A2]
When a user connects to a
condor_schedd for submitting or modifying jobs, the
schedd will request user authentication.
Multiple authentication methods are avaiable. The most usual method is FS. In this method, the
schedd
picks a filename that does not currently exist and then asks condor_submit to create that file. The schedd then
checks the owner of the file. This method works fine when the user who does condor_submit and
schedd are
on the same machine, but not so when they're on different machines. As a result, you need to configure another
authetication method. Ultimately, the user will want to set up one of the strong authentication methods such as GSI
and Kerberos. For simplicity in the testing phase, CLAIMTOBE is chosen for the authentication, which means
the
schedd trusts whoever does condor_submit for who they are.
Example:
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI, CLAIMTOBE
[Q3]
In the currrent configuration settings, SEC_DEFAULT_AUTHENTICATION_METHODS is commented out and therefore is not
defined production machine's condor_config file. Does this attribute also have a default value?
[A3]
Existence of Schedd
[Q4]
What would be the best way to query if the remote
schedd is still running? (and so, if not, then schedd-glidein needs to be relaunched again)
1. condor_status -schedd
2. condor_config_val -schedd (this doesn't seem to be implemented yet)
3. simple script:
sname=$(cat <(condor_status -schedd -l -c "is_glidein=?=true" | grep Name ) |
sed -e 's/Name[ ]*=[ ]*\"\(.*\)\"/\1/g')
if condor_q -name $sname; then
echo "cool, schedd is still alive"
else
echo "schedd's gone :("
fi
[A4]
When the schedd exits, it unadvertises itself. But if the batch system kills it with a KILL signal,
then it doesn't get a chance to unadvertise itself and its ad will remain in the collector for up to 15 minutes.
As a result, if the schedd shows up in condor_status -schedd, then you know it was running sometime in the past
15 minutes, but may not be now. To know if the schedd is running now, you can use condor_config_val
or condor_q as you outlined. You'll need to use the -name option to condor_config_val.
Command Caveat
[Q5]
In trying schedd glidein, if the user accidentally use condor as the jobmanager on the gatekeeper:
condor_glidein -count 3 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork
nostos.cs.wisc.edu/jobmanager-condor -type schedd -forcesetup
What would happen? We can see a fixed amount of glidein
schedd on the list but the individual schedd name
(which implies where
schedd are runnnig) seems to vary.
[A5]
1. These glideins were sent to the condor jobmanager, rather than the fork jobmanager.
This means they were submitted to Condor on
nostos and ran on execute machines in the condor pool.
Condor will keep a job running as long as it does not exit. If the machines the glideins are running
on become unavailable, condor will kill the glideins (i.e.
master and
schedd) and then restart them
on other machines that are available. Since the master daemon normally does not exit, the glideins will run forever,
occasionally being killed and restarted on different machines. On PBS or LSF, jobs can only run for
a limited amount of time. If they run longer, then they are killed and removed from the queue.
For fork jobs, Globus doesn't attempt to restart them if the machine reboots, but will otherwise let them run forever.
2.
schedd does not have an attribute similar to STARTD_NOCLAIM_SHUTDOWN in
startd
Shared Libraries
[Q6]
A scheduler universe job, with executable eqauls to
ps, was submitted to one of the glidein
schedd from nostos
but got no result in _condor_stdout file under the spool directory:
... /Condor_glidein/local/spool.128.105.149.101-6216/cluster2.proc0.subproc0
and then I checked the job_queue.log file and saw ExitStatus 127. Running the
ps copy in the spool directory
resulted in failure as expected (with exit code 127). Error messages are as follows:
./ps: error while loading shared libraries: libproc.so.2.0.17: cannot open shared object file: No such file or directory
[A6]
Shared libraries are the bane of linux users. For standard system tools, you can run the version installed on
the execution machine with executable = /bin/ps,
transfer_executable = false
Different File System between Submit and Execute Machines
[Q7]
When glidein is applied to nodes in WISC, both globus-url-copy and wget failed. Glidein failed at the stage of downloading
the binaries.
[A7]
This user needs to be mapped to an UID that has both read and write access to the glidein-related directories (e.g.
Condor_glidein/). For AFS, the user needs to have read/write permissions enabled for all the ACLs of the glidein-related
directores.
Life Span of Schedd
[Q8]
How long will
schedd run on the gatekeeper? Is there any schedd counterpart to STARTD_NOCLAIM_SHUTDOWN?
[A8]
Similar attribute does not exist in the case of
schedd. The glidein
schedd will run for as long as the gatekeeper is running
until it is shutdown and restarted.
master will make no attempt to restart
schedd after the gatekeeper restarts.
One Master but More than One Schedd?
[Q9]
I noticed that for now, in one of the BNL condor pool, there is only instance of
master but two
schedds.
How does that happen? Anyway to configure number of instances of
schedds that a
master can spawn?
[A9]
It possible, but very tricky to have one master spawn two schedds. One of the following is more likely:
One is that the
schedd spawns a child process to handle file transfers and a couple other tasks.
The child process is also named 'condor_schedd' and is usually short-lived; the other is that someone started
a
schedd directly without a
master watching over it. You can rule out the first case if both
condor_schedd processes have been alive for more than a few minutes.
Site-specific Issues
Below is a summary of technical issues specific to the site poilcies or site attributes:
- Within BNL, inbound connections are limited to TCP and to a certain port range. So, in order for the glidein to work, the following configurations need to be added to the condor_config file:
UPDATE_COLLECTOR_WITH_TCP = True
COLLECTOR_SOCKET_CACHE_SIZE = 128
However, the option of using TCP is also programmed into condor_glidein, so the user can force glideins to be submitted with -tcp option...
- The port that collector listens to needs to be adjusted with respect to the port range open for inbound/outbound connections. For example, the production grid servers at BNL all have port range open through firewall defined by GLOBUS_TCP_PORT_RANGE. So, the LOWPORT and HIGHPORT in Condor configuration file can be set to the lower bound and upper bount of GLOBUS_TCP_PORT_RANGE respectively. The port for the collector can be set to fall in between LOWPORT AND HIGHPORT...
- Since the glidein schedd is expected to be submitted to gatekeepers (or dedicated machines) on a site, if the target site does not share the same file system as the sumbit machine, the submitter needs to be mapped to a user (UID) that has both read and write access to glidein-related directories (the default top-level directory for glidein is usually Condor_glidein/). For example, if the gatekeeper behind nostos (node at WISC) in AFS, were to run the glidein schedd, the submitter needs obtain AFS tokens first since Globus does not know how to get necessary AFS tokens. If the read/write permissions are not already activated, the user needs to run fs setacl rl to add read/write permissions...
- Some sites may not agree to have a constant resident Condor schedd running on the gatekeeper, in this case, one possible solution would be to use jobmanager-[pbs|lsf|condor] to let the batch system sitting behind to deploy the glidein (i.e. master and schedd). This would work fine with those sites that use Condor. This based on the observation that master daeomn normally do not exit itself and as a result, even if the schedd glidein are killed, the glidein still gets to run (restart) on any machines that become available. This behavior may not apply very well to other batch systems such as LSF, PBS since they usually only allow jobs to run for a certain time and then they become subject to removal...
Notes for Debugging and Testing
Connection between Remote Daemons and Collector
For example, if the glidein is launched from the submit host gridui01 at BNL, running the command on the submit machine:
lsof -i -U | grep 'gridgk01' will show which daemons are connecting to the
collector. For example, for a
collector
that listens to port 24000 (unconventional but necessary at BNL):
$> /usr/sbin/lsof -i -U | grep 'gridgk01'
condor_co 3785 pleiades 11u IPv4 1405125787 TCP gridui01.usatlas.bnl.gov:24000->gridgk01.racf.bnl.gov:40553 (ESTABLISHED)
condor_co 3785 pleiades 12u IPv4 1405126124 TCP gridui01.usatlas.bnl.gov:24000->gridgk01.racf.bnl.gov:40557 (ESTABLISHED)
condor_ne 3788 pleiades 10u IPv4 1405204489 TCP gridui01.usatlas.bnl.gov:20219->gridgk01.racf.bnl.gov:40551 (ESTABLISHED)
By issuing lsof as above, we see that it is listening to whichever daemon that communicates via port 40553
on the remote site. To see which daemon it is, globus-job-run comes in handy. As an example,
globus-job-run gridgk01.racf.bnl.gov/jobmanager-fork /usr/sbin/lsof -i -U -u <your user name> |
grep 40557 # check which process is listening to port 40557 then lookup its pid
Then do:
globus-job-run gridgk01.racf.bnl.gov/jobmanager-fork /bin/ps -p 14513
globus-job-run gridgk01.racf.bnl.gov/jobmanager-fork /bin/cat /proc/14513/status
Problems on Rare Occassions
1. Failure of uncompressing the glidein tarballs during the setup job due to imcomplete file transfer
2. condor_write() errors in MasterLog:
Error messages:
condor_write(): Socket closed when trying to write buffer, fd is 8, errno=104
Buf::write(): condor_write() failed
Got SIGTERM. Performing graceful shutdown.
Sent SIGTERM to STARTD (pid 4722)
The STARTD (pid 4722) exited with status 0
All daemons are gone. Exiting.
This is probably
master is failing to connect to the
collector to advertise itself. The
startd will also run into the same problem, so the machine will not show up in condor_status. CollectorLog would be the place to trace the reason of failure and error messages.
Comparisons
Glidein is achieved through two successive Condor-G job with one being the setup job while the other is the startup job.
Condor-G converts the commands in submite description file and then appends the expressions not included in the submit file
with Globus_RSL such as (count=5)(jobtype=single), which starts 5 instances of the executables with only one process.
Below is an example, taken from the case of startup script, that shows how submit file is converted to RSL that GRAM
understands:
Submit file for startup script
Universe = Globus
Executable = $(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4/glidein_startup
Arguments = -dyn -f
Environment = CONDOR_CONFIG=$(DOLLAR)(HOME)/Condor_glidein/schedd_glidein_condor_config;
_condor_CONDOR_HOST=gridui01.usatlas.bnl.gov;
_condor_GLIDEIN_HOST=gridui01.usatlas.bnl.gov;
_condor_LOCAL_DIR=$(DOLLAR)(HOME)/Condor_glidein/local;
_condor_SBIN=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4;
_condor_LIB=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4;
_condor_LIBEXEC=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4;
_condor_CONDOR_ADMIN=pleiades@bnl.gov;_condor_NUM_CPUS=1;
_condor_UID_DOMAIN=racf.bnl.gov;
_condor_FILESYSTEM_DOMAIN=racf.bnl.gov;
_condor_MAIL=/bin/mail;
_condor_START_owner=pleiades;
_condor_UPDATE_COLLECTOR_WITH_TCP=True
Transfer_Executable = False
GlobusRSL = (count=3)(jobtype=single)
GlobusScheduler = gridgk01.racf.bnl.gov/jobmanager-condor
Notification = Never
Queue
RSL
&(rsl_substitution=(GRIDMANAGER_GASS_URL https://gridui01.racf.bnl.gov:20001))
(executable=$(HOME)#'/Condor_glidein/6.8.1-i686-pc-Linux-2.4/glidein_startup')
(scratchdir='')
(directory=$(SCRATCH_DIRECTORY))
(arguments=-dyn -f)
(environment=(CONDOR_CONFIG $(HOME)#'/Condor_glidein/schedd_glidein_condor_config')
(_condor_CONDOR_HOST 'grid10.racf.bnl.gov')
(_condor_GLIDEIN_HOST 'grid10.racf.bnl.gov')
(_condor_LOCAL_DIR $(HOME)#'/Condor_glidein/local')
(_condor_SBIN $(HOME)#'/Condor_glidein/6.8.1-i686-pc-Linux-2.4')
(_condor_CONDOR_ADMIN 'pleiades@bnl.gov')
(_condor_NUM_CPUS '1')
(_condor_UID_DOMAIN 'racf.bnl.gov')
(_condor_FILESYSTEM_DOMAIN 'racf.bnl.gov')
(_condor_MAIL '/bin/mail')
(_condor_STARTD_NOCLAIM_SHUTDOWN '1200')
(_condor_START_owner 'pleiades')
(_condor_UPDATE_COLLECTOR_WITH_TCP 'True'))
(proxy_timeout=240)
(save_state=yes)
(two_phase=600)
(remote_io_url=$(GRIDMANAGER_GASS_URL))
(count=3)
(jobtype=single)
Major updates:
--
TWikiAdminGroup - 18 Jun 2013