r26 - 11 Jul 2007 - 18:56:46 - PohsiangChiuYou are here: TWiki >  AtlasSoftware Web > CondorExperience

CondorExperience - Functionality and Performance Assessments for Condor-G and Glidein


Condor-G Performance

The performance of Condor-G is related to both the nature of the jobs and the policy of the gatekeeper. Also, since achieving high throughput is the main goal of Condor, the performance evaluation needs to be done with respect to a subtantial amount of jobs as a whole. Therefore, there are two major performance metrics to be considered: queuing time and running time. Queuing time is the average amount of time it takes for jobs submitted to be launched to the worker nodes; in other words, the time it takes for jobs to go from submit machines to execute machines. The running time is that the average amount of time for jobs to run from start to completion. These metrics are being reduced to the most simplest forms since the job preemption and checkpointing are not yet being considered. The queuing time is largely influenced by the policy enforced by the gatekeeper and also the requirements and preferences set on the work nodes. The running time analysis could be more complex; however, without considering the factor of job preemptions and checkpointing (or even failures), it is largely determined by the contents of the jobs. These jobs may be intrincically different in that the executables and the amount of I/Os are most likely different.

The way to have a steady queuing time is to fix the requirements and rank attributes of both jobs and machines and have all the tests performed with respect to the same gatekeeper before targeting the next one. However, it would be an overly ideal case where all the jobs and machines have exactly the same requirements. Gaussian distribution from the looser requirements to the stricter requirements of jobs and machines would be one possible and reasonable test case. To fix the factor of rank attribute, we can assume all the test is done with respect to the jobs from the same user. The running time can be make steady using the similar methodology by assuming that the jobs (the executable in particular) consume Gaussian distribution of CPU time and so does the I/O time. The case for CPU time and I/O time would have to be separated since they are intricically different, meaning that the order of running time for CPU-intensive and I/O-intensive jobs are different.

Test Cases

The test cases can be categorized by the goal of testing. If queuing time is more important then running time, then the factors influencing the running will have to be fixed when Condor-G test is performed. Again, we can consider two cases for fixing the running time of jobs: Gaussion distributions (or other distribution depending on the overall nature of physics experiments) of CPU time and I/O time among the tested jobs. For example, if there are 1600 jobs to be tested, 200 jobs would be least CPU-intensive, 800 jobs would be in the medium range of CPU usage, and the other 600 jobs would the most CPU-intensive (1,4,6,4,1)...

Test Methods

1. Between Condor and Gatekeepers
Use Grid Exerciser to do functionality test with respect to gatekeepers

2. Condor-G performance metrics
For the lack of benchmarks for general Condor-G performance test (to my knowledge), here is a simple idea for the general testing:

  1. Create a program (in Perl or shell script for example) that automatically generate submit description files with options available for specifying the executable, requirements, input/output/log, grid_resource directories, and so on.
  2. In the case of testing Condor-G, a wrapper program on top of the submit-file-generation program can be developed to automate the submit file generation even further based on the observation that submit files, defined in the same universe (i.e. grid universe for Conodr-G jobs), usually differ in merely a few commands such as the executable, input/output, among others. The command, grid_resource, does not change if the test is performed with respect to the same gatekeepers running same version of GRAM (i.e. test is performed within a site such as BNL). As a result, it is possible to have this wrapper program select among a list of test programs available and then substitute the executable in the submit file (to generate a new job) as an example.
  3. Simiarly, the wrapper program can alter the requirements of jobs.
  4. Set cron jobs that generate these submit files and then submit them to the gatekeeper that Condor-G communicates with.

Shell script version of such programs are under way. As an example,

 
      condor_gen_cg -exec /bin/ps 
                             -out  $HOME/condor_test/out.txt  
                             -in     $HOME/condor_test/in.txt   

Glidein

This section refers to the startd-based glidein, which is already supported with current condor_glidein command. Below are a list of questions arose during the test and their solutions.

Estimation of the Life Span of a Glidein

[Q1]
Is there anyway that the user at the submit side, can estimate the duration of a glidein resource?
[A1]
There's no way for the glidein to figure out how long the batch system is willing to let it run for. If you use the -runtime option to condor_glidein, the glidein will shutdown after that much time and advertise the shutdown time as DaemonStopTime in the startd ad.

Startup Script Runs Forever

[Q2]
Sometimes when no jobs were submitted to glidein nodes, the startup script kept running while it should have been terminated after a certain period of time?
[A2]
With a condor_status -l command, if UpdateSequenceNumber keeps increasing, that means the glidein startd is still running, and the value increases each time the startd re-advertises itself (once every 5 minutes by default). If the glidein shuts down, the job should leave the queue. If the glidein is indeed not shutting down when it should, then the glidein's log files (i.e. MasterLog, StartdLog) would be most useful in figuring out why. However, the reason that the glidein startup scirpt kept running with glideins being gone was not investigated further since this problem did not occur later on.

Schedd-based Glidein

The Internals

For implementation details, please visit schedd-based glidein page

Service Extensions

The current version of schedd-based glidein includes support for Condor-G, Condor-C, scheduler universe jobs. Proper configuration can be added to support gt4 GRAM as well as other batch systems such as PBS and LSF. As an example,
1. Include the following configurations for gt4 GT4_GAHP = $(SBIN)/gt4_gahp
GT4_LOCATION = $(LIB)/gt4
GRIDFTP_SERVER = $(LIBEXEC)/globus-gridftp-server
GRIDFTP_SERVER_WRAPPER = $(LIBEXEC)/gridftp_wrapper.sh

2. Include the following for PBS and LSF GLITE_LOCATION = $(LIB)/glite
PBS_GAHP = $(GLITE_LOCATION)/bin/batch_gahp
LSF_GAHP = $(GLITE_LOCATION)/bin/batch_gahp
UNICORE_GAHP = $(SBIN)/unicore_gahp
NORDUGRID_GAHP = $(SBIN)/nordugrid_gahp

These service extensions with schedd-based glidein can be selected via options of the command condor_glidein.

Glidein Binaries

Schedd-based glidein binaries are available here

Problems Experienced and Their Solutions

This section is better illustrated in the form of questions and answers:

Glidein Binaries

[Q1]
Which daemons are necessary for schedd-based glidein?
[A1]
a. Always needed: {condor_master, condor_schedd}
b. Grid universe jobs: condor_gridmanager, {gt2: gahp_server, gt4: gt4_gahp}
c. Condor-C jobs: {condor_c-gahp, condor_c-gahp_worker_thread}
d. PBS/LSF jobs: everything under lib/glite except batch_gahp_daemon

Authentication

[Q2]
The glidein schedd was able to show up with condor_status -schedd but when a job is submitted to this schedd, the following errors returned:
condor_submit cgtest1 -name agrd0926@acas0011.usatlas.bnl.gov

Submitting job(s) 
ERROR: Failed to connect to queue manager agrd0926@acas0011.usatlas.bnl.gov 
AUTHENTICATE:1003:Failed to authenticate with any method 
AUTHENTICATE:1004:Failed to authenticate using GSI 
GSI:5004:Failed to get authorization from server.  Either the server does not trust your certificate, or you are not in the server's authorization file (grid-mapfile)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
[A2]
When a user connects to a condor_schedd for submitting or modifying jobs, the schedd will request user authentication. Multiple authentication methods are avaiable. The most usual method is FS. In this method, the schedd picks a filename that does not currently exist and then asks condor_submit to create that file. The schedd then checks the owner of the file. This method works fine when the user who does condor_submit and schedd are on the same machine, but not so when they're on different machines. As a result, you need to configure another authetication method. Ultimately, the user will want to set up one of the strong authentication methods such as GSI and Kerberos. For simplicity in the testing phase, CLAIMTOBE is chosen for the authentication, which means the schedd trusts whoever does condor_submit for who they are.
Example:
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI, CLAIMTOBE

[Q3]
In the currrent configuration settings, SEC_DEFAULT_AUTHENTICATION_METHODS is commented out and therefore is not defined production machine's condor_config file. Does this attribute also have a default value?
[A3]

Existence of Schedd
[Q4]
What would be the best way to query if the remote schedd is still running? (and so, if not, then schedd-glidein needs to be relaunched again)
1. condor_status -schedd
2. condor_config_val -schedd (this doesn't seem to be implemented yet)
3. simple script:
    sname=$(cat <(condor_status -schedd -l -c "is_glidein=?=true" |  grep Name ) |
                   sed -e 's/Name[ ]*=[ ]*\"\(.*\)\"/\1/g')

    if condor_q -name $sname; then
      echo "cool, schedd is still alive"
    else
      echo "schedd's gone :("
    fi 
   
[A4]
When the schedd exits, it unadvertises itself. But if the batch system kills it with a KILL signal, then it doesn't get a chance to unadvertise itself and its ad will remain in the collector for up to 15 minutes. As a result, if the schedd shows up in condor_status -schedd, then you know it was running sometime in the past 15 minutes, but may not be now. To know if the schedd is running now, you can use condor_config_val or condor_q as you outlined. You'll need to use the -name option to condor_config_val.

Command Caveat
[Q5]
In trying schedd glidein, if the user accidentally use condor as the jobmanager on the gatekeeper:

condor_glidein -count 3 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork
nostos.cs.wisc.edu/jobmanager-condor -type schedd -forcesetup

What would happen? We can see a fixed amount of glidein schedd on the list but the individual schedd name (which implies where schedd are runnnig) seems to vary.
[A5]
1. These glideins were sent to the condor jobmanager, rather than the fork jobmanager. This means they were submitted to Condor on nostos and ran on execute machines in the condor pool. Condor will keep a job running as long as it does not exit. If the machines the glideins are running on become unavailable, condor will kill the glideins (i.e. master and schedd) and then restart them on other machines that are available. Since the master daemon normally does not exit, the glideins will run forever, occasionally being killed and restarted on different machines. On PBS or LSF, jobs can only run for a limited amount of time. If they run longer, then they are killed and removed from the queue. For fork jobs, Globus doesn't attempt to restart them if the machine reboots, but will otherwise let them run forever.
2. schedd does not have an attribute similar to STARTD_NOCLAIM_SHUTDOWN in startd

Shared Libraries
[Q6]
A scheduler universe job, with executable eqauls to ps, was submitted to one of the glidein schedd from nostos but got no result in _condor_stdout file under the spool directory:

... /Condor_glidein/local/spool.128.105.149.101-6216/cluster2.proc0.subproc0

and then I checked the job_queue.log file and saw ExitStatus 127. Running the ps copy in the spool directory resulted in failure as expected (with exit code 127). Error messages are as follows:

./ps: error while loading shared libraries: libproc.so.2.0.17: cannot open shared object file: No such file or directory

[A6]
Shared libraries are the bane of linux users. For standard system tools, you can run the version installed on the execution machine with executable = /bin/ps, transfer_executable = false

Different File System between Submit and Execute Machines
[Q7]
When glidein is applied to nodes in WISC, both globus-url-copy and wget failed. Glidein failed at the stage of downloading the binaries.
[A7]
This user needs to be mapped to an UID that has both read and write access to the glidein-related directories (e.g. Condor_glidein/). For AFS, the user needs to have read/write permissions enabled for all the ACLs of the glidein-related directores.

Life Span of Schedd
[Q8]
How long will schedd run on the gatekeeper? Is there any schedd counterpart to STARTD_NOCLAIM_SHUTDOWN?
[A8]
Similar attribute does not exist in the case of schedd. The glidein schedd will run for as long as the gatekeeper is running until it is shutdown and restarted. master will make no attempt to restart schedd after the gatekeeper restarts.

One Master but More than One Schedd?
[Q9]
I noticed that for now, in one of the BNL condor pool, there is only instance of master but two schedds. How does that happen? Anyway to configure number of instances of schedds that a master can spawn?
[A9]
It possible, but very tricky to have one master spawn two schedds. One of the following is more likely: One is that the schedd spawns a child process to handle file transfers and a couple other tasks. The child process is also named 'condor_schedd' and is usually short-lived; the other is that someone started a schedd directly without a master watching over it. You can rule out the first case if both condor_schedd processes have been alive for more than a few minutes.

Site-specific Issues

Below is a summary of technical issues specific to the site poilcies or site attributes:


  1. Within BNL, inbound connections are limited to TCP and to a certain port range. So, in order for the glidein to work, the following configurations need to be added to the condor_config file:
    UPDATE_COLLECTOR_WITH_TCP = True
    COLLECTOR_SOCKET_CACHE_SIZE = 128
    However, the option of using TCP is also programmed into condor_glidein, so the user can force glideins to be submitted with -tcp option...
  2. The port that collector listens to needs to be adjusted with respect to the port range open for inbound/outbound connections. For example, the production grid servers at BNL all have port range open through firewall defined by GLOBUS_TCP_PORT_RANGE. So, the LOWPORT and HIGHPORT in Condor configuration file can be set to the lower bound and upper bount of GLOBUS_TCP_PORT_RANGE respectively. The port for the collector can be set to fall in between LOWPORT AND HIGHPORT...
  3. Since the glidein schedd is expected to be submitted to gatekeepers (or dedicated machines) on a site, if the target site does not share the same file system as the sumbit machine, the submitter needs to be mapped to a user (UID) that has both read and write access to glidein-related directories (the default top-level directory for glidein is usually Condor_glidein/). For example, if the gatekeeper behind nostos (node at WISC) in AFS, were to run the glidein schedd, the submitter needs obtain AFS tokens first since Globus does not know how to get necessary AFS tokens. If the read/write permissions are not already activated, the user needs to run fs setacl rl to add read/write permissions...
  4. Some sites may not agree to have a constant resident Condor schedd running on the gatekeeper, in this case, one possible solution would be to use jobmanager-[pbs|lsf|condor] to let the batch system sitting behind to deploy the glidein (i.e. master and schedd). This would work fine with those sites that use Condor. This based on the observation that master daeomn normally do not exit itself and as a result, even if the schedd glidein are killed, the glidein still gets to run (restart) on any machines that become available. This behavior may not apply very well to other batch systems such as LSF, PBS since they usually only allow jobs to run for a certain time and then they become subject to removal...

Notes for Debugging and Testing

Connection between Remote Daemons and Collector
For example, if the glidein is launched from the submit host gridui01 at BNL, running the command on the submit machine: lsof -i -U | grep 'gridgk01' will show which daemons are connecting to the collector. For example, for a collector that listens to port 24000 (unconventional but necessary at BNL):
 
$> /usr/sbin/lsof -i -U | grep 'gridgk01'
condor_co  3785 pleiades   11u  IPv4 1405125787       TCP gridui01.usatlas.bnl.gov:24000->gridgk01.racf.bnl.gov:40553 (ESTABLISHED)
condor_co  3785 pleiades   12u  IPv4 1405126124       TCP gridui01.usatlas.bnl.gov:24000->gridgk01.racf.bnl.gov:40557 (ESTABLISHED)
condor_ne  3788 pleiades   10u  IPv4 1405204489       TCP gridui01.usatlas.bnl.gov:20219->gridgk01.racf.bnl.gov:40551 (ESTABLISHED)
By issuing lsof as above, we see that it is listening to whichever daemon that communicates via port 40553 on the remote site. To see which daemon it is, globus-job-run comes in handy. As an example,
 
globus-job-run gridgk01.racf.bnl.gov/jobmanager-fork /usr/sbin/lsof -i -U -u <your user name> | 
     grep 40557 # check which process is listening to port 40557 then lookup its pid
Then do:
 
globus-job-run gridgk01.racf.bnl.gov/jobmanager-fork /bin/ps -p 14513
globus-job-run gridgk01.racf.bnl.gov/jobmanager-fork /bin/cat /proc/14513/status

Problems on Rare Occassions

1. Failure of uncompressing the glidein tarballs during the setup job due to imcomplete file transfer

2. condor_write() errors in MasterLog:
Error messages: condor_write(): Socket closed when trying to write buffer, fd is 8, errno=104
Buf::write(): condor_write() failed
Got SIGTERM. Performing graceful shutdown.
Sent SIGTERM to STARTD (pid 4722)
The STARTD (pid 4722) exited with status 0
All daemons are gone. Exiting.

This is probably master is failing to connect to the collector to advertise itself. The startd will also run into the same problem, so the machine will not show up in condor_status. CollectorLog would be the place to trace the reason of failure and error messages.

Comparisons

Glidein is achieved through two successive Condor-G job with one being the setup job while the other is the startup job. Condor-G converts the commands in submite description file and then appends the expressions not included in the submit file with Globus_RSL such as (count=5)(jobtype=single), which starts 5 instances of the executables with only one process. Below is an example, taken from the case of startup script, that shows how submit file is converted to RSL that GRAM understands:

Submit file for startup script

Universe = Globus
Executable = $(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4/glidein_startup
Arguments = -dyn -f
Environment = CONDOR_CONFIG=$(DOLLAR)(HOME)/Condor_glidein/schedd_glidein_condor_config; 
              _condor_CONDOR_HOST=gridui01.usatlas.bnl.gov; 
              _condor_GLIDEIN_HOST=gridui01.usatlas.bnl.gov; 
              _condor_LOCAL_DIR=$(DOLLAR)(HOME)/Condor_glidein/local;
              _condor_SBIN=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4;
              _condor_LIB=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4;
              _condor_LIBEXEC=$(DOLLAR)(HOME)/Condor_glidein/6.8.1-i686-pc-Linux-2.4;
              _condor_CONDOR_ADMIN=pleiades@bnl.gov;_condor_NUM_CPUS=1;
              _condor_UID_DOMAIN=racf.bnl.gov;
              _condor_FILESYSTEM_DOMAIN=racf.bnl.gov;
              _condor_MAIL=/bin/mail;
              _condor_START_owner=pleiades;
              _condor_UPDATE_COLLECTOR_WITH_TCP=True
Transfer_Executable = False
GlobusRSL = (count=3)(jobtype=single)
GlobusScheduler = gridgk01.racf.bnl.gov/jobmanager-condor
Notification = Never
Queue

RSL
&(rsl_substitution=(GRIDMANAGER_GASS_URL https://gridui01.racf.bnl.gov:20001))
 (executable=$(HOME)#'/Condor_glidein/6.8.1-i686-pc-Linux-2.4/glidein_startup')
 (scratchdir='')
 (directory=$(SCRATCH_DIRECTORY))
 (arguments=-dyn -f)
 (environment=(CONDOR_CONFIG $(HOME)#'/Condor_glidein/schedd_glidein_condor_config')
              (_condor_CONDOR_HOST 'grid10.racf.bnl.gov')
              (_condor_GLIDEIN_HOST 'grid10.racf.bnl.gov')
              (_condor_LOCAL_DIR $(HOME)#'/Condor_glidein/local')
              (_condor_SBIN $(HOME)#'/Condor_glidein/6.8.1-i686-pc-Linux-2.4')
              (_condor_CONDOR_ADMIN 'pleiades@bnl.gov')
              (_condor_NUM_CPUS '1')
              (_condor_UID_DOMAIN 'racf.bnl.gov')
              (_condor_FILESYSTEM_DOMAIN 'racf.bnl.gov')
              (_condor_MAIL '/bin/mail')
              (_condor_STARTD_NOCLAIM_SHUTDOWN '1200')
              (_condor_START_owner 'pleiades')
              (_condor_UPDATE_COLLECTOR_WITH_TCP 'True'))
 (proxy_timeout=240)
 (save_state=yes)
 (two_phase=600)
 (remote_io_url=$(GRIDMANAGER_GASS_URL))
 (count=3)
 (jobtype=single)


Major updates:
-- TWikiAdminGroup - 16 Oct 2018

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback