r1 - 22 May 2006 - 11:54:25 - TomWlodekYou are here: TWiki >  Support Web > CRSAdmin

CRS User Manual

Job Definition File

The job is described by user using the same format as in the old CRS. I will describe it here, using an example.

Let us suppose that user wants to write a job which takes 2 input files, one stored on NFS disk and one in HPSS tape storage. The job uses an executable, which takes a number of input parameters and then writes 3 output files, two of which are supposed to be written to HPSS and one to NFS area. Moreover, one of the output files is considered mandatory, which means that if it is not present when the user's executable completes, then the job is considered a failure. The other two files are optional, which means that if the job does not produce them, it still can be considered a success.

To be considered a succesfull the job should produce all of the declared mandatory output files and the user's executable should return exit code which corresponds to succesfull exit (defined in the input cards).

The input file would then look more or less like that (lines which start with # are comments):

# first of all, user should specify the executable
executable=/u0b/tgtreco/crs/bin/run_reco.csh
# and its arguments
executableargs=0,none,none,0,run3dAu_v03_pro44_crstest,run03_production_01

#Then he/she should specify his e-mail, so that the job knows where to send e-mail notifications
notify=my_email@bnl.gov
# Then he/she should leave the magic line:
mergefactor=1

# How many input files are there?
inputnumstreams=2

#What are the input files?

# The first file is stored in HPSS
inputstreamtype[0]=HPSS
# this is its directory
inputdir[0]=/home/rcfreco/
# and this is its file name
inputfile[0]=test_file.tar

# The second file is stored in NFS area on a UNIX disk
inputstreamtype[1]=UNIX
# this is its directory
inputdir[1]=/some/directory
# and this is its file name
inputfile[1]=some_file_name.dat

# now output files

# How many output files will the job produce?
outputnumstreams=3

#What are the output?

# The first one will be an hpss file
outputstreamtype[0]=HPSS
# here is its target directory
outputdir[0]=/home/rcfreco/
# and its name
outputfile[0]=junk_file_delete_it_A.dat
# the file is mandatory, if it is missing at the end of job, the job failed.
outputmandatory[0]=yes

# the second hpss fileRelation Between Job Definition and Job Environment Variables.

outputstreamtype[1]=HPSS
outputdir[1]=/junk/data
outputfile[1]=second_junk_file.dat
# the file is optional, if it is missing, no harm is done.
outputmandatory[1]=no

# the third file is UNIX
outputstreamtype[2]=UNIX
outputdir[2]=/some_directory/somwhere
outputfile[2]=junk_file_name.dat
outputmandatory[2]=no

# where should CRS put standard output and standard error of user's executable at the end of job?

# standard output should go to directory
stdoutdir=/star/rcf/prodlog/P05ic/log/daq
# to file
stdout=rcf_big_temp.out

# standard error should go to
stderrdir=/star/rcf/prodlog/P05ic/log/daq
stderr=rcf_big_temp.err

# and finally, what should be correct exit code of user's executable?
# This line is optional, if it is not present in the job definition file, it means that
# the correct exit code is 0.
# but for the fun of it, let us declare that the succesfull user's job should end with status 7 (why not?)
executableexitcode=7

Relation Between Job Definition and Job Environment Variables.

The user executable should assume that all input and output files are located on the local disk in the directory where the binary is executed. (There is an exeption to this rule for STAR, more about it below). The CRS system will define the following environment variables:

  1. INPUTn (n=0,..., number of input files-1): base name of the input file n.
  2. ACTUAL_INPUTn: full (directory included) name of input file n.
  3. INPUT_TYPEn : UNIX or HPSS depending whether input file n is Unix of Hpss file.
  4. OUTPUTn : base name of outptu file n
  5. ACTUAL_OUTPUTn : full name of output file n
  6. OUTPUT_TYPEn : file type (UNIX or HPSS) of output file n
  7. CRS_STDOUT , CRS_STDOUT_DIR : name and directory of standard output file
  8. CRS_STDERR , CRS_STDERR_DIR : name and directory of standard error file
  9. OUTPUTNUMSTREAMS, INPUTNUMSTREAMS : number of output and input files. 10. MANDATORYOUTPUTn : yes or no depending on whether the oputput has been declared as mandatory by the user.

Star exeption:

For Star experiment for UNIX files both variables ACTUAL_INPUTn and INPUTn denote full name of input file (including directory).

Job Queues

There are 5 queues to which jobs can be submitted to. (There are 6 queues for STAR). The queue 5 corresponds to the fastest machines, 1 to the slowest. An additional queue, 0, corresponds to CAS machines.

How to Create a Job.

  1. Each time you logon to submit machine execute command: setup_crs.
  2. Prepare a job definition file. Store it in some directory somewhere on the submit machine.
  3. Execute command: crs_job -create job_defintion_file [options]
  4. The job will be created. It will be given a name which consists of the job definition file name with time stamp appended to it.

The possible options are:

* [-qn] : the job should be submitted to queue n

* [-pn] : the job should be given priority n (0<n<20)

* [-drop] : if all machines in the queue n are occupied, then the job can be executed in a slower queue.

Example:


 crs_job -create ~/newcrs/production/rcrsuser2_success.jdf  -q4 -p5 -drop

It means: create job using job definition file ~/newcrs/production/rcrsuser2_success.jdf, submit it to queue 4, if the queue is full, execute it in slower queue, give it priority 5.nce you have created a job it is registered in CRS system and has status CREATED.

Order of options is not important, however options should come after the job definition file.

That's all. You can now see the job in the system using crs_panel or crs_job -stat commands.

How to submit a job.

There are two ways you can submit a job:

  1. Do nothing. Once your job is created, it will be submitted for execution when its time is due by the loader daemon.

  1. Submit it manually. To do this open the crs_panel, select the job and click "submit". You should use this option only if you want to push one or two jobs "ahead of the line", you should never use this option to submit large numbers of jobs.

If for some reason user does not want a particular job in CREATED state to be submitted for execution by loader, he should block it.

To block a job select it from the main panel and click "block". To unblock it click "unblock".

To block/unblock jobs from line mode use crs_job -block job_name / crs_job -unblock job_name command.

Only jobs in CREATED state can be blocked. Only jobs in BLOCKED state can be unblocked.

The jobs in BLOCKED state are ignored by loader and will not be submitted to condor until they are unblocked by user.

Job Flow.

  1. Once the job is created it is registered as CREATED in the system. It mens that CRS knows about this particular job and has its information in its databases. However the job is not (yet) known to condor and has not (yet) been submitted to condor.
  2. Once the job is submitted to condor, either by the user using "submit" button on crs_panel or by the loader, it will change status to SUBMITTED. This means that the job is known to condor, but is not (yet) running and it sits in condor in idle state.
  3. Once condor starts the job it changes status to STARTED. A job in this stage copies input files from NFS disks to local execution directory (if the job requires NFS input) and submits stage requests to ORBS. After the requests are accepted by HPSS the jobs waits until they either fail or time out or complete.
  4. Once the HPSS requests are completed, job starts to import data file from HPSS cache. First of all it changes status to MAIN-INIT. In this state it does some internal maintenance work (it is very fast, you rarely see jobs in this state)
  5. Then it changes status to MAIN-IMPORT-WAITING. In this state it waits for an open pftp slot to import the data.
  6. When there is an open slot it changes status to MAIN-IMPORT and imports data.
  7. Once all input data is local the job chacks how many other CRS job are running data reconstruction at this particular node. If there are 2 other jobs running reconstruction, the jobs enter MAIN-SLEEP state and waits until at least one of the other jobs completes.
  8. If there are less thatn 2 reconstruction jobs running, the job starts the main module and changes status to MAIN-EXEC.
  9. When data reconstruction is done, it checks if the executable exit code was correct and if all mandatory output files are present. If everything is ok it starts data exporting, if not it goes to ERROR state. 10. First of all job exports UNIX data files, and while it does it it is in MAIN-EXPORT-UNIX state 11. When it is done with UNIX files it tries to export HPSS files. It enters MAIN-EXPORT-WAITING and waits until there are pftp slots available 12. when there are, it changes status to MAIN-EXPORT-HPSS and starts to export data to HPSS. 13. When everything is OK job ends in status DONE. 14. If something failed at any stage the job will end in one of the three possible final states:

* SUBMIT_FAILED - it means that the job failed to be submitted to condor.

* ERROR - it means that the job failed, but the problem seems to be of temporary nature, (network breakdown, hpss failure,...) and it is likely that if the job is reset it will run correctly. Users are encouraged to reset all jobs in ERROR state and give them a second chance.

* FATAL - the job failed and the failure is likely to be serious and irreversible (for example: bad exit code from user's executable). Most likely reseting a job will not help, but user should always investigate the cause since CRS is not foolproof at determining the cause of failures.

Semaphores and HPSS flags.

As the job moves through various stages of execution it can be temporarily stopped by the user (or HPSS crew) using set of flags. There are six flags which can be set by users and one which is controlled by HPSS crew.

The Semaphores controlled by users can be changed from "Semaphores" subpanel of the main CRS panel.

When a job reaches a particular stage of execution it will check the status of corresponding flag. If the flag says "go" it will continute execution. If it says "stoip" it will wait untill it is allowed to continue.

  1. The ORBS (Oak Ridge Batch Software) flag. It is checked before the job is about to contact the HPSS interface machines. If the HPSS interface is down, users should close the corresponding semaphore.
  2. Unix get semaphore - it indicated to CRS whether the NFS disks are OK or not. It is checked before importing data from UNIX disks.
  3. Pftp get semaphore - it it checked before importing data from HPSS cache by pftp.
  4. Job execution semaphore - it is checked before starting execution of user's executable. It should be set to "stop" if there are AFS problems.
  5. Unix export semaphore - it tells CRS if the NFS disks for exporting data are available.
  6. PFTP export semaphore - it tels CRS if it is OK to export data to HPSS DST cache.

In addition to the user semaphores there is a HPSS status flag which is set by HPSS crew.

How to Reset Job.

Any job at any stage of execution can be reset and sent back to CREATED state.

To reset job from panel: select the jobs you want to reset, then click reset.

There are several ways you can reset jobs using line mode commands.

  1. If you want to reset a couple of jobs: crs_job -reset job_name_1 job_name_2 ....
  2. If you want to reset all jobs in a particular status: crs_job -reset_status status (for example: crs_job -reset_status ERROR - will reset all jobs in ERROR state).
  3. If you want to reset a particular list of jobs you can create list of their names in a text file, put them one name per line (# at beginning of line denotes comments) and then run: crs_job -reset_from_file file_name

The last command is useful in conjunction with crs_job -stat. You can redirect crs_job -stat to a temporary file, then edit it leaving the jobs yoo want to reset, save the file and then run crs_job -reset_from_file.

How to Check Job Status.

Once the job is created it will change its status as it goes through various stages of execution. In order to learn about the status of a particular job do:


crs_job -stat | grep job_name

or using the crs_panel, click "refresh" and find the name of the job to read its status. If the crs_panel has thousands of jobs finding a particular name might be hard. You can use "sort" buttons to sort the jobs by name, timestamp, status (and inverse the sorting order). If that does not help you can type in job name (or part of thereof) into an entry field in lower left corner of the panel and then click "select jobs". The jobs whose names contain the string you've typed will be highlighted.

To check progress of a particular job you can select it from the panel and thel lokk at its crs logfile useng "crs logfile" button. The log is human readable and allows you to see what happened to the job recently.

You can also select a running job and click "spy execdir" to see contents of its execution directory. You can then select the job files and peek at their content. The "get top" and "get ps" allow you to execute top and ps commands on machines on which the selected job is running.

How to Get Fast Status of the CRS Production.

From line mode: do farmstat command.

From main panel: click "production status".

You will get information how many jobs are in a particular state.

CRS Line Mode Commands.

CRS line mode commands are invoked using script

crs_job [command] [options]

A listing of available commands can be obtained by

crs_job -help

The available options are:

-stat           : get information about known jobs
-stat_show_machines : show status of each job, but instead of status time show machine on which this job runs
-stat_show_problem  : show status of each job, and, if the job has a problem, short description of the problem.
-submit jobname : submit job.
-submit_all     : submit all jobs in CREATED status
-create_and_submit  job_description_file : create and start a job. (The job_description_file can include wildcards). This command is obsolete and should not be used.
-block  job_name      : block a CREATED job, so that submit daemon ignores it
-unblock  job_name    : unblock a previously BLOCKED job
-block_created        : block all jobs in CREATED state
-unblock_blocked      : unblock all jobs in BLOCKED state
-crs_logfile job_name : print content of crs log file of job job_name
-spy_execdir job_name : show content of job work directory (job must be in MAIN* state)
-cat_stdio   job_name : show content of stdio file
-tail_stdio  job_name : tail content of stdio file
-cat_stderr  job_name : show content of stderr file
-tail_stderr job_name : tail content of stderr file
-kill jobname   : kill job
-kill_status  status  : kill all jobs with given status
-archive jobname: archive job jobname
-archive_done   : archive all jobs in DONE status
-reset jobname  : reset job, bring it to CREATED state
-reset_status status  : reset jobs in given status
-reset_from_file  fn  : reset jobs from file fn
-kill_from_file  fn   : kill  jobs from file fn
-create job_description_file [-qn] [-pn] [-drop]: create a job from the job description
         -qn=submit job to queue n; -pn=give job priority n; -drop=allow drop queue
-save_for_debug job_name : copy a copy of the job to temporary storage
                        so that it can be debugged later
-show_machines : show the status of the farm machines, as seen by condor
-show_queues   : show the status of the farm queues, as seen by condor
-show_crs_jobs_per_machine : show number of CRS jobs per machine
-show_crs_jobs_per_queue   : show number of CRS jobs per queue
-get_pftp_link_limits : print the maximum number of pftp links
-change_number_of_input_links : change the max allowed number of input links. This will adjust the allowed number of output links as well
-recent_errors : show list of errored jobs, status time, problem description; order jobs by status time
-get_jobs_with_missing_jobdir : print jobs which have missing job directory
                    The job_description_file can include wildcards

CRS Panel - Main Panel.

CRS panel servers as a GUI for job control. To start it execute crs_panel command.

crs_panel -help

will give you list of options.

The main panel consists of a listbox which shows list of jobs known to CRS and their statuses. Buttons in the "Job commands" allow user to execute commands which relate to selected jobs. Buttons in "System commands" column allow user to inspect the status of the system.

Buttons at the bottom of the page allow to control the flow of the production.

CRS Panel - How to Select Jobs.

Jobs can be selected using mouse (left button click).

To select a range of jobs use left mouse button+shift.

To select individual jobs use left mouse button+ctrl button.

If you would like to select jobs which contain a particular string in name, go to entry fiels in lower left corner of the panel. Type in (or paste) the string in that field. Click "select jobs". Jobs with names that contain the selected string will be highlighted.

CRS Panel - Job Commands.

Job in this command relate to individual jobs. Going from top down you will see buttons which sort jobs according to their name, time, status (ordered according to the logical job flow), execution host and queue. The "inverse sort" button inverses current sorting order.

"crs logfile" button displays crs logfile of selected job. "list job files" lists content of job directory on submit machine. Once this option is selected user can peek at the content of individual job files.

"submit job" submits a CREATED job to condor and should not be used. "reset job" resets selected jobs. "Archive job" deletes a completed job from CRS however stores some of its log files in a archive directory. (The archive directory should be purged from time to time, or it will overfill the disk).

"kill job" kills a job and deletes it from CRS.

"show job details" shows some information about the job status.

"spy execdir" allows user to look at the content of execution directory on the machine on which the job runs.

"get top" and "get ps" buttons execute top and ps commands on the host on which the selected job runs.

CRS Panel - System Commands.

* "show machines" displays a panel with information about CRS machines.

* "show archive" - shows list of jobs which were done using CRS system in the past, and their result.

* "spy hpss server" - allows user to peek into the machine which serves as interface to HPSS. It opens a subpanel which lists hpss requests known to CRS and their statuse (INPUT/WORKING/OUTPUT). Requests for which the parent job has been deleted are listed as ORPHANED. User can delete the orphaned requests by clicking "delete orphaned". The panel also allows user to send "ping" signals to HPSS daemons to check if they are alive.

* Show I/O files. Shows files belonging to selected jobs and their status.

* "Show HPSS requests" - show status of HPSS requests for selected jobs.

* "Show PFTP links" - shows status of PFTP links for selected jobs.

* "Adjust PFTP links" - normally each experiment is assigned a quota of PFTP connections it is allowed to use at any given time. This is usually between 10 and 20. This quota can be shared between the "incoming" and "outgoing" connections in a way that is convenient for any experiment. This button opens a panel which allows user to change the number of "in" and "out" links. Changing the number of "in" and "out" links can be done by line mode command -change_number_of_input_links n as well.

* condor_q - execute the condor_q command

* condor analyze - execute "condor analyse" command for selected jobs.

CRS Panel - Production commands.

Those buttons are at the bottom of the main panel.

The first two buttons are meant so simnplify navigating among the jobs:

* "select jobs" - this button is used to select jobs which contain a particular string in the name. Let us assume that you want to select all jobs which have string "abc" in the name. Type "abc" in the input field next to "select jobs" button. Then click "select jobs". All jobs which have "abc" as part of name will become highlighted.

* "print selected" - opens a text window with names of highlighted jobs. The names can be then cut and paste to any text file.

The "Loader options" buttons starts loader panel, which steers the behaviour of the job loader.

The buttons on the loader panel:

* Loader status/loader enable, loader disable - give status of the loader, start and stop it.

* Load by name/ load by creation time - chooses if loader should load jobs according to their names or creation times.

* Buttons below decide from which queues should the loader pic jobs. To stop loading from a particular queue depress its button.

The "project history" shows history of the code development. It helps the user to figure out which version of CRS he is now using.

The "production status" shows the snapshot of the production. It does the same thing as the "farmstat" command in the line mode.

"production history" gives a list of jobs which were executed by CRS and their statuses and times of completion.

"for experts button" opens a panel which gives the user some commands to check status of CRS daemons. From that panel you can start/stop loader daemon (this can be done from the loader panel as well), check the status and start/stop the logbook manager daemon and ping the hpss daemons.

"refresh" button refreshes the status of jobs shown in CRS panel.

"Semaphores" button opens the semaphores panel. From this panel users can open/close the production semaphores.

"Help" button gives listing of available help.

"Exit" button closes CRS panel.

CRS Administrator Manual

CRS Machines.

CRS Submit Machines.

There are currently 3 rcrs machines from which users can submit their jobs: rcrsuser2, rcrsuser3 and rcrsuser4. Jobs can be seen from any of the submit machines, however they can be killed only from machine from which they were submitted. For that reason experiments should stick to the assigned machines and do not run production elsewhere.

In the future it might become possible to set up CRS in such a way that all experiments can use all submit machines without restrictions, however at this point it is not possible due to condor/kerberos related issues.

CRS Execution machines.

The jobs are executed on the rcrs nodes (and the rcas nodes for STAR). Users are not supposed to logon to those machines.

Condor

CRS uses Condor as it's job execution backend. Condor has some configuration options turned on to allow the CRS software to interface with it. The CRS software depends on the CPU_Speed and CPU_Type flags advertised by machines. These flags allows the CRS software to divide jobs into groups that seek out all or a subset of available machines. Condor also uses the CRS_Turn_Off flag in the start expression for CRS machines to determine if it should start any CRS jobs on a node. The CRS software can toggle this flag for one or more nodes in the pool. Note that this is different than Turn_Off which applies to all jobs. In the case of STAR and PHENIX there is no difference but BRAHMS and PHOBOS run both CRS and Analysis on their nodes.

The three submit machines have daemons running for all four experiments. The CRS software interfaces with the software by using the CONDOR_CONFIG variable. The daemons can be controlled using the /home/condor/condorctl tool. For more information on Condor please see the administrator documentation for Condor.

HPSS Interface (ORBS machine).

The programs which run the interface between CRS and HPSS tape storage are located on rcfmon02 machine. Each experiment has an account on this node. (bramreco, phobreco, phnxreco and starreco as well as rcfreco).

Users are not supposed to logon to this machine.

The ORBS code is stored in directory ~/Batch/bin. To learn how to start and reset it go here.

The HPSS requests are submitted, and their status is obtained, through four subdirectories of directory ~/Batch: input,working,output and bad.

When user wants to submit a request he has to create a request file and put it to directory input. The ORBS system will pick a file and move it to directory working (and will start processing the request). When the request is done (either data is staged or stage failed) it will move the file to directory output. If the request file has bad format it will be moved to directory bad.

In normal circumastances the HPSS interface daemons listen to requests comming from jobs, and upon receiving them, they either create a request file and store it in input directory (if job wants to submit request), or check what is the current status of a given request (if job wants to inquire about a particular request) or return to the job requests status and delete the request file (if the request completed and is in output directory).

Sometimes it can happen that a CRS job has submitted an HPSS request, and then was killed. In such case the HPSS request is present in ORBS, however its parent job is missing. Such HPSS request is called "orphaned". To clean orphaned HPSS requests go here.

Code Organisation

The CRS code is written in Python.

The code is stored on submit machines in directories ~/newcrs/bin/

In python, when a code consists from several files stored in several modules the component modules are loaded using "import" statement. This works fine if the code is stored on one node.

In CRS however the code is executed on rcrs and rcas nodes. There is no guarantee that an "import" command would work on those nodes, since a priori, there is no guarantee that the source code of a job can be visible from the execution node. It is therefore necessary that a job which is sent for execution is "self-contained" - that is, contains already all non standard subroutines and does not rely in "import" statements to load them.

For that reason I have decided to use a home grown pre-processor of the code. The idea is stolen from an ancient program called PATCHY (or YPATCHY in later versions) which was used together with FORTRAN 77 to build and maintain CERN libraries in prehistoric times.

Here is how it works:

The code consists of several modules, stored in files with extension *.cra or *.car. (The extensions cra and car are due to historical reasons, those were the extensions used by the original PATCHY program). Those files are original files, and they contain python code as well as preprocessor instructions, which look like that:

#!/usr/bin/env python

# import python modules
+INCLUDE ~/newcrs/bin/IMPORT.car

def puthon_function():
    # some python code

.......

The +INCLUDE statement means that the preprocessof should replace this particular line by contents of the file ~/newcrs/bin/IMPORT.car. There are several INCLUDE statements spread along the code.

During code development the programmer is supposed to modify the text of modules *.cra and *.car. Once this is done one has to run the preprocessor.

Let us assume that the user's code is located un file example.car, which contains the preprocessor instructions. To build executable code the user has to run the preprocessor ypatchy.py which is located in ~/newcrs/bin

./ypatchy.py example.car

The preprocessor will now replace all INCLUDE statements by the relevant code and write the result to file example.py which can be later executed as regular python code.

It is important that all code development is done on car and cra files, as the files with extension py are overwritten each time the preprocessor is invoked.

The setup_crs command.

In order to make sure that the user always uses the most recent version of the code I have written a setup_crs script. This is a wrapper around the preprocessor ypatchy.py command, which re-creates the most recent version of the code from the current cra and car files as well as recreates some important symbolic links. Users are instructed to execute the setup_crs command each time before they start to work.

Code Modules

The code is organized into several modules which are combined by the preprocessor before the execution. By convention the modules have extensions cra and car. Here are short descriptions of the most important ones:

  1. MAIN.cra Contains the skeleton code of the user's main job. Upon this module the CRS job is built.
  2. MYSQL.car Contains routines which interface python code to MySQL? databases.
  3. SSH.car Contains Packages for communications between computers - those include routines for execution of remote ssh commands as well as wrappers around routins wchich communicate with remote routines via TCP/UDP sockets.
  4. PROCMAN.car Routine for managing parallel processes, used when it is necessary to fork a child process from the main thread.
  5. LOCK.car Package for creating and managing locks.
  6. LOGBOOK.car Package for managing job's logbook.
  7. READ_CRS_CONF.car Package which reads the job description file and sets all environment variables.
  8. TCP.car some TCP communication routines.
  9. CRS_JOBS.car Main engine which contains routines for communicating between user and jobs. Those routines are called either from the main panel (GUI) or from the line mode commands. 10. crs_panel.cra The skeleton of the code of the main panel GUI 11. crs_job.cra The skeleton of the code used to invoke line mode commands. 12. MAIL.car The code for sending mail messages to users and/or operators

Code Management, Distribution and Backup.

The code is stored in ~/newcrs/bin directory on each account on each submit machine. It is necessary to maintain the same code version for each experiment. To do this I have written a home-grown primitive CVS-like management system.

Each time the developer creates new version of the code he should make it official by going to ~/newcrs/bin command and invoking the command:

./update_code

The code will then ask you for your name, respond by typing your initials:

Type your name:TW

The it will ask you for comment, type short description of the new version of code:

comment? this is test version of the code

Once you hit enter, the new version of the code will be given a number, the entire code in ~/newcrs/bin will be packed into tar file and stored in a special directory where it waits to be exported to a code repository.

The update_code command can take some parameters:

update_code -h : print help

This command publishes current version of code as the official one.
usage:  update_code  [arguments]
arguments:
-n This is new version, not just update. Give it a number rounded to then next integer
-v print code history
-h,-help print this message

do update_code -v to see the history of the code. The version numbers are created by adding 0.01 to the last one. If you release a major new version of the code you may want to indicate it by using -n option, this will change the version numbering "from the next integer". (If previous version is 7.34 and you create new version with -n option then the new version will be 8.00 and not 7.35)

Ok, that much about creating new version of code. Now, once you have created new tar file with new code you may want to export it and distribute among experiments. To do this use the code_manager script. When you execute commands of this script you will be asked for password. To lear what the password is, ask me.

go to ~/newcrs/bin and type

code_manager.py -export

This will pick the tar file with the most recent version of the code, move it to backup directory and delete the tar file stored in the export directory. Now the code is in the official repository.

To see which versions of the code are in repository execute

code_manager.py -list

and you will get list of archived tar files with various code versions.

In order to import a particular version of the code and install it on the account you are right now , do:

code_manager.py -import tar_file_name

This command is rarely used, since rarely are you interested in installing old version of the code. More usefull is the command

code_manager.py -distribute tar_file_name

This command takes the tar_file_name and installs it in ~/newcrs/bin accounts for all experiments on all submit machines. Once this command is executed users should run setup_crs to make sure that they get the most recent version of code.

To summarize: If you have created new version of code and you want it to be installed on all user machines do:

update_code (creates new version of the code), type your name and description of the new version.

code_manager.py -export (send the code to repository)

code_manager.py -list (to see the list of code versions currently in storage. Then pick the tar file with the most recent code version and do:

code_manager.py -distribute tar_file_name

to distribute it to experiments.

Common problem: From time to time someone deletes contents of ~/newcrs/bin directory. To see how to fix the problem go here.

CRS daemons

The CRS software consists of a number of daemons which perform various tasks. Here is their list.

  1. The Logbook daemon. As the jobs execute they need to write job logs. Since they execute on various machines we would like to give the users possiblility to see their logs as they execute. This could be accomplished by writing to a file visible via NFS - which is not a good idea. So I took a different approach: When the job needs to write an entry to the job log, it writes it to a local logbook file (in the execution directory on execution machine) and in the same time it sends the log entry via TCP to a daemon which is listening on one of the ports on the submit machine. The logbook daemon receives the log entry and writes it to the appropriate log file of the job. The CRS log of a job can be seen from the main panel using "CRS logfile" button. The logbook is started by a cron job "/home/starreco/newcrs/bin/logbook_server_002.py /home/starreco/newcrs/bin/crs.conf" (replace starreco by relevant account for other experiments). The daemon is started by cron job every minute, then it checks if another daemon is running. If yes, it quits. If no other logbook daemon is active - it takes over and sets the locks to notify other daemons that it is running. Common problem: sometimes the logbook daemon hangs up. If this happens, job logbooks do not get updated, and occasionally the jobs stop moving. The log information is not lost, since it is kept in the copy of the logbook producted on the execution machine, however the logbook server needs to be restarted. To learn how to restart it go here.
  2. The loader daemon. The CRS system can handle as many jobs in CREATED state as you like. However the ORBS system has a limited capability - if you submit to many jobs at one time to it, if many fail. For that reason we do not want to overload the system by submitting to many jobs at once. The loading of CREATED jobs to condor is thus done by a daemon, called loader which wakes up once a minute on submit machine fsfsand check the number of jobs in SUBMITTED+STARTED state. If the number drops below N (Where N is a constant, which varies from experiment to experiment) then it submits a number of jobs to make the total number of jobs in those states to be N. Common problem: sometimes the lock which prevents two loader daemons from running at the same time is corrupted (this can happen if the submit machine crashed previously). To fix such problem try this.
  3. The ORBS daemons. They run on the ORBS machine, under the accounts bramreco,phobreco,.... . To learn how to start and reset them go here.
  4. The ORBS interface daemons. Those daemons listen to a connection comming from a job, and then execute ORBS commands or return status of a HPSS requests. To learn more about how the ORBS requests are submitted go here. The ORBS interface daemons can hung up when they receive to many connection at the same time. To avoid this, each experiments has two such daemons, which listen on two different ports. When a job wants to talk to ORBS it connects to first of those daemons, if there is no response it tries the second one.
  5. The ORBS interface monitor. The ORBS interface daemons sometimes can be hung. The monitoring daemon runs on HPSS interface machine under account root. Once a minute it sends a ping signal to each of the HPSS interface daemons (two daemons per account, 10 daemons in total). If a daemon does not reply the monitor kills it and restarts.
  6. Spy servers. Each of the execution machines runs a spy server daemon. This is a daemon which listens for commands comming from the operator and executes commands on execution node. The spy server enable users to look into content of job execution directory, have a look at the content of files in the execution directory, execute ps and top commands on execution machines.

Common Problems.

Someone somehow deleted contents of ~/newcrs/bin directory.

The directory is empty, all code management scripts are gone, so I cannot even install new ones. What should I do?

Go to any other experiment account on any other submit machine which does have the ~newcrs/bin content intact. Enter ~/newcrs/bin and execute

code_manager.py -list

then pick the most recent code version and do:

code_manager.py -distribute tar_file_name

it will reinstall the code everywhere. Then go back to the experiment which had its code deleted and run setup_crs to fix all links and recreate the *.py files using the preprocessor.

How to reset the ORBS software?

Logon to ORBS machine as bramreco, pbobreco etc. Execute command:

ps -auxw | grep batch | grep phobreco (bramceco, starreco or phnxreco)

You should see two processes corresponding to scripts batch_monitor and hpss_batch. (If they are missing then ORBS is dead and has to be restarted anyway). If the processes are present, kill them.

Then go to directory ~/Batch/bin. If there are files core.* from core dumps, delete them all. Then execute script

./StartIt

The ORBS should start.

I cannot create jobs.

User complains: "I created job desription file, I execute job create command, but jobs do not get created. This usede to work in the past!"

Many explanations are possible, but it is worthwile to check if users job description file is correct. For example: user may specify wrong number of output streams

outputnumstreams=3
outputstreamtype[0]=HPSS
outputdir[0]=/home/rcfreco/
outputfile[0]=junk_file_delete_it_A.dat
outputmandatory[0]=yes

In this example user declared that there should be 3 output files, but declared only 1. The job will fail to create.

Jobs are not moving

There can be many reasons for this. Here are two most common ones

Logbook daemon is stuck

Sometimes, rarely, the logbook daemon hangs. If jobs are not moving and you can see that job logfiles were not updated for some time - then probably this is the case. You need to restart logbook daemon. Description how to do this is given in the section below.

Condor scheduler is stuck

The symptoms are: Many jobs are in SUBMITTED state, and they do not go to STARTED state.

First select the stuck jobs and click "condor analyze" button. Condor will analyze those jobs. If the response is that there are machines available to run those jobs, but they do not run anyway - then this is symptom of one of condor daemons being stuck. Tell about it the condor experts, fixing this problem is beyond my area of expertiese.

Jobs are inconsistent.

This happens most commonly to jobs in STARTED mode. The symptoms are: when you select the job and click "condor analyze" the system responds "job is unknown to condor".

Please read the section "What are the inconsistent jobs?"

How to restart logbook daemon?

Open crs_panel, click "For experts...". Click "Logman ping" to ping the logbook daemon and chekc if it is alive. To reset it click "logman kill". The logbook manager will be killed. It will restart automatically in one minute.

How to clean a daemon lock?

The daemon lock is a file which prevents two loader daemons from running at the same time. Sometimes the lock can be corrupted, for example if the entire machine crashed previously. In such case when a daemon wants to run, it tries to read lock information and then fails. The logbook and loader daemons use locks.

To clean the lock go to the ~/newcrs/lock directory. The lock name should contain the daemon name. (For example the logbook daemon lock name is logbook_server.lock). Delete the lock. The next time the daemon starts it should be able to run.

How do I clean orphaned HPSS requests?

Sometimes jobs are killed or reset by user while the corresponding ORBS requests are being processed. When this happens the system is left with an ORBS request which has no corresponding CRS job. This is not a very big concern, however over time the ORBS machine can accumulate large number of orphaned requests. They are represented by files located in directories on ORBS interface machines. It is a good practice to remove the orphaned HPSS requests from time to time.

From main panel click "Spy HPSS server". A subpanel with list of known HPSS requests will appear. Some of them will be marked as "ORPHANED". Now click "delete orphaned". The system will delete orphaned requests. Once it is done click "refresh".

Job files overfilled disk on submit machine. What should I do?

Job files are stored in ~/jobs/created directory. Enter into this directory. Each job has its subdirectory there. Using the Linux du command identify which are the culprits which overfill the disk. Delete the largest files.

What are "inconsistent" jobs?

Sometimes jobs can be killed by external factors (machine crash etc). In such case CRS still lists a job as "running" even if it is dead. To find such "inconsistent" jobs do: crs_job -get_inconsistent.

If you find inconsistent jobs you should reset them. You do this either by using simple crs_job -reset command or by writing the crs_job -get_inconsistent output into temporary file and then using crs_job -reset_from_file command

crs_job -get_inconsistent > temp.temp
crs_job -reset_from_file temp.temp

What are "orphaned" jobs?

Sometimes user kills a running job, the job gets deleted from CRS but somehow condor fails to kill it. (This is a condor bug, or feature depending on how you look at it). Such orphaned job will continute to run as a zombie.

You can detect orphaned jobs by

crs_job -get_orphaned

and you can kill them by

crs_job -delete_orphaned

Tomasz Wlodek

Last modified by Cyber Patrol on Mon Dec 29 18:13:00 EST 2003

-- TomWlodek - 22 May 2006

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback