r7 - 05 Sep 2014 - 14:10:27 - RobertBallYou are here: TWiki >  Admins Web > HTCondorCE

HTCondorCE

What is HTCondor-CE ?

  • As support for Globus middleware is discontinued, OSG is in transition to a new implementation of the gatekeeper services --- HTCondor-CE.
  • HTCondor-CE is a special configuration of HTCondor.
  • Grid users can still use Condor-G to submit grid jobs to HTCondor-CE. The grid job will be firstly placed in a HTCondor Schedd daemon, running on the gatekeeper. Another daemon from the gatekeeper, JobRouter, then transforms the grid job to a routed job.
    • For a site with a HTCondor batch system, the routed job will be a vanilla local HTCondor job.
    • For sites using other batch systems, like PBS, then a third daemon, called blahp, will submit the routed job to the local batch system.
  • More details about HTCondor-CE internals is explained in this post.

Transition Plan

  • USLHC sites are expected to migrate to HTCondor-CE by the end of 2014.
  • HTCondor-CE rpm packages are available in OSG 3.2 series repo. As of release 3.2.14, the software packages have become quiet ready, and several sites have deployed it for production.
  • From the packaging point of view, the current OSG release meta rpm installs both GRAM CE and HTCondor-CE, site admin can choose to enable either one of them. After the transition is done, the GRAM CE subpackage will be dropped.
  • We recommend that, for the transition, sites deploy HTCondor-CE on a second gatekeeper, keep both GRAM CE and HTCondor-CE running concurrently.

Installation Instructions

  • The official installation instruction can be found here.
  • A new component in HTCondor-CE is the JobRouter, in which we will go a little deeper in the next section.

JobRouter Configuration

  • JobRouter transforms grid jobs into local batch job. Its configuration files are located at /etc/condor-ce/config.d/, the configuration files define the routing policies for incoming jobs.
  • The JobRouter provides a unique mechanism for site admins to do local customization. This is considered as an improvement over GRAM CE, where one has to hack jobmanager scripts to add local attributes to job ClassAd.
  • Below I will explain how we add local condor pool attributes to incoming grid jobs using JobRouter.
    • At BNL, we classify incoming jobs into different "queues" in local condor pool, by inserting some special tags into job ClassAd. The special tags are called "RACF_Group", "Job_Type" and "Experiment" etc.
    • In users' condor-g submit file, user needs to specify which queue to submit the job to by using a "remote_queue" attribute. Below is a sample submit file :


 
        universe = grid
        grid_resource = condor gridgk07.racf.bnl.gov gridgk07.racf.bnl.gov:9619

        executable = test5.sh
        output = test5short.out.$(Process)
        error = test5short.err.$(Process)
        log = test5short.log.$(Process)

        ShouldTransferFiles = YES
        WhenToTransferOutput = ON_EXIT

        use_x509userproxy = true

        +remote_queue="analy.short"

        queue 1
 

    • This job is submitted to the short analysis queue. To transform this grid job to the local job with the right ClassAd, we have the following section in the JOB_ROUTER_ENTRIES :



JOB_ROUTER_ENTRIES = \

   [ \
     GridResource = "condor localhost localhost"; \
     eval_set_GridResource = strcat("condor ", "$(FULL_HOSTNAME)", "$(FULL_HOSTNAME)"); \
     TargetUniverse = 5; \
     name = "BNL_Condor_Pool_short"; \
     Requirements = target.queue=="analy.short"; \
     eval_set_AccountingGroup = strcat("group_atlas.analysis.short.", Owner); \
     eval_set_RACF_Group = "short"; \
     set_Experiment = "atlas"; \
     set_requirements = ( ( Arch == "INTEL" || Arch == "X86_64" ) && ( CPU_Experiment == "atlas" ) ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ); \
     set_Job_Type = "cas"; \
     set_JobLeaseDuration = 3600; \
     set_PeriodicHold = (NumJobStarts >= 1 && JobStatus == 1) || NumJobStarts > 1; \
     eval_set_VO = x509UserProxyVOName; \
   ] \


    • See the attachment 80-local.conf for the whole BNL JobRouter configuration file.

Experience and lessons learned

  • If an incoming job matches criteria of several routing entries, JobRouter will select the route on a round-robin basis.
    • This may not be what people want. To avoid this, the "requirements" of all the route entries need to be completely exclusive.
  • HTCondor manual mentions JOB_ROUTER_DEFAULTS, which can be used to define attributes that apply for all routing entries. This doesn't work in the case of HTCondor-CE, as the default routing is overwritten by the script of /usr/share/condor-ce/condor_ce_router_defaults. For now, one just needs to put all the common attributes in each of the routing sections one by one.
  • If a job doesn't match any of the routing entries, it will be held in 30 minutes on the condor-ce queue on the gatekeeper, and eventually get deleted in 24 hours. The HoldReason will be sent back to the user job log.

Deployment at T2 sites

Known issues

  • SAM test doesn't work with HTCondor-CE at this point, mid Auguest. The last update from SAM team is that the new release coming up late August should have it fixed. For now, LHC sites need to run at least one GRAM CE, so that SAM test jobs can continue to function.

-- Main.Xin Zhao - 14 Aug 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


conf 80-local.conf (8.7K) | XinZhao, 15 Aug 2014 - 16:43 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback