r9 - 03 Oct 2012 - 03:49:55 - PatrickMcGuiganYou are here: TWiki >  Admins Web > AthenaMPFacilityConfiguration

AthenaMPFacilityConfiguration

Introduction

Goal is to setup local schedulers at sites to support AthenaMP jobs in production. See also AthenaMPFacilityTests.

Condor schedulers

At the RACF:

We currently set up some 24-core nodes to run 8 regular jobs and 2 8-core jobs. This is done through a very simple static configuration:

CPU_Type = ifThenElse(Cpus == 1, "global-quota", "mp8-quota")

MP_JOBS = 2
CORES_PER_MPJOB = 8

REGULAR_CORES = (($(DETECTED_CORES)) - ($(CORES_PER_MPJOB)*$(MP_JOBS)))

SLOT_TYPE_1 = cpus=$(CORES_PER_MPJOB), ram=1/3, swap=1/3, disk=1/3
SLOT_TYPE_2 = cpus=auto, ram=auto, swap=auto, disk=auto

NUM_SLOTS_TYPE_1 = $(MP_JOBS)
NUM_SLOTS_TYPE_2 = $(REGULAR_CORES)

START = ( TARGET.RACF_Group == "mp8" && Cpus == 8 ) ||          ( <start-expression for regular jobs> && Cpus == 1 )
        ...

This config creates two special slots per node with a different CPU_Type flag and only matching jobs with a special flag (RACF_Group) set.  The jobs that run in this queue have the following configuration set:

Requirements = ( Cpus == 8 ) && ( CPU_Type == "mp8-quota" ) && ...
+RACF_Group == "mp8"

With everything else set normally.

Note: If you are using group quotas, there appears to be a bug that stops a group with multicore jobs in it from getting matched, see https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2958

At MWT2

Midwest Tier2 is currently using 50 R410 with HT enabled (24 logical cores) in the MWT2_MCORE configuration. Each node is configured such that 8 logical cores is used by the MCORE slot (slot1), with the remain 16 logical cores reserved for SCORE jobs (slot[2-17])

MWT2 is currently using Condor 7.8.1. The supplied default master configuration file, /etc/config/condor_config, is modified by the addition of local modules in /etc/condor/config.d. To configure a node for MCORE work, the following module, 70-mp.conf, was added to /etc/condor/config.d

### Start of Athena MP support ###

### For a job to invoke MP, the submitted jobs requires the following

### Requirements = ( Cpus == 8 )
### +MCORE = True
### +AccountingGroup = "group_atlasmcore.atlas1"

MP_JOBS = 1
CORES_PER_MPJOB = 8

REGULAR_CORES = (($(DETECTED_CORES)) - ($(CORES_PER_MPJOB)*$(MP_JOBS)))

SLOT_TYPE_1 = cpus=$(CORES_PER_MPJOB), ram=1/3, swap=1/3, disk=1/3
SLOT_TYPE_2 = cpus=auto, ram=auto, swap=auto, disk=auto

NUM_SLOTS_TYPE_1 = $(MP_JOBS)
NUM_SLOTS_TYPE_2 = $(REGULAR_CORES)

MP_START = ( (Cpus == 1) || ((Cpus == $(CORES_PER_MPJOB)) && (TARGET.MCORE == True)) )

# Start the job based on its single-core or multi-core requirements

START = ($(START)) && ($(MP_START))

### End of Athena MP support ###

In the above, MP_JOBS defines the number of MCORE slots on this node, whereas CORES_PER_MPJOB defines the number of logical cores per MCORE slot.

This is a "static" configuration. If there are no MCORE jobs running on the node, the 8 logical cores assigned to the MCORE slot will remain idle.

The Panda queue, MWT2_MCORE was created by cloning the existing MWT2-condor production queue into MWT2_MCORE-condor. In Schedconfig, beyond changing all references of MWT2 to MWT2_MCORE, the only parameter which needed to be changed for multi-core was "coreCount".

coreCount = 8

The queue was then added to APF with the request that the following globusrls string be used

globusrsl       = (condorsubmit=('Requirements' '(Cpus==8)')('+MCORE' TRUE)('+AccountingGroup' '\\\"group_atlasmcore.usatlas1\\\"'))

This results in the following parameters being passed with submitted pilots

Requirements = ( Cpus == 8 )
+MCORE = True
+AccountingGroup = "group_atlasmcore.atlas1"

The Panda queue was added to HammerCloud? functional testing. This is performed at the rate of 2 functional tests per day.

At AGLT2

Broad Outline

In broad outline, this is how we went about setting up the AGLT2_MCORE queue

1. The job, via condor.pm or some other mechanism, states its requirements for a job slot (must be 8 core), and what it provides (I am an MCORE job). The job slot advertises what it provides (the 8 core slot) and what it requires (jobs says it is MCORE). So, first up, is to decide how that will be specified by our system, and get it set up.

2. Add and check in the SchedConfig setup for the AGLT2_MCORE queue. Our changes for this from the standard Production queue setup are shown below.

3. Contact Jose Caballero jcaballeroATbnlDOTgov to start autopyfactory submission to the queue. Tell him the jdl we want the pilot to provide (for us, this is queue=mp8, which we then interpret in our condor.pm and elsewhere), and how many such job slots we are providing. Once he has the factory set up, we set the queue to manual, then to "test" so that his pilots come in. Check that they are doing what we want them to do.

4. We get one job per day from HC for testing. Contact gianfrancoDOTsciaccaATlhepDOTunibeDOTch and/or atlas-adc-hammercloud-supportATcernDOTch to set this up.

5. Contact joseDOTenriqueDOTgarciaATcernDOTch to start getting real MCORE jobs sent to us. Before this is done set the queue "online" as there is insufficient HC testing to do this automatically. I think only one site, maybe in the UK, actually lets HC turn the queue on/off as our other Prod and Analy queues are handled.

6. Stand back and watch the fun.

Details

We have set up 10 24-core machines to each have a single, 8-core job slot. This is done in a manner similar to that at both BNL and MWT2. Under Condor 7.8.2, files in /etc/condor/config.d were modified as follows:

# Slot Types
#

# Usual T2 Slot 1 CPU, 2GB RAM, 2GB swap
ST_1_MEM = 4096
SLOT_TYPE_1 = cpus=1, memory=$(ST_1_MEM), swap=auto, disk=auto

# Athena MP slot 8 CPUs, 16GB RAM, 16GB Swap
CoresPerMP8Slot = 8
ST_2_MEM = 32768
SLOT_TYPE_2 = cpus=$(CoresPerMP8Slot), memory=$(ST_2_MEM), swap=1/3, disk=1/3

# CPU_Type for job matching
#
CPU_Type = ifThenElse(Cpus == 1, "mp1", "mp8")

# Capacities
#
NUM_SLOTS_TYPE_1 = 16
NUM_SLOTS_TYPE_2 = 1

MEMORY = ( ($(NUM_SLOTS_TYPE_1)*$(ST_1_MEM)) + ($(NUM_SLOTS_TYPE_2)*$(ST_2_MEM)) )

# Basic START rule
#
START = $(IS_T2_GROUP)

# Attributes
#
STARTD_ATTRS = $(STARTD_ATTRS), CPU_TYPE
IsMP8Job = ( TARGET.Slot_Type == "mp8" )

START = ....
         ((SlotID == 17) && ($(START)) && ($(IsMP8Job)) )

This configuration is static, so ONLY Athena MP8 jobs can run in this slot 17. If there are no such jobs to run, these 8 cores will remain "Idle".

The AGLT2_MCORE queue was cloned in SchedDB? , from the AGLT2 definition, with the following changes.

[ball@umt3int01:GreatLakesT2]$ diff AGLT2-condor.py AGLT2_MCORE-condor.py
39c39
<     'corecount' : 'None', # Defined in All.py: GreatLakesT2 site
---
>     'corecount' : '8', # Defined in All.py: GreatLakesT2 site
59c59
<     'jdl' : 'AGLT2-condor', # Defined in Config
---
>     'jdl' : 'AGLT2_MCORE-condor', # Defined in Config
96c96
<     'siteid' : 'AGLT2', # Defined in Config
---
>     'siteid' : 'AGLT2_MCORE', # Defined in Config

The JDL is modified from that of the AGLT2 production queue, as follows:

[ball@umt3int01:JDLConfigs]$ diff AGLT2-condor.py AGLT2_MCORE-condor.py
31c31
< globusrsl       = (jobtype=single)(maxWallTime=4000)
---
> globusrsl       = (jobtype=single)(maxWallTime=4000)(queue=mp8)
37c37
< submit_event_user_notes = pool:AGLT2
---
> submit_event_user_notes = pool:AGLT2_MCORE

The condor.pm file at AGLT2 was modified to interpret the mp8 queue parameter, and add the following to the Condor submit file that is generated via globus.

   * Requirements is appended with " Cpus == 8 && CPU_TYPE =?= "mp8" "
   * +Slot_Type = "mp8"
   * +JobMemoryLimit = 33552000

BNL has added the AGLT2_MCORE queue to the autopyfactories. HammerCloud has been contacted to begin testing the queue.

PBS schedulers

SWT2_CPB / UTA_SWT2

Layout

 Both cluster utilize Torque for resource management and Maui for scheduling purposes.   The first step was to segregate multi-core jobs into a separate queue within Torque to ease pilot submission while it is trivial to prioritize jobs based on queues within Maui.  We do not dedicate particular nodes to running the multicore jobs, rather we allows the jobs to run on whichever node frees up the necessary eight processors.  As the system matures, we may revisit this decision.   Within Maui, the multicore jobs have the highest priority.  When a queued multicore job exists in Torque and we are below the maximum number of running jobs, Maui will stop scheduling jobs on any node until a node with eight processors becomes available.    This is not a terrible burden under normal conditions since once a suitable job slot has been opened, pilots will consume these slots even if there are no jobs defined.  Lsatly we set an upper bound on the number of multicore jobs that can run at a given time.  During the initial setup of the queue the limit is being maintained at two jobs, but will increase as we start running production jobs.

Torque

The following is used to create the queue within our Torque system.  There are two notables things within the definition: the queue is limited to use by the local usatlas1 user and a default resource requirement of 9 processors on one node.

#
#
# Create and define queue multi_core_q
#
create queue multi_core_q
set queue multi_core_q queue_type = Execution
set queue multi_core_q acl_user_enable = True
set queue multi_core_q acl_users = -
set queue multi_core_q acl_users += usatlas1
set queue multi_core_q resources_min.cput = 00:00:01
set queue multi_core_q resources_default.nodes = 1:ppn=8
set queue multi_core_q enabled = True
set queue multi_core_q started = True


Maui

While not the whole Maui configuration, the lines below show that Torque queues map to Maui classes.  The multi_core_q class has the highest priority, is limited to two running jobs, and uses the default partition (which allows any node with 8 free processors to run the job).

SYSCFG[base] PLIST=DEFAULT,cvmfstest&
CLASSCFG[atlas_analy_q] PRIORITY=10000 MAXJOB=800 PLIST=DEFAULT&
CLASSCFG[multi_core_q] PRIORITY=20000 MAXJOB=2 PLIST=DEFAULT&
CLASSCFG[atlas_prod_q] PRIORITY=9000 MAXJOB=2400 PLIST=DEFAULT&
CLASSCFG[osg_test_q] PRIOIRITY=1 MAXJOB=1 PLIST=DEFAULT&
CLASSCFG[default] PRIOIRITY=1 MAXJOB=10 PLIST=DEFAULT&
CLASSCFG[cvmfs_p_q] PRIORITY=10000 PLIST=cvmfstest&


Globus JobManager?

Fortunately, the default pbs-jobmanager supplied by Globus/OSG can be used without modification to submit jobs to the Torque queue.  Originally it was thought that the resources_default assignment in the Torque configuration would provide the necessary linkage so that jobs would be requesting eight processors.  Alas, the pbs-jobmanager will always request simply one node without specifying the number of processors and this request overrides the default supplied by Torque.  It was discovered that additional RSL is necessary for the jobmanager to provide the appropriate request  to Torque.  We asked Jose, at BNL, to use the following additional RSL when submitting jobs to our sites.

The specification asks jobs to have a wall time limit of 4500 minutes, be submitted to the defined multi_core_q within Torque, and specify eight processors via xcount.

(maxWallTime=4500)(queue=multi_core_q)(xcount=8)

Status

As of this wriring, Oct. 2012, pilots are being submitted properly with the correct resource requirements and the pilots are being scheduled properly by Maui.  We are awaiting running hammer cloud tests to verify the full production chain.

References

  • See HTPC presentation and references therein on MinutesFeb29.


-- RobertGardner - 29 Mar 2012s

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback