r6 - 06 Jun 2013 - 11:12:34 - JohnHoverYou are here: TWiki >  Admins Web > SL6Migration

SL6Migration

Under construction

Introduction

US ATLAS is scheduled to being migration of all Tier2 to SL6 starting June 1 with completion by June 30 (WLCG schedule is August), see DeploySL6. After some initial testing, a plan has been developed as to how to migrate a standard Tier 2. Initial testing has shown that it is not possible to reliably run a site in a mixed SL6/SL6 configuration. The recommend procedure by Atlas at SLC6Readiness is to either convert an entire site in a "Big Bang" or by a "Rolling upgrade".

To start, here is the central ATLAS SL6 migration documentation at CERN: https://twiki.cern.ch/twiki/bin/view/LCG/SL6Migration. This discussion will assume a rolling migration, where SL5 and SL6 clusters are run simultaneously for a time. This allows SL6 end-to-end functionality to be tested before dedicating all site WNs to the SL6-based system. An all-at-once migration requires the same adjustments, but they need to be done to the existing elements rather than in parallel.

Rolling deployment one regional site at a time

The easiest and safest way to upgrade is via the rolling upgrade. Each regional site (UChicago, Indiana, Illinois) can be upgrade separately from the others. In this way only part of sites will ever be down for an extended period while the workers nodes, etc are upgraded to SL6. Also, should a problem develop with the SL6 deployment, part of the site will remain with SL5 capabilities. Since each regional site has its own gatekeeper, condor head node and condor pool of worker nodes, upgrading each site individually is an easy, less stressful procedure. To perform a rolling upgrade, a site needs to take the following steps:

Announce Downtime

This downtime registration will be forwarded to WLCG

Upgrade Worker Nodes to SL6

OSG WN-Client 3.1

Create an SL6 Compute Element

  • New CE to migrate the workload to.
  • Ensure empty grid3-locations.txt. ATLAS releases are now handled differently than before, i.e. release tags will not be published via BDII from the CEs.
  • TO DISCUSS: GRAM interfaces across US ATLAS. E.g. queue=xyz in site JDL.

Remove FLOCKing

  • Any batch system flocking between SL5 and SL6 subclusters should be disabled.
  • This gatekeeper will advertize to the BDII only SL6 validated releases
  • The $APP (and grid3-locations) would be different that those on the SL5 nodes
  • Pilots which glide into this node will then be run only on SL6 compute nodes.

Setup a new validation in LJSFi to the SL6 GK

  • The validations for SL6 releases would be sent to this GK, run on an SL6 C nodes.
  • Initially the BDII for the GK is empty and thus no jobs will be submitted to the two Panda Qs,
  • But as the validations succeed and the BDII becomes populated, jobs will be submitted to the GK

Gatekeeper submits jobs to SL6 nodes

  • The gatekeeper must participate in a Condor pool with only SL6 deployed nodes.
  • A Condor head node (collector/negotiator) separate from the SL5 pools is needed for this functionality

Panda Queue Creation and/or Configuration

  • New Panda queues.

This is an opportunity to clean up the Panda queue definitions that relate to pilot submission. The following parameters are no longer used, and can be set to None in AGIS:

  • cmd
  • datadir
  • environ
  • gatekeeper
  • jdladd
  • jdltxt
  • queue

Also, be aware that the grid job wrapper always sources $OSG_GRID/setup.sh (the OSG wn-client setup file), so it is not necessary to list this in envsetup[in] or copysetup[in]. If you need special environment variable set (e.g. a site-wide http proxy) please set these in $OSG_GRID/setup.sh.

  • IDEA! TO DISCUSS: Current and desired functionality of envsetup and copysetup.
  • IDEA! TO DISCUSS: http proxy setup at SLAC in pilot?

Create new Panda Queues associated only with an SL6 GK

Clone your existing Panda queues as SL6 queues with distinct names, (e.g. BNL_CVMFS_1 -> BNL_PROD) These clones would then associate only with the SL6 gatekeeper.

This procedure is done with AGIS. This involves three steps; Create a PANDA resource; Create a PANDA queue; Associate a CE (gatekeeper) with the Panda Queue. Lastly the queues need to be change to "manual", "offline" and have APF enabled.

Create a PANDA Resource

Two new resources need to be created for the SL6 queues.

To create a new PANDA resource, select "Define PANDA resource" on the AGIS home page

  • In the "PANDA Site:" box, specify the site name. A popup of possibilities appears once you begin typing.
  • Enter the "Name of PANDA resource" in that box; e.g. ANALY_MWT2_SL6.
  • Select GRID as the "Resource type" from the pull-up list.
  • Click the "Check input data" button.
  • If all is well, a new button "Save PANDA Resource" button will appear. Click it and the resource will be created.

Create the PANDA Queue

The two new queues can now be created. AGIS has a nice clone function. Since most of the value we want to use for the new SL6 queues will be the same as the current SL5 queue, we can just clone the existing queues and then make the appropriate changes.

To create a new PANDA Queue, select "Define PANDA queue" on the AGIS home page

  • In the "Specify PANDA Queue" box, enter the name of the queue to clone (e.g., MWT2-condor or ANALY-MWT2-condor), then click on "Clone"
  • Change the PANDA Resource Name to the appropriate SL6 resource created above (e.g., MWT2_SL6 or ANALY_MWT2_SL6)
  • Change the PANDA Queue Name. Use the same name as the resource for consistency (e.g., MWT2_SL6 or ANALY_MWT2_SL6)
  • Specify the Type of the queue via the pull-down choices (production or analysis)
  • Click the "Save and continue" button at the bottom of the form

Associate a CE to the queue

The new queue has to be associated with a gatekeeper.

To find an modify a PANDA Queue, select "PANDA Queue" at the top of the AGIS home page

  • In the box above "Panda Site", enter "MidwestT2" This will filter the list to only MidwestT2? queues.
  • Other filter option would be "MWT2" in the "Altlas Site", "Panda Queue" or "Panda Resource" fields
  • Select the appropriate PANDA Queue to modify (e.g., MWT2_SL6 or ANALY_MWT2_SL6)
  • Click on the "Find and associate another CE/Queue"
  • In the "Search CE queues" box, type in the name of your gatekeeper (e.g., mwt2-gk) and click the "Search" button
  • Check the box with the "default" entry
  • Click on "Save". The gatekeeper/queue should now show up in the "Associated CE queues" section

Wait for the Queues to be created

The process of Queue creation by AGIS can take up to 20 minutes. If the new queues do not show up after 30 minutes in the Clouds, Production or Analysis pages, you will need to email Alden as the update system might be hung.

Modify the Panda "Status" settings

There are two settings on a Panda Queue that are not controlled by AGIS and need to be changed; "Status" and "Status Control"

Status Control should be "manual"

curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=MWT2_SL6'
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=ANALY_MWT2_SL6'

Status should be "offline" until we are ready to test

curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=MWT2_SL6'
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=ANALY_MWT2_SL6'

ATLAS Release Validation

SL6 is an entirely new platform, and only a subset of ATLAS releases have been adjusted to work on it. So the release tags associated with SL6 Panda resources must be distinct.

Pilot submission

APF needs to be reconfigured so as to submit pilots for the new Panda Queues via the correct gatekeeper. Eventually we can disable pilot submission to the old gatekeeper/queues, although this is not critical (APF will stop submitting once some jobs are pending--if none run, they will just stay pending). Email Jose Caballero and/or John Hover to set up new queues.


-- JohnHover - 29 May 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback