r17 - 02 Aug 2007 - 17:51:19 - MarcoMambelliYou are here: TWiki >  Admins Web > PilotCheckerP1

PilotCheckerP1

Goal

This task involves executing a site-level environment script(s) that can be performed by the local systems administrator that checks all the environment and service dependencies necessary for a proper functioning of the Panda pilot script. A web frontend was designed to give to Site administrators a tool that is easy to use , hiding the complexity of the PandaJobScheduler.

Note for Phase 1: the software is still in the prototype stage, so the suggestion is to not install the software but use a running server provided by Marco. See http://tier2-06.uchicago.edu:8800/pandajs/ and contact Marco (marco@hep.uchicago.edu) for support. If you really want your own installation you can check the #installation section.

Instructions are additionally available in the official documentation page: http://twiki.mwt2.org/bin/view/DataServices/PandaSubmitHost.

How to validate a Computing Element

Computing Element (CE), sometime referred also as Site, in this document refers to a computing resource (cluster queue) used to execute Panda pilots. In order to check the green mark in the "Pilot cecker" column of the table on SiteCertificationP1 a Site administrator would have to perform the following procedure using the PilotChecker installed at http://tier2-06.uchicago.edu:8800/pandajs/ :
  • control that the data about a CE is correct
  • run succesfully test pilots
  • run succesfully ATLAS pilots

In order to submit pilots a user needs Username and Password. You can contact Marco (marco@hep.uchicago.edu) if you don't have them.

Check your site information

The PilotChecker is using siteinfo.py, a file Database that is used by all Panda production and contains the description of all the used CEs

To do this:

  • Go to http://tier2-06.uchicago.edu:8800/pandajs/ce/ (selecting "Computing Elements" from the menu)
  • Check that your CE is in the list (alphabetically ordered)
  • Clicking on your CE check that the information is correct, specially:
    • Name of the CE
    • gatekeeper and queue
    • OSG directories (Osg app, Osg data, Osg wntmp, Osg grid)
    • The python in "Python path" (the default python if this is blank) should be at least 2.3 (min requirement for the pilot)
    • DQ2URL? is pointing to the DQ2 site service used for that CE

If your CE is missing or has wrong information please send an update to Marco and/or Xin. Either siteinfo.py needs to be updated on the PilotChecker? server or the central siteinfo.py (stored in CVS) needs updates. After the update, repeat the check before proceding.

Check test pilot execution

To test your CE and mark it green you should:
  • Go to the submit page: http://tier2-06.uchicago.edu:8800/pandajs/js/new/ ("Submit a job" or "new pilot submission" in the menus). The first time (in the session) you use it you'll be requested to insert Username an Password. I you don't know them please contact Marco (as above)
  • Keep the default submit host (pandajs1), check the box next to "Check to select a single CE:", select your CE from "Chosen CE:"select "Test job" from "Pilot type:" list, insert 3 as "Number of pilots to send:", click "Submit pilot jobs". This way you'll be submitting 5 pilots to your CE. Wait few seconds for the result page.
  • Check that the pilots are actually queued at your gatekeeper:
    • You can see the status of the submit host at: http://tier2-06.uchicago.edu:8800/pandajs/sh/1/ ("Submit host" from the menu, pandajs1 is the first one of the list)
    • You should check locally your queue. The information should correspond (excluding some delay)
  • Wait for job completion (until they are done and no more in the status of the submit host)
  • Check the job information:
    • Check that the Condor-G log is not showing any error and that stderr is empty
    • The test pilot is verifying that the OSG environment is set correctly. You may check its content to see that all directory are OK
  • If you don't see any error you passed this step, go to the next one. If there are errors in the execution provide to fix the gatekeeper and the OSG environment.

Check ATLAS pilot execution

The PilotChecker is a pilot submit host very much like the one used for production jobs. You can send test or regular pilots to one or all CEs. Please refrain from submitting huge amounts of pilots to avoid problems to the targeted CE (specially if the target is not under your control).

To test your CE and mark it green you should complete the last step:

  • Go to the submit page: http://tier2-06.uchicago.edu:8800/pandajs/js/new/ ("Submit a job" or "new pilot submission" in the menus). The first time (in the session) you use it you'll be requested to insert Username an Password. I you don't know them please contact Marco (as above)
  • Keep the default submit host (pandajs1), check the box next to "Check to select a single CE:", select your CE from "Chosen CE:"select "default Panda pilot" from "Pilot type:" list, insert 5 as "Number of pilots to send:", click "Submit pilot jobs". This way you'll be submitting 5 pilots to your CE. Wait few seconds for the result page.
  • Check that the pilots are actually queued at your gatekeeper:
    • You can see the status of the submit host at: http://tier2-06.uchicago.edu:8800/pandajs/sh/1/ ("Submit host" from the menu, pandajs1 is the first one of the list)
    • You should check locally your queue. The information should correspond (excluding some delay)
  • Wait for job completion (until they are done and no more in the status of the submit host)
  • Check the Panda monitoring dashboard at http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?dash=prod . Specially the upper right section with pilots ("Pilot job requests per hour, last 3 hours"). Your CE should appear there, probably with 5 job requests.
  • If you see the pilots (job requests) you can add the green checkmark in SiteCertificationP1. Congratulations!

Custom use of PilotChecker

The following section is for user that want to know more about the PilotChecker? and do more than validating the CE. If your goal is to validate the CE, you can stop at the previous section.

Usage and validation

The Panda Pilots Submitter provides
  • a user interface to submit new pilots and view past submissions.
  • the user interface also gives the status of the submit host, and information about CEs.
  • an administrative interface, to add and configure new submit hosts.

The server administrator can add one or more backends from the administrative interface. The server has to be already running and the administratos should be allowed to use it. To add a server it its information has to be provided and the files sitenfo.py (containing CE information) and storageaccess_info.py (containing other CE information) have to be uploaded.

To submit Panda pilots or test jobs, a user should go to the 'submit' page (e.g. http://tier2-06:8800/pandajs/js/new/), fill the form, and then hit "Submit pilot jobs".
The first time that you access this page you'll be required to enter username and password. Please contact the developer? if you'd like to submit jobs and don't know username and password.

In the form it is possible to select one of the available submit hosts. Then it is possible to send pilots to all the available CEs (set to OK), if you leave "Check to select a single CE:" unchecked, or to send all to a single CE by checking "Check to select a single CE:" and selecting the CE in the option field.

The user can choose to send different executables (pilots):

  • default - same as pilot2
  • pilot2 - the current Panda pilot used for production
  • test - test jobs that check the site configuration
  • old Panda pilot - previous version of the panda pilot

Other operations possible for regular users include view/select one available submit host, control that the information about their CE is correct (specially if they are Site administrators), view information about job submissions.

Most of the time the Condor-G submission of the pilots completes successfully. To be sure that the CE is working correctly, site administrators can verify that pilots are arriving to their CE and control their status using the Panda monitoring as explained in ProductionTroubleshooting? .

Installation

The pilot checker is currently a prototype. The installation involves:
  • installation of a working Panda Job Submitter
  • installation of the Django web framewok
  • installation of the Panda Pilot Submitter

To install a PandaJobSubmitter? you can use:

pacman -get GCL:PandaJS

It will take around 20min (very network dependent) and use more than 600MB.

A more detailed description of the installation and configuration of a Panda Job Submitter is available here.

To install Django you can follow the clear instructions provided at the Django website.

To install the Panda Pilot Submitter at the moment you can send an email to marco@hep.uchicago.edu and you will receive a tarfile with the application to expand in your Django installation directory.

More information

-- MarcoMambelli - 26 Jun 2007 -- RobertGardner - 05 Jun 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback