r11 - 20 Aug 2014 - 07:13:07 - ShuweiYeYou are here: TWiki >  AtlasSoftware Web > ProofAtBNL

ProofAtBNL - Using the BNL PROOF Farm


Introduction

What is PROOF?

The Parallel ROOT Facility, PROOF, is an extension of ROOT allowing transparent analysis of large sets of ROOT files in parallel on clusters of computers or multi-core machines. There is an official PROOF website and a PROOF forum in Root Talk, as well as an ATLAS hypernews forum.

And the job with PROOF starts instantly and run interactively with progress status. Of course, you can also run PROOF job in batch mode.

PROOF farms at BNL

There are 2 PROOF farms at BNL:

  • bnlt3a10: running ROOT-5.34.07, consisting of 96 workers (bnlt3a01-a09, bnlt3a11-a13).
  • bnlt3s01: running ROOT-5.34.07, consisting of 128 workers (bnlt3s01-s08).

Using PROOF farms at BNL

You had better use the same as or more recent ROOT version than the ROOT version used at PROOF farm.

Setting up ROOT at BNL

You can use ROOT from either AFS or CVMFS (recommended):

  • AFS: source /afs/usatlas/scripts/root_set-slc5.sh 5.34.07
  • CVMFS: cvmfs-setupATLAS --quiet; localSetupROOT 5.34.07-x86_64-slc5-gcc43-opt

If you use tcsh, you just need replace the above ".sh" with ".csh".

Connection to PROOF farm

Prior to run your PROOF job, you are encouraged to make a simple connection to the PROOF farm on which your job intends to run. If the connection can be made, please do not go ahead to run your jobs, something like:

root [0] TProof::Open("bnlt3a10");
Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: OK (96 workers)                 
Setting up worker servers: OK (96 workers)                 
PROOF set to parallel mode (96 workers)
root [1]

root [0] TProof::Open("bnlt3s01");
Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: OK (128 workers)                 
Setting up worker servers: OK (128 workers)                 
PROOF set to parallel mode (128 workers)
root [1]

Overriding ROOT version on PROOF farm

You can override the ROOT version used to run your jobs on PROOF farm in the following way prior to PROOF connection:

TProof::AddEnvVar("PROOF_INITCMD", "echo source AbsolutePath/yourScript");

Actually you are encouraged to use the same ROOT version on PROOF farm as that running on your client machine, to ensure compatibility.

Adding clist to a TChain or TDSet

For each dataset in xrootd, a clist file is created under directories *~xrdadmin/xrd*_copied dataset/. You can make use of the clist to add the list of files into your TChain or TDSet, for example:

TFileCollection fc("fc", "list of input root files", "fileNameOfClist");
TChain* chain = new TChain("physics");
chain->AddFileInfoList(fc.GetList());

// or for TDSet
TDSet* dset = new TDSet("dset","physics");
dset->Add(fc.GetList());

Run a Very Simple Example

In this example, you will open a PROOF session, add a file to a TChain, and Draw() a simple quantity.

TChain *chain = new TChain("FullRec0");

chain->Add("/usatlas/workarea/yesw2000/root/Data/HPTV/user.TARRADEFabien.trig1_misal1_csc11.005145.PythiaZmumu.Athena_12.0.6.GroupArea_12.0.6.6.Jamboree_II-HightPtView-00-00-30.AAN.AANT3._000*.root");

TProof::AddEnvVar("PROOF_INITCMD", "echo source /afs/usatlas/scripts/root_set-slc5.sh 5.34.07");
TProof *proof = TProof::Open("bnlt3a10");
chain->SetProof();

TStopwatch t;
t.Start();
chain->Draw("Jet_C4_p_T","Jet_C4_N>0");
t.Stop();
t.Print();

The output should look something like this:

Histogram plot of the simple example

PROOF Query Progress of the simple example

Run Analysis Job

To run event selection job on PROOF, you need use TSelector. A TSelector-inherited class (says TopSel) framework based on your ntuple can be generated via TTree::MakeSelector as shown below:

root [0] TChain chain("physics");
root [1] chain.Add("/usatlas/groups/bnl_local/yesw2000/TOP_Data/data11_7TeV.00180164.physics_Egamma.merge.NTUP_TOPEL.r2603_p659_p694_p822_tid601859_00/NTUP_TOPEL.601859._000001.root.1");
root [2] chain.MakeSelector("TopSel");
Warning in <TClass::TClass>: no dictionary for class AttributeListLayout is available
Warning in <TClass::TClass>: no dictionary for class pair<string,string> is available
Info in <TTreePlayer::MakeClass>: Files: TopSel.h and TopSel.C generated from TTree: physics
root [3] .! ls -l TopSel*
-rw-r--r-- 1 yesw2000 usatlas   3209 May 22 09:23 TopSel.C
-rw-r--r-- 1 yesw2000 usatlas 995310 May 22 09:23 TopSel.h
root [4]

Before you start to run your job on PROOF farm, you should check:

  • if your job can run successfully in non-PROOF mode.
  • if your job can run successfully in PROOF-Lite mode.
  • if you have uploaded all required packages/files to PROOF farm, or if you use absolute NFS path in file access if not shipped to PROOF farm.

In addition, you had better minimize the number (says 2) of PROOF workers if your job had never been tried on PROOF before. You can limit the number of PROOF workers in the following way:

TProof::AddEnvVar("PROOF_NWORKERS","2");
TProof *proof = TProof::Open("bnlt3s01");

Using PoD(Proof on Demand) at BNL

PoD (Proof on Demand) would enable to use batch queue as a "Proof" farm. PoD? setup has already been provided in CVMFS Atlas distribution.

Setup of PoD env

On acas machines at BNL, you can run the following to set up PoD? environment

cvmfs-setupATLAS
localSetupPoD

The latter command "localSetupPoD" would also create a directory .PoD under $HOME/:

acas% ls -1AF $HOME/.PoD
PoD.cfg
etc/

A configuration file PoD.cfg and a subdir etc/ have also been created.

Condor configuration for PoD

There is a [condor_plugin] section in the configuration file:

acas% tail -5 $HOME/.PoD/PoD.cfg

[condor_plugin]
upload_job_log=no
options_file=$POD_LOCATION/etc/Job.condor.option
[slurm_plugin]
upload_job_log=no

You need modify options_file to:

options_file=$HOME/.PoD/etc/Job.condor.option

Then provide in the file $HOME/.PoD/etc/Job.condor.option your condor description specific to your condor queue. As example, the following is my private condor description:

acas% cat $HOME/.PoD/etc/Job.condor.option 

# Email address to send notification to.
Notify_user     = yesw@bnl.gov

# These are job flags which are non-Condor specific.
# The "Experiment" flag should be set to the user's experiment:
# star, phobos, phenix, brahms, atlas, etc.
+Experiment     = "atlas"

#group
+RACF_Group = "bnl-local"
+AccountingGroup = "group_bnllocal.yesw2000"

Start your PoD server

Now you can run

pod-server start
pod-submit -r condor -n 3

The output of above commands are

acas% pod-server start

Starting PoD server...
updating xproofd configuration file...
starting xproofd...
starting PoD agent...
preparing PoD worker package...
select user defined environment script to be added to worker package...
selecting pre-compiled bins to be added to worker package...
PoD worker package: /direct/usatlas+u/yesw2000/.PoD/wrk/PoDWorker.sh
------------------------
XPROOFD [1197] port: 21001
PoD agent [1226] port: 22001
PROOF connection string: yesw2000@acas0250.usatlas.bnl.gov:21001
------------------------

acas% pod-submit -r condor -n 3
Job ID: 21066

Wait a few seconds, you should be able to see 3 Proof workers on condor queue available for you to use.

acas% pod-info -l

worker yesw2000@acas0254.usatlas.bnl.gov:21001 (direct connection) startup: 5s
worker yesw2000@acas0266.usatlas.bnl.gov:21001 (direct connection) startup: 5s
worker yesw2000@acas0266.usatlas.bnl.gov:21002 (direct connection) startup: 5s

acas% pod-info -n 3

Connection to your PoD server in ROOT

Now you can connect this PoD? server and run job there, similar to connecting a Proof farm.

acas% root -l

root [0] TProof *p = TProof::Open("pod://");
Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: OK (3 workers)                 
Setting up worker servers: OK (3 workers)                 
PROOF set to parallel mode (3 workers)

root [1] p->Exec(".!hostname");
acas0254.usatlas.bnl.gov
acas0266.usatlas.bnl.gov
acas0266.usatlas.bnl.gov

root [2] p->Exec(".!which root");
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/5.34.19-x86_64-slc6-gcc4.7/bin/root
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/5.34.19-x86_64-slc6-gcc4.7/bin/root
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/5.34.19-x86_64-slc6-gcc4.7/bin/root

root [3] p->Exec(".!pwd");
/home/condor/local/localt3/execute/dir_7594/proof/yesw2000/session-acas0250-1406655137-1548/worker-0.0-acas0254-1406655138-7873
/home/condor/local/localt3/execute/dir_19638/proof/yesw2000/session-acas0250-1406655137-1548/worker-0.2-acas0266-1406655138-20096
/home/condor/local/localt3/execute/dir_19636/proof/yesw2000/session-acas0250-1406655137-1548/worker-0.1-acas0266-1406655138-20095

Stop your PoD server

Please note that after a certain idle timeout of ~ 30 minutes, these queues will be released. But it is recommended to stop your PoD? server after your job finishes.

pod-server stop
So that the resources are freed for the next user at your site.

FAQ

How to Stop Interactive PROOF Session

You can stop interactive PROOF job by entering "Ctrl+C" followed by character "S", that is,

[...]
Looking up for exact location of files: OK (103 files)                 
Validating files: OK (103 files)                 
^C
Enter A/a to switch asynchronous, S/s to stop, Q/q to quit, any other key to cntinue: S
Info in <TSignalHandler::Notify>: Processing interrupt signal ... S
Info in <TMonitor::Select>: *** interrupt occured ***
Mst-0: merging output objects ... done                                     
[...]

As you can see, there are other options after entering "Ctrl+C". You can switch asynchronous, or quit your job.

How to Stop and Reset Your PROOF Session

Sometimes your PROOF session could get corrupted and TProof::Open() or your job will hang. Reset your session to restore functionality:

 TProof::Reset("bnlt3a10");
or for a hard reset, 
 TProof::Reset("bnlt3a10", true);

Please use hard reset with caution because it is really time-consuming, has much side-effect.

Note: And please test simple connection to PROOF farm after reset and prior to running any new PROOF job.

Log File of Your PROOF Session

You can check the log file of your PROOF session by clicking on the button "Show Logs" on the PROOF Query Progress window. And you can run the following to extract the log:

    TProofLog *pl = proof->GetManager()->GetSessionLogs();
    pl->Save("*", "file_with_all_logs.txt");

How to Clean up Your PROOF Sandbox

You leave some big unwanted files in PROOF sandbox, and you can clean up it yourself. PROOF manager class TProofMgr provides some functionality to http://root.cern.ch/drupal/content/accessing-sandbox? .

The example below shows how to list sandbox on both PROOF master and workers, and how to clean up.

root [0] mgr = TProofMgr::Create("bnlt3a10")
(class TProofMgr*)0x2d10900
root [1] mgr->Ls(".")
Node: bnlt3a10.usatlas.bnl.gov:1093
-----
cache                              session-bnlt3a10-1400684593-28356
data                               session-bnlt3a10-1400684683-29479
datasets                           session-bnlt3a10-1400684793-30602
last-master-session                session-bnlt3a10-1400684860-31724
packages                           session-bnlt3a10-1400685003-392
queries                            session-bnlt3a10-1400685036-1529
session-bnlt3a10-1398884007-18873  session-bnlt3a10-1400685075-2689
session-bnlt3a10-1398884735-19012  session-bnlt3a10-1400685809-4677
session-bnlt3a10-1400684556-27052  session-bnlt3a10-1400701585-12123
root [2] mgr->Ls(".","","bnlt3a01")
Node: bnlt3a01.usatlas.bnl.gov:1093
-----
cache                              session-bnlt3a01-1400616410-29518
data                               session-bnlt3a10-1398885130-19090
datasets                           session-bnlt3a10-1399905053-4912
last-master-session                session-bnlt3a10-1400264819-31080
last-worker-session                session-bnlt3a10-1400265035-31131
packages                           session-bnlt3a10-1400608711-24288
queries                            session-bnlt3a10-1400613627-26678
session-acas1010-1394217503-21586  session-bnlt3a10-1400619063-28422
session-acas1010-1394218797-22815  session-bnlt3a10-1400619618-29399
session-acas1010-1394219241-23721  session-bnlt3a10-1400701585-12123
root [3] mgr->Ls("queries/")
Node: bnlt3a10.usatlas.bnl.gov:1093
-----
session-bnlt3a10-1400685075-2689  session-bnlt3a10-1400685809-4677
root [4] mgr->Rm("queries/*","-rf")
(Int_t)0
root [5] mgr->Ls("queries/")
Node: bnlt3a10.usatlas.bnl.gov:1093
-----
root [6] mgr->Rm("session-bnlt3a10-139*","-rf")
(Int_t)0
root [7] mgr->Rm("session-bnlt3a01-*","-rf")
(Int_t)0
root [8] mgr->Ls(".")
Node: bnlt3a10.usatlas.bnl.gov:1093
-----
cache                              session-bnlt3a10-1400684683-29479
data                               session-bnlt3a10-1400684793-30602
datasets                           session-bnlt3a10-1400684860-31724
last-master-session                session-bnlt3a10-1400685003-392
packages                           session-bnlt3a10-1400685036-1529
queries                            session-bnlt3a10-1400685075-2689
session-bnlt3a10-1400684556-27052  session-bnlt3a10-1400685809-4677
session-bnlt3a10-1400684593-28356  session-bnlt3a10-1400701585-12123
root [9]


Major updates:
-- TWikiAdminGroup - 17 Jan 2018

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png Hist-SimpleExample.png (30.5K) | ShuweiYe, 20 May 2014 - 15:32 | Histogram for the simple example
png PROOF-Progress-SimpleExample.png (59.0K) | ShuweiYe, 20 May 2014 - 15:33 | PROOF Query Progress of the simple example
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback