r1 - 19 Aug 2009 - 16:29:44 - DouglasBenjaminYou are here: TWiki >  Admins Web > Tier3Setup
This is a DRAFT

Welcome to the Tier3(g,w) setup guides "borrowed" from active guide at ASC ANL

ANL ASC Tier 3 setup guide

Introduction

The US ATLAS Tier 3 Task Force Report of Spring 2009, concludes that enhanced ATLAS analysis computing capabilities at home Universities of US ATLAS members are needed. Such capabilities are broadly called Tier3 computing. This site is being built as a part of an effort to help US ATLAS institutes in setting up and maintaining an effective Tier3 (T3).

By setting up a T3 as described in these pages, an institute will gain an ability to locally process 10s of TB of Atlas data (corresponding to 10 fb-1 of AOD or D(1)PD data for most analyses) overnight. For a local analyzer, this will mean a very significant increase in his productivity since the overhead of dealing with a complex distributed system, the grid, will be very much reduced.

Please note that the processing of ATLAS data with enough parallelism to achieve meaningful rates cannot be done with a conventional cluster. The instructions for building the parallel processing farm are based on the concept of distributed data storage which solves this problem. For a short explanation of distributed data storage, see this e-News item.

What is a Tier3g(T3g), Tier3w (T3w)?

There are four categories of Tier3 defined in the Report. This site only deals with T3g and T3w.

  • Tier3g (T3g): A T3g has a consumer-only grid connectivity and Atlas software capability. In addition T3g has enough processing and storage capabilities to handle multiple TB size Atlas datasets locally.
  • Tier3w (T3w): A T3w is an unclustered workstation with consumer-only grid connectivity and Atlas software capability. The processing and storage capacities are limited to that of the workstation.
  • Tier3gs (T3gs): A T3gs is a large facility with much of the functionality of a Tier2, but with local control or resources (gs stands for Grid Services). Often these facilities are associated with a local Tier2.
  • Tier3af (T3af): This is an Analysis Facility envisioned to be used by multiple University groups. No complete prototype exists for this model thus far.

Unless you are associated already with a T3gs, it is very likely that T3w is something you will need and T3g is something to aim for at your University. Note that having a T3g or a T3w does not mean that you have any service (or other) obligations to the Grid.

T3 typeCoresStorageGrid Access
10TB Processing
CostSetup Time
Maintenance
T3w1-8~1 TBYes No
$1-5k
few days
<0.1 FTE
T3g>50> 10 TBYes
Yes
$25-40k 1 week
0.1-0.5 FT

Why do I want a T3w or a T3g?

T3w

A T3w will bring let you locally do what you are probably doing at lxplus at CERN or acas at BNL. Namely,

  • Use DQ2 to bring small data samples to your local machine.
  • Test run athena interactively.
  • Submit your athena job to the Grid.
  • Retrieve your athena job results from the Grid.
  • Analyze the result interactively (e.g. with root).

Having this capability locally will give you more flexibility (adding or connecting to local storage, increasing processing power etc.) compared with relying on large installations at CERN or BNL.

However, perhaps the most compelling reason to setup a T2w is the possibility to expanding it to a T3g.

If all you want is a single-user T3w: There is a a well-documented Virtual Machine (VM) described at this site. This is probably the fastest way (one afternoon) to get a T3w--you will still need a machine with a decent cpu and at least 2GB of memory for a reasonable performance of Atlas software. The VM installation, however, is not easily extendable to a T3g at a later date.

T3g

In addition to the capabilities of a T3w, T3g adds a very significant processing capability under local control. It is very likely that analyzers will need to run over a relatively large, but a fixed set of data multiple times--particularly at the early stages of understanding the data. While such processing is possible on the Tier2 analysis queues, experience shows that having local control vastly improves the efficiency. The T3 task force report also points out the likelihood that Monte Carlo generation needs may overwhelm the current capability of T2s leaving little room for analysis jobs at T2.

In order to be effective, T3g will have the ability to:

  • Process 10's of millions of D(1)PD, AOD or ESD events overnight.
  • Corresponding to processing of 10s of TB of data.
Our tests have shown that, running over TB sized AOD data, a T3g setup can be several times faster than running them at Tier2 analysis queues, even when large number of job slots are avilable at T2s. This is mainly due to the overhead at T2 in retrieving the data and setting up for a run.

As a byproduct of this capability, a T3g will be also be able to:

  • Process ~100 million of root ntuple (or D(3)PD) event in 1 hour.
  • Generate significant number of MC events locally.

Of course this needs to be done keeping in mind the need for:

  • Lowest cost possible
  • Easy setup
  • Easy usage
  • Low maintenance
  • Scalability

Components of a T3g

A T3g is composed of two logically separate pieces.

  1. An interactive cluster, which handles:
    • Running short Athena test jobs interactively
    • Copying data to/from the Grid (DQ2)
    • Submitting jobs to the Grid (pathena)
    • Interactive root analysis
  2. A parallel processing farm which is capable of
    • Running over a large (~10 TB) amount of data in a relatively short time (overnight)

The interactive cluster is basically the same as a T3w--and T3w will not be discussed separately beyond this point.

The parallel processing farm has special requirements because of the large amount of data it has to handle. A low-cost solution described in ATL-COM-GEN-2009-016 is the basis of the instructions given at this site. For a short discussion of the technical requirements of the processing farm look here? .

How big a T3g will I need?

The T3g with a interactive cluster with 16 cores (2 boxes) and 5 TB of nfs storage and a PC Farm with 80-100 cores and 20-30 TB of storage is estimated to support 2 to 3 analysis teams (or separate streams of data your institute is interested in). Also it should be noted that some 0.1-0.5 fte of effort is needed to maintain such a T3g.

Recent talks about T3g at ANL and benchmarking using Monte Carlo samples

Recent talk discussing the T3g PC farm setup at ANL can be found here: T3g PC farm talk at ANL Tier3 meeting

How to get started

The explicit instruction for setting up a T3g(w) follows.

-- DouglasBenjamin - 19 Aug 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback