r2 - 10 May 2006 - 17:18:30 - KristyKallbackYou are here: TWiki >  Admins Web > MeetingNotes

Jim Shank's Talk

What ATLAS needs within next ~year

DC2 - some items still require follow up

Releases - working through problems

Release 12 delayed

Schedule problem

Need someone outside developers to debug and test code. Need more effort here. Strategy needs to change if going to meet dates.

Up to release 14 prior to real data Push from both ends = developers need more time and managers want it to go faster

New 'project' model

Disruptive system, Need to learn something new on top of trying to meet deadlines

Root 4 --> Root 5 cascaded to other areas

no trigger information until release 12, this is more important than doing 10^n events

if don't run production won't have systems ready, find problems every step of the way, so can't cut number back

if at turn-on can simulate, but can't handle real data --we're dead

as far as tier2, need continuous production -- until forever

tier2s can provide place for testing, subsystem defined tests, Jim prefers not

software release impact to overall timeline

LHC timeline and tests

full service op by sep 06, LHC service commision apr 07

--capacity and scale for SC4?

Will still have DQ server at each site, but easier to install

deployment issues

Dist Prod Targets

managed prod

    1. Mil jobs by fall 07

but now jobs are longer (so lose wall time)

need higher throughput, can use hyperthreading if jobs don't fail,

is jobs per day the right metric -- with longer jobs

metric for apr 06 was 10K, but we're not sustaining 5K at this time

not pre-emptive software, can we suspend effectively? suspend user analyses?

need better merge at end of long runs, short jobs need more merging

probability of failure increases with length of job

Where is analysis model document? [make link] not just distributed model

LHC overall schedule

official CERN schedule will slip 3mos? 6 mos?

offiical sched now has first schedule in summer 07

Realistic first beam collision in 08?

implication to T2 procurement ramp?

Resource Allocation Committee (RAC)

recently formed

will decide what are prioirties user analysis

method by which t2 implements decisions made? VOMS?

What can t2 expect?

cont. mgt of prod

user analysis -- what does that mean at a T2, eg we can support skim, can fast jobs run, etc.

calib align studies

Can we pull out specific targets?

in Fred's break out discuss further and also policy discussion

T2 should have some general user analysis capabilities

primary MC

also user based analyses -- fairly broad based, what does that really imply, what env needed, we need to have uniform response of what can be done at each tier (1,2,3)

end Jim's talk

begin razvan's talk= we have resources - are we using them well

functional & robust

support is effective? no metric to tell

integrated services functional?

do we have the right strategy? environment is changing

DQ2 - answers unclear, need better communication, enclaves of knowledge need better communication, both locally and across sites

this is typical software - need many re-iterations to get bugs out and fix and re-test need not to rely on new features so much

should expect milestones to slip - allow for factor to allow for this, or we need better milestones that include time which include testing and debugging

if can't deliver milestone need another plan

add more people to critical items eg for DQ are working on it now where there was 3

increase feedback loop, make faster process to fix

idle resources

need thorough testing, give it importance, motivate people to test furhter

support structure

- uncorrelated among institutions

- works, but effective? stats from ticket system?

- how to evolve


how to evolve

money is easy problem, effective use is harder, installations is easy getting out investment is hard

expertise in hand -- personnel, getting people with right experience

are we being effective and do we having monitoring to verify that

what is access model?

resource allocation, priority scheme,

is monitoring/accounting avail to control use


bnl - provides hpss tertiary storage

dCache - cheap, but not in people hours

what does dCache get us? if not dCache, what's plan B?

single namespace

dCache gives storage mgt solution, performance, storage nodes in front of SATA arrays have cost,

- discuss in storage break out,

- CMS t2 centers are deploying dCache

- does each site have to give people to storage solution? for plan B?

- centralized storage + storage on node (dcache or pvfs2)

- should seriously entertain other storage ideas, but we don't have a lot of time to select storage solution, should pool resources to finding a solution of all t2 sites

- other options? just wait...? but not for long

Strategy for support, internalize knowledge to have BNL serve as central support provider then expand to local sites when they are ready

--we are blind as to troubleshoot problem

--can afford more labor at T1 for now

--but need local expertise, or give remote access to BNL, are we ready to do that?

--email request for DQ2, IU and UC request to run script to fix dq2 problem lost in email need trouble ticket system, need to discuss further

production if production needs change how quickly can we adapt


where do we stand with CBA

- we get good products without spending resource

- but we do have to spend cycles to contrib to community

If we have functional stack when should we upgrade? are we good osg citizens


panda req's

need future talk on dq2 deployment - now need full deployment of dq2, new release? when coming?

each computing cluster must have shared filesystem across compute nodes - but NFS is problematic ...

- can't run ATLAS if you don't meet this req, doens't even have to be shared, can mimic shared model

up time requirements [link to actual docs]

up to 10 G for single release

option produce mini-tar? releases are too rapid?

at SLAC using AFS (with local caching for binaries) and NFS for BaBar? with 4000 nodes

Person on shift can't solve all problems, local monitoring is still required, .25 FTE required for this, non-ATLAS monitoring?

CPU requirements should be in kSI2K rather than num CPUs

CPU accounting

- CPU usage comparison to Monalisa? Numbers in Panda were wrong and corrected a few days ago.

- sites report non-Panda usage, expand upon Dan's system? PBS and Condor solution? Effort required who will do it? OSG has not delivered a solution

=David's talk=

Tier 2 need 1/3 or full AOD - need to firm up?

for databases will be single instance of mysql or multiple?

- each one should understand how it's deployed via pacman

- packaging to avoid replicating effort at each site, so let someone become the expert and duplicate steps to package

In breakout, look at deployment model and how to deliver to T2s.

==skubic's talk===

how to support users at swt2

- data needs, differs between t3 and t2, if job doesn't need a lot of input/create a lot of output these are good candicate for t3

- expectations of people at sw workshop - students currently do have access to OU t3,

- can setup to allow sub to only t3?

- user guidance, and get their feedback back to RAC

=Saul's talk=

grid computing via panda interface vs. interactive logging, different support model

--grid computing is the model, but interactive use will be needed, but it takes resources they

want 1TB data, home dirs, emacs, ROOT etc.

-- for local computing you segregate part of t2 cluster ? or point to t1? or separate cluster? does resource need to be local?

--when panda is ready, will meet need,

-- but still need place to compile, edit code, let's state it outright, but need to manage expectations

hypernews not to replace trouble ticket

-users can help each other

-open forum, i.e. globally viewable

will setup hypernews for each t2 site

relation of what xin is doing for atlas installs to pacman

-if it doesn't work via pacman

-mirror serves as backup for CERN

-mirror is not installation itself

- 6-7 GB for all releases

- not official atlas policy, should it be? bring up to usatlas

- confusion between Xin's install and other installs?

- consistency between mirrors?

going to loadleveler over pbs (torque) -- free at NE

=patrick's talk = ibrix - dist filessystem with client on each compute node

nlr connection in future

horst= dq2 cache is in ibrix data partition, not sure what adding dcache would add

scalability studies in future for ibrix

WEDNESDAY Shawn's Talk Need detailed networking map including equipment and IP numbers etc for each T2 site Possibility of getting regional provider to get block of IPs to use across sites

Measuring Packet Loss

- devices along path can report errors

May want to install NDT (Network Diagnostic Test) at T2s

- CD available, could dedicate machine saving config to USB drives

- Shawn will provide instructions

Site responsibility for security

Should we have US ATLAS security officer

--how to respond to incidents?

--impact to sites

Government concerns and want to put restrictions in place

Is there an existing model we can follow

Marco's Talk Are Dev Panda and Prod Panda the same?

--Should be the same, Tadashi can provide details, logs, etc.

--Need to communicate needs to OSG so Panda can work with it without major changes, need ATLAS contributions and testing

--Could run with 0.4.1 is we installed and configured using "old" model, but with less restrictive 0.4.1 configuration had to make a change

--As of today Panda job has not been validated on ITB

--What advantages to new version of OSG --Support will run out for older versions --

How to proceed with non-usatlas sites without DQ2 servers?

--current model is that DQ2 would have to be deployed at site (limiting, only 4 sites now)

When should T2's go to OSG 0.4.1?

--no time better than any other?

--need to finish validation first

If 0.6 is a big change, should we just wait instead of investing time now

--could ride at 0.4.1, so should at least upgrade from 0.2.x

--need to be good citizens

--many unused resources for ATLAS in OSG because we require DQ2


--can CMS run on 040 or can they run on 041? Yes, need BDII

Work needs to be done to validate Panda in ITB

Need to use site with DQ2 for ITB validation for now

The OSG software will change, so we have to plan for this

If we don't have OSG as a source of middleware, what will we use instead?

Dan's Talk

Need to get accounting ducks in a row, we now have Monalisa and Dan's solution, OSG and other solutions coming

Can we customize this?

--refine views

--how much user used, production used, grid submit, other submit methods...

--Need to explore options, but could be slipperly slope of big time investment into accounting

--Try to modify an existing tool like Monalisa?


Dantong's Talk

Starting June 19 need T2 for SC4, total storage required ~1TB

--for meaningful exercise need SRM or SRM dCache

--don't have to save the data

--need to replicate what we truly plan to do pick that SRM, it's a readiness test

--DQ2 ?

Dantong will provide details for T2s,what services, expected performance etc for SC4

OU ITB will go to 041 for this test?

FTS supports gridftp, but we need SRM to provide queue in front of gridftp server

dCache new version will fix performance problems?

CMS involvement, Fermilab

Sites may need cheap storage and also some more high performance storage

Can we be smart about data sets, meaning not pull down same data set repeatedly because we had to make room for new datasets

Why move data to worker node if its in dcache

Tom's talk

Average user needs to know computing model

Panda is not there yet, so there are growing pains, not ready for analysis users yet, timeline this fall for analysis

How to treat analysis jobs differently from prod jobs

Building into Panda quota system for each user, and then can have policy


RT will be used by support center

-- KristyKallback - 10 May 2006

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback