TUESDAY
Jim Shank's Talk
What ATLAS needs within next ~year
DC2 - some items still require follow up
Releases - working through problems
Release 12 delayed
Schedule problem
Need someone outside developers to debug and test code. Need more effort here. Strategy needs to change if going to meet dates.
Up to release 14 prior to real data
Push from both ends = developers need more time and managers want it to go faster
New 'project' model
Disruptive system, Need to learn something new on top of trying to meet deadlines
Root 4 --> Root 5 cascaded to other areas
no trigger information until release 12, this is more important than doing 10^n events
if don't run production won't have systems ready, find problems every step of the way, so can't cut number back
if at turn-on can simulate, but can't handle real data --we're dead
as far as tier2, need continuous production -- until forever
tier2s can provide place for testing, subsystem defined tests, Jim prefers not
software release impact to overall timeline
LHC timeline and tests
full service op by sep 06, LHC service commision apr 07
--capacity and scale for SC4?
Will still have DQ server at each site, but easier to install
deployment issues
Dist Prod Targets
managed prod
end Jim's talk
begin razvan's talk=
we have resources - are we using them well
functional & robust
support is effective? no metric to tell
integrated services functional?
do we have the right strategy? environment is changing
DQ2 - answers unclear, need better communication, enclaves of knowledge need better communication, both locally and across sites
this is typical software - need many re-iterations to get bugs out and fix and re-test
need not to rely on new features so much
should expect milestones to slip - allow for factor to allow for this, or we need better milestones that include time which include testing and debugging
if can't deliver milestone need another plan
add more people to critical items eg for DQ are working on it now where there was 3
increase feedback loop, make faster process to fix
idle resources
need thorough testing, give it importance, motivate people to test furhter
support structure
- uncorrelated among institutions
- works, but effective? stats from ticket system?
- how to evolve
metrics/monitoring/accounitng
how to evolve
money is easy problem, effective use is harder, installations is easy getting out investment is hard
expertise in hand -- personnel, getting people with right experience
are we being effective and do we having monitoring to verify that
what is access model?
resource allocation, priority scheme,
is monitoring/accounting avail to control use
storage
bnl - provides hpss tertiary storage
dCache - cheap, but not in people hours
what does dCache get us? if not dCache, what's plan B?
single namespace
dCache gives storage mgt solution, performance, storage nodes in front of SATA arrays have cost,
- discuss in storage break out,
- CMS t2 centers are deploying dCache
- does each site have to give people to storage solution? for plan B?
- centralized storage + storage on node (dcache or pvfs2)
- should seriously entertain other storage ideas, but we don't have a lot of time to select storage solution, should pool resources to finding a solution of all t2 sites
- other options? just wait...? but not for long
Strategy for support, internalize knowledge to have BNL serve as central support provider then
expand to local sites when they are ready
--we are blind as to troubleshoot problem
--can afford more labor at T1 for now
--but need local expertise, or give remote access to BNL, are we ready to do that?
--email request for DQ2, IU and UC request to run script to fix dq2 problem lost in email
need trouble ticket system, need to discuss further
production
if production needs change how quickly can we adapt
OSG
where do we stand with CBA
- we get good products without spending resource
- but we do have to spend cycles to contrib to community
If we have functional stack when should we upgrade?
are we good osg citizens
=====Kaushik=======
panda req's
need future talk on dq2 deployment - now need full deployment of dq2, new release? when coming?
each computing cluster must have shared filesystem across compute nodes - but NFS is problematic ...
- can't run ATLAS if you don't meet this req, doens't even have to be shared, can mimic shared model
up time requirements [link to actual docs]
up to 10 G for single release
option produce mini-tar? releases are too rapid?
at SLAC using AFS (with local caching for binaries) and NFS for BaBar? with 4000 nodes
Person on shift can't solve all problems, local monitoring is still required, .25 FTE required for this, non-ATLAS monitoring?
CPU requirements should be in kSI2K rather than num CPUs
CPU accounting
- CPU usage comparison to Monalisa? Numbers in Panda were wrong and corrected a few days ago.
- sites report non-Panda usage, expand upon Dan's system? PBS and Condor solution? Effort required who will do it? OSG has not delivered a solution
=David's talk=
Tier 2 need 1/3 or full AOD - need to firm up?
for databases will be single instance of mysql or multiple?
- each one should understand how it's deployed via pacman
- packaging to avoid replicating effort at each site, so let someone become the expert and duplicate steps to package
In breakout, look at deployment model and how to deliver to T2s.
==skubic's talk===
how to support users at swt2
- data needs, differs between t3 and t2, if job doesn't need a lot of input/create a lot of output these are good candicate for t3
- expectations of people at sw workshop - students currently do have access to OU t3,
- can setup to allow sub to only t3?
- user guidance, and get their feedback back to RAC
=Saul's talk=
grid computing via panda interface vs. interactive logging, different support model
--grid computing is the model, but interactive use will be needed, but it takes resources they
want 1TB data, home dirs, emacs, ROOT etc.
-- for local computing you segregate part of t2 cluster ? or point to t1? or separate cluster?
does resource need to be local?
--when panda is ready, will meet need,
-- but still need place to compile, edit code, let's state it outright, but need to manage expectations
hypernews not to replace trouble ticket
-users can help each other
-open forum, i.e. globally viewable
will setup hypernews for each t2 site
relation of what xin is doing for atlas installs to pacman
-if it doesn't work via pacman
-mirror serves as backup for CERN
-mirror is not installation itself
- 6-7 GB for all releases
- not official atlas policy, should it be? bring up to usatlas
- confusion between Xin's install and other installs?
- consistency between mirrors?
going to loadleveler over pbs (torque) -- free at NE
=patrick's talk =
ibrix - dist filessystem with client on each compute node
nlr connection in future
horst=
dq2 cache is in ibrix data partition, not sure what adding dcache would add
scalability studies in future for ibrix
WEDNESDAY
Shawn's Talk
Need detailed networking map including equipment and IP numbers etc for each T2 site
Possibility of getting regional provider to get block of IPs to use across sites
Measuring Packet Loss
- devices along path can report errors
May want to install NDT (Network Diagnostic Test) at T2s
- CD available, could dedicate machine saving config to USB drives
- Shawn will provide instructions
Site responsibility for security
Should we have US ATLAS security officer
--how to respond to incidents?
--impact to sites
Government concerns and want to put restrictions in place
Is there an existing model we can follow
Marco's Talk
Are Dev Panda and Prod Panda the same?
--Should be the same, Tadashi can provide details, logs, etc.
--Need to communicate needs to OSG so Panda can work with it without major changes, need ATLAS contributions and testing
--Could run with 0.4.1 is we installed and configured using "old" model, but with less restrictive 0.4.1 configuration had to make a change
--As of today Panda job has not been validated on ITB
--What advantages to new version of OSG
--Support will run out for older versions
--
How to proceed with non-usatlas sites without DQ2 servers?
--current model is that DQ2 would have to be deployed at site (limiting, only 4 sites now)
When should T2's go to OSG 0.4.1?
--no time better than any other?
--need to finish validation first
If 0.6 is a big change, should we just wait instead of investing time now
--could ride at 0.4.1, so should at least upgrade from 0.2.x
--need to be good citizens
--many unused resources for ATLAS in OSG because we require DQ2
--BDII
--can CMS run on 040 or can they run on 041? Yes, need BDII
Work needs to be done to validate Panda in ITB
Need to use site with DQ2 for ITB validation for now
The OSG software will change, so we have to plan for this
If we don't have OSG as a source of middleware, what will we use instead?
Dan's Talk
Need to get accounting ducks in a row, we now have Monalisa and Dan's solution, OSG and other solutions coming
Can we customize this?
--refine views
--how much user used, production used, grid submit, other submit methods...
--Need to explore options, but could be slipperly slope of big time investment into accounting
--Try to modify an existing tool like Monalisa?
-
Dantong's Talk
Starting June 19 need T2 for SC4, total storage required ~1TB
--for meaningful exercise need SRM or SRM dCache
--don't have to save the data
--need to replicate what we truly plan to do pick that SRM, it's a readiness test
--DQ2 ?
Dantong will provide details for T2s,what services, expected performance etc for SC4
OU ITB will go to 041 for this test?
FTS supports gridftp, but we need SRM to provide queue in front of gridftp server
dCache new version will fix performance problems?
CMS involvement, Fermilab
Sites may need cheap storage and also some more high performance storage
Can we be smart about data sets, meaning not pull down same data set repeatedly because we had to make room for new datasets
Why move data to worker node if its in dcache
Tom's talk
Average user needs to know computing model
Panda is not there yet, so there are growing pains, not ready for analysis users yet, timeline this fall for analysis
How to treat analysis jobs differently from prod jobs
Building into Panda quota system for each user, and then can have policy
==
RT will be used by support center
-- KristyKallback - 10 May 2006
Please note that this site is a content mirror of the BNL USATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your BNL USATLAS account.