r15 - 23 Nov 2010 - 18:09:01 - TorreWenausYou are here: TWiki >  AtlasSoftware Web > AgencyReview201005PSC

Physics Support and Computing Materials for the May 2010 DOE/NSF Review of the U.S. ATLAS Operations Program


Overview

Physics Support and Computing materials are gathered here for the May 2010 DOE/NSF review of the U.S. ATLAS Operations Program. See also the attachments at the bottom of the page. The review website is here.

U.S. ATLAS Operations Program Manager's Review, April 1-2, 2010

In April a U.S. ATLAS Operations Program Manager's Review was held at BNL, chaired by Steve Vigdor. Agenda is here, password usatlasmr. This review included detailed talks on Computing Facilities and Software which are not covered by dedicated talks in the agency review. Reviewing their content is recommended to gain more detailed information in these areas than is available in the agency review talks. Highlights from these presentations are included in the supplementary slides of Torre's overview talk in the agency review.

Recommendations to PS&C

Here follow the reviewers' recommendations in the PS&C area and a response (in green).

Computing facilities and operations

  • Make sure that the distributed analysis support model, which is based on volunteers, is sustainable, e.g. by providing Cat-1 shift credit for participation in these efforts.

We reported to the reviewers that we were in a concerted campaign to identify additional distributed analysis (DAST) shifters, particularly in our time zone region where the number has been low (4, with the target 10), and this was proving difficult. We have continued the search, with broadened scope, since the review. Also ATLAS has since made a change in its shift credit categorization, Cat1/Cat2 replaced with three Classes of shifts, with DAST shifts accruing full shift credit under this scheme. We have identified the following new shifters this far: one in the U.S. physics community, one (project-supported) at BNL, *... any more?*. We remain below the target and so continue the search, both among volunteers in the physics community and among project-supported personnel.

  • Reduce the scope of the Tier-1 facility to providing the appropriate percentage (23%) of the total Atlas resource requests, which will free significant funds for other uses.

Revisiting the U.S. pledge level with the possibility of a downward revision which would free resources for U.S. discretionary use is under discussion in U.S. ATLAS. (We note that 23% is the fraction we are currently providing, which requires pledging essentially all the Tier 1 resources.) Should we decide to do this we will proceed in consultation with international ATLAS, and will formally revise our pledge well before the next Resource Review Board meeting in October.

  • The proposal for providing enough disk space at BNL to store a complete ESD and AOD/dESD copy plus 0.5PB for specific RAW-based studies (at estimated costs of $880k/630k in FY10/11), should be included in the priority list of U.S. Atlas projects.

We have included this plan in our funding request for FY10, FY11 and it has been included in our target funding by the management.

  • Clarify and define the role of the BNL facility as a physics analysis facility, in terms of needed resources, size, scope, and how it interacts with the U.S. community, the Tier-2 centers and the Tier-1 facility, so such a facility can be considered to be re-include in the baseline plan.

We are proceeding to implement this recommendation. We have included a request beyond target for the BNL analysis facility in FY11 and FY12, in conjunction with an RBT for supplementing Tier 2 funding to strengthen their analysis role as well, based on an estimated budgetary shortfall at the Tier 2s in FY11 and FY12. Discussion and planning has begun on clarifying and defining the relative roles and needs of the Tier 1 and Tier 2s in analysis, as we learn more every day from real running experience and already dramatically increasing analysis activity.

Physics Support

  • In arguing for the needs of Physics Support and Computing, recast those arguments to emphasize the degree to which particular elements enable the leadership of U.S. ATLAS physicists.

We will take up this recommendation.

Software

  • A more defensible assessment of contribution of U.S. ATLAS Software (WBS 2.2) to the overall ATLAS effort should be developed.

A first step in this direction has been taken, with a full prioritization and justification for the effort across the software WBS (attached below). In each WBS area we describe the contribution and its importance to U.S. ATLAS and ATLAS. We will address this further in the software managers retreat we will hold in a few months.

U.S. ATLAS Software and Computing Review, May 18-20, 2009

In May 2009 U.S. ATLAS commissioned a review of the U.S. ATLAS Software and Computing program was held at BNL, chaired by Richard Mount, with a review committee consisting of ATLAS and non-ATLAS HEP computing experts: Peter Elmer / Princeton, Alexei Klimentov / BNL, Richard Mount / SLAC (Chair), Ryszard Stroynowski / SMU, Craig Tull / LBNL, Vicky White / Fermilab. The review report went to U.S. ATLAS management.

The charge to the review panel was to analyze current activities and future plans of U. S. ATLAS software and computing in terms of:

  • Requirements arising from the agreed U.S. role in ATLAS and the WLCG
  • The importance of each activity in delivering high quality physics results
  • The cost‐effectiveness of each activity
  • The technical quality of each activity
  • The effectiveness of the management processes involved in each activity – is the organization in
place optimal or should it be changed to make it more effective?

Examples of specific topics covered in the charge were given:

  • Is US ATLAS Software and Computing doing its share, more than its share, or less than its share of ATLAS software and computing?
  • Is the Tier 1 appropriately staffed, in terms of staff numbers and skill sets?
  • What are the staffing needs of the Tier 2s and Tier 3s in the operational phase, including staff needs to support physics analysis? How do these needs compare with current staffing?
  • Is the Tier1/2/3 complex ready for physics analysis (in addition to production processing)? Is the complex flexible enough to deal with evolution of the computing model driven by experiences with real data?
  • Are the analysis support activities appropriately resourced and are they poised to deliver efficiently the support required as data arrive?
  • How should the US software effort make the transition from construction/development to operation? Which activities should shrink or move into maintenance mode? Are there software activities that need to grow? In the longer term, what are the likely needs for, and timing of, major software work, to prepare for the Phase 1 and even Phase 2 luminosity increases, or to re‐ engineer software components to meet long‐term performance and maintainability requirements?
  • Are the developments required to stay cost effective in the evolving hardware and software environment receiving adequate attention and resources?
  • Comparisons with US CMS where possible (recognizing that the comparisons will require some effort from US CMS and that this effort cannot be insisted upon.)

The reviewers were asked to make recommendations on improvements and optimizations, noting particularly the lowest priority activities that are in the current plan, and any higher priority activities that are not in the plan.

The reviewers' final report is attached below. Recommendations are listed here. Current comments (May 2010) are shown in green .

Recommendations

General recommendations

  1. Flexible, nimble, and at times decisive management will be required to meet both the challenge of real data and the challenge of working within the context of international ATLAS.
  2. Ensure that the WBS structure does not become a barrier to efficient execution of activities requiring cross-WBS coordination.

We believe we have heeded these recommendations. In supporting analysis on U.S. facilities, for example, we have flexibly responded to the real needs of analysis users in Tier1/Tier2 analysis roles and data placement policies, working collaboratively with but not always completely aligned with international ATLAS policies. The U.S. has demonstrated the best performing analysis services ATLAS-wide (as measured by the users themselves in the usage distribution), and on policies like Tier 1 availability for individual analysis and placement of ESDs at Tier 2s, ATLAS policy is moving towards the U.S.

Regarding cross-WBS coordination, in the expanded Tier3 work of the last year for example we have effectively marshalled cross-WBS efforts to work with the Tier3 team on developing the technical specifics for supporting Tier3s as an integral (yet distinct) part of the distributed computing resources.

Ensemble of facilities

  1. Ensure that adequate management attention is focused on non-Tier 1 issues, for example:
    • how will university-based physicists have the resources and support to do analysis?
    • clarification of expectations on user-driven access to data and the priorities of the experiment with respect to use of resources.

Management has strongly supported the growth in technical and planning activity away from the Tier 1s, most notably in the Tier 3 ramp-up and the close coordination and collaboration between that effort and the rest of the project (and international ATLAS). The Tier 3 program (survey of all institutes last fall, technical development on the Tier 3 types identified in last year's Tier 3 report, Tier 3 prototyping, equipment recommendations and installation support, extensive documentation) has answered how university-based physicists will have the resources and support to do analysis on local facilities -- with the very great assistance of the ARRA funding as well.

Regarding Tier 2s, a great deal of focus during the year has been on expanding the role of the Tier 2s in analysis, using data placement policies, testing and support for distributed analysis services at Tier 2s, user encouragement, and so on. The results are evident in the growth in the analysis fraction performed at Tier 2s.

On data access and resource usage issues, the revised and re-energized Resource Allocation Committee is proving very effective for translating user needs to policy and practice, for both Tier 1 and Tier 2s, informed by the experience and constraints of facilities and operations.

Tier 1 and networking

  1. If there is a funding shortfall, consider cutting equipment rather than staff at the Tier 1 – don’t (necessarily) meet the pledge to WLCG for resources. The disk and CPU requirements have huge error bars and only after data comes will the real needs be known and new pledges, or a different computing model, will be negotiated.
  2. Plan to work with others (e.g. a small number of key Tier‐1s that are major dcache users) to evaluate options and move forward with replacement for dcache in the longer term.

Together with the detailed comments in the report proper, summarized by the statement "the [Tier 1] staffing levels are not far off what they should be", we take recommendation 1 as a validation that Tier 1 staffing is at about the level required, and we should not look for substantial cuts there. The LHC schedule stretch reduced resource needs such that there is not at this point a funding shortfall under the nominal scenario, but with the increase in the management reserve, there is a shortfall under the low scenario. We would address this partly through prioritized effort reductions (across the full project, not limited to and not excluding the Tier 1), and partly through reduction/reoptimization of computing capacity (see the discussion on dealing with the low budget scenario below).

On dCache, a very promising recent development is that addressing the longer term needs of storage and data management for the LHC, in particular what new tools and technologies should be enlisted to better support analysis needs, has been taken up as an urgent issue at the WLCG level. A process to address this kicks off with a WLCG workshop in Amsterdam in June. ATLAS internationally and the U.S. will be well represented in the discussions and the activity. See the Jamboree on Evolution of WLCG Data and Storage Management site.

General software recommendations

  1. In general crisp, clear, prioritized lists of activities and tasks will facilitate communication of Software’s efforts to both global and US ATLAS management.
  2. Consider ways to improve communication and collaboration with ATLAS central – such as:
    • restructuring to better match roles in US‐ATLAS with roles in ATLAS
    • following negotiation with ATLAS management, make clear formal statements about what the US can and can’t support and what the expectations are on both sides
    • formalizing decision making in US‐ATLAS and communicating those decisions and their rationale to International ATLAS
    • publishing weekly US‐ATLAS status bulletins

Regarding prioritized activities and tasks in software, following the review we organized a planning retreat for software area managers, held in Tucson in August, where the goals were 1) to discuss priorities in software and 2) develop an improved management plan for the U.S. ATLAS software effort (WBS 2.2) -- goals driven specifically by this recommendation. The workshop proved very useful both for planning/prioritization and improving the management plan. The priorities identified there have been reflected in our workplans since, and the WBS modifications agreed there are reflected in the current project WBS. The report from the meeting is attached below.

More recently we have done a full prioritization of the software WBS, evaluating WBS tasks against a set of metrics, with descriptions outlining the activity, its importance and its appropriateness as a project activity. This prioritization was an important input to the process of identifying cuts required to meet the PS&C budget target under the low scenario.

We plan to carry both these activities forward: another planning retreat will take place in the next several months, and the WBS prioritization will be maintained as a tool in arbitrating and communicating our priorities, and aiding in planning.

Regarding communication and collaboration with central ATLAS, today we observe no significantly contentious issues that arise from communication problems. Contentious issues of the past (e.g. PanDA's ATLAS-wide role and support) have largely settled down, and the relatively minor issues outstanding are being addressed collegially. We certainly have our differences with central ATLAS, for example in what computing model is appropriate to supporting analysis effectively, but these issues too are addressed collegially. To be sure a strong reason for these good relations is the strong U.S. representation in central ATLAS: Jim Shank as Computing Deputy, Alexei Klimentov as ADC Coordinator, Paolo Calafiura as Software Coordinator, and others.

Core services

  1. Address the technical and non‐technical aspects of the memory crisis in the context of the PMB. Priority must be on robustness of production and timely resolution so that core effort can be refocused on languishing core developments.
  2. The PMB plan we were presented with should be diligently executed and communicated to ATLAS collaborators.
  3. Develop and/or adopt more detailed tools needed for monitoring of performance in the short and long term. Use tools to keep ahead of such issues.
  4. Develop a prioritized list of Core's highest priorities driven by physics priorities to define the coming year's work and make clear trade‐offs when firefighting decisions are made.

These recommendations have been heeded, with the PMB (co-chaired by Paolo Calafiura) playing a central role in addressing memory usage (with considerable success in the past year -- memory is no longer the crisis issue it was). Performance in general and the 'daily firefighting' of addressing robustness and performance issues remains a principal focus of the core services effort in the U.S. and ATLAS generally, as has investigating and adopting tools for performance monitoring. Recently, consultation with Intel engineers has been very fruitful for in identifying performance issues in our code and possible improvements.

Core prioritization was done both in the summer planning retreat and continues in the software WBS prioritization. The priorities identified last summer -- memory, multicore support, virtual machine support, and 64bit adaptation -- have guided the work program since, in collaboration with non-US core members.

Data management

  1. The role of TAGS and metadata in ATLAS data management must be clarified and validated with respect to physics priorities.
  2. At this late date, critical priorities for Data Management must be driven by the short term physics tasks and goals related to LHC startup. More explicit and trackable connection between physics tasks and data management features would help focus effort.

The role of tags and metadata have been clarified over the year as the reality of beam data has focused more attention on their roles as an entry point to analysis. Beyond event-level tags, the area includes infrastructure for cross-section and luminosity calculation, quality information and Good Run Lists and the machinery that supports them, configuration information, job-level reporting, infrastructure for flagging, reporting, and dealing offline with errors in data-taking and in the offline processing itself, and more. In such areas metadata and tags are beginning to be used extensively and successfully. We are not yet seeing enough usage to be able to say the role is fully clarified and validated with respect to physics priorities, but this is the trend, and larger data volumes (motivating greater use/reliance on tags and metadata) should accelerate the trend.

Data management efforts have been very closely driven by physics needs in the past year as addressing data formats, their roles and their performance has been a major focus of the Collaboration as ATLAS physicists sort out the analysis model they will employ. In distributed data management the effort is fully driven by physics needs through direct contact with physics community interfaces such as the CREM (computing resource needs panel; RAC in the US), and production coordinators for reprocessing, Monte Carlo and physics groups.

Distributed software

  1. Explicitly request Global ATLAS provide operational and development support for PanDA.
  2. Make sure that all commonalities between distributed production and distributed analysis are exploited to avoid unnecessary duplication of effort.

We did make this explicit request to Global ATLAS (as we have in the past). Results have been mixed but more positive than in the past. In important parts of PanDA, pilot submission and (possibly) also monitoring, we are working to offload U.S. effort to international ATLAS. The U.S. submission system, AutoPilot, will shortly be retired in favor of a replacement 'autopyfactory' developed outside the US, and currently being extended to add AutoPilot and other features we require. In monitoring, we have converged on a common development approach that opens the possibility of offloading monitoring elements to the ATLAS monitoring effort (based on the ARDA Dashboard), but this has to be shown in practice (we have had such agreements in the past, but they haven't borne fruit). On the other hand, in operational support, we are worse off than before: the CERN ATLAS person working on central (CERN) PanDA services support has largely left that role, and a U.S. person has (necessarily) picked it up. Resolving this is an open issue with international ATLAS.

On exploiting all possible commonalities between distributed production and analysis, progress has been slight but the signs are encouraging. The principal issue consuming manpower unnecessarily is the continuing support for both PanDA and gliteWMS as back ends for distributed analysis, despite the ATLAS policy to standardize on PanDA. As a result, redundant effort goes into gliteWMS support rather than enjoying common analysis/production support of the grid through one interface, PanDA and its pilot. The new ADC management shows encouraging signs of addressing and rectifying this by planning to focus all effort on the PanDA back end, starting sometime soon.

Analysis support

  1. Because this effort is both important and difficult to scope, constant attention to ensure that all aspects are covered is required. For instance, core developers' support responsibilities for analysis must be a recognized, critical contribution and may grow.
  2. Pay particular attention to this new effort as it may take some time to gel.

There are presently four developers supported in the analysis support WBS. In each case their analysis support responsibilities are explicit, as documented in this spreadsheet of software and analysis support prioritized activities. This effort is being watched, and for those developers who have been in place for some time, results have been good, with visible and productive contributions to analysis support (the analysis workbook, and analysis release support and debugging). In two cases the effort is new and more time is needed to assess, but work areas are well chosen for relevance and no problems are apparent thus far.

DOE/NSF Review of the U.S. ATLAS Operations Program, Feb 9-12, 2009

Recommendations for PS&C area

The review report makes three recommendations in the PS&C area. Recommendations with responses in green are below.

The report also comments:

"There is real leveraging of OSG support benefiting U.S. ATLAS, relieving U.S. ATLAS from providing some 5 FTEs. OSG Service Level Agreements are coming soon. We look forward to seeing them presented at the next review."

OSG service level agreements have been put in place; some are still in the process of being finalized. They are documented here.

  • Work with International ATLAS to improve the production from all T1 centers.
U.S. ATLAS did work with International ATLAS to improve the production performance of all Tier 1 centers during the past year. We contributed to this through our computing management roles, Jim Shank as ATLAS Distributed Computing (ADC) Coordinator and more recently as Deputy Computing Coordinator, Alexei Klimentov as ADC Operations Coordinator and more recently as ADC Coordinator, as well as through other managerial and technical roles in production operations and management. The principal avenue for this has been a strict Tier 1 production validation program coupled to regular functional tests of Tier 1 systems performed by ATLAS production operations. Success has been substantial albeit not complete. In the December 2009 reprocessing exercise, which began days after the end of the first beam run, nine of the ten Tier 1s were validated and participated successfully in the campaign, which concluded on schedule at all sites. The exception was Taiwan where storage, staffing and other issues have hampered their performance. ATLAS management as well as wLCG management has been pressuring all Tier1 centers to keep to the MOU agreements and deal with problems in a timely manner. This message has gotten through as can bee seen in the report of Ian Bird to the recent Computing RRB meeting at CERN where he shows a distinct improvement in site readiness since the LHC beam has turned on this year. The subsequent February and April reprocessing campaigns have seen the same: all Tier 1s participating in successful campaigns apart from Taiwan. BNL performed the best in all campaigns, and hosted 100% of the input data; through the February campaign this was reflected in BNL doing a disproportionately large share (~50%). In the April campaign (and future campaigns) this was rectified: shares are controlled by allocating input data to the Tier 1s following the MOU shares. This drives the processing shares to match the MOU, which is what took place in April; the BNL share was ~21% (current U.S. MOU share is 23%).

  • Work with International ATLAS to ensure that the top-mixing study becomes an effective tool to scope out a realistic test of analysis loads.
During the year the U.S. worked with International ATLAS to plan and execute a series of analysis scaling tests. The two principal tests are outlined here: STEP09 and UAT. The UAT test in particular was designed to test realistic analysis loads produced by a large group of users performing their own analysis on a large data sample, which from the discussions in the review is what we believe the reviewers were looking for with this recommendation.

In June 2009 ATLAS conducted the largest such test to date, the STEP09 exercise that also involved the other LHC experiments. In ATLAS an analysis challenge was part of the test. A set of analyses including SUSY validation, single top study, and graviton search were stress tested ATLAS-wide in distributed analysis using the automated testing system Hammercloud. As part of the preparations for this test, PanDA analysis queues were deployed throughout ATLAS (ie beyond the US) so the test represented an ATLAS-wide exercise in PanDA scalability as well as the other ATLAS option, the Ganga analysis system (which supports gliteWMS as well as PanDA as the back end grid interface). The analysis tests and PanDA tests in particular were highly successful, and the test contributed greatly to a subsequent growth in the PanDA analysis user community.

Subsequently in October a further test, an ATLAS-wide User Analysis Test (UAT), was conducted to enlist individual analysts running their own analysis codes in large-scale testing, rather than relying primarily on an automated system. A primary sample of 520M multijet events (about 50 inverse picobarns) containing appropriate amounts of tt, W/Z, & prompt γ was distributed to the US Tier 1, all US Tier2s, and most non-US Tier1s and Tier2s. Experienced grid users (only) were encouraged to participate over the three days of the test. About 170 users submitted 200k jobs, in parallel with Hammercloud jobs to reach full capacity on the distributed analysis grid resources. The test went very well; while problems were found there were no serious scaling issues.

With the arrival of beam data in late 2009 the analysis infrastructure was put to the test for real, and performed well, initially at smaller scales than these tests, but shortly after the arrival of 7 TeV? data in April distributed analysis using PanDA (increasingly prevalent as the main ATLAS distributed analysis system) increased tremendously, with concurrent running analysis job counts equalling the peak levels of both STEP09 and UAT. As in those tests, system scaling issues have not appeared and our analysis resources particularly in the US are close to full utilization.

  • Develop a plan to address storage infrastructure issues for the upgrade and pursue R&D on multi-core processing with OSG and CMS.
Recently (writing in May 2010) a promising LHC-wide initiative has taken shape to address the future evolution of data access and storage management for analysis. US ATLAS has strongly supported this and will be well represented among participants at a first workshop on the issue next month in Amsterdam, organized by the WLCG. We expect to use participation in the process launched by this workshop as our means of addressing storage and data management improvements for LHC analysis into the future. A short note summarizing a March 2010 discussion among the experiments that led to this initiative is attached below. The tentative timescale for large scale use of developments emerging from the program is 2013.

Regarding multi-core R&D, this has been pursued as recommended, with ATLAS, CMS and OSG participating in coupled R&D programs that have culminated in ATLAS in the midterm report on athenaMP studies "Status of the Work to Parallelize ATLAS Reconstruction Processing" attached below. We plan to continue this collaborative R&D into 2011, and a mini-workshop dedicated to discussing current status and planning a further 2011 program is taking place May 13 at Fermilab, with ATLAS, CMS and OSG participation. ATLAS will have about 7 participants.

Prioritization of PS&C activities

We present here information on the prioritizations done across the PSC WBS.

Software (WBS 2.2)

In the software area we have recently performed a prioritization across all activities in the WBS. We present this as a spreadsheet in which, area by area at WBS level 4, we have assigned metrics to the activity in a set of key areas:

  • criticality to operations now, and to operations in 5 years (1=high criticality, 5=low)
  • level of commitment to the activity (1=a US responsibility and deliverable, 5=an activity easily moved elsewhere)
  • expertise (1=US has unique expertise and role, 5=expertise is readily available elsewhere)
  • particular value to US ATLAS (1=has specific value to US ATLAS, beyond general usefulness to ATLAS, 5=generally useful to ATLAS)
  • appropriate for project support (1=fully appropriate for project support, 5=appropriate for base program support)

Comments on the metrics: The metric values are dominated by 1's and 2's. This may reflect some timidity towards higher rankings but primarily it reflects the highly selected nature of our WBS. Were we to perform such an exercise across ATLAS Software and Computing, the dispersion would be far greater. We have a program specifically selected as one attuned to US ATLAS needs, leveraging US expertise, focusing on the most critical areas within our needs and expertise, and appropriate to project support. The program has been largely stable in its scope since its inception a decade ago, and within that scope has been continually tuned to our needs. So 1's and 2's aren't that surprising. As a consequence the gradations between the highest and lowest priorities are not large, and this too reflects reality; our lowest priorities represent carefully selected activities in a resource- and scope-constrained environment and so are not undertaken lightly. The spreadsheet includes comments for each area describing the nature and importance of the activity. The sheet includes an unweighted sum of the metrics for each area. While clearly very approximate, a survey of the sums suggests that the unweighted sum is about right to produce an overall prioritization metric for the area.

This metric based prioritization has contributed to but not strictly determined the identification of lower priority items which would be cut if necessary under the low budget scenario, to reach the PS&C target funding level under that scenario. As described below, the cuts are to well-motivated areas the loss or reduction of which would have a strongly negative impact as described below.

Facilities and Distributed Computing (WBS 2.3)

The Facilities management maintains a prioritization across the full WBS 2.3 as a management tool. The staffing level and activity of the Tier 1 center was reviewed as part of the U.S. ATLAS Software and Computing Review in April 2009 and was judged (by a panel including HEP computing facility experts from Fermilab and SLAC) to have a staff size close to what it should be; the review panel recommended against cutting staff in the case of budget shortfalls. The other areas managed under WBS 2.3 cover distributed facilities and operations, and Tier 2 facilities. The bulk of distributed facilities and operations funding goes to the grid production team charged with operating the US ATLAS distributed facility as a production resource for ATLAS and US ATLAS. Tier 2 facilities funding is managed separately from the project as a fixed allocation to each, supplemented by local resources. With these considerations the scope for identifying low priority activities that can be targeted for reductions is small. Using the management's working prioritization this can be addressed, however, and has formed the basis of the cuts in this area identified for the low budget scenario.

Analysis Support (WBS 2.4)

Analysis support is also covered by this spreadsheet describing software and analysis support prioritized activities.

Addressing the low budget scenario

PS&C has been charged to provide a revised budget for the low budget scenario in which PS&C target funding has been reduced by $400k in FY11 and $1M in FY12. To address this we have drawn on the prioritizations across the project described above. We first sought to identify manpower reductions given that computing resources are already overstretched and will be even moreso through FY11 at least. The reductions we identified meet our FY11 low scenario budget target. Perpetuating them into FY12 would save close to $500k (an allowance is made in FY11 for delays in staff reduction for practical reasons). As described on the spreadsheet below the reductions have a severe impact and we will seek to restore most through RBTs.

For FY12, where we face a further $500k reduction coming in large measure from a $450k reduction in the NSF budget, we could not identify further manpower cuts that would have any less than a drastic impact on meeting our deliverables and supporting the U.S. analysis community. Our plan to address this further cut is as follows. We propose to reduce the committed computing facility funds for delivering analysis capacity by $500k. The specific reductions within our Tier1/Tier2 facilities will be determined by a cost-benefit analysis across the facilities, targeting the reductions where cost-benefit is least favorable. At the same time we will submit a request beyond target for funds sufficient to restore the analysis capacity lost by the cut, because we do not believe we can afford to lose the capacity. If the RBT is awarded we will apply it where cost-benefit is most favorable. Restoring the capacity should require less funds than the original $500k cut, benefitting from the more favorable cost-benefit equation where the funds are applied. This plan presumes that non-negligible cost-benefit differences will exist across the T1/T2 facilities for analysis-targeted resources. We can expect this to be the case as we know the funding profiles and cost sharing across the facilities will be evolving over the next two years. The motivation and objective of the plan is that under the low budget scenario it becomes crucial that we maximize cost-benefit for analysis processing so that we use our resources most effectively. Making the cost-benefit assessment will be aided by a review and ongoing watchdog working group that will shortly be initiated to assess and monitor Tier 2 funding needs, cost sharing plans and resource provisioning expectations for the coming years.

These reductions together with impact statements on the manpower reductions are summarized on this spreadsheet.

U.S. Distributed Facility

Tier 1

See the attachment "U.S. ATLAS Facilities Overview" below for a document describing the Tier 1 and Distributed Facility.

Tier 2s

Current Tier 2 cost estimate for FY11, FY12 Here follows a cost estimate for FY11 and FY12 for a unified U.S. ATLAS Tier-2 center. The estimate for labor is based on the assumption that we have 2 FTE per Tier-2 center supported from program funds.

Included in the cost figures are items that are already (at least partially) charged to the program in FY11 and will most likely have to be paid from program funds in FY12 and beyond, such as networking and computer room infrastructure. Particularly the situation in 2011 is tight. Sites have to ramp up their disk capacity by ~60%, CPU by ~23%. Even with a high budget scenario (assuming +$100k) this will not allow sites to replenish aging disk storage equipment. The situation is somewhat improving in 2012. So in 2011 sites will only arrive at the “planned to be pledged” capacities (note, these numbers are based on the assumption that we will make pledges according to our MoU? share of ~22%; pledges for 2011 are expected to be made in ~Sep 2010) with additional funds or if sites enjoy university contributions at the level they presently get.

Tier-2 cost in FY11

Tier-2 facility labor, 2 FTE @ $120k/ea. $240,000
Wide Area Networking (dedicated 10 Gbps wave) $50,000
Power, space, cooling $50,000
CPU - add 23%, replenish 25% (of pledged) $130,000
Disk - add 60% to meet 2011 pledge $240,000
Total $710,000

Tier-2 cost in FY12

Tier-2 facility labor, 2 FTE @ $123.5k/ea. $247,000
Wide Area Networking (dedicated 10 Gbps wave) $50,000
Power, space, cooling $50,000
CPU - add 20%, replenish 25% (of pledged) $130,000
Disk - add 10%, replenish 25% (of pledged) $210,000
Total $687,000

Tier 3s

In March 2009 the U.S. ATLAS Tier 3 Task Force completed its final report that has since guided our Tier 3 planning. The report is here.

These working pages of the ATLAS and US ATLAS Tier 3 effort are all under active development:

OSG

See the attachments below for

  • An assessment of core services provided to U.S. ATLAS and U.S. CMS by OSG (Feb 2010)
  • A requested clarification to this assessment delineating what parts of the OSG are LHC specific and what parts are community grid infrastructure that other OSG science communities as well as the LHC take advantage of

Background documents

  • Article on Frontier, the conditions database web proxy developed by US CMS and used by CMS and ATLAS


Major updates:
-- TorreWenaus - 06 May 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf USATLAS-SC-Review-May09-Report-Final.pdf (222.4K) | TorreWenaus, 04 May 2010 - 11:41 | Final report of the May 2009 U.S. ATLAS S&C Review
pdf Software_planning_retreat_Minutes.pdf (134.5K) | TorreWenaus, 04 May 2010 - 14:13 | Minutes of the U.S. ATLAS Software Planning Retreat, Tucson, Aug 2009
pdf experiment_requirements_2010-2012.pdf (53.1K) | TorreWenaus, 05 May 2010 - 16:25 | Official experiment requirements for computing resouces in 2010, 2011 and (tentative) 2012
pdf athenaMPSixMonthsV1.pdf (2982.8K) | TorreWenaus, 06 May 2010 - 15:00 | Status of the Work to Parallelize ATLAS Reconstruction Processing, March 2010
pdf StorageDataManagement.pdf (69.3K) | TorreWenaus, 06 May 2010 - 16:58 | First discussion addressing evolution of LHC data and storage management for analysis, March 2010
pdf OSG-USATLAS-USCMS-01-2010-v1.0.pdf (256.6K) | TorreWenaus, 07 May 2010 - 20:47 | Assessment of core services provided to US ATLAS, US CMS by OSG, Feb 2010
pdf OSG-USATLAS-USCMS-Clarification.pdf (139.2K) | TorreWenaus, 07 May 2010 - 20:56 | Clarification to: Assessment of core services provided to US ATLAS, US CMS by OSG, March 2010
pdf US-ATLAS-Facility-HL-Overview-update-5-2010.pdf (1666.6K) | TorreWenaus, 11 May 2010 - 09:38 | U.S. ATLAS Facilities Overview, Updated May 2010
pdf Report_on_Annual_DOE_NSF_Review_ANL_May_10-11_2010.pdf (4128.3K) | TorreWenaus, 23 Nov 2010 - 18:09 | Reviewers report
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback