This report covers Phase 4
of the IntegrationProgram
which covers the period of January 1 - March 31, 2008. Meetings during this period:
Summary of milestones achieved
- WBS 1.1 ATLAS releases: more progress on Panda-based release installation method (releases installed via Panda jobs). A number of issues were address:
- Firewall conduits at BNL ITB opened up.
- Tests passing installation of release 12.5.0: sites passed this test: MWT2_IU, MWT2_UC, UTA_SWT2, UTA-DPCC, OU_OCHEP_SWT2, UC_ATLAS_MWT2, SLACXRD, AGLT2 while sites failed but for known causes: BNL, SLAC.
- All sites agree to change the permission on DQ2 area to allow usatlas2 account write into it, by making the directory group writable; the issue with dCache is that it doesn't respect umask and sticky bit, so new subdirs created by usatlas1 will give permission errors to usatlas2. The suggestion is to change the BNLdCacheSiteMover function in pilot code, so that whenever it creates a new subdir for storing log file, makes it group writable.
- Deploy production submit host for Panda release pilots: 2/1/08
- Validation on all sites: 2/15/08
- On-going ...ready for production operations: 3/1/08.
- Further deployment will be based on schedule of new releases of installation pilot/trf.
- WBS 1.2 DQ2 site services: Site services were upgraded to DQ2 0.5.2. There was a special upgrade to support Adler32 checksums, and to to fill LRC entries with Adler32 data. As of the end of Phase 4, DQ2 0.6.5 has been in use elsewhere in ATLAS but has not yet been declared a stable production release.
- WBS 1.3 OSG services: completed deployment of the OSG 0.8 on the last Tier2. We are starting the detailed work of getting RSV data forwarded to SAM during Phase 4. Evaluation of dCache 1.8 from the OSG-VDT storage group, including functional testing, on Integration Testbed sites (UC, BNL) completed. Still no WLCG accounting portal view for the US ATLAS Facility. A site level RSV --> Nagios demonstrator: publish the results of RSV probes into a local Nagios instance Prototype complete Provisioning of OSG ITB 0.9 testbed sites at BNL, UC, OU has been delayed due to the OSG schedule delay; likewise, validation of ATLAS/Panda on ITB 0.9, in advance of OSG 1.0 release (estimate May 1, 2008).
- WBS 1.4 Storage services: dCache 1.8 upgrade at AGLT2 and MWT2 providing SRM v2.2 capability for space token reservations. Work continued at SLAC to provide an equivalent SRM v2.2 capable storage element based on Bestman-Xrootd. This work has not yet been completed. SWT2, NET2 will likewise deploy Bestman-like services.
- WBS 1.5 Monitoring services: work continued on Nagios-based alarm infrastructure for the Facility; OSG RSV probes installed on sites with OSG 0.8. At MWT2_IU, a demonstrated integration of site-level Nagios and RSV probe output that will go into OSG 1.0. External Nagios console collecting Nagios probe results for Facility-wide services was deployed on 2/28, providing a clean Nagios-alert system for grid services running at all sites in the US ATLAS facility.
- WBS 1.6 PROOF-Xrootd: A small group has formed and a number of informational meetings have been held to explore the deployment and use of PROOF at Tier2/Tier3 facilities. See ProofXrootd. At this time, there still are no specific guidelines and/or recommendations, though a number of important issues have been discussed. Most notable are issues regarding data import, and management of analysis datasets in a PROOF farm context.
- WBS 1.7 Load tests: work continues of Tier2 (and some Tier3) sites to establish throughput benchmarks. Most sites have identified the weak points in their infrastructure, and at the end of Phase 4 were taking steps to remedy them (most involved hardware upgrades to gridftp and/or SRM doors).
- WBS 1.8 File Catalogs: The goal is to migrate from the LRC-based catalogs to the EGEE product LFC. Based on discussions with Kaushik, the idea is that there is no need to test basic functionality because Panda is using LFC elsewhere. The concern is rather about performance with a catalog with 20,000,000 entries with more expected. Sent email to Miguel Branco asking about maximum LFC size and transation rate in Europe. No reply yet. Plan to dump entire LRC to file and begin long-term import into existing test LFC instance (Hiro). This may take O(weeks). But once done, reloading this DB into another LFC instance will be quick. Final high-performance hardware for LFC at BNL is ordered, and expected by April 15th. The idea is to test read and write performance with a fully-populated catalog. If the tests look good, then the idea would be to do a second (and third) LRC dump, SELECTing only newly created entries since previous dump. Then import them into LFC. Once (nearly) all entries are migrated, briefly halt production and do a final dump and load. Change BNL catalog type to LFC, and restart production.
- WBS 1.9 Accounting: the accounting infrastructure comprised of OSG provided components (Gratia, and a forwarding service to the EGEE APEL/web portal services) continue to be monitored. All sites in the facility are reporting correctly, though there is some on-going discussion of the accuracy of normalization factors in some cases.
- WBS 1.10 Analysis Queues: queues that were initially setup during Phase 3 have been successfully brought into production during Phase 4. These queues have and continue to be available for physicists doing analysis of FDR-1 data.
- WBS 1.11 Site certification Table: the organizational tool to track progress on tasks at the site-level for each WBS area.
- WBS 1.12 Summary Report: this report.
Normalized CPU delivered to WLCG by USATLAS Tier2 centers (Phase 4):
Procurement reports and capacity status
Procurements and capacities from Phase 3 were reported in SummaryReportP3
. There is also the CapacitySummary
in which we compare pledge and deployed capacities for each phase of the integration program.
Procurements during Phase 4 (Jan 1 - Mar 31, 2008):
Capacity status: (dedicated processing cores, usable storage) as of March 31, 2008
- T1: None
- AGLT2: None
- MWT2_IU: 40 dual-dual Opteron 2218
- MWT2_UC: 85 dual-dual Opteron 2218
- NET2: 176 cores (Intel E5430s and X7350s) + 84T (raw) Storage
- SWT2-UTA: 540 Cores (Opteron 2220) + 240TB (raw) Storage
- SWT2-OU: none
- WT2: 320 cores (X5355) + 230TB (usable)
- T1: 1952 cores, 1200 TB
- AGLT2: 900 cores, 400 TB plus 170 TB in dCache
- NET2: 570 cores, 170TB
- MWT2_IU: 288 cores, 110 TB
- MWT2_UC: 476 cores, 102 TB
- SWT2-UTA: 520 cores, 81TB Storage
- SWT2-OU: 260 cores, 16 TB
- WT2: 312 cores (AMD 2218), 51TB usable
CPU capacity vs WLCG pledge (period ending FY08Q2)
Usable disk capacity vs WLCG pledge (period ending FY08Q2)
Summary of failures and problem areas
- Storage: not all sites have successfully deployed an SRM v2.2 capable storage element with space tokens, as required by ATLAS.
- Throughput: sites have yet to clearly establish routine throughput transfers at the scale of sustained 200 MB/s.
- Catalogs: we still rely on the old, unsupported LRC and its utility routines. We need to decide quickly during Phase 5 whether LFC will meet US ATLAS needs and migrate to its use, or re-dedicate development effort in support of the LRC.
- Site availability monitoring: we have few site reliability data getting published into the WLCG reporting systems at this time.
Carryover issues to next Phase
- The issues as noted above in Storage, Throughput, File Catalogs, and Site Availability Monitoring are core infrastructure areas that still require significant attention and risk.
- Ramped up use of the Analysis Queues in support of FDR exercises.
- Continued increases to the processing capacity of the Facility, and stable operations of those resources already deployed.
- 01 Apr 2008
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
(57.5K) | RobertGardner
, 14 Apr 2008 - 10:22 | US ATLAS Facility Spreadsheet, v8 (Phase IV)
(71.5K) | RobertGardner
, 14 Apr 2008 - 13:51 | Normalized CPU for Tier2 centers, Phase 4
(51.6K) | RobertGardner
, 16 Apr 2008 - 06:18 | CPU capacity vs WLCG pledge (period ending FY08Q2? )
(53.8K) | RobertGardner
, 16 Apr 2008 - 06:19 | Usable disk capacity vs WLCG pledge (period ending FY08Q2? )