r17 - 15 Nov 2007 - 14:26:39 - SaulYoussefYou are here: TWiki >  Admins Web > SummaryReportP2



This report covers Phase 2 of the IntegrationProgram which covers the period of August and September 2007. Meetings during this period:

Summary of milestones achieved

  • WBS 1.1 ATLAS releases, deployment method, tests: we continue to utilize the framework of Xin Zhao for deployment of ATLAS releases on OSG sites. Initial plans in place to work with Panda team to integration and test a Panda-based release installation jobs. This work has been deferred by the Panda development team.
  • WBS 1.2 DQ2 site services: work included definition and plans for a DQ2 integration testbed at BNL and UTA. The work was delayed due to DQ2 0.4 slippage.
  • WBS 1.3 OSG services: deployment of the OSG Integration Testbed software stack ITB 0.7 on three US ATLAS sites (BNL, UC, OU) including the provisioning of an OSG storage element at BNL. BNL and UC are providing operational assistance to OSG VO validating against the OSG software stack. ATLAS validation on ITB 0.7 (for the deployment release OSG 0.8) included full Panda tests running more than 20 complete production jobs over three days on the UC_ITB site.
  • WBS 1.4 Storage services: work during this period included wide area gridftp memory-to-disk transfer load tests for globus-gridftp/NFS storage elements and dCache-gridftp storage elements. Local disk throughput optimization (including dCache optimization at Tier2 sites) is work still to be performed.
  • WBS 1.5 Monitoring services: work on Nagios-based alarm infrastructure for the Facility continues including initial integration work with OSG "RSV" probes, necessary for WLCG site availability monitoring.
  • WBS 1.6 Logging services: Facilty-wide syslog-ng forwarding of DQ2 site services logfiles and development of the troubleshooting console continues to be operated, though no effort was identified to implement a security layer for the infrastructure; the work has been deferred until the next phase.
  • WBS 1.7 Load tests: a control framework based on Monalisa has been implemented which provides regular, scheduled tests of data transfer operations of various types. Closely related to the load testing effort, we have launched in this phase an initiative to optimize various modes of throughput between BNL and the Tier2s, beginning first with a systematic program of network optimization. This program is being led by Shawn McKee. At the time of this report three Tier2 sites (AGLT2, MWT2_IU, MWT2_UC) have been optimized at the level of Gigabit capacity (>950 Mbps ceilings) and Gridftp throughput (>112 MB/s).
  • WBS 1.8 File Catalogs: an initial survey of options for a possible replacement for the local replica catalogs used by the sites has been made; a technology decision needs to be made by the Panda development team.
  • WBS 1.9 Accounting: the accounting infrastructure comprised of OSG provided components (Gratia, and a forwarding service to the EGEE APEL/web portal services) has been checked on a site-by-site basis. Reporting irregularities (caused by site-level VO-mapping problems, OSG/EGEE registration problems, etc) have been discovered and steps to eliminate them are presently being pursued.
  • WBS 1.10 Site certification Table: the organizational tool to track progress on tasks at the site-level for each WBS area. Included in this tasks is a program to setup analysis queues at each site (this effort is led by Bob Ball from Michigan and Mark Sosebe from UTA) with configurations for both PBS and Condor job managers. We now have analysis queues deployed at two sites AGLT2 and UTA.
  • WBS 1.11 Summary Report: this report.

Procurement reports and capacity status

Procurements from Phase 1 were reported in SummaryReportP1.

Procurements during Phase 2 (Aug 15-Sep 30):

  • T1: Sun/STK SL8500 Library Expansion, 10 LTO4 Tape Drives, 20 TB HPSS Disk Cache
  • AGLT2: 350 cores, 200TB plus 50TB dCache
  • MWT2_IU: none
  • MWT2_UC: none
  • NET2: none
  • SWT2-UTA: none
  • SWT2-OU: none
  • WT2: none

Capacity status: (dedicated processing cores, usable storage)

  • T1: 1612 cores, 1120 TB
  • AGLT2: 550 cores, 238 TB plus 89 TB dCache
  • NET2: 406 cores, 144 TB, 620,000 specInt2K (OSG table)
  • MWT2_IU: 128 cores, 110 TB
  • MWT2_UC: 136 cores, 102 TB
  • SWT2-UTA: 300 cores, 16 TB
  • SWT2-OU: 260 cores, 16 TB
  • WT2: 312 cores, 51 TB

Summary of failures and problem areas

  • DDM issues continue to pose the most significant challenge to operational stability. Instabilities with DQ2 0.3 persisted at all sites, requiring frequent manual restarts.
  • Delay in release of DQ2 0.4 prevented work in establishing a DDM testbed, and addressing the stability issues. At the end of this Phase BNL and UTA had completed the first installations of DQ2 0.4.
  • Initial load testing and network performance measurements indicate much work is to be done to tune hosts at the Tier2 centers. Most sites are well-below (up to factors of 5) their theoretical ceiling given the rated network capacities between their site and BNL.

Carryover issues to next Phase

  • Functional and scalability tests of DQ2 0.4 in the testbed, collection of known problems, feedback to Miguel on documentation.
  • Full roll-out of DQ2 0.4 to all production services.
  • Continued development of load testing framework in terms of displays and test definitions. Weekly feedback during Wednesday meetings on load test performance metrics.
  • Complete network optimization at all Tier2 sites.
  • Begin focus on storage throughput optimzation, beginning with dCache sites at Tier2 centers.
  • Continued procurements and capacity ramp-up.
  • Deployment of OSG 0.8

-- RobertGardner - 02 Oct 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback