r11 - 30 Jan 2008 - 12:24:26 - RobertGardnerYou are here: TWiki >  Admins Web > SummaryReportP3

SummaryReportP3

Introduction

This report covers Phase 3 of the IntegrationProgram which covers the period of October - December 2007. Meetings during this period:

Summary of milestones achieved

  • WBS 1.1 ATLAS releases: significant progress on Panda-based release installation method (releases installed via Panda jobs). Demonstrated feasibility with releases installed at SLAC (Tadashi, Wei). Initial steps for a production infrastructure to feed these jobs into Panda queues. Xin has taken this responsibility.

  • WBS 1.2 DQ2 site services: Three upgrades, up to DQ2 0.5.1. Upgrade procedures are vastly improved, as are communications w/ DQ2 development team.

  • WBS 1.3 OSG services: deployment of the OSG 0.8 on four of five Tier2s, and the Tier1 facility.

  • WBS 1.4 Storage services: dCache 1.8 upgrade at BNL; additional gridftp doors deployed at MWT2; dCache at AGLT2 tested with heavy throughput; SRM-Bestman+xrootd deployed at WT2; GPFS evaluated with good results at BU.

  • WBS 1.5 Monitoring services: work continued on Nagios-based alarm infrastructure for the Facility; OSG RSV probes installed on sites with OSG 0.8, discussions with OSG RSV development team regarding forwarding to SAM, and making site-level RSV probe results available to Nagios.

  • WBS 1.6 Logging services: Facilty-wide syslog-ng forwarding of DQ2 site services logfiles and development of the troubleshooting console continues to be operated, though its use is no longer critical as DQ2 has improved significantly in DQ2 0.4, 0.5, and Panda Mover is now in production use for input datasets at sites.

  • WBS 1.7 Load tests: all Tier2 sites have done host-level TCP kernel tunings to achieve Gigabit capacity (>950 Mbps iperf ceilings) and Gridftp throughput (>112 MB/s). A major focus has been on disk-to-disk throughput which has exposed a number of issues at both the BNL dataservers and those at the Tier2s.

  • WBS 1.8 File Catalogs: Initial evaluations of FTS underway; installation of FTS instance, glite libraries at BNL understood and working. Ready to begin addressing more difficult issues: Panda integration, client interactions, deployment/update/migration from current LRC on sites (next Phase).

  • WBS 1.9 Accounting: the accounting infrastructure comprised of OSG provided components (Gratia, and a forwarding service to the EGEE APEL/web portal services) has been checked on a site-by-site basis. At the end of this phase all but one site (SWT2_UTA) is reporting to WLCG, but this site will be reporting soon. There are issues still with BNL-Tier1 being correctly displayed in the portal.

  • WBS 1.10 Analysis Queues: queues have been setup at each site with simple validation evgen Pathena jobs.

  • WBS 1.12 Summary Report: this report.

Procurement reports and capacity status

Procurements and capacities from Phase 2 were reported in SummaryReportP2.

Procurements during Phase 3 (Oct 1 - Dec 31):

  • T1: 44 dual quad-core Intel servers (352 cores)
  • AGLT2: none
  • MWT2_IU: none
  • MWT2_UC: 20 dual-dual 2218 Opteron servers (80 cores)
  • NET2: none
  • SWT2-UTA: none
  • SWT2-OU: none
  • WT2: none

Capacity status: (dedicated processing cores, usable storage)

  • T1: 1952 cores, 1200 TB
  • AGLT2: 900 cores, 400 TB plus 170 TB in dCache
  • NET2: 401 cores, 108 TB
  • MWT2_IU: 156 cores, 110 TB
  • MWT2_UC: 220 cores, 102 TB
  • SWT2-UTA: 520 cores, 76 TB
  • SWT2-OU: 260 cores, 16 TB
  • WT2: 312 cores, 51 TB

FACILITY SPREADSHEET: Normalization-factors-USATLAS-v6.xls: v6 of the capacity cores/specint capacities for the US ATLAS Facility.
USEABLE STORAGE: T1: 1200 TB; Sum T2: 1033 TB

Summary of failures and problem areas

  • Demonstrated disk-to-disk performance very far below expectations for T1-T2 data transfers. Summary:
    • AGLT2 - good performance achieved, optimizations of dCache parameters and configuration of write-pools needed
    • MWT2 - similar: I/O performance good, but more gridftp doors are needed, and separate write-pools from worker nodes.
    • NET2 - ready to begin sustained tests; plans are to use a gridftp door and GPFS backend.
    • SWT2_OU - ready to start sustained throughput tests; a 10G upgrade is underway.
    • SWT2_UTA - regional network limitations are an obstacle.
    • WT2 - WAN being upgraded to 10G; good results achieved with SRM-Bestman and xrootd backend.
  • Failed to begin routine distributed (pathena-based) analysis at Tier2s

Carryover issues to next Phase

  • Continued disk-to-disk throughput assessment and actions to address known shortcomings.
  • Analysis queues brought into routine use for Pathena jobs. Requires reliable deployment of AODs.
  • File catalog technology firmly decided with integration to Panda and other ATLAS and site services, as appropriate.
  • Evaluation of dCache 1.8 with focus on SRM v2.2 functionality testing.
  • Consideration / study of PROOF/xrootd for Tier2 centers
  • Continued procurements and capacity ramp-up.

-- RobertGardner - 16 Jan 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


xls Normalization-factors-USATLAS-v6.xls (38.0K) | RobertGardner, 29 Jan 2008 - 06:12 | v6 of the capacity cores/specint capacities for the US ATLAS Facility.
xls Normalization-factors-USATLAS-v5.xls (50.0K) | WeiYang, 18 Jan 2008 - 14:07 | v5 of the capacity cores/specint capacity spreadsheet
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback