r3 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJul2

MinutesJul2

Introduction

Minutes of the Facilities Integration Program meeting, July 2, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Wen (U Wisc), Charles, Rob, Justin, Marco, Nurcan, Mark, Hiro, Xin, Horst, Saul, John, Patrick, Kaushik, Armen, Datong, Tom, Wei, Michael, Wensheng, Jay, Shawn, Torre
  • Apologies: none
  • Guests: David Chen, Janis Landry-Lane (IBM)

Integration program update (Rob, Michael)

  • IntegrationProgram for Phase 5 (April 1 - June 30, 2008: FY08Q3)
  • Overarching near term goals for Phase 5:
    • Full and effective participation FDR-2 exercises
    • Complete the benchmarks of 200 MB/s sustained disk-to-disk throughput to all Tier2s
    • SRM v2.2 functionality for all ATLAS sites
  • Upcoming meetings:
  • Milestones from the Ann Arbor meeting: AnnArborNotesMay2008:
    • FDR2: data replication and analysis queues
    • 200/400 MB/s T1-T2
    • OSG 1.0 deployed
    • LFC evaluation and deployment strategy complete
    • WLCG - SAM/RSV, reliability availability metrics for CE and SE reporting >80% for all sites.
    • Provisioning of capacities according to pledges on track for September 15 2008 deployment.
    • Network performance monitoring infrastructure deployed.
    • Revision to the Tier 3 white paper, and a reference Tier 3 facility defined.
    • Analysis benchmarks demonstrated at increasing scale (100/200/500/1000 simultaneous jobs) at all Tier 2 facilities.

Next procurements

IBM iDataPlex (David Chen, IBM)

  • See slides circulated via email.
  • Follow-on to benchmark presentation from Ann Arbor workshop.
  • iDataPlex - new packaging of familiar components, a competitive offering of compute nodes for comparison w/ other vendors.
  • Delivered as a rack. Presumably wired for network and power.
  • Questions
    • What about "X" processors (rather than low-powered "L")? Answer: components are build to order. All except AMD.
    • Cooling - Door - $5K plus ed discount; still need water chiller and pumper.
    • Power connections - to an existing APC?
    • Weight - x2; a concern?
    • There will be a single price for all ATLAS sites
    • 2700/node, with 3 year warranty; everything included (pdu's and rack). Will check on management appliance and switch.
  • Local comments
    • power comparison - more like a blade
    • why better than blades? better for local disk.

Follow-up issues

  • Storage capacity recommendations/guidance for the Facility (320 TB capacity, from Kaushik's model on MinutesJune11).
    • Kor's figure is much lower, due to minimal requirements for production.
    • No major objections to these numbers.
    • At the moment - we are really short on disk.
  • Revised WLCG pledges - need info by July 15.
  • Specifications from Internet2 for network monitoring hosts (Rich)
    • No update. Why are I2 folks silent? Need by next week cc Eric Boyd. Whitepaper update.

Operations overview: Production (Kaushik)

  • Follow-up on space token description assignments (PRODDISK, MCDISK, GROUPDISK, etc)
  • No scheduled tasks coming in. Complete breakdown in the production effort in ATLAS.
  • Shall we re-direct our focus on analysis at Tier2s?
    • Site's responsible for checking analysis queue functional, SE services, etc.
    • Support beyond this has to be provided by analysis
    • Need to ramp-up analysis benchmarks.
  • Software availability issue - people are waiting on a new release for 10 TeV? .

Shifts (Mark)

  • Production proxy and upgrade to VOMS server. Thought it had been resolved, but appears to be intermittently re-appearing.
  • SLAC - local proxy and port for ssl traffic - Wei solved.
  • SWT2_CPB - stopped autopilot submission - will be offline while integrating new hardware and storage.
  • File transfer backlog at AGLT2. Under investigation.
  • NFS outtage at BNL, cleared.
  • Sporadic feed of jobs.

Analysis queues, FDR analysis (Nurcan)

  • Follow-up:
    • Regular exercising of analysis queues over data sets, especially when there are new releases. Nurcan is doing this.
      • Problems w/ FDR2 datasets; all sites successful except SLAC, surprising. There is a parameter in queue definition that needed to be changed (Paul and Wei) - need to setup communication w/ Nurcan for any analysis queue changes. Problem w/ syntax of file URLs - adapted by Tadashi.
      • Points out need for continuous testing.
      • Two analysis jobs at MWT2 - change in pilot wrapper script for pilot child timeout processes contributed by Charles.
      • AGLT2 - user sends job that makes TAG selections; works fine BNL, but not at Michigan. libdcap.so patch at BNL. There is a dccp client inside the ATLAS release (for dcap linking). Need to mitigate with SIT.
      • Reconstruction jobs will need to be tried at Tier2s. But there are other job types. Nurcan will be making a list - regular and advanced.
      • Will saturate analysis queues with jobs.
    • Mark and Nurcan will meet to discuss some systematic testing at the sites and will re-consider the analysis benchmarks.
    • Metric - define a standard for time required to process a standard dataset
    • Consider site availability monitor which indicates basic functionality indicating site-readiness; this would help users distinguish "site" problems for "user-code" problems.
  • Panda monitor still probably the best place for users to analyze jobs.
  • There is a new version of elog that might be useful.
  • Request from user for a link to the Panda monitor for users giving some status information for sites, indicating downtimes, etc, - well advertised for users.
  • Twiki page to collect problems w/ analysis queues.

Operations: DDM (Hiro)

LFC migration

  • SubCommitteeLFC
  • LFC sub-committee meeting yesterday, see: LFCMeetJune24
  • Dantong: has setup to front-end nodes w/ backend Oracle cluster.
  • Hiro will start slow migration today
  • Panda group will need to use the testbed machine.
  • There are problems with the wlcg-client having two versions of globus.
  • LFC pieces are from a binary distribution - solves DQ2 utilities, but introduces probs with other client programs.
  • Possibility of adding an http interface to LFC - would provide clear separation between the client and service; request from Kaushik to bring up with LFC developers. Dantong will contact LFC developers.

RSV SE & CE probe update status (Fred)

  • Follow-up from last week:
    • SRM probes needed for AGLT2, SWT2, NET2
    • AGLT2 - has 2.0 probes, just not enabled. Will run configure.
    • BU - has RSV 2.0 running, but not reporting. Saul will follow-up, will install OSG 1.0 by next week.
    • SW - need SRM probes. Did upgrade, but may not have enabled SRM probe.
    • BNL - why not reporting? Xin claims its reporting fine locally. Are they going into Gratia correctly? Fred will follow-up with Xin.
  • See https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvSAMGridView for links to SAM and Gridview reporting consoles.

Scheduling maintenance downtimes with the GOC (Sarah)

WLCG accounting

OSG 1.0 (Rob)

  • OSG 1.0 deployment status, issues
  • See https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/WebHome
  • Deployed 1.0
    • AGLT2
    • MWT2_UC
    • MWT2_IU
    • UC_ATLAS_MWT2
    • PROD_SLAC
    • UTA_DPCC
    • BNL_ATLAS_1 - all upgraded now.
  • Status update:
    • SWT2_CPB: post-integration, will be 1.0.
    • OU - will be upgraded later this month.
    • NET2 - few days.
  • glexec - deployed in production at BNL; required at SLAC, especially for analysis jobs.
  • glexec needs outbound access.
  • Has been tested on the ITB at BNL.

Site news and issues (all sites)

  • T1: Dantong: yesterday had major NFS downtime, 4-5 hours. Autopilot submit host voms proxy certificate issue resolved (Marco).
  • AGLT2: Currently having issues getting files back to BNL - Wenjing working w/ Hiro, some not available. Pools got filled by resilient dcache mechanism. Urges to use new production space.
  • NET2: all is well.
  • MWT2: all is well.
  • SWT2 (UTA): CPB going down for major overahaul, addition of new machines.
  • SWT2 (OU): all is well. 3 headnodes now onto 10 G networks. Ready for testing w/ BNL after local tests.
  • WT2: talking with ATLAS database group to setup conditions database at SLAC (compute nodes have no IP connectivity). Analysis job issue - DQ2 and Panda problems with port number missing from URL. Hits only SLAC since they are doing direct reads from xrootd. Need a consistent convention. Is a single convention used by the pilot? Look at what the LRC interface is providing to see what the pilot is using (storage default location).

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • A bit of development to do. Carry-over
  • No results yet from tests at AGLT2.

Release installation via Pacballs (Xin)

  • Follow-up
    • Progress - this morning to discuss this. Fred - hoping this week to have first set of pacballs installed in DQ2. Will test with some older releases on some test machines.
    • Need official naming scheme.
    • Get installed with a special Panda pilot job using the software role. Expect performance to improve.
    • Expect a couple of weeks of testing.
    • Goal to bring into production by end of the month (June).
  • There is pacball release available which Xin has tested.
  • Saul: factorize into two problems - pacballs which define the release versus the delivery mechanism.

Throughput initiative - status (Shawn)

Nagios monitoring subcommittee (Dantong)

  • Available space reporting at all sites.
  • Tomasz was organizing a meeting to test globus-job-run (?)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

Site certification review

User LRC deletion (Charles)

  • Nurcan reports this is currently failing - Charles has addressed bug reported. New version available for Nurcan to try, will follow-up.

AOB

  • none


-- RobertGardner - 01 Jul 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback