r2 - 30 Jan 2008 - 12:24:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan23



Minutes of the Facilities Integration Program meeting, January 23, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.


  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

  • Working on summarizing Phase 3 summary report:
  • Overarching near term goals (previously December 15) are:
    • Full and effective participation in FDR exercises
    • Establish 200 MB/s sustained d2d throughput to all Tier2s
    • Analysis queues in routine production at all Tier2s
      • Analysis load generator / validation system
      • Replicate Rel 12 AODs to all Tier2, for routine pathena analysis
    • SRM v2.2 testing, pinning - make a connection to the OSG storage group.
  • Phase 4 plan outlined in IntegrationProgram
  • Upcoming meetings:
    • Jointly w/ OSG all-hands at RENCI / North Carolina, March 3-6, 2008
      • March 3 - OSG site administrator's workshop
      • March 4 - US ATLAS facility workshop
      • Website, agenda
    • US ATLAS Tier2/Tier3, last week of May 2008 - location: Ann Arbor

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Follow-up on Eowyn scalability problems.
  • Production shift report (Mark/Nurcan)

SRM v2.2 and pinning (Gabriele)

  • Follow-up on the bring-online functionality
    • Action item - report back on update from Miguel
  • Working with OSG storage and integration groups on SRM validation
    • Discussed at last week's OSG Integration meeting
    • Three OSG validation site installing dCache 1.8 for testing: BNL, UC, and LBNL. In addition to Fermilab site. Timeline is to have these sites functional and ready for testing by Feb 1.

LFC (John)

  • Following up:
    • Setup panda test site (Mark Sosebee)
    • Setup in autopilot (Torre)
    • Also need to check w/ Tadashi
    • Action item - John will organize meeting and will discuss with Mark
      • John has made contact with Kaushik and Torre, still to setup meeting

Operations: DDM (Alexei/Kaushik/Hiro)

Analysis Queues (Bob, Mark)

  • See AnalysisQueues; updated DONE
  • Working everywhere except NET2 - working on this - an enviroment variable issue with the location of the ATLAS releases. Environment var OSG_APP is used, but releases are elsewhere. Its a pilot3 issue. Follow-up
  • AOD/ESD-based analysis demonstrator across Tier2s - Suggestions:
    • Using either Release 12 or 13 datasets
    • Realistic physics analysis task - reading datasets resident at Tier2's, performing analysis, returning results back to the physicist
    • Metrics: #jobs, input dataset size (# files, #TB), output dataset size, #sites involved and distribution of jobs per site, processing time per job and total clock time start to finish, set of output histograms demonstrating results.

Accounting (Shawn, Rob)

Summary of existing accounting issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up from last meeting:
    • Follow-up - last week: SWT2_UTA (Patrick) - one step closer; still need to get registered in VORS; will be delayed since there is no operations meeting on Monday. Post 28th.
    • Follow-up - last week: BNL mappings (Xin) - very close; there is a plan to change the names for BNL in a couple of places.
    • US ATLAS Facility view (Rob) - post resolution of the BNL mapping issue. (still pending)

Throughput initiative - status (Shawn)

Throughput goals status and schedule (follow-up from last meeting)

  1. Each site 200MB/s? (or best value): Status: AGLT2 and MWT2-UC have reached this. SLAC has reached a best value of 110MB/s. Next up Wisconsin, then UTA, OU, MWT2-IU and NET2? Order could change but assume we can finish all sites in the next two weeks.
  2. 10GE sites 400MB/s?: Status: AGLT2 and MWT2-UC have reached this value. Still need to test MTW2-IU. There are no 10GE hosts at OU or NET2 but enough machines in aggregate should be able to reach this level. Schedule? Estimate the remaining sites could be completed as part of the testing in 1) above.
  3. Long-term (24+ hours) of 500MB/sec BNL->Tier-2s? Status: We demonstrated 500-600MB/sec for most of two weekends ago.
  4. Demonstration of BNL->ALL_Tier-2s at 200MB/s EACH (1GB/sec) for long period? Status: this will have to await new/upgraded doors at BNL and the completion of goals 1) and 2) above.
  5. Measurement of “maximum” burst mode bandwidth for each site (20-60 minute period?) Status: This could be started once we complete 1) and 2) above. The maximum "maximum" may be limited by BNL's current config at somewhere between 700-800MB/sec. This testing could be completed in 1 week (assuming each site is already debugged and meeting goals 1) and 2) if applicable).

  • Need from sites:
    • disk performance
    • optimal number of streams on each site
    • add these to the site certification table to check off

  • This coming week: follow-up
    • UTA - will start next week; Jay notes need iperf
    • BU - will still be limited a single host of 1G; can they reach 120 MB/s d2d? Saul will send the path to Hiro and Jay.
    • SLAC - have demonstrated 110 MB/s already. Two gridftp doors with bestman SRM. Awaiting for 10G upgrade for further tests.
    • Monday meeting - status update from all the sites

  • Shawn will create a table in the LoadTestsP3 task for path, local I/O performance.

Panda release installation jobs (Xin)

  • Next steps (from last time, follow-up)
    • more test jobs, real installation at SLAC
    • If this is good, will push to more sites
    • Change to Panda monitor to isolate release installation jobs? Xin will discuss with Torre.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Follow-up: Split of Nagios server into internal and external - still working on this. Work has now started. The server has been built. The external server will be moved to a new server
  • RSV publishing to WLCG
    • Dantong - looking into US Facility reporting of SAM data; entries are not appearing. Will follow-up with Rob Q.
    • Local RSV to Nagios publishing.

Site news and issues (all sites)

  • T1:
  • AGLT2: Follow-up on job-rate problems
    • From last time: Is input rate of job sufficient? Bob: 900+ slots - there was a problem with communication the gatekeeper. Job rate is about 100/hour. Need more jobs per pilot cycle, and the latency between submitted and scheduled latency. Would increasing the number of jobs in the submit state help? Is this limited by the Condor job manager taking too much time to negotiate the match? What is the problem with jobs in the "Tchk" state? Need to understand this. Gatekeeper looks fine. Consult Torre on meaning of tchk (a problem w/ communication back to submit host); consult Jamie Frey on Condor-G --> Condor scheduling problems. Increase Q-depth.
  • NET2:
  • MWT2:
  • SWT2_UTA:
  • SWT2_OU: Follow-up: 10G upgrade, gridftp server crashes - host/motherboard problems.
  • WT2: Follow-up on new installation

RT Queues and pending issues (Tomasz)

Carryover action items

New Action Items

  • See items in carry-overs and new in bold above.


  • none

-- RobertGardner - 22 Jan 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback