r6 - 30 Jan 2008 - 12:24:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan16



Minutes of the Facilities Integration Program meeting, January 16, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.


  • Meeting attendees: Rob, Tom, Saul, Fred, Charles, Rich, Xin, Hiro, Wensheng, Patrick, Kaushik, Nurcan, Mark,
  • Apologies: Michael, John, Wei

Integration program update (Rob, Michael)

  • Working on summarizing Phase 3 summary report:
  • Overarching near term goals (previously December 15) are:
    • Full and effective participation in FDR exercises
    • Establish 200 MB/s sustained d2d throughput to all Tier2s
    • Analysis queues in routine production at all Tier2s
      • Analysis load generator / validation system
      • Replicate Rel 12 AODs to all Tier2, for routine pathena analysis
    • SRM v2.2 testing, pinning - make a connection to the OSG storage group.
  • Phase 4 plan outlined in IntegrationProgram
  • Upcoming meetings:
    • Jointly w/ OSG all-hands at RENCI / North Carolina, March 3-6, 2008
      • March 3 - OSG site administrator's workshop
      • March 4 - US ATLAS facility workshop
      • Website, agenda
    • US ATLAS Tier2/Tier3, last week of May 2008 - location: Ann Arbor

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Eowyn problems again yesterday - could this be improved if it were deployed at CERN. There are four instances pulling jobs for Eowyn now. Turns out a single instance at CERN is not enough (yesterday we ran out of jobs again). Large number (10K) of high priority M5 reprocessing jobs during the weekend - sent to BNL. This prevented production jobs from going to Tier2s.
    • Request from Charles - would be nice to post these situations in eLog or on the list.
    • Pavel will send jobs in a controlled way in the future.
    • Kaushik has stepped down as ATLAS production manager - new guy is Alex Read, who will be controlling the job flow to each cloud. Site admins should communicate via the Panda shift crew.
  • Production shift report (Wensheng)
    • Normal production - three sites are on scheduled maintenance.
    • BU turned back on.

SRM v2.2 and pinning (Gabriele)

  • Follow-up on the bring-online functionality
    • Action item - report back on up-to-date from Miguel
  • Working with OSG storage and integration groups on SRM validation

LFC (John)

  • Following up:
    • Setup panda test site (Mark Sosebee)
    • Setup in autopilot (Torre)
    • Also need to check w/ Tadashi
    • Action item - John will organize meeting and will discuss with Mark
      • John has made contact with Kaushik and Torre, still to setup meeting

Operations: DDM (Alexei)

Analysis Queues (Bob, Mark)

  • See AnalysisQueues; updated DONE
  • Working everywhere except NET2 - working on this - an enviroment variable issue with the location of the ATLAS releases. Environment var OSG_APP is used, but releases are elsewhere. Its a pilot3 issue.
  • Validated with a pathena submission test job (Evgen)
    • Mark: Analysis queues: Mark will send a summary list of issues for sites to complete the analysis queue deployment. DONE
    • Kaushik: Likely AOD's for analysis: Release 13, working backwards in task definitions. (CSC notes are still using Rel 12)
    • Mark: Prototype analysis task, on a site-by-site basis
    • Nurcan: Will provide a standard SUSY plotting package.
      • Once AODs are at Tiers's, create ntuples using susyview
      • Validation macros in place. Will use validation sample as a test.

Accounting (Shawn, Rob)

Summary of existing accounting issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up from last meeting:
    • SWT2_UTA (Patrick) - one step closer; still need to get registered in VORS; will be delayed since there is no operations meeting on Monday. Post 28th.
    • BNL mappings (Xin) - very close; there is a plan to change the names for BNL in a couple of places.
    • US ATLAS Facility view (Rob) - post resolution of the BNL mapping issue.

Throughput initiative - status (Shawn)

Throughput goals status and schedule

  1. Each site 200MB/s? (or best value): Status: AGLT2 and MWT2-UC have reached this. SLAC has reached a best value of 110MB/s. Next up Wisconsin, then UTA, OU, MWT2-IU and NET2? Order could change but assume we can finish all sites in the next two weeks.
  2. 10GE sites 400MB/s?: Status: AGLT2 and MWT2-UC have reached this value. Still need to test MTW2-IU. There are no 10GE hosts at OU or NET2 but enough machines in aggregate should be able to reach this level. Schedule? Estimate the remaining sites could be completed as part of the testing in 1) above.
  3. Long-term (24+ hours) of 500MB/sec BNL->Tier-2s? Status: We demonstrated 500-600MB/sec for most of two weekends ago.
  4. Demonstration of BNL->ALL_Tier-2s at 200MB/s EACH (1GB/sec) for long period? Status: this will have to await new/upgraded doors at BNL and the completion of goals 1) and 2) above.
  5. Measurement of “maximum” burst mode bandwidth for each site (20-60 minute period?) Status: This could be started once we complete 1) and 2) above. The maximum "maximum" may be limited by BNL's current config at somewhere between 700-800MB/sec. This testing could be completed in 1 week (assuming each site is already debugged and meeting goals 1) and 2) if applicable).

  • Need from sites:
    • disk performance
    • optimal number of streams on each site
    • add these to the site certification table to check off

  • This coming week:
    • UTA - will start next week; Jay notes need iperf
    • BU - will still be limited a single host of 1G; can they reach 120 MB/s d2d? Saul will send the path to Hiro and Jay.
    • SLAC - have demonstrated 110 MB/s already. Two gridftp doors with bestman SRM. Awaiting for 10G upgrade for further tests.
    • Monday meeting - status update from all the sites

  • Shawn will create a table in the LoadTestsP3 task for path, local I/O performance.

  • Current BNL limitation is about 700-800 MB/s; what are the upgrade plans? 8 doors presently.

Panda release installation jobs (Xin)

  • Couple of jobs completed at SLAC
  • Xin would like to remove a release and test its re-installation.
  • Status/progress on: permissions problems;
    • submitting test jobs which scan directories
    • no problem on most Tier2 sites, but there's a problem with dCache at BNL. Jobs failing at BU - perhaps related to the OSG_APP environment; Saul changing back.
  • Next steps:
    • more test jobs, real installation at SLAC
    • If this is good, will push to more sites
    • Change to Panda monitor to isolate release installation jobs? Xin will discuss with Torre.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • There are some persistent red flags: OU gatekeeper is red; Wisconsin LRC
  • Slowness of gatekeeper at UC - improved.
  • Split of Nagios server into internal and external - still working on this. Work has now started. The server has been built. The external server will be moved to a new server. Unknown.
  • RSV publishing to WLCG
    • Dantong - looking into US Facility reporting of SAM data; entries are not appearing. Will follow-up with Rob Q.
  • Local RSV to Nagios publishing.

Site news and issues (all sites)

  • T1: GUMS bug found, fixed and repaired.
  • AGLT2: Is input rate of job sufficient? Bob: 900+ slots - there was a problem with communication the gatekeeper. Job rate is about 100/hour. Need more jobs per pilot cycle, and the latency between submitted and scheduled latency. Would increasing the number of jobs in the submit state help? Is this limited by the Condor job manager taking too much time to negotiate the match? What is the problem with jobs in the "Tchk" state? Need to understand this. Gatekeeper looks fine. Consult Torre on meaning of tchk (a problem w/ communication back to submit host); consult Jamie Frey on Condor-G --> Condor scheduling problems. Increase Q-depth.
  • NET2: ordering new gatekeeper w/ fiber channel to directly backend storage.
  • MWT2: Follow-up on gatekeeper slowness: http://www.mwt2.org/sys/gatekeeper; purchase arrived: 105 dual-dual 2218 Opterons (65-UC, 40-IU); System View
  • SWT2_UTA: DPCC dead due to a bad internal switch, to be replaced. New cluster is up and running jobs (200 cores). UTA_SWT2 account.
  • SWT2_OU: All working well. OSCER upgraded to OSG 0.8. OU Condor pool being upgraded. Problems w/ motherboard on gridftp server (keeps crashing w/ dropped packets, not understoood) -no update. Working on 10G upgrade.
  • WT2: Follow-up on: 10G network, Ganglia monitoring (for external viewing), install of recent purchase: (34 machines - 272 cores). Wei:
    1. Working on 10Gbit upgrade. Situation is better than we thought. But nothing to say at this time. Currently reached 110Mbyte/s. Need to keep peace with the rest of the Lab so can't go beyond.
    2. Ganglia monitoring is up
    3. CPU installation is in progress.

RT Queues and pending issues (Tomasz)

Carryover action items

New Action Items

  • See items in carry-overs and new in bold above.


  • none

-- RobertGardner - 15 Jan 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback