r6 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr9



Minutes of the Facilities Integration Program meeting, April 9, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Saul, Rob, Charles, Marco, Tony, Michael, John BU, Horst/Karthik, Fred, Nurcan, Kaushik, Rich Carlson, Hiro, Bob, Wei.
  • Apologies: none

Integration program update (Rob, Michael)

Next procecurements

  • Standing agenda item, see CapacitySummary.
  • Multi_Core_CPU.ppt: Some recent benchmarks - Tony Chan/BNL
  • Bought own SPEC license.
  • Noticing significant differences in performance between 1 core and 8 core jobs for ATLAS benchmarks. More pronounced than SI2K? .
  • Multi-core disadvantages noted - licenses and Condor configuration.
  • Use of virtual domains to isolate users from applications, and the "domain 0" base OS
  • Intel offering more competitive solutions than AMD.
  • Intel will be making more improvements to their Harpertown processor
  • Nehalam coming; expect announcement at SC08.
  • What does AMD have beyond Barcelona? (year late)
  • What about bandwidth - for 8 cores/server, 1G is needed.
  • No news about new very low power consuming processors from Intel. Need to consider SI2K? /W.

RSV --> SAM (Fred, Rob Q - GOC)

  • Please see this link: https://www.usatlas.bnl.gov/twiki/bin/view/Admins/OSGservices#Site_Availability_Monitoring_SAM
  • Relatively few sites are sending RSV data correctly to SAM.
  • Sites need to configure RSV
  • Sites that are configured, but not reporting correctly, need to check this.
  • Gridview does the availability calculations.
  • ATLAS will participate in SE testing.
  • AGLT2 - passing 3 tests, failing one. There are some expired CAs.
  • Service proxy versus personal proxies - for "local" use, we can use a service cert, renewable by cron.
  • GOC is producing availability statistics (not reliability = availability - scheduled downtime; stats).
  • Gridview is very difficult to use.
  • Use April to transition into regular use, and have the first complete report May.
  • Still integration is still needed for the storage element, and to cover non-dCache SEs. SLAC and SWT2 running bestman-xrootd SEs, will need to be instrumented. Probe does srm-ping, srm-read, srm-write.

DQ2 0.6.5 upgrade status/plan (Hiro)

  • Follow-up:
    • Still not stable, recommends waiting.
    • April 21 - there will be a test db w/ new schema. 21-24 intensive testing. 25th shutdown, back with 1.0.
    • DQ2 1.0 should work with old clients, but site services?
    • Big bang migration??
    • Don't upgrade yet.
    • No news:
      • Release notes will be setup by the development team, and these will be distributed when patches are available.
      • There will be a dedicated mailing list setup for this.

Analysis Queue Update (Nurcan)

  • Deleting user datasets from pathena. Under intense discussion presently.
  • Hiro/Charles interface now exists, but there are some integration issues with ATLAS tools.
  • Hiro: LRC now has option to delete. Will require sites to to modify LRCs. Requires host cert (uses grid proxy).
  • Not ready yet for causal users.
  • DQ2 dataset catalog, LRC file catalog, and SE namespace need to be in harmony. A deletion tool needs to do this in a systematic way.
  • Kaushik has surveyed users - finding confusion. Judgement is that tools presently leave messes. There are SRM v2 capable tools, but not sure if it works with our LRC. (Hiro - can't possibly work.)
  • Want a single tool, dq2_rm. Charles believes a tool like that can be made available very quickly.
  • There may be some ownership problems/issues to work out.
  • Why are there so few jobs at T2 analysis queues? Expect them to come back - June Vancouver workshop; ntuples.
  • What kind of datasets do we really need? There is a lot of activity w/ Rel 12 AODs w/ CSC notes. Rel. 13? Any way to get more clarity here? Release all of them. Kaushik believes we have a lot of Rel 13 AODs subscribed.

Operations: Production (Kaushik)

  • Production summary
    • Biggest issue - mysterious autopilot problem, Condor-G submission to AGLT2, also at SLAC, UTA. Submit host state is not the same as the queues at the site. Wisconsin engaged.
    • Pressure to get FDR2 production done - 1,2 weeks to go.
    • Planning stages of working out mixing jobs at BNL.
    • Follow-up on DBRelease inconsistencies and consistency check features.
  • Production shift report
    • No report. Mark

Operations: DDM

  • ATLAS functional tests and throughput latency
    • On-going.
    • All is going well.
    • There will need to be a cleanup after this.
  • Follow-up:

SRM v2.2 functionality for storage elements (ATLAS April 2 milestone)

  • Michael notes that sites are being required to provide ATLASDATADISK, ATLASMCDISK space tokens. (Optional ATLASUSERDISK). April 25 is the (new) deadline. This has entered into an emergency state. Need a few guidelines as well.
  • AGLT2 - have implemented v2.2 with space tokens. Switching main channel. Migrating all data into dCache, updating LRC.
  • MWT2 - two dCache sites. At IU - up and running. Installed at UC, passes validation suite. Not changed public endpoint. We are still having reliability problems with the gridftp doors at UC. No work on space tokens.
  • WT2 - Plan A: finally got all new software together and testing; finding some minor bugs with bestman-xrootd. Plan B: has requested change in endpoint to SRM, but still seeing gridftp traffic. Will be implementing space tokens by setting up separate DQ2 endpoints (given by the path).
  • NET2 - going to install bestman-xrootd, with a single gsiftp instance, with a Posix filesystem. Still waiting on new hardware.
  • SWT2 - will start setting up along same lines as Wei.
  • Q: do we need to publish anything into the LCG information system; will want to publish this through OSG. (its not a requirement, at the moment.)

Throughput initiative - status (Shawn)

  • Report from Monday's meeting, see LoadTestsP5
  • Jay: has new graphs setup - they are running regularly. Shows network, network history, gridftp d2d, and history. Would like to have a uniform environment at all the sites. BU will be in touch.

Panda release installation issues (Xin)

  • Follow-up on:
    • New format from Alessandro in information system for attributes about architecture and compiler. Needs to update script for this.
    • Needs to inform Tadashi of these updates.
    • Progress on using Pacman pacball to do installation. Saul, Fred, John B... discussing with Alessandro, Xin having identical installs in US and Europe. Archived MD5checksumed releases, distributed like data. Installation now simplified.
    • Installation pilots will be run using the autopilot using the "software" role. John will help.

Site news and issues (all sites)

  • Review SiteCertificationP4 table
  • T1: John Hover is working w/ UW on the Condor-G issue discussed above.
  • AGLT2: Autopilot submission probs. Few pilots coming... anxiously awaiting word from experts. gridui07 problems. No other known probs.
  • NET2: Still getting new hardware ready. 128 cores of the new Harpertowns.
  • MWT2: dual quad opterons delivered; working this week on dcache and srm issues.
  • SWT2 (UTA): No big issues. Push is setting up srm.
  • SWT2 (OU): no problems. still waiting for 10G equipment.
  • WT2: no problems; still some gridftp traffic after srm endpoint in toa

RT Queues and pending issues (Tomasz)

Carryover action items

  • Procurements
    • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • Accounting: US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.
    • Still no news from John Gordon or EGEE (I've given up)

New Action Items

  • See items in carry-overs and new in bold above.


  • None.

-- RobertGardner - 08 Apr 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


ppt Multi_Core_CPU.ppt (643.5K) | RobertGardner, 09 Apr 2008 - 12:49 | Tony's talk
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback