r3 - 06 Aug 2008 - 14:46:21 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug6



Minutes of the Facilities Integration Program meeting, Aug 6, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Wen, Rob, Fred, Tom, Charles, Nurcan, Mark, Sarah, Jim, Justin, Rich, Patrick, Hiro, Xin, Torre, Wei, Bob, Karthik, Horst,
  • Apologies: Michael
  • Guests: none

Integration program update (Rob)

  • Overarching near term goals:
    • LFC migration
    • Complete the benchmarks of 200 MB/s sustained disk-to-disk throughput to all Tier2s
    • Analysis benchmarks demonstrated at increasing scale (100/200/500/1000 simultaneous jobs) at all Tier 2 facilities.
    • Storage upgrade: provisioning of capacities according to pledges on track for September 15 2008 deployment.
    • WLCG - SAM/RSV, reliability availability metrics for CE and SE reporting >80% for all sites.
    • Network performance monitoring infrastructure deployed.
  • Upcoming meetings

LFC migration

  • SubCommitteeLFC
  • We need to put a structure with actions and timelines in place, Testing nd drawing conclusion based on the results becomes more and more urgent. Regarding data replication services the US becomes too special with undesirable exceptions, manual interventions etc.
  • Recent activity? Dantong: BNL hardware installation - smart switch - installed. Most data has been migrated into new LFC. There is also a delta upgrade.
  • Meeting: next Tuesday.

Next procurements

  • Standing agenda item, see CapacitySummary.
  • RFQs to be readied by week's end for each Tier2
  • We need a pricelist from Dell - another email sent. Some sites will have site agreements with Dell.

Internet2 monitoring host

  • UChicago_20080730.pdf: Quotation from Koi computers for perfsonar hosts - okay'd by Rich
  • Use consistent hardware.
  • Rich says 3-4 weeks a new release of Perfsonar will be available.
  • No objections to using Koi as a supplier.

Follow-up issues

  • Storage capacity recommendations/guidance for the Facility (320 TB capacity, from Kaushik's model on MinutesJune11).
  • Revised WLCG pledges - need info by July 15. Action item for Rob (not done!)

Operations overview: Production (Mark)

  • Many more jobs into system in the past few days - have had 5-6K jobs.
  • Weekly shift meeting (Xavi, at CERN) - some items:
    • Autopilot submissions slow? Help from Condor - has been difficult to troubleshoot since we've not had capacity. Has there been a scaling issue? Still an open question, will follow-up.
    • Checksum errors - mismatches caused jobs to fail. Complicated - who is responsible for the checksum and data integrity (panda or dq2, FTS?). Adler32 versus md5, still not resolved.
    • Some random side issues - most have been addressed, no major outages.
    • There were a large number of validation job failures - resolved.
  • Kaushik comment's: dccp can use the Adler32, but we're moving to lcg-cp.
  • lcg-cp - progress from Paul - will switch to PRODDISK at michigan tomorrow. Panda-dev server down for two days? pandadev02 is the server Tadashi uses - need this. From worker-node to the SE.
  • Want to see AGLT2 exercised for a week. Follow-up on this next week.
  • 20-40 TB needed for PRODDISK. Start with 20 TB. Follow-up each Tier2 next week.
  • wn-client from OSG 1.0 needed for lcg-cp.
  • Issue - high number of re-try's at 25. Not necessary. This has been changed, it will be reduced.
  • Wei - reports there is a bug with lcg-cp with srm-bestman if file does not exist. Its not yet packaged in glite. Wei requests lcg-ls be used first to check file existence.

Shift report (Marco)

  • Downtime at BNL for Oracle database - responsible for backlog in file transfer
  • Esnet link problems at BNL? Probably not cause for backlog.

Analysis queues, FDR analysis (Nurcan)

  • A wiki page was setup to show the online/offline status of the US pathena analysis queues as well as their availability for various athena releases/packages, see the page at PathenaAnalysisQueues.
  • We have analysis workshop at the end of August - there will be a user-support session that Nurcan will present plans for US. Plan is to provide combined support for pathena and ganglia.
  • Preparing for 3-site Jamboree in September.
  • Usage has been light this month, though expect once users start again and update their pathena hosts which has brokering to other sites in the cloud.
  • Kaushik comments that the brokering seems to be working.
  • Collecting information about Panda releases to be used during workshop.

RSV and WLCG SAM (Fred)

  • See https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvSAMGridView for links to SAM and Gridview reporting consoles.
  • For scheduling downtimes, the OIM system: https://oim.grid.iu.edu/
  • Things are looking good now.
  • SWT2_CPB:
    • There is a problem with one of the probes having jobs go into the un-submitted state. Patrick will increase timeout to see if this helps. It is an intermittent problem at various sites.
    • OIM registration - Mark and Patrick are addressing this.
  • NET2 - now working properly. There was an issue with a single host providing both CE and SE.
  • Kaushik - its a lot of effort to install and support the software. Seems to be uncorrelated with what we need in ATLAS.
  • We still need to develop the standard for the frequency of running the probe.

Operations: DDM (Hiro and/or Alexei)

wlcg-client (Marco)

  • There is a new version which handles protocols correctly for SRM endpoints.
  • Running transfer tests from the sites - transferring files from each site. Will send around summary of tests.

OSG 1.0

Site news and issues (all sites)

  • T1: Upgrade of Oracle database for FTS - now finished.
  • AGLT2: file copy at AGLT2_UM; contact.
  • NET2: no report.
  • MWT2: no news. Will relocate ANALY_MWT2 this week. Improved throughput tests - re-did (just under 200 MB/s, maxed out BNL gridftp).
  • SWT2 (UTA): CPB cluster is now back online. Will schedule a time to upgrade SWT2_UTA - OS and Grid.
  • SWT2 (OU): all is well, have been in south africa for a grid school. Will upgrade IBRIX in the next week or two.
  • WT2: upgraded xrootd to the latest release. Requires some changes in config and setup. Still setting up a machine for local conditions database.

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • Update from Shawn. lcg-cp working; still need to work on registration. Carry-over

Release installation via Pacballs (Xin)

  • Need to follow-up. Meeting this Friday.

Throughput initiative - status (Shawn)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

User LRC deletion (Charles)

  • Data deletion tool - completed. Testers needed. Updates at sites are needed.

WLCG accounting


  • There is a separate subcommittee formed to redefine the whitepaper. Placeholder to follow developments.


  • none

-- RobertGardner - 05 Aug 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pdf UChicago_20080730.pdf (43.0K) | RobertGardner, 05 Aug 2008 - 10:15 | Quotation from Koi computers for perfsonar hosts
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback