r3 - 03 Dec 2008 - 14:41:50 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec3



Minutes of the Facilities Integration Program meeting, Dec 3, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Saul, Shawn, Charles, Marco, Rob, Fred, Patrick, Michael, Nurcan, Kaushik, Mark, John, Wei, Bob, Justin, Rich, Douglas, Dantong, Torre, Hiro, Jim
  • Apologies: Michael after 2pm.
  • Guests:

Integration program update (Rob, Michael)

  • IntegrationPhase7
  • High level goals in Integration Phase 7:
    • Pilot integration with space tokens DONE
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated DONE
    • Transition to /atlas/Role=Production proxy for production DONE
    • Storage
      • Procurements - keep to schedule
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed DONE
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Support upcoming Jamborees
  • ANL Analysis Jamboree, Dec 9-12, twki-home led-blue
  • BNL Analysis Jamboree, Dec 15-18, 2009 agenda, BNLJamboreeDec2008 led-blue
  • Next US ATLAS Facilities face-to-face meeting (past meetings):
    • Will be co-located with the OSG All-Hands meeting at the LIGO observatory in Livingston, LA, March 2-5, 2009 Agenda
    • US ATLAS: March 3, 9am-3pm - focus on readiness for data, and Tier 3 integration - organized by Maxim Potekhin
  • Tier 0/1/2/3 Jamboree - Jan 22, 2009
  • Tier 3 study group is making good progress - a draft document is available. Consolidating input regarding workflows, requirements.
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open usually): http://integrationcloud.campfirenow.com/1391f
  • Upcoming reviews:
    • Program mangers review in January
    • February agency review

WLCG Reporting (Fred)

  • US_LHC_Tier2_Activity_for_November_2008.pdf: November WLCG report (preliminary)
    • From Brian Bockleman: Attached is the draft of the November 2008 Tier 2 metrics, which include both availability and usage statistics. Because of the current RSV outage, we took the numbers from the WLCG; there does not appear to be any major problems, although we won't know for sure until the RSV data is restored. I don't have a good estimate of when the RSV restoration will happen, and the WLCG numbers are high enough that I'm willing to risk a draft..... It's a good idea to closely examine this report. In one week, if there are no major comments or problems, I'll double-check the numbers (switching to RSV, if available) and release the official report. Thanks, Brian
  • Reporting for the last month has looked pretty good. There has been one site having difficulty with RSV. Availability numbers look good.
  • There have been problems on the other end - WLCG SAM reporting.
  • Fred will follow-up w/ OSG to check for any descrepancies.

Operations overview: Production (Kaushik)

Shifters report (Marco)

Holiday production

  • Will continue operations during the holiday
  • There will be a few days of no shifts. Christmas day, New Years Day, and possibly eves.
  • No active shifters at CERN - but there will person on call. Best effort basis.

Local Site Mover (Marco, Paul, John, Charles)

  • Specification: LocalSiteMover
  • code
  • last week:
    • Now testing.
    • Week from today expect to have the final version working. This will bring up the Harvard site.
  • Test jobs sent by Paul worked. Code is now ready.

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/AtlasProtected/PathenaAnalysisQueuesUScloud
  • Analysis shifts, http://indico.cern.ch/conferenceDisplay.py?confId=46430
  • Two queues set to test mode. Scheddb value overwritten. Waiting for Ganga jobs to complete.
  • LFC problem at SLAC solved by Paul, but there is still a problem. Paul and Wei looking into it. Would like more frequent Ganga robot submissions.
  • Pathena stress tests - Benjamin contacted for monitoring from ARDA dashboard. Waiting for response. Might do this as a joint activity w/ French cloud.
  • Analysis support during christmas - most shifters will be away. Support will be on a best-effort basis. There may be a couple of days of shift for Christmas.
  • There was a request from the Canadian cloud - request offline/online be managed centrally.

Operations: DDM (Hiro)

  • All sites are back to normal.
  • BNL SRM update caused some problems - change in IP, DNS latency.
  • BNL switch to LFC - caused permissions problems - solved.
  • UTA, ALGT2 proxy problems fixed - delegation gets stuck for space token areas, Hiro has to intervene with the FTS server. "Could not load client credentials..". Fixed. Non-space token endpoints use myproxy. (Some sites, like UC, still use both). John notes it happened twice at BU over the last month.
  • Discussion about operational stability of FTS, requirement of restarting DQ2 site services.
  • There is an open ticket with the FTS developers.
  • This needs to be discussed in ADC operations. Michael will bring this up at tomorrow's ADC ops meeting.

LFC migration

  • SubCommitteeLFC
  • last week
    • BU - to locate at Harvard due to firewall issues this week.
    • OU - waiting on Pilot/Panda changes.
    • BNL - wait until week following Dec 1.
    • Stability issue : thread-safe issue with a Globus function - VDT addressing this, will provide a patch to test shortly.
    • BNL - upgrade next week. Kaushik has given green light. Coordinate with Paul - requires changes in the pilot code. Will not use space tokens.
  • this week
    • There is VDT fix coming soon for the problem with one of the libraries used LFC daemon.
    • BNL: role back pilot changes related to space tokens - resulted in permissions issues. Long term issue of using usatlas1 versus usatlas4, still need to be worked out. Hiro has locked LRC for write operations. Looks like its all functioning, waiting for test jobs to complete.
    • BU: starting to install at Harvard.

Site news and issues (all sites)

  • T1:
    • last week: Data replication is going on at high rate. 350 MB/s average over the last 7 days. Sites are appearing stable and handling rates up to 200 MB/s at a couple of sites. Very positive progress made here. There is a problem with cooling facilities at the Tier 1 (heat exchanger punctured), though no systems needed to be shut down. Rental of a 50 T chiller - arrives within 24 hours. $7000/month.
    • this week: Dantong: missing lib in workernode client, re-installed.
  • AGLT2:
    • this week: FTS proxy issue, otherwise all is well. In process of bringing up Lustre. Checking out luster+bestman. LFC inconsistency yesterday - still looking into fixing it. Using Charles' script, but it needs modification. DB contents were lost for 27 hours. Will need to scan and re-register. (Wei has a related question about how to retrieve the LFN given an observed orphan in the storage system.)
  • NET2:
    • this week: John has been working on LFC migration and local site mover - solving tricky HU firewall issues. BU running smoothly; HU down pending migrations. Contact with Tufts - might join in production. All storage hardware arrived, 100 cores of blades arrived. Still need to setup perf-sonar machine. Problem with analysis queue - related to mistake in panda pilot, which apparently was fixed.
  • MWT2:
    • last week: dCache gridftp door problems resolved. A second gridftp door w/ 10G nic. Ready to load test. Bringing up new storage nodes.
    • this week: Disk-catalog-DQ2 catalog consistency checking - recovering from some lost metadata, nearly done now. Down to 28 files missing. New Dell storage server - image installed, configuring raid, switch. Did some throughput IU-UC; 200 MB/s.
  • SWT2 (UTA):
    • last week: CPB running smoothly. SWT2 - offline for upgrades.
    • this week: NFS server failure last week. Brought back, replaced file system. Servers /home for grid user directories and ATLAS releases. Kernel panics. Replace xfs from ext3.
  • SWT2 (OU):
    • last week: all is well. Transfer timeouts expiring. Marco submitting test jobs. Hiro changed ToA.
    • this week: all is well. LFC daemon died once - put auto-restart script in place. Kartik putting up perfsonar hosts - disk problems not well-understood.
  • WT2: no report.
    • last week: completed migration to LFC, production running fine. ANALY queue test jobs are successful, but Ganga Robot jobs fail - those with input files. Perhaps because of direct reading from storage. Sent email to Paul. January 8 - there will be a power cooling upgrade that will require 5 days of downtime.
    • this week: production is working fine. Working w/ Paul on analysis queue. glexec is working correctly for remote users - test jobs successful. Frontier testing.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • Torre is working submission of installation jobs. Will be trying.
  • Pacballs are being submitted automatically to all Tier 1's.

Squids and Frontier

  • follow-up from last status: Frontier server at BNL - almost done; will run some internal tests. Ready for testing in a week. Shuwei is providing re-processing jobs as tests - not sure of latest news - expectation is its almost ready for tests w/ SLAC.
  • Dantong meeting w/ folks responsible at BNL, Carlos, John Stefano. Testing. 87 seconds raw oracle query (Hong Ma's test script) reduced to 2.4 seconds w/ Frontier. W/ caching, milliseconds. Will do more performance testing next week. Two squid servers in front of Frontier.
  • SLAC - has direct access to the BNL squid server.
  • Douglas: script that does conditions data access - setup and working yesterday from SLAC. Will learn a bit about the squid server at SLAC - will test various setups. Will also test against frontier at CERN.
  • Frontier client provides connection from job. CORAL. Frontier libraries. Shuewei figuring this out.
  • Issues needs to be brought up in official ADC operations meeting. There will be site-specific configurations needed.
  • How large should the squid server be, etc.
  • More next week...

glexec @ SLAC

  • follow-up from last status: Doug - there was a test job - a configuration problem that needs to be fixed. reinstalled wn-client on a rhel4 platform, this has been successful. Available for test jobs now. Jose C at BNL is working on changes in the pilot.
  • covered above - tested and working. DONE


  • None

-- RobertGardner - 02 Dec 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pdf US_LHC_Tier2_Activity_for_November_2008.pdf (93.7K) | RobertGardner, 02 Dec 2008 - 13:29 | November WLCG report
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback