r4 - 10 Sep 2008 - 14:06:11 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep10

MinutesSep10

Introduction

Minutes of the Facilities Integration Program meeting, Sep 10, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Mark, Kaushik, Justin, Saul, Patrick, Charles, Rob, Wensheng, Tom, Bob, Wei, John, Sarah, Karthik, Wen, Fred, Hiro, Xin
  • Apologies: Horst, Michael, Nurcan
  • Guests: none

Integration program update (Rob, Michael)

WLCG websites

Next procurements

  • Any follow-ups from last week's status:
    • AGLT2
      • Trying to get final prices for Dell; Intel processes. Sun Thor system will go out today. MSU and UM have received Koi systems.
    • SWT2 - no update - getting started.
    • MWT2 - Preliminary pricing from Dell in-hand for storage servers and networking, needs small iteration. Current planning is 364 TB (useable) procurement.
    • NET2 - negotiations still in progress with IBM, combining in a large order. No news.
    • WT2 - not buying this round.

Internet2 monitoring hosts

Operations overview: Production (Kaushik)

  • MC production on/off.
  • Jamboree week.
  • DQ2 going slow.
  • Follow-up issues:
    • Job eviction problems - work still going on. Trying out Condor-G fixes.
    • Subversion server loading - was heavily loaded; Squid server deployed last week on Thursday, backed out. Back to default now (no Squid).
    • Checksum errors - a corrupt dataset was scrapped entirely. This was caused by replacing a file at CERN. There is a new proposal to handle this case. No developments on checking checksums in data transfers.
    • PRODDISK integration
      • Next step: Paul needs to be involved as we go through the sites. Yuri will supervise migrating the sites one by one with Paul and site admins. Start with AGLT2 - put it fully into production.
        • AGLT2 still in test mode
      • Fine at Michigan; UC and IU are being tested.
    • Space tokens
      • Would like to do an inventory of deployed space tokens site-by-site next week.
      • USERDISK and pathena analysis jobs.
      • Need official page for this.

Shift report

  • Production up and down, no major site issues to report at this time.
  • See Yuri's summary.

Analysis queues, FDR analysis (Nurcan)

Operations: DDM (Hiro)

  • Follow-up issues:
    • Checksums for the US - waiting for Paul to put Adler32 into the pilot (for output files, in the registration), but checking dCache checksum and the catalog. Not implemented in Bestman, but Wei believes they can implement it. Paul will work on this after the space tokens are complete.
    • Wei has discussed the issue with Hiro - a list of pros/cons, sent to Jean-Philippe Baud. Has started discussion with FTS and DQ2 folks. There needs to be a discussion on where these checks.
    • In xrootd itself, you can get any checksum you want; not sure what is needed for Bestman - will discuss w/ Hiro.

LFC migration

  • High priority to migrate to LFC asap
  • SubCommitteeLFC, see meeting notes LFCMeetSep10
  • Need a step-by-step plan for the final steps. Announce dates for conversion.

RSV and WLCG SAM (Fred, Karthik)

  • See https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvSAMGridView for links to SAM and Gridview reporting consoles.
  • For scheduling downtimes, the OIM system: https://oim.grid.iu.edu/
  • Karthik notes a problem overnight.
  • Fred notes there have been problems at the CERN level (RSV is reporting correctly), but these should be fixed in the monthly summary report.
  • Will put up testing interval information.
  • SE from UTA (gk03) - all is fine. Mark in contact with GOC to rename the common name. Should show up in the reporting at some point.

Site news and issues (all sites)

  • T1:
    • last week: there was an issue with WAN connectivity last Friday and Saturday - primary link from CERN to BNL went down; failover didn't work. Policy based routing removed from border router, but Panda services were broken after the change. Primary link came back up and previous configuration was restored. Discussing on how to fix this problem, only happens at BNL Tier 1 due to the firewall. High priority to find a solution. Considering moving resources closer to the interface of the OPN. Probably would require at least a day of downtime.
    • this week: Hiro: dcache gridftp doors almost ready, testing next week. New thumpers will be ready next week (all will be deployed). ~20 thumpers.
  • AGLT2:
    • last week: turned back on - waiting for pilots.
    • this week: been getting autopilots since last week, and analysis queues are working. DQ2 end user tools cannot fetch files from the site. Mario Lassnig aware, ticket open. Probably will require a new release.
  • NET2:
    • last week: no problems - still trying to get some new hardware up (Harvard).
    • this week: all systems go, only one analysis job. Still working on HU networking. Will probably need Panda help soon.
  • MWT2:
    • last week: autopilot adjuster has been disabled not to interfere pilot eviction troubleshooting.
    • this week: no big news. Space token based tests going on.
  • SWT2 (UTA):
    • last week: no problems
    • this week: all is well.
  • SWT2 (OU):
    • last week: no report
    • this week: Nothing much to report for OU, all is well, but the old OSCER topdawg cluster will be decommissioned on Friday, so I just asked the pandashift people to turn off submission. We'll get the new grid gatekeeper for the new sooner cluster up and running soon, so hopefully we can restart production again soon. Everything else is running fine. Thanks, Horst
  • WT2:
    • last week: still working on the conditions database access. AGLT2 confirmed similar latency issues for the database access. Will be taking the issue to the 3D meetings at CERN. Is the time required for access significant compared to the total job time. Exception for access to CERN. There is a lot of effort required to setup another stream to a site. Still working on the network monitoring equipment. There is still some concern about the Web100 kernel, and reliability of the hardware.
    • this week: still working on conditions database.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • Testing pacball downloads, working on getting releases transfered by DQ2. Getting timeouts today (has been working fine before). Will send Savannah report.
  • Xin: still waiting for Alessandro to finish last version of scripts. Will be working with Torre's group to setup submission system. Expect by end of the week.

Throughput initiative - status (Shawn)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

User LRC deletion (Charles)

  • Updates to site services are necessary. See SiteCertificationP6.
  • Two issues - not sure that all sites have implemented the changes at all sites. Won't exercise at analysis jamboree, until working at all sites.
  • John: was an issue of getting the additional apache modules installed.
  • Wei: at SLAC, can't use web server on the LRC, requires forwarding.

WLCG accounting

OSG 1.0

  • Following development of Globus gatekeeper errors 17 and 43 at some sites in OSG 1.0

Tier3

  • There is a separate subcommittee formed to redefine the whitepaper (Oct 1). Placeholder to follow developments.

Revised WLCG pledges

  • Need the planned pledge amounts. Rob to send info to Michael and Jim In progress!

AOB

  • none


-- RobertGardner - 09 Sep 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback