r3 - 03 Sep 2008 - 15:06:01 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep3

MinutesSep3

Introduction

Minutes of the Facilities Integration Program meeting, Sep 3, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rich, Shawn, Rob, Charles, Torre, Tom, Kaushik, Nurcan, Mark, Patrick, Jim, Armen, Hiro, Xin, Michael, Fred, Horst, Karthik, Bob, Wen, Wensheng, Dantong
  • Apologies: none
  • Guests: none

Integration program update (Rob, Michael)

  • See IntegrationPhase6
  • Upcoming meetings
    • 3-site Jamboree: 9-12 September
    • Next Tier2/Tier3 meeting - revised dates: 22-23 September; @BNL, webpage - register as soon as possible. Tomorrow a set of rooms will be reserved for the BNL rate.

WLCG websites

  • Moving to a new website, with new appearance. Will not only reflect core services, but also partnering sites.
  • Each Tier 1 will be providing a link with more information, and for Tier 2's.
  • We should setup a top-level page for the Tier 2's, and then links to individual sites. Provide a comprehensive view of the facility.
  • Need to get this done by the next week - before the LHC startup-fest on Sept 12.

Next procurements

  • Any follow-ups from last week's status:
    • Dell, IBM, Sun pricing news
    • AGLT2
      • Iterating with Dell yesterday by phone - expect to get a quote today. There was a problem with the blade server pricing. Storage pricing off from what we thought was supposed to be. Goal is to order as available - for example, the blades.
      • Will purchase one Sun system once costs come through.
    • SWT2 - getting independent quote from Dell.
    • MWT2 - still awaiting final pricing from Dell and Sun. Considering x4540 + J4400 expansion. Calls today
    • NET2 - negotiations still in progress with IBM, combining in a large order.
    • WT2 - not buying this round.

Internet2 monitoring hosts

  • Any update?
    • Software available end of month, install as servers arrive. Rich: code freeze by end of the week. Delivery date by Sep 4. Glitch: probs contacting european box.
    • Fully deployed infrastructure at all sites by Sep 30

Revised WLCG pledges

  • Need the planned pledge amounts. Rob to send info to Michael and Jim

Operations overview: Production (Kaushik)

  • Have jobs, running production again, and running into some issues. Condor-G - submit host struggling to keep sites filled with pilots.
  • Follow-up on job eviction problems - Torre working on the autopilot logic to not evict jobs which actually are running. Stale information in the Condor monitoring. Increased 3 hours to 48 hours kill of job. Just put this in today.
  • subversion: problems with pilot code retrieval - front-end apache cache was not being used. ACF team putting squid cache in front of the server.
  • checksum errors - there have been some corrupted files used, created by previous versions of dq2-put. We don't have end-to-end checking of checksums in DQ2 transfer processes, and failures are extremely difficult to resolve. Discussed at ADC development meeting, conversations between DQ2 and FTS developers.
  • Follow-up on PRODDISK integration - seems to be working fine now.
    • Charles notes there is a new version of dccp that supports space tokens. Not known if its available in VDT or not.
    • Next step: Paul needs to be involved as we go through the sites. Yuri will supervise migrating the sites one by one with Paul and site admins. Start with AGLT2 - put it fully into production.
  • Which space tokens do we need in the near term. Need to start preparing the 5 space tokens as defined previously.
  • A storage token only used by pathena or ganga. USERDISK would be wide open.

Calibration splitter problem at UM

  • Contacting BDII server at IU, sometimes failing.

Shift report

  • All covered above.

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/Atlas/PathenaAnalysisQueuesUScloud
  • 14.2.20 - has already been tested.
  • Using top physics tools from CVS, created secondary DPDs, will test tierary DPDs.
  • Will be exchanging email with Jim and Akira.
  • User support - Nurcan and Daniel will be organizing shifts for analysis users. Organizing the group now. Running jobs on all ANALY queues. (not all active, contacting Cloud reps to see which they want supported in the Ganga Robot).
  • Daniel will setup a Hypernews for analysis queues; may retire Pathena and Ganga hypernews.
  • Expect to start the shifts in two weeks.

Operations: DDM (Hiro)

  • Arda dashboard backend is down, not receiving call-backs, so information not reliable.
  • Checksums for the US - waiting for Paul to put Adler32 into the pilot (for output files, in the registration), but checking dCache checksum and the catalog. Not implemented in Bestman, but Wei believes they can implement it. Paul will work on this after the space tokens are complete.
  • Wei will contact Hiro for more details.
  • Is there an ATLAS-wide specification? FTS, DQ2 developers are talking.

Full SURL in Catalogs

  • Follow-up.
The major points from our point of view are:

- We favor fully specified URLs in the catalog itself so as to
maximize the probably of success for client tools, leading
we believe to more stability in the production system and
a better experience for the physicists using the
available client tools.

- We don't want to introduce a strong dependence on another
external service, such as BDII, for job execution or data transfers.
Apart from introducing another single point of failure,
there may additionally be (unknown) scalability and
firewall issues to contend with for registrations/lookups
from pilot jobs.

- We especially don't want to introduce a dependence
if we don't have direct control over this service.  This
would exclude for example the BDII at the OSG GOC,
which receives GIP information for CE's and SE's only
for the purpose of monitoring in our context.

- Client tools can be made to accommodate compact URL versions
with sensible defaults, indeed the seems to have been done
quite well with the nordugrid clients.   Wrapper scripts
have already been contributed to Mario for dq2-* clients,
and to Paul for Panda pilot code by Marco.

- Tools can be made available to site administrators to modify
registrations in the catalog if the service end-points change.

LFC migration

  • SubCommitteeLFC, see meeting notes LFCMeetSep3
  • Hiro believes conversion will be quick since LRCs at Tier2's all have less than 1M.

RSV and WLCG SAM (Fred)

Site news and issues (all sites)

  • T1: there was an issue with WAN connectivity last Friday and Saturday - primary link from CERN to BNL went down; failover didn't work. Policy based routing removed from border router, but Panda services were broken after the change. Primary link came back up and previous configuration was restored. Discussing on how to fix this problem, only happens at BNL Tier 1 due to the firewall. High priority to find a solution. Considering moving resources closer to the interface of the OPN. Probably would require at least a day of downtime.
  • AGLT2: turned back on - waiting for pilots.
  • NET2: no problems - still trying to get some hardware up.
  • MWT2: autopilot adjuster has been disabled not to interfere with any of the
  • SWT2 (UTA): no problems
  • SWT2 (OU): no report
  • WT2: still working on the conditions database access. AGLT2 confirmed similar latency issues for the database access. Will be taking the issue to the 3D meetings at CERN. Is the time required for access significant compared to the total job time. Exception for access to CERN. There is a lot of effort required to setup another stream to a site. Still working on the network monitoring equipment. There is still some concern about the Web100 kernel, and reliability of the hardware.

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • Update from Shawn. lcg-cp working; still need to work on registration. Now working - remove from carryover! DONE

Release installation via Pacballs + DMM (Xin, Fred)

  • Fred - have downloaded, installed and validated.
  • Stan, Xin and Alessandro will be at ADC meeting tomorrow - where to store the Pacballs.
  • Xin - in contact with Alessandro to use his script - still waiting on one change. After thats done, will start converting to a Panda job. Expect changes this week. In contact with Tadashi - has release installation job transformation that can be modified. Xin notes the panda server will be able to discover which releases are installed at a site via curl to an EGEE portal.

Throughput initiative - status (Shawn)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

User LRC deletion (Charles)

WLCG accounting

OSG 1.0

  • Following development of Globus gatekeeper errors 17 and 43 at some sites in OSG 1.0

Tier3

  • There is a separate subcommittee formed to redefine the whitepaper (Oct 1). Placeholder to follow developments.

AOB

  • none


-- RobertGardner - 02 Sep 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback