r3 - 01 Oct 2008 - 14:58:28 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct1

MinutesOct1

Introduction

Minutes of the Facilities Integration Program meeting, Oct 1, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • old phone (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Kaushik, Patrick, Mark, Bob, Sarah, Shawn, Wei, Charles, Nurcan, John, Torre, Rich, Armen, Saul, Wen, Jim C, Ieng, Fred, Tom
  • Apologies: none
  • Guests: none

Integration program update (Rob, Michael)

WLCG websites

Next procurements

  • Follow-ups from known status:
    • AGLT2
      • Got all requisitions today - in the system for the UM site. Expect today or tomorrow same at MSU. Expect short turn-around time from Dell. Hope to have equipment ready by October 15.
      • UCI has negotiated terms and conditions - there are some specifics to identify the purchase within the program - contact needed w/ Ron Hubbard.
      • Expect the pricing matrix to be updated quarterly.
    • SWT2 - we started discussions w/ Dell. Will look.
    • MWT2 - Will soon have final quotes from UCI.
    • NET2 - have complete quotes in hand.
    • WT2 - not buying this round.

Internet2 monitoring hosts

Operations overview: Production (Kaushik)

  • In production mode again. Should be up for the next couple of days. Good opportunity to test some of the issues w/ Condor-G etc.
  • New state in Panda "live". Pilot will now send a notification via the Panda monitor - a direct indication the pilot is alive and running; don't have to rely on Condor. Very informative already. Huge number of jobs in the live state. Two reasons: 1) failed condor information; 2) in the monitoring loop the submission to condor has happened. This means there's lots of new information about pilots coming in. Now, these live jobs don't show up in queued jobs. Can also do queue clean-up in a safe way. Hopefully can do this today or tomorrow.
  • Had a meeting w/ Condor team - will feed this kind of information back. Will try to incorporate this into a new version of Condor-G at some point. Will also address the problem of accumulation of large numbers of hold (H) states.
  • Follow-up issues:
    • Condor-G (see above)
    • PRODDISK integration - nearly finished - need changes to scheddb. Kaushik will follow-up. pilotcontroller.py. Checkout, commit, but don't run.
    • Space tokens - review of tokens by site

Shifters (Mark)

  • AGLT2 - there might be an analysis queue issue - still working today.
  • MWT2_IU - there was a network issue at BNL, resolved.
  • MWT2_UC - transitioning to space tokens
  • HU - jobs not completed successfully. 1100 jobs in holding state. Would like to troubleshoot using the analysis queue.

Cosmic data distribution

  • Discussions are on-going. Would ask for specific streams, total not exceeding 25%. Estimate 2 1/2 TB raw / day, 1 TB ESD.
  • Have observed about 3 TB / day.
  • What about distribution to the Tier 2's? Jim C: perhaps people will be starting to look at the ESDs. Should we
  • Already getting 1/2 TB /day for AOD production. Capacity for both? Depends the production requirements.
  • Why aren't we doing more (full) simulation? - Jim will work on.
  • Can each site cover. 30 TB on DATADISK, 15-20 DATADISK.
    • AGLT2 - DONE
    • NET2 - not quite.
    • MWT2 - DONE
    • SWT2 - DONE
    • WT2 - DONE

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/Atlas/PathenaAnalysisQueuesUScloud
  • Started analysis shifts - Nurcan on shift for the pathena side. Responding to hypernews messages.
  • New test planned for early next week. DPD-making, and TAG-selection jobs, and jobs which require contact with conditions database.
  • How many slots should we keep available? Would like to automatically adjust as new activated jobs are waiting.
  • This can be controlled at a number of levels - Kaushik would prefer to have this controlled at site.

Operations: DDM (Hiro)

  • Minor issue over night.
  • Follow-up issues:
    • Implementation of checksums - Adler32 support in the pilot and file catalog. Low on the priority list for Paul at the moment.

LFC migration

  • SubCommitteeLFC, see meeting notes LFCMeetOct1
  • Change to BNL MCDISK, rather than BNLPANDA - requires tasks properly defined with space tokens. Not yet.

RSV and WLCG SAM (Fred, Karthik)

Site news and issues (all sites)

  • T1:
    • last week: Hiro: dcache gridftp doors almost ready, testing next week. New thumpers will be ready next week (all will be deployed). ~20 thumpers.
    • this week: the 5 new gridftp doors are in production. 13 doors total. load is being distributed well. Running a modified version of the pnfs code. CMS found that the authorization checking through the unix permissions system is expensive - modified version reduced load on pnfs server significantly. pnfs server was also migrated to new, powerful cpu's.
  • AGLT2:
    • last week: been getting autopilots since last week, and analysis queues are working. DQ2 end user tools cannot fetch files from the site. Mario Lassnig aware, ticket open. Probably will require a new release. Resolved; v21
    • this week: have also migrated to the fast pnfs server last week. Also implemented fast companion database into a mysql heap. Other tunings have been made - database seems to be performing much better.
  • NET2:
    • last week: all systems go, only one analysis job. Still working on HU networking. Will probably need Panda help soon.
    • this week: There was one problem with LRC server access from BNL. Was intermittent, resolved.
  • MWT2:
    • last week: no big news. Space token based tests going on.
    • this week: PRODDISK space token migration - not all scheddb configurations were done. Looked at fast pnfs - and other dcache optimization.
  • SWT2 (UTA):
    • last week: all was well.
    • this week: still tracking down an issue with AFS server - kernel panics.
  • SWT2 (OU):
    • last week: Nothing much to report for OU, all is well, but the old OSCER topdawg cluster will be decommissioned on Friday, so I just asked the pandashift people to turn off submission. We'll get the new grid gatekeeper for the new sooner cluster up and running soon, so hopefully we can restart production again soon. Everything else is running fine. Thanks, Horst
    • this week: No much new.
  • WT2:
    • last week: still working on the conditions database access. AGLT2 confirmed similar latency issues for the database access. Will be taking the issue to the 3D meetings at CERN. Is the time required for access significant compared to the total job time. Exception for access to CERN. There is a lot of effort required to setup another stream to a site. Still working on the network monitoring equipment. There is still some concern about the Web100 kernel, and reliability of the hardware.
    • this week: Nothing to report.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • Xin: testing Alessandro's new script.

Throughput initiative - status (Shawn)

  • Will start testing the new doors. Hiro is doing some preliminary tests; will go site-by-site.

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.
  • Update: will make this conversion w/ the LFC migration.

User LRC deletion (Charles)

WLCG accounting

OSG 1.0

  • Following development of Globus gatekeeper errors 17 and 43 at some sites in OSG 1.0
  • Request form Ruth regarding providing a test-stand for Bestman and xrootd. 5 servers. BNL would be willing to host the servers and OS. Pre-release, pre-ITB.

Tier3

  • There is a separate subcommittee formed to redefine the whitepaper (Oct 1). Placeholder to follow developments.
  • There was a meeting this morning. Expect a new whitepaper by end of the month.
  • Chip is working on the pieces. Use cases really need to get pinned down.
  • Working on getting a better response to survey of Tier 3's.
  • Should Tier 3 sites enable for Panda jobs? Need to preserve control for local availability.

Revised WLCG pledges

  • Need the planned pledge amounts. This has been completed; need correction for SLAC.

AOB

  • Jim C: support model for Tier 3?


-- RobertGardner - 30 Sep 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback