r3 - 15 Oct 2008 - 14:32:06 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct15

MinutesOct15

Introduction

Minutes of the Facilities Integration Program meeting, Oct 15, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rich, Rob, Charles, Sarah, Kaushik, Jim C, Horst, Karthik, John, Wei, Wen, Nurcan, Armen, Fred, Tom, Marco, Wensheng, Yuri, Tomasz, Patrick
  • Apologies: none
  • Guests: none

Integration program update (Rob, Michael)

  • IntegrationPhase7 under construction
  • IntegrationPhase6 final updates including SiteCertificationP6 and CapacitySummary (figures effective Sep 30) led-red
  • Quarterly reports - this is needed for the agency reporting. led-red
  • High level goals in Integration Phase 7 (from BNL workshop):
    • Pilot integration with space tokens
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated
    • Transition to /atlas/Role=Production
    • Storage
      • Procurements – keep to schedule SFAIK
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Upcoming Jamborees
    • Probably will hold another US ATLAS Tier 2/Tier3 meeting
      • late Fall (early December probably)
      • Location: TBD
  • OSG site admins meeting coming up: https://twiki.grid.iu.edu/bin/view/SiteCoordination/SiteAdminsWorkshop2008 led-blue

Next procurements

  • Follow-up from reported status last week:
    • AGLT2: UM PO has gone to Dell. PO has made it to purchasing at MSU: 50 computes, 3 storage servicer - MSU
    • SWT2: OU: working with Dell - looking at the matrix pricing. UTA - no updates.
    • MWT2: 360 TB (6 x 60TB PE2950/MD100 systems), 40 PE1950 servers - ordered.
    • NET2: in purchasing. ~400 TB raw. Will also buy some compute nodes.
    • WT2 - not buying this round.
    • Tier1: ready to start ramping up w/ FY09 funds. 1-2 PB of storage, and 2 MSI2K? - to be done shortly.

Operations overview: Production (Kaushik)

  • Have run out of jobs again. No more production tasks left, apparently.
  • Transferring jobs turned out to be a change in the panda db - Tadashi fixed.
  • SLAC has not received any jobs.

Shifters report (Marco)

  • Migration of AGLT2 to LFC looks promising - have had some production jobs complete successfully.
  • MWT2_UC now making the switch.
  • Have found some cases where dCache gets confused.

Cosmic data requirements (Jim C)

  • 100% of ESDs to T1 and to 4/5 of the T2s; 25% (random) of RAW - possibly some specific streams instead of random.
  • 4 of 5 Tier 2's will get the 100% ESDs.
  • Plan is to have Panda analyze these. Nurcan will develop panda jobs that create ntuples out of ESDs, working with David Adams and Lashkar.
  • Of order 10 users will be involved.
  • Have the Tier2's been subscribed? To be approved by ADC? Since its internal to the US cloud, the approval should be automatic.
  • US egamma group is planning its own production of ntuples, using the analysis queues.

PRODDISK migration (Yuri)

  • Follow-up w/ status at each site
  • AGLT2 - done DONE
  • MWT2_UC, UC_ATLAS_MWT2 - done DONE
  • MWT2_IU - done DONE; but not yet IU_OSG
  • SWT2_CPB - done DONE
    • last time, update: How to best use xrootd's internal mover to avoid SRM for transfers to/from the SE from compute nodes. Prefer to do both for read/write. Waiting on Paul for a pilot change. Right now using SRM server.
  • SWT2_OU - need to install Bestman
  • SLAC
    • ready - would also like to use the internal xrootd mover. There is also a pilot problem that needs to be fixed.
  • BU
    • Bestman - Posix w/ gpfs. Shouldn't be a problem, updates? Ready. Just change ToA and pilotcontroller.
  • Yuri notes that all configuration changes will now be done through pilotcontroller.py, in SVN, and all changes should go through Paul, Torre, Tadashi, and information there should be the same as in ToA.

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/Atlas/PathenaAnalysisQueuesUScloud
  • Discussing stress test details.
    • First phase - will submit 100 TAG-selection jobs to the sites. They read AODs from the site's SE.
    • Have sent single jobs to each site yesterday. SLAC having a problem with pilots retrieving the jobs. MWT2_UC - migration to LFC. NET2 - fine; OU - fine; BNL - fine; AGLT2 - dCache problem yesterday; SWT2 - issue of setting up the space token. Issue of direct reading of AODs. Have not enabled the xrootd system to translate to root URLs. Will require a pilot code modification.
    • Second phase - repeat 10K job submission as was done in the Jamboree.

Operations: DDM (Hiro)

  • Several problems at different sites.
  • Few sites have lost a few files.
  • AGLT2 - is there an SRM problem? Working on it.
  • Panda mover proxy - AGLT2 is getting the wrong proxy at the moment? They are going from the sm2 account.

LFC migration

  • SubCommitteeLFC, see meeting notes LFCMeetOct15
  • AGLT2 is finished, and has run 40 jobs successfully.
  • MWT2_UC starting today.
  • We need to revisit the long-form, short-form URL. What does lcg-cr doing. Marco will follow-up with Paul.

Site news and issues (all sites)

  • T1: had trouble with cosmic data replication overt the weekend - subscription backlog. Now seeing 400-500 MB/s data into BNL. Storage systems are performing well.
  • AGLT2: LFC transition - nearly completion.
  • NET2: getting ready for proddisk and LFC migration next week. Suvendra @ HU is upgrading the storage system. Yesterday got a bunch of jobs through the system successfully. Also setup a Panda analysis queue.
  • MWT2: in the the middle of LFC migration at UC. IU - making progress on problem with duplicate reservations. For some reason srm-remove isn't removing the reservation.
  • SWT2 (UTA): nothing major to report. Might have an issue with GUMS config. Getting ready to migrate.
  • SWT2 (OU): nothing new - all working. Following week.
  • WT2:
    • last week: conditions database issue - Rod provided a package that uses Frontier - a sort of squid. Working. Will look for a caching effect. Looks interesting. Will finish and then begin working on LFC. Michael: database task force performance meeting (Sasha V). Needs a launch pad - in front of conditions database.
    • this week: did further testing w/ Frontier to connect to CERN w/ and without squid. SQL to BNL is 1100 secs. SQL to CERN 2300 sec. With Frontier and empty cache 300 sec. With cache is full, < 40 sec. Testing LFC on a test host; seeing same issues as other sites. Testing on Mysql4, in production, will use Mysql5. Proddisk is ready.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • status from last week
    • Xin has tested scripts; passed to Tadashi to convert to Panda job.
    • Fred - has requested that BNL receive pacball datasets, as a permanent request.
    • Stan will coordinate pacball generation, and they should appear at BNL shortly thereafter.
    • Need to setup account sm2 account.
    • Job definitions interface.
    • Xin will follow-up with Tadashi.
    • 14.2.23 - just released - Fred will monitor the transfers to BNL.
  • Fred - tracking down why data is not being subscribed to BNL. Pacballs are being created, but the subscriptions aren't working to BNL automatically. Consulting Alexei.
  • Xin - Tadashi has converted scripts into install log payload, and created job creation interface. Next step will try some test sites in Panda, temporary area, and then test jobs using usatlas2 role.

Throughput initiative - status (Shawn)

  • last week
    • Meeting held this week. Jay put up an iperf server at BNL on a 10G.
    • Asking sites to review site configuration and tunings, site by site.
    • Perhaps tune the new nodes, given more memory.
    • Perfsonar v2 is available. Most sites have their boxes installed and ready.
    • Next step: configure a mesh scheduling to test among sites. Establish normalcy. Complementary to high throughput testing.
    • Rich - will suggest working with John Bigaro at BNL to setup a mesh, and then work through the Tier 2s.
  • no meeting this week

Tier3

  • There is a separate subcommittee formed to redefine the whitepaper (Oct 1). Placeholder to follow developments.

AOB

  • None.


-- RobertGardner - 14 Oct 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback