r14 - 19 Sep 2007 - 15:11:04 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep19

MinutesSep19

Introduction

Minutes of the Facilities Integration Program meeting, Sep 19, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Wensheng, Michael, Rob, Gabriele, Jay, John, Shawn, Kaushik, Dantong, Joe, John Brunnelle, Hiro, Mark, Patrick, Alexei, Karthik
  • Apologies: none

Integration program update (Rob, Michael)

  • Phase 2 schedule
  • See updates for
    • RSV/Nagios/SAM deliverables
    • Throughput initiative to be led by Shawn (100 MB/s routinely to all Tier2s)
  • Add two points to the agenda:
    • Need to address the procedure to be followed for out-tages
    • Tier2/3 workshop at SLAC

Accounting

  • Accounting portal, http://www3.egee.cesga.es/gridsite/accounting/CESGA/osg_view.html and https://goc.gridops.org
  • Action item: follow-up with John W and WLCG on accounting issues (Rob)
  • Waiting to hear back from John.
  • Note - September is the first month that counts for accounting information. They did this for August - and many missing sites. These figures are important - they are reviewed by the funding agencies. All sites need to be verified.
  • And what are the normalizations applied by Philip?
  • AGLT2 - is showing 0 hours, when there should be 77K hours.
  • BNL - is not showing up in the "OSG" view. Michael believes they are correct to the 10% level.
  • Setup a meeting w/ Shawn, Rob, Philip.

Outtages

  • What should be our operational process? Eg., the HPSS upgrade - it was a lengthy outtage, introduced all the usual problems. We need to do a better job communicating, so that Production can prepare, etc.
  • Proposal - to use this meeting to discuss/negotiate/announce outtages.
  • Tier1 storage services - Gabriele Carcassi is assigned as the contact person, will describe and discuss the impact.
  • Note - this is a bi-directional path - production campaigns need to be advertized/announced with the relevant numbers, etc.
  • Gabriele, Kaushik, all agreed on the plan.
  • Note there is the ATLAS operations meetings, Tuesday 9am Eastern - suggested meeting to attend.

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Production summary - no transparencies today.
    • There still is a large reconstruction exercise to come - date uncertain, since the release is not useable. Expect a new release 13 cache this week, to be validated, to be seen. Waiting for 12.0.7.3.
    • Have asked Ian for a list of datasets to be processed through reconstruction - will forward to Gabriele/BNL so that those files will be pre-staged.
    • Panda-prestager - perhaps next week when Tadashi returns from vacation.
    • Expect about two weeks before beginning the reconsruction.
    • Last week: duplicate events issue - some samples will need to be redone - a large number of samples to become available this weekend.
    • Each recon pass will take 2-3 weeks, and there will be two passes.
    • Will provide figures for recon job profile
  • Shifters report and other production issues (Mark/Nurcan)
    • HPPS upgrade last week
    • iu-atlas-tier2 is being retired
    • teraport issues - still some site specific issues
    • swt2-uta : power upgrade complete, system coming back online today
    • dacache issue over the weekend
    • Power outtage at slace
    • pilot update, but bnl specific
    • siteinfo update at submit host - there is a diversion between CVS and whats on the submit host

Operations: DDM (Alexei)

  • Proposal to stop subscriptions for AOD tasks because of a large backlog at all sites.
    • All LCG sites and Nordugrid have only done ~20% of the data
    • At BNL - we got ~80% (due to Hiro's and BNL'S work!).
    • LCG deployment model is now completely centralized w/ VO boxes and all other services at CERN. Note that the DQ2 developers are updating LCG instances.
  • Status of DQ2 0.4
    • Tomorrow Miguel announce 0.4 is available for installation
    • Will create a testbed for 0.4 installation, see further DQ2SiteServicesP2.
    • Concetrate on just the DQ2 software
    • Several bug fixes, including fair-share
  • M4
    • md5sum was not computed for raw data
    • Hiro and Alexei did this, put in catalog
    • Staged all files from castor at cern. This has been done for BNL.
    • Several bookkeeping problems found.
  • BU improvements (Saul)
    • Have noticed that its no longer necessary to restart DQ2 daily. Much improved efficiency.
    • Alexei - all M4 and AOD replication has stopped - perhaps the reason why.

Lost files at BNL (Wensheng, Hiro)

  • Generated a list of files from the BNL file catalog and from dcache deletion logs. Finding files don't exist in dcache.
  • 310 files that seem to be lost, 11 copied from LCG.
  • Most of these files were written to dcache last week.
  • Hiro: there were some files deleted in BNL dcache locally (doen't show in server logs); majority can be recovered from HPSS. Probably not remotely deleted.

DQ2 testbed

  • Software, sites, schedule: please update DQ2SiteServicesP2
  • BNL
  • UTA
  • Second tier2 - perhaps BU?
  • Validation (Patrick)
    • Mechanics of getting from 0.3 to 0.4
    • Verify transfer of files
    • Expect 0.4 to be an improvement - really about installing and functioning correctly
    • Do no expect many integration issues w/ Panda.
  • There are client tools used by the pilot software. Wensheng has installed the client over the weekend - there was mixed success. Are there any change to the API? Nothing seemed to break.
  • Alexei notes there was a critical bug in the client tools that was not widely announced.
  • BNLDISK is the best place for server-side tests.
  • Once the validation is done, we will define a period in which we expect all sites to migrate.

Panda Mover (Wensheng, Hiro et al.)

  • This was triggered by the question about monitoring the Wisconsin LRC.
  • Panda mover is in the development phase. Alternate mechanism for moving files, being tested for replication of files to tier2s.
  • Uses a special DDM cluster at BNL to schedule transfers, in a single service. All managed with a single queue.
  • Has been tested at BU, UTA, and OU.
  • Will take more iterations before its ready as a production service.
  • Does keep the current structure of LRCs at sites, so http interface is still needed.
  • Replaces fetcher at every site.

FTS 2.0 (Hiro)

  • Looking at the FTS migration - would like to do this soon. Would require about a half day of downtime. Could be mitigated by using Panda mover.
  • Hiro would prefer to do this next week - Tuesday/Wednesday. Note also using Oracle backend.
  • Dantong note this is an LCG milestone in September.
  • Scheduled this for Wednesday.
  • All "logging" information now in the database - will make a web-interface.

LRC thoughts (Hiro)

  • Has looked at Pedro's code - will install. There is a schema change. Not sure about peformance (should not matter much). Will run some scalability tests.
  • Prefer to add additional attributes for pnfs.

LFC Evaluation (John)

  • Has a box setup - almost has it running, ran into an SLC4 problem.
  • Will setup a twiki to capture information.

Options for installing iperf (Dantong)

Load testing update, issues (Jay)

  • See LoadTestsP2 for updates.
  • Needs to have iperf installed, and preferred port.
  • Working on the Monalisa monitoring application that parses results and graphs.
  • Many combinations, many middleware layers, repetition options, number of streams. (Use guidance from throughput optimization from Shawn)
  • Black hole - to do.
  • Need to fill in tcp parameters for sites. Shawn notes there is Monalisa module "Lisa" that publishes this information dynamically. Publishes relevant network parameters.
  • All files are deleted at sites.
  • Please sites update the pages with the required information.

Network Performance and Throughput initiative (Shawn)

  • See plan at NetworkPerformanceP2
  • Focus first on two sites (BNL and AGLT2) and once 120 MB/s is achieved move onto the other Tier2s. DONE
    • See initial results at https://hep.pa.msu.edu/twiki/bin/view/AGLT2/NetworkTuning
    • Summary: Use of 1GE path achieved (umfs05.aglt2 <=> dct00.usatlas.bnl.gov). Issues found with dq2.aglt2.org. Other traffic interferring with umfs02.grid.umich.edu (different subnet leads to different paths)
    • Action items:
      • Update e1000 driver on dct00
      • Debug dq2.aglt2.org: different NIC, MSI settings, motherboard/bus issue, NDT testing
      • Create tunenic service on umfs05 to insure appropriate settings in place on reboot.
  • Dmitri and intern will begin work with other sites.
  • MWT2 will provide next site. BU as well, after 10G is plugged in.

Tier2 meeting at SLAC

  • First announcement within the next couple of days.
  • There is a webpage available, work to begin on agenda.
  • November 28-30.
  • Wei will provide recommendations for hotels.

Analysis Queues (Bob, Mark)

  • Quick report form Bob - analysis queue functional, running analysis test jobs.
  • See AnalysisQueueP2
  • Action item: Mark will provide similar instructions for PBS.
  • Action items moving forward (each site:) * We need to setup analysis queues * Allocate a small number of cpu's to this site.

Site performance jobs and metrics (Rob)

  • Carryover

RSV, Nagios, SAM (WLCG) site availability monitoring program (Dantong, Tomasz)

  • Carryover

OSG Integration (Marco)

  • Testing on ITB 0.7, Site Validation Table
  • Working on validation of Panda on OSG ITB 0.7 (UC_ITB). Pilots are running fine, waiting for assigned jobs. Need to complete Panda validation by September 20.
  • See also tier2-06.uchicago.edu:8800/pandajs/ for submitting pilots to your site.

Site news and issues (All Sites)

  • T1: hpps upgrade completed on friday, running successfully. More storage is being deployed - 6 thumpers in hands of dcache team for the read pool. Two more racks to be integrated. Gabriele - have been uploading files produced last week into HPSS - should be finished tomorrow. FTS 2.0 upgrade next Wednesday.
  • AGLT2: Have analysis queue now running, test jobs running through it. Production rolling along well.
  • NET2: Production going at a excellent pace, no problems. Gratia installed, given high priority.
  • MWT2: Production going well on produciton clusters; iu_atlas_tier2 retired, "Quarry" coming to replace it.
  • SWT2_UTA: bringing back after power upgrade.
  • SWT2_OU: in process of upgrading the cluster.
  • WT2: Scheduled outtage shutting down gatekeeper.

RT Queues and pending issues (Dantong, Tomasz)

  • Summary URL for RT issues. We have been asked to privide a web page which shows summary of open RT tickets related to Tier2 centers. The page can be reached from the rt login page (link is in the lower part of the page):

https://rt-racf.bnl.gov/rt/ or directly https://rt-racf.bnl.gov/RSS/tier2.rss

Carryover action items

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

Nagios monitoring

  • Tomasz is working on Nagios service groups and hierarchy. Will report next week.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • None

-- RobertGardner - 18 Sep 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback