r5 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb6

MinutesFeb6

Introduction

Minutes of the Facilities Integration Program meeting, February 6, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Charles, Rob, Gabriele, Marco, John, Wensheng, Fred, Nurcan, Mark, Kaushik, Patrick, Hiro, Xin, Saul, Shawn, Horst, Alexei, Torre, Tom
  • Apologies: Michael

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Good week in the past week.
    • UM mystery solved.
    • Ran out of jobs this morning - discussions w/ Alex Reed and Ian Hinchliffe - tasks moved from Canada. Proposal of a week shutdown. Feb 18? ... discussion. What about validation of 13.0.40?
    • Alexei: FDR to T1 tomorrow; T2 a day after. 1 TB. Feb 11 - CCRC starts. 80 TB. T2's need to define subscriptions to CCRC data. Do we need to do reprocessing? Should last 2 weeks.
      • SRM v2 endpoints need to be tested. AGLT2 and BNL.
      • Issue of md5sum - for DQ2 0.6 we would like atlas32 (implemented in DQ 0.5.2). Modifications to Panda pilot and DQ2. Kaushik believes there is no compatibility problem. Hiro will work with Charles on testing the upgrade.
    • Follow-up on Eowyn scalability problems.
      • There are still issues with job status updates being made too late (3 days in some cases). 10 hours to restart Eowyn! Problem is that the amount of information in a job definition has grown very large.
      • Still an issue
      • Tadashi has written a server to pull jobs independently from the prod database. Can run in parallel with Eowyn. Can it be run to handle status updates? May not be able to since Eowyn is not completely stateless (Eowyn owns jobs).
      • Roll-out perhaps next week. Follow-up:
        • Ready today or tomorrow.
  • Production shift report (Mark/Nurcan)
    • Few problems over the past week, nothing major.
    • Job scheduler on submit host failed - affecting pilot2 sites - restarting.
    • On-gong work to integrate US and EU operations.
    • Still moving jobs from Canada to US - and this will require movement of input files. Hiro notes there may be jobs failing.

LFC (John)

  • Following up:
    • Setting up a panda test site (Mark Sosebee) and Tadashi on Friday.
    • Could be issues w/ authentication
    • Next steps - installation and migration.
  • Mark setting up Panda test site. Has seen entries being created.
  • Performance test next week w/ Hiro. Will do .25M entry test.
  • When doing migration, will have to drain site, and then switch over, and then send modified pilots.
  • Will take changes in the data mover - Paul needs to be in the loop.
  • AGLT2 may be willing to try, when ready.

Operations: DDM (Kaushik/Hiro)

  • Status of AOD replication for analysis at Tier2s
  • Discussion from last week:
    • Will need to follow-up on AOD replication. Recovering files previously replicated without the archival bit.
    • Test samples for Nurcan have been distributed.
    • At SLAC meeting we discussed replicating Rel 12, 13 at the Tier2's. Abandon this goal?
    • Should we focus instead on FDR datasets?
    • Hiro - Tier2's deleting AODs - via cleanse.py. AODs not coming in with archival flag. Alexei will do this in the future.
      • Suggestion was to delete them all, and start over. Can we do this in a more selective fashion?
      • Note there is a -Panda flag to distinguish Panda files and AOD files at a site.
      • What should we be doing? Patrick and Hiro will discuss w/ Miguel. Charles interested in contributing.
    • How should we handle multiple datasets? Manual operation at the site - but we need information about the datasets.
    • CCRC are copies - they are replicas of FDR data. Archival flag should be off.
    • We need space for two versions of the AODs most likely.
    • SRM v2.2 sites - these can be centrally deleted. Do we want to do this? Certainly not for FDR, but for CCRC this would be okay.
    • Discussion about dataset reservation at sites - would be good publish this.
    • Still will need a management oversight.
  • Today - Kaushik would prefer to focus only on FDR AOD.
  • Issue about cleaning up AOD datasets at sites that may have been deleted.
    • What about inconistencies between the datasets deleted at sites - inconsist w/ central catalog
    • How to remove the location in the central catalog? TID and type, you should be able to construct the reverse map.
    • There is a good method for deleting AOD datasets - written by Hiro. The problem is dealing w/ the production datasets. An issue if the location is not registered?
  • Another issue (from Marco) - deleting a dataset - is this possible? DQ2 does not delete dataset names - its a feature since DQ2 0.3 - forever locked.
  • Slow subscription - check ARDA dashboard.

Analysis Queues (Bob, Mark, Nurcan)

  • See AnalysisQueues; updated DONE
  • Follow-up on required rpm's at sites to analysis jobs requiring compilation.
  • As of last Friday - succeeded at AGLT2.
  • Need to pin down the necessary rpms for all sites.
  • Fred will circulate an email once the twiki page describing this is cleaned up. Also - what about RHEL5? No validation plans at the moment.
  • SLC3 sites should figure on upgrading in the near term - e.g. next two months.
  • Would like to send instructions this week.

Accounting (Shawn, Rob)

Summary of existing accounting issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up from last meeting:
    • Follow-up - last week: SWT2_UTA (Patrick) - one step closer; still need to get registered in VORS; will be delayed since there is no operations meeting on Monday. Post 28th.
      • Will get into VORS Still working on it.
    • US ATLAS Facility view (Rob) - post resolution of the BNL mapping issue.

Throughput initiative - status (Shawn)

  • Update from meeting this week
    • BU testing completed...limited by the capabilities of the current gatekeeper/gridftp door to around 45 MB/sec.
    • Wisconsin has reconfigured their infrastructure and is requesting follow-up testing
    • UTA plans to do testing in the coming week.

  • Need from sites:
    • Network diagram(s). See NetworkDiagrams for what we have so far.
    • Disk performance. See LoadTestsP5 for current information.
    • Optimal number of streams on each site
    • add these to the site certification table to check off

  • Shawn will create a table in the LoadTestsP5 task for path, local I/O performance.
    • New table showing the site status for testing needs to be filled in.
    • Sarah at IU working on network diagrams.

Panda release installation jobs (Xin)

  • From last week (any update?)
    • Now have a production submit host. Need conduits opened up. Early next month. (still waiting)
    • Meantime using a temporary machine - testing is working okay on some sites. Basic functionality works.
    • Change to Panda monitor to isolate release installation jobs? Xin discussing with Torre.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Follow-up: Split of Nagios server into internal and external - still working on this. Work has now started. The server has been built. The external server will be moved to a new server
    • Expect an update next Wednesday.
  • RSV publishing to WLCG
    • Dantong - looking into US Facility reporting of SAM data; entries are not appearing. Will follow-up with Rob Q.
    • There was an issue with storage element availability getting associated into the wrong group.
    • Still working out the bugs basic RSV reporting.
  • Local RSV to Nagios publishing
    • Sent code for RSV->Nagios reporting.

Site news and issues (all sites)

  • T1: no major news. Most Panda services now migrated to new hardware.
  • AGLT2: all running well. cleanup script.
  • NET2: all is well
  • MWT2: all okay
  • SWT2_UTA: need analysis queue setup for CPB cluster
  • SWT2_OU: replaced mobo for gridftp server for the second time, still crashing. 10G switch to be installed in the next week or two. Waiting for 10G equipment from UTA.
  • WT2: no report.

RT Queues and pending issues (Tomasz)

Carryover action items

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 05 Feb 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback