r5 - 26 Nov 2008 - 14:51:42 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov26

MinutesNov26

Introduction

Minutes of the Facilities Integration Program meeting, Nov 26, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Charles, Marco, Rob, Michael, Wei, Fred, Jim, Douglas, Hiro, Horst & Karthik, Nurcan, Kaushik, Mark, Sarah
  • Apologies: none
  • Guests:

Integration program update (Rob, Michael)

  • IntegrationPhase7
  • High level goals in Integration Phase 7:
    • Pilot integration with space tokens DONE
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated DONE
    • Transition to /atlas/Role=Production proxy for production DONE
    • Storage
      • Procurements - keep to schedule
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed DONE
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Support upcoming Jamborees
  • ANL Analysis Jamboree, Dec 9-12, twki-home led-blue
  • BNL Analysis Jamboree, Dec 15-18, 2009 agenda, BNLJamboreeDec2008 led-blue
  • Comments from Jim C:
    • At ANL, will use their analysis cluster to run jobs over Cosmics
    • ESDs on Tier2's - would like to test this, since its an important use case.
    • Perhaps using inner detector stream.
    • Expect to have more details within the next week to 10 days.
    • Are there any datasets know to be deprecated? Charles will send Jim a list of hosted datasets. Nurcan - recommend container datasets made during the September Jamborees be kept.
    • Kaushik: we need some automated tools to manage datasets at sites. Wensheng, Armem, Hiro are working on a website that displays capacity.
  • Next US ATLAS Facilities face-to-face meeting (past meetings):
    • Will be co-located with the OSG All-Hands meeting at the LIGO observatory in Livingston, LA, March 2-5, 2009 Agenda
    • US ATLAS: March 3, 9am-3pm - focus on readiness for data, and Tier 3 integration - organized by Maxim Potekhin
  • Tier 0/1/2/3 Jamboree - Jan 22, 2009
  • Tier 3 study group is making good progress - a draft document is available. Consolidating input regarding workflows, requirements.

Facility downtimes

We need to make transparent downtimes, or whether there are problems. Making this well-known to the collaboration. We need to work on a structure provide this information to the collaboration.
  • follow-up discussion from last couple of weeks
    • OIM https://oim.grid.iu.edu/ to announce downtimes
    • Where to "announce"? shifters, prodsys, oim, ....
    • Curl method for off/online sits in Panda - make this automatically inform OIM?
    • Would like to do this in one place. With one click. Note OIM requires a grid certificate.
    • Kaushik will provide a summary, send around. Kaushik will try for next week. There is a new curl command that puts an offsite site into "test", and intermediate before "online".
  • TicketExchangeGOCGGUS26Nov08
  • From Marco & Yuri: https://twiki.cern.ch/twiki/bin/view/Atlas/PandaShiftGuide#Procedure_to_set_a_site_online
  • Test jobs should be short, and the site admins should be able to submit them. Yuri and Marco will write down the procedure.
  • Same for ANALY queue.

Functional tests (Doug)

Discussion.

Next procurements

  • Cover in site-reports

Operations overview: Production (Kaushik)

  • last week
    • Problems connected to proddb at CERN being blocked over the weekend - Bamboo couldn't get jobs. Being discussed in ADC. Will setup a webpage with procedures so that operators on shift.
    • Job definition system bug - Pavel is redefining tasks. Happening right now.
    • GGUS tickets --> US interface - is this working? Instructions do say to submit RT tickets.
    • There is a shifters workshop Jan 23, 2009, see agenda. We should review the workflow/information flow through the various systems - Rob Quick (OSG GOC) has offered to help to develop a common system. Fred will help facilitate the process. Revisit first week of December.
  • this week
    • going fine, 7K jobs
    • Panda mover - there were problems with old job directories on the panda mover cluster. Has been addressed in the pilot.

Shifters report (Yuri)

  • Reference Yuri's weekly summary in operations meetings.
  • This week's meeting, US shift report
  • Panda database problem? Yuri will investigate.
  • There were a large number of aborted tasks that were deleted.
  • New error - 'replica not found', though doesn't affect US sites
  • Pilot problems at IU? Can experts comment? Yuri will check submit hosts.

PRODDISK migration (Yuri, Paul, others)

  • Follow-up w/ status at each site; last week:
    • SWT2_OU - waiting for new storage. Bestman-Gateway installed.
    • For Harvard, custom site mover tool will be needed (testing this week).
    • For BU, input datasets are going to non-space token area. Need to change scheddb.

Local Site Mover (Marco, Paul, John, Charles)

  • Specification: LocalSiteMover
  • code
  • Site movers: Posix, dCache, xrd, pcache
  • Now testing. * Week from today expect to have the final version working. This will bring up the Harvard site.

Analysis queues, FDR analysis (Nurcan)

Operations: DDM (Hiro)

  • Test replications working at all sites (ccrc08 datasets).
  • Tell ADC to do central deletion.
  • New list where notification of deletion of user datasets

DQ2 upgrade for long URL's in the catalog

  • Problems with updates at sites.

LFC migration

  • SubCommitteeLFC
  • last week
    • AGLT2, MWT2_UC, MWT2_IU, SWT2_CPB, SLAC, WISC - completed.
    • BU - to locate at Harvard due to firewall issues this week.
    • OU - waiting on Pilot/Panda changes.
    • BNL - wait until week following Dec 1.
    • Stability issue : thread-safe issue with a Globus function - VDT addressing this, will provide a patch to test shortly.
  • this week
    • BU - next week.
    • BNL - upgrade next week. Kaushik has given green light. Coordinate with Paul - requires changes in the pilot code. Will not use space tokens.

Site news and issues (all sites)

  • T1:
    • last week: Not much to report, things moving smoothly. Tracking the network connectivity issue between BNL and CNAF - not running well, under investigation. Making procurements - using UCI connection. Lots of data being replicated from Tier 1 to Tier 2's. Aggregate 400 MB/s over quite some time. MWT2 blocked by Cyber security at BNL; what was detected was interpreted at as a threat; reset packets detected. Over the course of 3 days a number of packets were not accepted, blocked by internal firewalls of the destination host. Have requested that none of the systems be automatically blocked, and a communication between cyber security and us; brought to attention of ITD management at BNL. Only 80 packets over 3 days, with millions of successful packet.
    • this week: Data replication is going on at high rate. 350 MB/s average over the last 7 days. Sites are appearing stable and handling rates up to 200 MB/s at a couple of sites. Very positive progress made here. There is a problem with cooling facilities at the Tier 1 (heat exchanger punctured), though no systems needed to be shut down. Rental of a 50 T chiller - arrives within 24 hours. $7000/month.
  • AGLT2:
    • last week: LFC issue four days ago, daemon crashing/locking up. Suggest was to limit number of jobs. Raised 20-60 threads. Will check. MSU has received all their equipment - 400 or so cores. 1400 total cores in system, to be increased by this 400 in mid-December.
    • this week: no report.
  • NET2:
    • last week: BU running smoothly since last week; HU - lustre causing corruption of files in releases, luster database machine would crash; re-installing releases. New storage has almost all arrived, 100 new cores. Main storage filling up - paused cosmic replication until new storage is online.
    • this week: no additional report.
  • MWT2:
    • last week: Accidental metadata deletion of DATADISK - datasets now being restored. ANALY queue testing.
    • this week: dCache gridftp door problems resolved. A second gridftp door w/ 10G nic.
  • SWT2 (UTA):
    • last week: NFS file server problems - cleaned up; back in production. Rate problem with pilots - cleared up.
    • this week: CPB running smoothly. SWT2 - offline for upgrades.
  • SWT2 (OU):
    • last week: LFC work continuing. Network monitoring host setup.
    • this week: all is well. Transfer timeouts expiring. Marco submitting test jobs. Hiro changed ToA.
  • WT2: no report.
    • this week: completed migration to LFC, production running fine. ANALY queue test jobs are successful, but Ganga Robot jobs fail - those with input files. Perhaps because of direct reading from storage. Sent email to Paul. January 8 - there will be a power cooling upgrade that will require 5 days of downtime.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • status from last week
    • Test jobs cannot contact EGEE installation portal. Waiting on Torre to setup new sites in Panda for installation.
    • Automatic transfer of pacballs to Tier 1 - under discussion.
  • follow-up: Alexei has setup automatic subscriptions for pacballs to Tier 1's. Some haven't made it. Fred. No update this week.

Squids and Frontier

  • follow-up: Frontier server at BNL - almost done; will run some internal tests. Ready for testing in a week. Shuwei is providing re-processing jobs as tests - not sure of latest news - expectation is its almost ready for tests w/ SLAC.

glexec @ SLAC

  • follow-up: Doug - there was a test job - a configuration problem that needs to be fixed. reinstalled wn-client on a rhel4 platform, this has been successful. Available for test jobs now. Jose C at BNL is working on changes in the pilot.

AOB

  • follow-up: Fred: retrying pathena analysis jobs at ANALY_MWT2. There seemed to be an exact 3 hour time delay before being restarted. atlas-dist-analysis-help list. Will email Tadashi. Create ticket in panda-savannah.
  • follow-up: OSG storage meeting w/ Bestman/xrootd team at last week's OSG site admins meeting. Need to summarize support understandings (Rob) Done DONE
  • Happy Thanksgiving!


-- RobertGardner - 25 Nov 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback