r3 - 19 Nov 2008 - 14:46:39 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov19

MinutesNov19

Introduction

Minutes of the Facilities Integration Program meeting, Nov 19, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Charles, Hiro, Xin, Fred, Sarah, Patrick, Michael, Saul, Nurcan, John, Karthik, Bob, Wen, Kaushik, Doug
  • Apologies: Mark, Horst, Wei, Paul, Shawn
  • Guests: none

Integration program update (Rob, Michael)

  • IntegrationPhase7
  • High level goals in Integration Phase 7:
    • Pilot integration with space tokens
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated
    • Transition to /atlas/Role=Production proxy for production
    • Storage
      • Procurements - keep to schedule
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Upcoming Jamborees
  • BNL Analysis Jamboree, Dec 15-18, 2009 agenda, BNLJamboreeDec2008 led-blue
  • Next US ATLAS Facilities face-to-face meeting (past meetings):
    • Will be co-located with the OSG All-Hands meeting at the LIGO observatory in Livingston, LA, March 2-5, 2009 Agenda
    • US ATLAS: March 3, 9am-3pm - focus on readiness for data, and Tier 3 integration - organized by Maxim Potekhin
  • Tier 0/1/2/3 Jamboree - Jan 22, 2009
  • Tier 3 study group is making good progress - a draft document is available. Consolidating input regarding workflows, requirements.

Facility downtimes

We need to make transparent downtimes, or whether there are problems. Making this well-known to the collaboration. We need to work on a structure provide this information to the collaboration.
  • follow-up discussion from last week
    • OIM https://oim.grid.iu.edu/ to announce downtimes
    • Where to "announce"? shifters, prodsys, oim, ....
    • Curl method for off/online sits in Panda - make this automatically inform OIM?
    • Would like to do this in one place. With one click. Note OIM requires a grid certificate.
    • Kaushik will provide a summary, send around. Kaushik will try for next week. There is a new curl command that puts an offsite site into "test", and intermediate before "online".

Next procurements

  • Cover in site-reports

Operations overview: Production (Kaushik)

  • Problems connected to proddb at CERN being blocked over the weekend - Bamboo couldn't get jobs. Being discussed in ADC. Will setup a webpage with procedures so that operators on shift.
  • Job definition system bug - Pavel is redefining tasks. Happening right now.
  • GGUS tickets --> US interface - is this working? Instructions do say to submit RT tickets.
  • There is a shifters workshop Jan 23, 2009, see agenda. We should review the workflow/information flow through the various systems - Rob Quick (OSG GOC) has offered to help to develop a common system. Fred will help facilitate the process. Revisit first week of December.

Shifters report

PRODDISK migration (Yuri, Paul, others)

  • Follow-up w/ status at each site; last week:
    • SWT2_OU - waiting for new storage. Bestman-Gateway installed.
    • For Harvard, custom site mover tool will be needed (testing this week).

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/AtlasProtected/PathenaAnalysisQueuesUScloud
  • There was a problem with user jobs finding the available releases at sites. What is our policy regarding unannounced releases?
  • Display comes with the new system.
  • USERDISK announcement about periodic cleaning - no changes yet for US sites.
  • All ANALY queues are online, except Wisc.
  • Functional testing - see recent summary
    • What tests, and and what frequency? - 200 jobs to a cloud.
    • Should Tier 3 facilities be included? (eg. big deal was made of Illinois failures. Why?)
    • Communication when there are problems?
    • Doug might have a script that could canned for site administrators
    • Webpage - there is a page available in the ARDA dashboard.

Operations: DDM (Hiro)

  • There was a problem at Wisconsin - required a DQ2.
  • A problem with subscriptions at CERN.
  • All other problems are minor.

Operations: Site-level SE management tool (Charles)

Throughput initiative - status (Shawn)

  • No meeting this week - SC09.

LFC migration

  • SubCommitteeLFC
  • AGLT2, MWT2_UC, MWT2_IU, SWT2_CPB, SLAC, WISC - completed.
  • BU - to locate at Harvard due to firewall issues this week.
  • OU - waiting on Pilot/Panda changes.
  • BNL - wait until week following Dec 1.
  • Stability issue : thread-safe issue with a Globus function - VDT addressing this, will provide a patch to test shortly.

Site news and issues (all sites)

  • T1: Not much to report, things moving smoothly. Tracking the network connectivity issue between BNL and CNAF - not running well, under investigation. Making procurements - using UCI connection. Lots of data being replicated from Tier 1 to Tier 2's. Aggregate 400 MB/s over quite some time. MWT2 blocked by Cyber security at BNL; what was detected was interpreted at as a threat; reset packets detected. Over the course of 3 days a number of packets were not accepted, blocked by internal firewalls of the destination host. Have requested that none of the systems be automatically blocked, and a communication between cyber security and us; brought to attention of ITD management at BNL. Only 80 packets over 3 days, with millions of successful packet.
  • AGLT2: LFC issue four days ago, daemon crashing/locking up. Suggest was to limit number of jobs. Raised 20-60 threads. Will check. MSU has received all their equipment - 400 or so cores. 1400 total cores in system, to be increased by this 400 in mid-December.
  • NET2: BU running smoothly since last week; HU - lustre causing corruption of files in releases, luster database machine would crash; re-installing releases. New storage has almost all arrived, 100 new cores. Main storage filling up - paused cosmic replication until new storage is online.
  • MWT2: Accidental metadata deletion of DATADISK - datasets now being restored. ANALY queue testing.
  • SWT2 (UTA): NFS file server problems - cleaned up; back in production. Rate problem with pilots - cleared up.
  • SWT2 (OU): LFC work continuing. Network monitoring host setup.
  • WT2: no report.

Carryover issues (any updates?)

Long URL's in the catalog (Marco)

  • follow-up:
    • Our convention is to use full URL's in the catalog in the US.
    • There are few changes implied for the pilot - Paul is aware.
    • What about dq2-put? Is the short URL used? Need to check w/ Mario.
  • savannah ticket submitted.
  • Update: patch now available - mail from Hiro

Release installation via Pacballs + DMM (Xin, Fred)

  • status from last week
    • Test jobs cannot contact EGEE installation portal. Waiting on Torre to setup new sites in Panda for installation.
    • Automatic transfer of pacballs to Tie 1 - under discussion.
  • Alexei has setup automatic subscriptions for pacballs to Tier 1's. Some haven't made it. Fred.

Squids and Frontier

  • Frontier server at BNL - almost done; will run some internal tests. Ready for testing in a week.

glexec @ SLAC

  • Doug - there was a test job - a configuration problem that needs to be fixed.

AOB

  • Fred: retrying pathena analysis jobs at ANALY_MWT2. There seemed to be an exact 3 hour time delay before being restarted. atlas-dist-analysis-help list. Will email Tadashi. Create ticket in panda-savannah.
  • OSG storage meeting w/ Bestman/xrootd team at last week's OSG site admins meeting.
  • Next week: short meeting


-- RobertGardner - 18 Nov 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback