r13 - 15 Aug 2007 - 14:48:16 - KaushikDeYou are here: TWiki >  Admins Web > MinutesAug15

MinutesAug15

Introduction

Minutes of the Facilities Integration Program meeting, Aug 15, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Charles, Horst, Karthik, Rich, Bob, Shawn, Alexei, Kaushik, Nurcan, Saul, Patrick
  • Apologies: none

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • No numbers this week due to backlog of transferring jobs.
    • Will update the plot for next week.
    • Things are slowly getting back to normal, pending DDM issues. 6000 jobs transferred overnight. 20,000 remain to be transferred.
    • We don't have enough jobs - again?
    • Patrick: "we kicked the box". A lockup of the agents on the bnlpanda site service. Restart, followed by another lock-up, followed by another restart (w/o a recreation of the database). Is there a race condition among the agents? At MWT2_IU - Xin increased the number of FTS streams. Looks like the new selection algorithm (datasets-to-files) in DQ2 grabs files differently than before, leading to longer times to have jobs move to "finished". Need better FTS, DQ2 monitoring. DQ2 also transfers bad files repeatedly rather than moving good ones first.
  • Shifters report and other production issues (Mark)
    • Big issue has been the transferring problem - had been doing well ~3000 jobs concurrently.

Operations: DDM (Alexei)

Network Performance (Shawn, Dantong)

Site news and issues (All Sites)

Follow-up from last week's news:
  • T1: will be a major dcache upgrade tomorrow.
  • AGLT2: working on accounting - trying to get it working for a couple of reasons/projects in a standardized schema. Analysis queues - 4 setup - follow-up next week.
  • NET2: nothing new - have not run since Monday, waiting for jobs. No known problems at the site, just a job shortage. Problem with Eoywyn. Jobs are coming slower than usual - seems to be suffering from trying to update job status in the prod DB.
  • MWT2_IU: Looking at transfer issues with Xin, see plots.
  • MWT2_UC: disk firmware upgrade went well. UPS upgrade as well.
  • SWT2-UTA: New cluster being installed this week, Dell onsite. Power outage at SWT2 cluster pushed back to September. Dell SC1435 (200 cores), dual dual opts, 75 TB raw dell 10 md1000's, 500GB drives.
  • SWT2-OU: All running okay, oscer interruption this pm, still waiting for final date for the move (~labor day). Will be getting 37 500 GB drives. Add 23 quad nodes. reconigure cluster. 4 head nodes.
  • WT2: all working okay. Still trying to figure out if 30% AOD replication is complete. Power outage on Aug 27. US ATLAS analysis workshop next week, may not attend.
  • UC Teraport - coming back online, after 64bit RHEL4 upgrade. Charles notes that setting LD_DEBUG=files gives you the files in your path.

Carryover action items

  • Encryption to syslog-ng Still to do, carryover.
  • Install NDT at each site - put in site certification table. Follow-up next week.
  • RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.
    • Dantong has guidelines.
    • There are a couple of tickets that are not getting attention.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • None

-- RobertGardner - 14 Aug 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback