r13 - 24 Oct 2007 - 14:39:09 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct24



Minutes of the Facilities Integration Program meeting, October 24, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.


  • Meeting attendees: Gabriele, Rob, Shawn, Nurcan, Wei, John/BU, Jay, John, Horst, Karthik, Wensheng, Dantong, Saul, Joe, Tom, Tomasz
  • Apologies: Patrick, Mark (others at Software Week)

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
  • Production shift report (Nurcan)
    • News from ATLAS softwre week - validation session - there will be a single executor?
    • Panda server down last night, resolved, but resulted in large number of 'lost heartbeat' jobs.
    • IU_OSG - still some transfer issues, perhaps related to firewall issues.
    • BNL FTS issue - there was an oracle backend problem, resolved.
    • LUT - will be down for two weeks.
    • Wensheng: Tape-stage-in issues on Monday, resulting in job starvation. These are evgen input files. Can they be pinned on disk? At least it would be good to have a plan in place. Has this been the main problem with lack of input files? Finding some panda-mover jobs still running after 5 days - is there a problem with the scheduling?

Operations: DDM (Alexei)

  • M4/M5 replication - see note.

DQ2 0.4 deployment (Hiro, Patrick, Shawn)

  • See further DQ2SiteServices to capture deployment experience, known issues.
  • Results from AGLT2
    • Seems to be running fine.
    • Installation went smoothly. Were running a late version of Mysql.
    • Hiro had fixed some config problems, but they were minor.
    • No problems to report.
    • There was a fairshare test made by Alexei - all went well.
    • Shawn installed both agents and transfer queue on a dual quad core server - loads very low.
  • Next sites: BU, MWT2, WT2 - starting Monday next week.
  • One issue: subscription control. Have seen cases of users subscribing datasets w/o site admin's knowledge.

FTS monitoring (Hiro)

Mysql LRC (John)

  • Some progress over the past week.
  • Waiting on some repository information from BNL's OS group - need to mirror some CERN repositories. Has to pull libraries from BNL, not CERN, for security purposes.
  • Also waiting on a test dataset from Hiro.
  • Looking into the code base - deeply coupled w/ DQ2 code base. Use's DQ2 web services infrastructure. Looks like it cannot be separated from a DQ2 install.
  • Only coarse grain security available.

Accounting (Shawn, Rob)

Follow-up on (see AccountingP2):
  • The are still some accounting view-grouping problems GOC contacted
  • IU_OSG not reporting?
  • SWT2_UTA still being addressed, also as unregistered.
  • BU_ATLAS_Tier2o - Saul - still trying to track this down; believes correctly reporting to OSG, but info not forwarded to WLCG.

Network Performance and Throughput initiative (Shawn, Dantong)

  • See work in progress at NetworkPerformanceP2
  • BU deferred till next week
  • Defer on OU until hardware installation

Load testing update, issues (Jay)

The basic theme today is seeing the result with multiple streams - you can see that having 12 streams you can effectively fill the pipe.
  • FTD can be used for end-users.
  • Goal is to achieve sustained transfers.
  • 12streams.png:
  • 1stream.png:


RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Follow-up on: Tomasz will write a generic wrapper for Nagios to RSV probes - plan to release this Friday.
  • Has send new wrappers to Arvind and Rob.
  • Round-up of RED Nagios issues:
    • gk01.swt2.uta.edu - work going on.
    • tier2-osg.uchicago.edu - working on it.
    • OU - LRC back up.
  • Will provide a second Nagios server for site admins to grant admin privs.
  • Please alert Tom about false-postives.

Site news and issues (All Sites)

  • T1: Atlas panda database migrated last week; addressing firewall problems. Last night panda server crashed - out of memory. Oracle FTS problems recovered. Investigating Oracle redundancy.
  • AGLT2: Shawn had to leave.
  • NET2: No probs - could use more.
  • MWT2: production clusters okay - still working on uc-prototype.
  • SWT2_UTA: still working on Ibrix probs.
  • SWT2_OU: Installed OSG 0.6, basically ready. Copied LRC back from backup, ran cleanse.py. ipmi baseboard management of headnodes not working, preventing remote power cycles. Solved rocks-ganglia problems of last week, as well as Condor version probs. Expect to be online tomorrow.
  • WT2: Production going well. End of the month will turn on 10G network. Making progress running NTP server at SLAC - CD w/ flash stick.

RT Queues and pending issues (Dantong, Tomasz)

Carryover action items

Panda release installation jobs

  • Need to find a Facilities person to work with Tadashi

Analysis Queues (Bob, Mark)

  • See AnalysisQueueP2
  • Action item: Mark will provide similar instructions for PBS. -Mark still working on it.
  • Main problem is the set of AODs are not available.
  • Action items moving forward (each site):
    • We need to setup analysis queues
    • Allocate a small number of cpu's to this site


  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics

  • Carryover

New Action Items

  • See items in carry-overs and new in bold above.


  • none

-- RobertGardner - 23 Oct 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


png 12streams.png (38.5K) | JayPackard, 24 Oct 2007 - 13:54 |
png 1stream.png (39.5K) | JayPackard, 24 Oct 2007 - 13:48 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback