r5 - 31 Oct 2007 - 14:54:50 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct31



Minutes of the Facilities Integration Program meeting, October 24, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.


  • Meeting attendees: Michael, Rob, Wei, John, Karthik, Nurcan, Kaushik, Jeff/SUNY-Albany, Horst, Fred, Hiro, Jay, John
  • Apologies: Bob - will join a little late

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Lots of difficulties in the last week. Firewall issue w/ Panda servers - resolved. Reconstruction jobs - requiring files on tape (situtation improving, thanks to Hiro and Iris with manual stage-ins); Eowyn - problems w/ sufficient numbers of jobs,and jobs getting lost. Karthik setting up a second instance; some concerns about duplicate jobs; crashed for half a day as well. Panda server failure - RAID problem, down for 10 hours. Today: Panda Mover Condor queue not keeping up; under discussion with experts. Question: if Panda Mover is being used, pulling all the input files - but DQ2 remains in use for all other files. John notes that DQ2 seems to be used at BU. Are they rogue subscriptions? There should be no subscriptions for any fetcher except BNLPANDA. Kaushik will check the Panda Mover configuration.
    • Who is invoved in resolving this: Xin, Torre, Tadashi.
    • eLog for Panda shifts for announcements of downtimes of central services?
    • There is also a Panda Nagios monitor.
  • Production shift report (Nurcan)
    • See Mark's report
    • Bringing up a new site at UW
  • Question is why input files are not readily available on disk? Michael notes there are 10 additional drives avaiable - should we deploy? Kaushik: focus on Panda Mover issue sorted out today, then look at the input file problem.
  • At SW workshop at CERN - important decision made - Panda will be used for all production. Will be adding 7 more Tier1's in ATLAS to Panda. The deveopment server will be moved to CERN.
    • Dantong notes there is a BNL facility plan to shore up the infrastructure for Panda services. gridu02. John Hover is concerned that we shouldn't try to manage European sites with the existing database backend - scalability issues.

Operations: DDM (Alexei)

  • No report
  • Issues
    • Status of M5 processing and distribution of datasets to BNL.
    • Review of current subscriptions to Tier2 centers, progress on transfers.
    • One issue that came up last week: is there a subscription control mechanism in DQ2/subscription request interface?. We have seen cases of users subscribing datasets w/o site admin's knowledge which could at some point cause problems since storage elements currently have no quotas or management controls.
  • Michael notest that 250 MB/s is coming from CERN to BNL. Have accumulated almost 20 TB since yesterday - sent to tape and disk simultaneously. Available for further processing.
  • There is dedicated dashboard for M4 - expect one for M5.

BNLPANDA status (Hiro)

  • Removed firewall to avoid proxy, increased number of threads 4 --> 20 that process replicas.
  • Is proxy the bottleneck?
  • 3-week conduit. Make this permanent? Negotiating with cybersecurity.
  • Happens when huge number of subscriptions ~ 30K.
  • Kaushik: Was taking 5 minutes to communicate w/ DQ2 central catalogs.

DQ2 0.4 deployment (Hiro, Patrick, Shawn)

  • See further DQ2SiteServices to capture deployment experience, known issues.
  • AGLT2 - done.
  • WT2 - installation mostly okay, but some small glitches with a few packages that can be worked-around. Able to transfer data from brookhaven to SLAC and register. Cannot tell if communcation of dataset finished to central catalogs. Lots of error messages - source file agent lookup crash request. Trying to access bnldisk, france, etc. Will post messages. But otherwise done.
  • BU - John upgraded w/ test site okay, but now probs with production site upgrade - getting mysql started. Hiro: reminds if you upgrade from older versions - note the schema change - there is a script to run. Starting from scratch.
  • SWT2: Updated production host yesterday - ran into a prob w/ version of lib.curl on system - conflict with Ibrix. Install completed, started it up. Have a backlog of old subscriptions, so a lot to process. Large number of errors showing up quickly. Posted to ddm-l, Pedro responded - errors to console rather than logfile - known issue. Remaining issues are almost all probs w/ FTS communication. Is it a load issue? Seeing data come in to UTA, but at a low rate.
  • Next site: MWT2

FTS monitoring (Hiro)

  • FTS monitoring - now available: https://www.usatlas.bnl.gov/fts/
  • Great to have the page, no immediate comments from the group.
  • Hiro can add features - for example links to logs for failed transfers.

Analysis Queues (Bob, Mark)

  • See AnalysisQueueP2
  • Looking at sites with analysis queues - starting first with sites w/ Condor queues. Proceed w/ OU first.
  • Then move onto PBS queues.
    • Action item: Mark will provide similar instructions for PBS. -Mark still working on it.
  • Setup one machine, soley reserved for analysis.
  • Wei - running LSF - probably be running another queue - using fairshare - what to give it? Need to discuss this in-depth.
  • ball@umich.edu

Mysql LRC (John)

  • Follow-up on: Waiting on some repository information from BNL's OS group - need to mirror some CERN repositories. Has to pull libraries from BNL, not CERN, for security purposes. Also waiting on a test dataset from Hiro.
  • John: there was decision last week to go with LFC, has setup SL3. Need to follow-up on this decision w/ Panda development team.

Accounting (Shawn, Rob)

Follow-up on (see AccountingP2) issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up on:
    • IU_OSG not reporting? Working now, DONE
    • SWT2_UTA still being addressed, also as unregistered: getting back to this this week, machine
    • BU_ATLAS_Tier2o - Saul - still trying to track this down; believes correctly reporting to OSG, but info not forwarded to WLCG.

Network Performance and Throughput initiative (Dantong)

  • See work in progress at NetworkPerformanceP2
  • Follow-up on BU: done DONE.
  • Worked with Augustine and Shawn - step-by-step; changed tcp buffer size - BNL-to-BU (950 Mbps). BU now at 10 Gpbs.
  • Finding lower performance BU to BNL due to 2% packet loses, killing tcp performance.
  • Traced to a problem with a dirty fiber causing CRC errors at the NOX.
  • Next week - fix BU, do OU.

Throughput initiative - overview (Shawn)

  • Small meeting this week Shawn, Hiro, Jay, Rob
  • Hiro preparing the files.
  • Need to coordinate with M5 writing into BNL.
  • For Tier0-Tier1 monitoring

Load test displays, issues from the last week (Jay)

  • For transfer plots - hard to get data into ML.
  • Will consult Iosif.


RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Firewall problems at BNL - creating false alarms.
  • Problem with LRC at Wisconsin; who is responsible?
  • Plan to switch this into instances - one for "internal" and one "external". Need some hardware for the second server. Target before SLAC meeting.

Site news and issues (All Sites)

  • T1: Panda server - there was a crash - recovered.
  • AGLT2: Up and running fine - starved for jobs. Recent purchases installed and running. Will have 500 job slots, and 600 at MSU (in about a month).
  • NET2: Nothing major - production working fine. DQ2 0.4 upgrade.
  • MWT2: 20 new nodes added last week
  • SWT2_UTA: back and ready to run. Next up is the new cluster.
  • SWT2_OU: basically done - except for Ibrix crashes (overly sensisitive failovers, stuck rsync processes); Load tests caused gatekeeper to fail. LRC up and working. Headnodes (2950) not accessible through IPMI.
  • WT2: not much news - contining to prepare

RT Queues and pending issues (Dantong, Tomasz)

Carryover action items

Panda release installation jobs

  • Need to find a Facilities person to work with Tadashi;


  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics

  • Carryover; some benchmarking work w/ quad core opterons.

New Action Items

  • See items in carry-overs and new in bold above.


  • none

-- RobertGardner - 30 Oct 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback