r13 - 28 Sep 2007 - 09:49:31 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep26



Minutes of the Facilities Integration Program meeting, Sep 26, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.


  • Meeting attendees: Fred, Rob, John, Michael, Gabriele, Alexei, Bob, Kaushik, Torre, Patrick, Horst, John, Rich, Dantong
  • Apologies: Wensheng, Jay, Xin

Integration program update (Rob, Michael)

  • Phase 2 schedule
  • Update to SiteCertificationP2 table (reduced scope)
  • Is this realistic (for network performance tuning)? Shawn believes this is appropriate, and will make contacts.


  • Report on meeting w/ John Weigand, Chris Green, Shawn and Rob
  • See AccountingP2 and items therein.
  • Various types of problems discovered.
  • Note there is related issue Panda specific reporting - Arron from UTA is looking into this.
  • BNL_PANDA vs BNL_OSG; BNL_OSG seems to be under-reporting.
  • Discussion of scaling factors - can we agree on a set of numbers for processor types?
  • In terms of reporting - there is way of correcting data for usage that fell outside the accounting system.
  • Michael: propose small working group to agree on a path for determining these numbers. Facility personnel (Tony Chen) at BNL have worked on the scale for SI2K? . Setup initial phone conference.

Operations: Production (Kaushik)

  • Production summary (Kaushik). Reconstruction production - input files sent to BNL; but there were bugs found yesterday; so the massive reconstruction has been postponed for two weeks. Note that pre-stager in Panda is working now, so datasets on disk can be accessed. Release? -- problems. 13.0.30? Probably continue with Xin's mechanism for release installation. This week busy with issues of panda mover and site info database. Also working on multi-cloud scheduling for tasks.
  • Are there dcache problems at BNL - since yesterday. Gabriele: there were heavy load problems on the pnfs server, but they were solved this morning.
  • Some users have submitted very large user analysis tasks which required thousands of RDO files. Tadashi has put in a fairshare mechanism to prevent users from doing this.
  • Site information about sites is now available through panda monitoring.
  • Shifters report and other production issues (Mark) - covered most above.

Operations: DDM (Alexei)

  • Kors asked to postpone functional tests for LCG sites - awaiting more info from Miguel. Believes next week may be a good time to revisit this.
  • Propose half-day to work together on this (Hiro, Patrick, Miguel) - next Wednesday.
  • There will be a large number of subscriptions cancelled on the LCG sites. Large number of small files, causing the problems, clogging the system.

DQ2 testbed and DQ2 0.4 deployment (Patrick)

  • DQ2SiteServicesP2
  • Miguel sent message that 0.4 branch has been moved into the apt unstable repository.
  • Has not heard back regarding precise commands to do the installation.
  • Hiro had started working on this for BNL.

Panda release installation jobs (Fred, Tadashi, Xin)

  • Quite far away. Revert to Xin's method for the time being.

Panda Mover (Kaushik, Hiro)

  • Problems w/ MWT2 with hangs - perhaps wrong information in ToA.
  • SLAC - need to resolve proxy issues.
  • All other sites have been added.

FTS 2.0 (Hiro)

  • Update on the FTS migration, was scheduled this for today (Wednesday).
  • Still working on the deployment as of the time of the meeting.

LRC evaluations (John)

  • See FileCatalogP2 for discussions.
  • Will be on agenda at next week's Panda/DDM workshop at BNL.

Load testing update, issues (Jay)

  • See LoadTestsP2 for updates.
  • From Jay: The monalisa developers helped me through my problems and there are now some graphs up on the monalisa client that you can assess. There is dummy test data for last Friday. I wanted to wait until the sites were tuned before starting the live data since a gigabyte is required to get statistics to some sites which takes a very long time to sites with a slow transfer rate. To view the graphs, open the monalisa client by clicking: http://monalisa.cacr.caltech.edu/ml_client/MonaLisa.jnlp. Once the monalisa client is open, click "Groups" on the left. Make sure usatlas is checked in the Groups menu at the top. Expand it in Farms. Click on BNL_ITB_Test1 within the "Farms" section. Expand Loadtest in Clusters. Click on a transfer such as "bnl->bu (MB/s)". Select all the parameters in the "Parameters" box. Then click "history plot" and then the "history 2d plot". Click "View->Plot interval->Relative" in the menu and then choose an end date/time of Friday at 00:00:00 and 1440 (24 hours) for number of minutes. You should see the comparison of network, gridftp_m2m, and gridftp_d2d over time. My biggest complaint with the graphs is that I can only see time in the axis and not date. There are also other plots besides the "History plot" that you can play with.

Network Performance and Throughput initiative (Shawn)

  • See work in progress at NetworkPerformanceP2
  • MWT2 now done. Initially 40-50 Mbs; now full Gbps.
  • What about the issue of jumbo frames, generally. Concern about mixing client-server frame sizes - losing connectivity. Sometimes "path-MTU-discovery" does not work properly. More important for 10G systems.
  • Rich notes that host, router, switch and VLAN are all affected by MTU size.
  • Site admins need to investigate these detail, to reach our scalability targets.

Tier2 meeting at SLAC

Analysis Queues (Bob, Mark)

  • See AnalysisQueueP2
  • Condor instructions are setup for analysis jobs.
  • At AGLT2 - no jobs are getting assigned to AGLT2; consulting Tadashi
  • Action item: Mark will provide similar instructions for PBS.
  • Action items moving forward (each site:) * We need to setup analysis queues * Allocate a small number of cpu's to this site

OSG Integration (Marco)

  • Testing on ITB 0.7, Site Validation Table
  • Update on validation of Panda on OSG ITB 0.7 (UC_ITB). Ran more than 20 full Panda jobs over three days last week (changed UC_ATLAS_MWT2 siteinfo in pilots to point to UC_ITB site). Info on jobs:

  • Agree we should checkoff green in the OSG VO validation table.

Nagios monitoring (Tomasz)

  • Nagios service groups and hierarchy. carry-over.
  • Dantong reports progress wrapping RSV probes for Nagios.

Site news and issues (All Sites)

  • T1: FTS upgrade in progress; Thumpers being installed (now on rack #2). dcache issues have been addressed, and more work on understanding the effect.
  • AGLT2: Expected ship dates for new cores: Oct 2-5 + three days. 300 UM/400 MSU in November. Expect operational week afterwards.
  • NET2: no report
  • MWT2: new IU_OSG site progressing; panda mover issue; 10G NIC replacement; UC_ATLAS_MWT2 nearly rebuilt.
  • SWT2_UTA: debugging autopilot issues; setting up host for ; SWT2 gatekeeper not reporting.
  • SWT2_OU: expect Dell visit Oct 8 and rocks/osg/dq2 deployment thereafter
  • WT2: Tier2 meeting - November/December busy at SLAC, book hotel early. Addressing issue of panda mover at slac. Getting probation to run without gsi-authentication. Backport old job accounting information into Gratia.

RT Queues and pending issues (Dantong, Tomasz)

Carryover action items


  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics (Rob)

  • Carryover

RSV, Nagios, SAM (WLCG) site availability monitoring program (Dantong, Tomasz)

New Action Items

  • See items in carry-overs and new in bold above.


  • None

-- RobertGardner - 25 Sep 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback