r7 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan30



Minutes of the Facilities Integration Program meeting, January 30, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Shawn, Wei, Patrick, Charles, Rob, Rich, John, Tom, Marco, Fred, Michael, Bob, Jay, Torre, Nurcan, Kaushik, Wensheng, Horst
  • Apologies: none

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Autopilot problems and remedies; submit host issues at BNL
    • Migration of panda server to new hardware. Will help resolve a number of issues, including network (database and server on same subnet) and NFS filesystem problems, new hardware. Smooth so far.
    • dCache upgrade in an hour - will stop submission of local pilots at BNL for a while.
    • There were problems with dq2_cr due to a new version of glite clients, affecting PandaMover. Lots of new transfer failures for PandaMover right now. glite related. John Hover will revert the dq2_cr host back the way it was (w/ Tadashi).
    • Operations news:
      • Rolling shifts in effect between EU and US - 15 hours. "ADC Shifts". Use Hypernews forum - Kaushik will circulate.
      • How does the shift interface with the Facility? There was a strong feeling to use the GGUS system during operations workshop at CERN.
      • Expert on call. 3 shifters on call in Europe.
      • Plan on 24 coverage when data arrives.
    • Follow-up on Eowyn scalability problems.
      • There are still issues with job status updates being made too late (3 days in some cases). 10 hours to restart Eowyn! Problem is that the amount of information in a job definition has grown very large.
      • Still an issue
      • Tadashi has written a server to pull jobs independently from the prod database. Can run in parallel with Eowyn. Can it be run to handle status updates? May not be able to since Eowyn is not completely stateless (Eowyn owns jobs).
      • Roll-out perhaps next week.
  • Production shift report (Mark/Nurcan)
    • Wengsheng on shift: submitted validation bug report. OU OSCER maintenance. UTA_DPCC in maintenance.
  • Autopilot issues of last week (Bob/Torre)
    • 9 am jobs being deleted at UM
    • This seems to have been been resolved - roll back changes in the condor pilot scheduler.

SRM v2.2 and pinning (Gabriele)

  • Follow-up on the bring-online functionality
    • Action item - report back on update from Miguel still no word
  • Working with OSG storage and integration groups on SRM validation
    • Discussed at last week's OSG Integration meeting
    • Three OSG validation site installing dCache 1.8 for testing: BNL, UC, and LBNL. In addition to Fermilab site. Timeline is to have these sites functional and ready for testing by Feb 1.
  • Shawn reminds us about the LBNL testing service.
  • Next step - FTS-type tests.

LFC (John)

  • Following up:
    • Setting up a panda test site (Mark Sosebee) and Tadashi on Friday.
    • Could be issues w/ authentication
    • Next steps - installation and migration.

Operations: DDM (Alexei/Kaushik/Hiro)

  • Status of AOD replication for analysis at Tier2s
  • Will need to follow-up on AOD replication. Recovering files previously replicated without the archival bit.
  • Test samples for Nurcan have been distributed.
  • At SLAC meeting we discussed replicating Rel 12, 13 at the Tier2's. Abandon this goal?
  • Should we focus instead on FDR datasets?
  • Hiro - Tier2's deleting AODs - via cleanse.py. AODs not coming in with archival flag. Alexei will do this in the future.
    • Suggestion was to delete them all, and start over. Can we do this in a more selective fashion?
    • Note there is a -Panda flag to distinguish Panda files and AOD files at a site.
    • What should we be doing? Patrick and Hiro will discuss w/ Miguel. Charles interested in contributing.
  • How should we handle multiple datasets? Manual operation at the site - but we need information about the datasets.
  • CCRC are copies - they are replicas of FDR data. Archival flag should be off.
  • We need space for two versions of the AODs most likely.
  • SRM v2.2 sites - these can be centrally deleted. Do we want to do this? Certainly not for FDR, but for CCRC this would be okay.
  • Discussion about dataset reservation at sites - would be good publish this.
  • Still will need a management oversight.

Analysis Queues (Bob, Mark, Nurcan)

  • See AnalysisQueues; updated DONE
  • AOD/ESD-based analysis demonstrator across Tier2s
  • Report from Nurcan and Mark Suggestions:
    • requested a dataset to be replicated to all tier2's - finished.
    • have tested all analysis queues
    • first pathena job had hostname error - Tadashi fixed
    • problems querying lrc - fixed.
    • cannot compile the SUSY validation analysis package at tier2s. Works okay at BNL. Missing libs between SL3 and SL4. Xin investigating.
    • John may have a pacman package for missing libraries for development.

Accounting (Shawn, Rob)

Summary of existing accounting issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up from last meeting:
    • Follow-up - last week: SWT2_UTA (Patrick) - one step closer; still need to get registered in VORS; will be delayed since there is no operations meeting on Monday. Post 28th.
      • Will get into VORS
    • Follow-up - last week: BNL mappings (Xin) - very close; there is a plan to change the names for BNL in a couple of places.
      • Resolved.
      • Still discussing w/ OSG gratia group adding resource ownership into the accounting system.
    • US ATLAS Facility view (Rob) - post resolution of the BNL mapping issue. (*

Throughput initiative - status (Shawn)

  • Update from meeting this week
  • Need updates from sites on endpoints
  • IU testing - dCache discussions for pool-to-pool transfers
  • OU and NET2 still be tested. This week. A meeting again on Monday.
  • Transfers slowing down with filesize is 2GB. Not being reproduced.

Throughput goals status and schedule ( follow-up from last meeting)

  1. Each site 200MB/s? (or best value): Status: AGLT2 and MWT2-UC have reached this. SLAC has reached a best value of 110MB/s. Next up Wisconsin, then UTA, OU, MWT2-IU and NET2? Order could change but assume we can finish all sites in the next two weeks.
  2. 10GE sites 400MB/s?: Status: AGLT2 and MWT2-UC have reached this value. Still need to test MTW2-IU. There are no 10GE hosts at OU or NET2 but enough machines in aggregate should be able to reach this level. Schedule? Estimate the remaining sites could be completed as part of the testing in 1) above.
  3. Long-term (24+ hours) of 500MB/sec BNL->Tier-2s? Status: We demonstrated 500-600MB/sec for most of two weekends ago.
  4. Demonstration of BNL->ALL_Tier-2s at 200MB/s EACH (1GB/sec) for long period? Status: this will have to await new/upgraded doors at BNL and the completion of goals 1) and 2) above.
  5. Measurement of “maximum” burst mode bandwidth for each site (20-60 minute period?) Status: This could be started once we complete 1) and 2) above. The maximum "maximum" may be limited by BNL's current config at somewhere between 700-800MB/sec. This testing could be completed in 1 week (assuming each site is already debugged and meeting goals 1) and 2) if applicable).

  • Need from sites:
    • Network diagram(s). See [[https://www.usatlas.bnl.gov/twiki/bin/view/Admins/NetworkDiagrams][NetworkDiagrams] for what we have so far.
    • Disk performance. See [[https://www.usatlas.bnl.gov/twiki/bin/view/Admins/LoadTestsP3][LoadTestsP3] for current information.
    • Optimal number of streams on each site
    • add these to the site certification table to check off

  • This coming week: follow-up
    • UTA - will start next week; Jay notes need iperf
    • BU - will still be limited a single host of 1G; can they reach 120 MB/s d2d? Saul will send the path to Hiro and Jay.
    • SLAC - have demonstrated 110 MB/s already. Two gridftp doors with bestman SRM. Awaiting for 10G upgrade for further tests.
    • Monday meeting - status update from all the sites

  • Shawn will create a table in the LoadTestsP5 task for path, local I/O performance.

Panda release installation jobs (Xin)

  • Now have a production submit host. Need conduits opened up. Early next month.
  • Meantime using a temporary machine - testing is working okay on some sites. Basic functionality works. * Change to Panda monitor to isolate release installation jobs? Xin discussing with Torre.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Follow-up: Split of Nagios server into internal and external - still working on this. Work has now started. The server has been built. The external server will be moved to a new server
    • Expect an update next Wednesday.
  • RSV publishing to WLCG
    • Dantong - looking into US Facility reporting of SAM data; entries are not appearing. Will follow-up with Rob Q.
    • Meeting this week to separate
    • Mid-Feb web interface to RSV collected data from sites.
    • Local RSV to Nagios publishing.
  • Tomasz: problems pinging hosts at UTA and MWT2.

Site news and issues (all sites)

  • T1: Relocating Panda server. dCache 1.8 upgraded. Jay: gathering hardware information for throughput testing.
  • AGLT2: waiting for activated jobs.
  • NET2: no news.
  • MWT2: brought up new hardware at IU and UC (65/40 servers)
  • SWT2_UTA: no news
  • SWT2_OU: 10G upgrade - expect about 2 months more of work. gridftp server being replaced
  • WT2: Still doing hardware installation - will take a while to bring the new nodes online. Doing some networking testing for 10G w/ Esnet. Working w/ Panda to use new SRM enpoint w/ Panda.

RT Queues and pending issues (Tomasz)

Carryover action items

New Action Items

  • See items in carry-overs and new in bold above.


  • none

-- RobertGardner - 29 Jan 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback