r18 - 08 Aug 2007 - 14:54:48 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug8

MinutesAug8

Introduction

Minutes of the Facilities Integration Program meeting, Aug 8, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: John B, Alexei, Michael, Rob, Charles, Wei, Fred, Tom, Marco, Rich, Jay, Patrick, Horst, Karthik, Kaushik
  • Apologies: none

Integration program update (Rob, Michael)

  • Phase 1 complete, see SummaryReportP1
  • Starting up Phase 2
  • We need to work out the specifics - fine grained dates/deadlines
  • There may be particular boundary conditions, dependencies, etc. AI: develop these, RFQ (Rob).
  • Need to review as a group, bring concerns to the meeting.

Site procurements discussion (Michael)

  • See https://www.usatlas.bnl.gov/twiki/pub/Admins/MinutesAug8/T2_pledges.pdf
  • We have a plan for the next round of procurements. Funding plan becoming clear - funding in one piece, in November.
  • Start the planning process, according the pledges made. Consider vendor quotes, phys environment, timing of procurements.
  • Aim to have equip installed at mid-FY08. On the floor, available first of April.
  • We don't want to be too restrictive, nor do we want to leave it independent to each site.
  • Discuss here, in this meeting, plans, problems w/ phys infrastructure, etc.
  • Kaushik: the numbers are for calendar-year rather than fiscal year.
  • Question about the table - they should reflect 20% less than total, to reserve 20% for exclusive US access.
  • Define a starting point for a joint planning.
  • Uncertainty - size of ATLAS AOD size, event size. What is the steady-state need? AI: Alexei to follow up on this, for guidance.
  • Procurement, delivery, operational
  • Generally two major time periods: this fall, to finish FY07 funds; late-Spring using FY08 funds. Perhaps two new tables for capacity.
  • Agreement from the group (Shawn, Tom, Saul, Kaushik, Horst, Wei, Rob) to set this up in a spreadsheet/twiki.

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • facilities-status-aug07_2.ppt: Production Summary Slides
    • Change of metrics - from successful jobs to wall-time consumed for successful jobs. (CPU-days) Does want to move into kSI2K, but its no longer being stored, and conversion factors are suspect. CPU time, wall time, and type are currently being reported. Uniform benchmark job?
    • MWT2 issues - slow at IU, due to data transfer problems.
    • BU numbers up, SLAC numbers up.
    • Panda crossed 3K jobs concurrent!
    • Canadian site, UBC.
    • UTD Tier3
    • Request from Shawn to keep both metrics around for the a while.
    • Q from Michael - backlog of 4500 jobs
      • Volume not known, and fluctuates
  • Shifters report and other production issues (Mark)
    • Success at UBC and UTD (~500 CPUs).
    • Ibrix issue at OU - but stable for 4 weeks.
    • Pilot upgrade from Paul
    • Pilot Checker update

Operations: DDM (Alexei)

  • See Alexei's status notes here: AlexeiNotesAug1
  • Functional testing
  • AOD replication
  • Functional tests started yesterday - BNL only site getting test data.
  • DQ2 servers have crashed twice. EU, Taiwan, NG -- there are several sites. All LCG sites have issues with file catalogs, cleaning, etc.
  • Privilege and default queues are no different in DQ2. Fair share does not work in the presence of backlogs or other problems.
  • AOD replication issues of last week, think, trace to BNL - dCache? Needs more investigation.
  • Plan for this week: continue functional tests; 4GB files near end of the week. Will send to US Tier2's data, but in modest amounts.
  • New version (0.3_4?) - slightly better, but high load not yet solved.
  • Alexei describes an incident where Napoli requested data from UTA - holes in DQ2 logic. Alexei would like to rigorously enforce the heirarchy.
  • Michael's comments: it was clear, for BNL, there is a potential bottleneck in the write-pools, and associated hardware. High fraction of I/O waits in the disk systems when concurrent streams are being served. Needs to be replaced, close. (Next few days.)

DDM Testbed (Rob)

  • Initial thoughts: DDMIntegrationTB
  • Timeframe - end of August for a new DQ2 release. UTA, BU have dedicated resources.
  • Timeframe for the new LRC code?
  • New version of FTS 2.0 is available. SRM v2 dependence?

New cleanse.py (Patrick)

  • Read email, send any problems.

Transfer behavior observations (Charles)

Network Performance (Shawn, Dantong)

Load testing development (Jay)

  • Help from Monalisa developers, module working for gridftp tests. Open up an app-control - does g-u-c, in a scheduled manner.
  • Need a location, paths, for each of the sites. srm vs g-u-c. Consult ToA.
  • Next - will look into how/what to monitor.
  • Public certificate is needed to use app-control.

Site news and issues (All Sites)

  • T1: Tom/Jay: waiting from comments on Nagios services.
  • AGLT2: Initially tried to keep AOD replica production up to 100, but had to shut down. Lots of DQ2 fetcher problems. May put in new code. Also some DNS issues, but resolved.
  • NET2: Running well with production once AOD replication stopped. Last night, started to drain, traced to missing DQ2 subscriptions.. Restarted, rebuilt DB, and ramping up again.
  • MWT2_IU: Looking at network monitoring plot, data transfer rate is low. Dantong - probably don't have all the subnet.
  • MWT2_UC: Down tomorrow for UPS install, disk firmware upgrade.
  • SWT2-UTA: New machine room in the physics building is ready; expecting to have one-day average with current production cluster for electrical power distribution issues in the off-campus center.
  • SWT2-OU: No much new - still waiting for final date for the move.
  • WT2: Two issues: currently block outside access to LRC web interface. Test of reverse proxy for apache server, providing read-only access to the outside world. Q to Alexei - where to get information about the percentage of AOD.

Carryover action items from previous meetings

  • All sites: check off actions taken in SiteCertificationP1 DONE
  • Follow-up with Shawn and folks on first set of network I/O load tests. Need to install NDT on dedicated host at each site. Then tests on-demand can take place. Dantong driving this. DONE
  • Follow-up with OSG troubleshooting on AGLT2 gatekeeper problem. Shawn and Yuri to follow up. (deprecate?) There was a solution found for PBS at LSU - a required INCLUDE was not available. Will follow-up.
  • Encryption to syslog-ng Still to do, carryover.
  • Install NDT at each site - put in site certification table. Follow-up next week.
  • RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.
    • Dantong has guidelines.
    • There are a couple of tickets that are not getting attention.

New Action Items

  • See items in bold above.

AOB

  • None

-- RobertGardner - 07 Aug 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf T2_pledges.pdf (11.6K) | RobertGardner, 07 Aug 2007 - 19:05 | summary of LCG pledges from USATLAS
pdf transfers-iu.pdf (18.7K) | RobertGardner, 08 Aug 2007 - 11:45 |
pdf transfer.pdf (15.7K) | RobertGardner, 08 Aug 2007 - 10:51 | data transfers from IU
pdf transfers-uc.pdf (18.0K) | RobertGardner, 08 Aug 2007 - 11:45 |
ppt facilities-status-aug07_2.ppt (108.5K) | KaushikDe, 08 Aug 2007 - 11:53 | Production Summary Slides
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback