r6 - 25 Jul 2007 - 15:28:51 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJul25

MinutesJul25

Introduction

Minutes of the Facilities Integration Program meeting, July 25, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern, 1-510-665-5437 #24743

Attending

  • Meeting attendees: Shawn, Karthik (OU), Alexei, Charles, Joe, Wensheng, Kaushik, Fred, Xin, Mark (UTA), Patrick, Hiro, Rob, Michael, plus...
  • Apologies: Nurcan

Operations: Production

  • Production summary from Kaushik
    • facilities-status-jul07_2.ppt: Production Summary Slides
    • All sites doing very well. Problems with NET2.
    • Still limited by lack of defined jobs. Who is responsible for supplying jobs? Defined for physics groups, coordinated by Ian, with assigned quotas. Note physicists are also busy validating release 13.
  • Shifters report and other production issues (Mark)
    • There is a new version of the pilot code provided by Paul to do more error handling. This was a BNL specific issue; Paul asked only Xin to do this. What about the pilot checker? Marco will add this to the pilot checker.
    • Problems continue at UT-Dallas and UC teraport (GPFS).
    • Has there been progress with UT-Dallas for optimistic access? Kaushik notes they will use the SWT2 storage and DQ2. Mark will follow-up.

Operations: DDM (Alexei)

  • DQ2 0.3 and other DDM issues
  • See DQ2SiteServicesP1 for latest status and issues on the DQ2 0.3 upgrade.
  • AOD replication to all Tier1 sites has resumed. Monday, July 30, should start replication to Tier2s according to the plan as of June 18.
    • Total data volume is 12 TB for all AODs.
    • GL, SLAC, MWT2: 100%; UTA_SWT2 10% (expect disks to come online over the next month); No NET2.
    • Hiro has a scripts which parse DQ2 logfiles at BNL.
    • What about queues? Are the defaults sensible? Hiro will send info on this. Their shares: Tier0, production, default.
  • Functional tests the second week of August. Two sets: 0.5 GB and 4 GB sized files (0.5 TB). Replicate to T1, and then T2. Hiro metrics - for time of subcription to completion.
    • There was a problem/issue conditions database being staged from Castor - would like to check this with well-known datasets.
  • Access to logfiles - to have access to a central server logs. Need more than just ARDA monitoring. Alexei pursuing public access for this.
  • M4 Cosmic run in Aug/Sep, about 10 TB. Agreement is to distribute this as well to Tier2s. In long term, will need to keep roughly two copies around (current and next).
  • Hiro reports Miguel has a DQ 0.3 update ready. We have a general discussion about rolling out changes in DQ2. Need to setup a flexible testbed. Rob will develop a little plan for this.
  • Troubleshooting - Michael has dedicated effort for this at the facilties.

Analysis Queues

  • See AnalysisQueueP1
  • Follow-up on Condor config options: Mark and Bob
  • Implementation on sites - need update the twiki. Shawn will ping Bob, for the Condor section, Patrick for the PBS section.

Site certification review for Phase 1

OSG 0.6 deployment update

  • OU_OCHEP_SWT2 - still waiting for the move.
  • UTA_SWT2 - the headnode in use for Panda is already 0.6. Diagnosing hardware failures.
  • OSG site admins meeting at Fermilab - OU, UC, BNL will be representing ATLAS.

Network Performance (Shawn)

  • NetworkPerformanceP1
  • About half sites have installed NDT. There has been some work within Ultralight for advanced scheduled transfers for CMS - there is a document that Shawn will circulate. Perhaps US ATLAS sites would want to participate.
  • M-to-M iperf test to get achievable bandwidth between sites. Plug into Jay's load testing framework is an option.
  • Dantong will run tests from BNL to each NDT server running at the Tier2s.

Pilot Checker (Marco)

  • See PilotCheckerP1
  • Checking site information CE, and pilot submission. Several checks also available.

Syslog-ng (Rob)

  • See LoggingServicesP1
  • Still need to work on encryption with stunnel. See also note above about troubleshooting console.

Carryover action items from previous meetings

  • All sites: OSG 0.6 installations see updates above
  • All sites: check off actions taken in SiteCertificationP1 will setup timeline
  • Follow-up with Shawn and folks on first set of network I/O load tests. Need to install NDT on dedicated host at each site. Then tests on-demand can take place.
  • Follow-up with OSG troubleshooting on AGLT2 gatekeeper problem. Shawn and Yuri to follow up.
  • Encryption to syslog-ng Still to do.
  • Install NDT at each site - put in site certification table. In progress.
  • RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.

New Action Items

  • See bold items above.

-- RobertGardner - 24 Jul 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


ppt facilities-status-jul07_2.ppt (74.5K) | KaushikDe, 25 Jul 2007 - 12:06 | Production Summary Slides
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback