r12 - 24 Oct 2007 - 13:00:13 - JayPackardYou are here: TWiki >  Admins Web > MinutesOct17

MinutesOct17

Introduction

Minutes of the Facilities Integration Program meeting, October 17, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Charles, Marco, Rob, Jay, John, Wei, Horst, Karthik, Alexei, Xin, Hiro, Rich, Shawn, Nurcan, Tom, Fred, Saul, Wensheng
  • Apologies: Patrick, Kaushik

Integration program update (Rob, Michael)

  • Phase 3 planning underway: here
  • Tier2 meeting at SLAC
    • Website: http://wt2.slac.stanford.edu/events.shtml
    • Agenda: in Indico at CERN (previous US ATLAS Tier2 meetings)
    • Current draft proposal - three parts: day1: Overview LHC, ATLAS, Facility status; production and analysis at Tier2 and Tier3 (with specific examples); day 2: fabric hardware, fabric services (batch, storage, dcache, xrootd); installations; data transfer optimizations; OSG middleware (installations, clients, etc); ATLAS software (DDM tools, LRC, Panda workflow management, Panda-driven ATLAS software installation), running entire chain with Panda+Proof+Xrootd; day 3: site reports, and summary discussion and further planning.
    • Expect to get first draft out by end of the week.

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Production is going fine (except for the Ibrix problems at SWT2 and OU down for the move). Some reprocessing has started - Panda is doing well. Full flood has not been opened yet. Maybe this weekend. There was no impact to production from HPSS outage yeterday - thanks to advance planning.
  • Shifters report and other production issues (Wensheng)
    • Everything looks normal
    • Several sites are down for scheduled maintenance
    • AGLT2 - lost a switch that serves headnodes for dCache
  • Number of analysis jobs is increasing rapidly, and that we need to replicate the AODs to Tier2.

Operations: DDM (Alexei)

  • Situation with AODs: BNL is the only Tier1 with 80-90% of all AODs. Only place where TAGs and ntuples (these are available). Has stopped replication to Tier2s due to interference. Now that Panda mover is in place, we can resume. Need requested fractions from sites. Approximately 20 TB.
  • M4 agreement was 100% of all ESDs to all Tier2s except UTA. There are several versions; at 13.0.26. This 1.4 TB.
  • M5 will start next week. Friday/Saturday morning data taking will begin. Estimate RAW = 100 TB, ESD = 10 TB, Combined ntuples 10 TB.
  • Functional tests - update?
    • Request from management - need a way to conduct functional tests regularly. Other requests for peformance tests as well.
    • Will probably take until November.

DQ2 0.4 testing, deployment (Hiro, Patrick)

  • See further DQ2SiteServices to capture deployment experience, known issues.
  • DQ2 0.4 running stable at BNLPANDA.

FTS monitoring (Hiro)

  • FTS monitoring - update? Not sure its any better than the DQ2 dashboard.
  • Now installed. Still setting up the reverse proxy. Will send location (login using kerberos) - hopefully today.

Mysql LRC (John)

  • Evaluation progress - continuing to correspond with Pedro.

Accounting (Shawn, Rob)

  • From last week:
    • Follow-up on pending accounting issues, see AccountingP2 and items therein.
    • WT2 - reporting correctly at Gratia level
    • NET2 - should be reporting correctly from October onwards.
    • SWT2 - normalization factors missing. DPCC should be okay. For UTA_SWT2 - need to change machine reporting Gratia reporting. Need to make sure data is being attributed to ATLAS correctly. Expect to be complete be
    • AGLT2 - reporting okay, just a some historical.
  • Any updates and/or issues?
    • The are still some accounting view-grouping problems
    • IU-OSG not reporting?
    • SWT2_UTA still being addressed, also as unregistered.
    • BU_ATLAS_Tier2o - Saul will check

Network Performance and Throughput initiative (Shawn, Dantong)

  • See work in progress at NetworkPerformanceP2
  • Work w/ SLAC (Dmitri, Jay):
    • Before tuning, it was less than 50 Mbps. After tuning, up to 800 Mbps.
    • Some effects that look like competing traffic.
    • Multiple streams, no signficant change. Ceiling established somewhere between 800-900 Mpbs.
    • Does not appear to be directionally dependent
  • What is RTT? 80 msec.
  • Move buffer above 16 MB? Might be hitting this limit. Try 20-24 MB.
  • Note that with their kernel does not auto-tune - worry about too many connections.
  • Next week: revisiting Michigan, start on BU
  • Defer on OU until hardware installation

Load testing update, issues (Jay)

  • From last week:
    • iperf is there, but not at all the sites. Plan is to use bwctl at all sites.
    • There is a question of the accuracy of the plots.
    • Need an iperf endpoint for BU and Michigan.
    • BU will will install iperf.
  • Updates this week

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Meeting this week w/ Arvind from OSG
  • Looking at ways to collect results from probes.
  • Will write a generic wrapper for Nagios to RSV probes - plan to release this Friday.

Site news and issues (All Sites)

  • T1: two major upgrades this week. Panda database server will be swapped in tomorrow. Second server has been setup. HPSS and robot upgraded complete 6000 slots; fast and reliable dCache was upgraded. Joined the Tier0 distribution test - 250 MB/s between CERN and BNL, 100% efficiency (no errors).
  • AGLT2: network switch replaced - recovering now. Expect in a hour or two we'll be good to go again.
  • NET2: have installed iperf for Jay. Correctly accounting in Gratia, but not in WLCG.
  • MWT2: production sites okay; IU_OSG coming online, memory limitation probs (1G/core).
  • SWT2_UTA: bringing new cluster online - started having storage problems/ibrix segment servers. Kernel panics over the weekend - traced to some corrupt disks, not fully recovered its filesystem. Trying to istall a newer version.
  • SWT2_OU: need to install new version of Ibrix 3.0 on the new cluster. Working on a Condor issue (v 6.8). Will consult with UMich. Web server probs - huge SSL error logfiles filling system partition. Then OSG and LRC. Hopefully by next week.
  • WT2: Production going well after Panda mover switch. Tier2 meeting - a web team is working on the web site. Online payment for credit card payment! Expect official webserver online. (Start at 9am until 1pm on Friday.)

RT Queues and pending issues (Dantong, Tomasz)

Carryover action items

Panda release installation jobs

  • Need to find a Facilities person to work with Tadashi

Analysis Queues (Bob, Mark)

  • See AnalysisQueueP2
  • Action item: Mark will provide similar instructions for PBS. -Mark still working on it.
  • Main problem is the set of AODs are not available.
  • Action items moving forward (each site):
    • We need to setup analysis queues
    • Allocate a small number of cpu's to this site

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics

  • Carryover

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 16 Oct 2007

  • 1 stream:
    1stream.png

  • 12 streams:
    12streams.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png 1stream.png (28.2K) | JayPackard, 24 Oct 2007 - 12:59 | 1 stream
png 12streams.png (28.2K) | JayPackard, 24 Oct 2007 - 13:00 | 12 streams
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback