r2 - 05 Dec 2007 - 14:49:54 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec5

MinutesDec5

Introduction

Minutes of the Facilities Integration Program meeting, December 5, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Rich Carlson, John, Wensheng, Charles, Sarah, Gabriele, Horst, Karthik, Alexei, Nurcan, Kaushik, John, Patrick
  • Apologie: Wei, Jay

Integration program update (Rob, Michael)

  • Phase 3 plan: here
  • Phase 3 SiteCertificationP3
  • Review of action items from Tier2 meeting at SLAC: NotesTier2Nov30. Overarching near term goals (December 15) are:
    • Establish 200 MB/s sustained throughput to all Tier2s
    • Establish analysis queues at all Tier2s
    • Replicate Rel 12 AODs to all Tier2, for routine pathena analysis

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Running fine for the past week. Looks good, steady stream of jobs. Large memory jobs - for future releases Athena will print out memory usage after each event from release 13.0.40 on. Panda ID's from some sites - Savannah bug reports filed. Would like to get emails about jobs w/ excessive memory consumption.
    • Charles will send reports of panda ID's with resident memory size exceeding 2 GB.
    • What is the source - there are known memory leaks w/ pileup jobs.
  • Production shift report (Nurcan/Mark)
    • Mark and Wensheng on shift: pilot update on the production submit hosts. 3 bug reports submitted to validation-savannah (these are heavy-ion simulations).
    • eLog December 15.
  • ADC Operations (convened by Alexei) - we are working on plans to combine US Panda shifts w/ LCG shifts. Will increase coverage in terms of overlap. Will present a plan to Alexei in a few days. Will be discussed at the Tier1 Jamboree tomorrow.

Operations: DDM (Alexei)

  • DQ2 0.5.0 schedule and plan
    • First ADC software meeting - tested and ready for the UK cloud.
    • Patrick and Hiro about testing in BNL. Yes.
    • Deployment schedule - this week and next week on LCG by Miguel.
    • Which version to run during the December?
    • DQ2 0.4.2 was the fix for duplicate registrations (there was no announcement). This adds load onto the call-back host.
    • Charles will upgrade MWT2 site services using apt-get and will circulate recipe
  • LRC upgrade
    • John will convene an LFC working.
    • Priority is to bring up an instance of a LFC
    • Milestone of December 20
  • Status of M5 processing and distribution of datasets to the facility
    • See M5 dash
    • See M5 T1 dash
    • State of M5 distribution to Tier2's? More or less complete.
    • One last thing - to do the complete M6 chain at BNL. Replication of data from CERN to BNL. Conditions data, etc, need to be involved.
    • Agrees w/ Arvind - that all data copied from pit will go to castor and registered in DQ2; any site can request.
    • Muon runs going right now. Will notify.
  • AOD replication for analysis at Tier2s - will resume at all sites.
  • We have replicated to four Tier2's roughly 10TB of the AODs. Total volume is 86 TB - but this needs to be checked, probably wrong. Estimate 30-40 TB (Rel 12, 13).
  • How to handle the replication?s too.
  • Dataset/site matchmaking is available in Panda.

Analysis Queues (Bob, Mark)

  • See AnalysisQueues
  • Email Bob, ball@umich.edu.
  • Four sites are various states of implementation:
    • SLAC - Marco submitted jobs to SLAC, its working.
    • MWT2 - just getting started - Charles will set up this week.
    • OU - Mark working with Horst. There were minor config problems using the submit host at UTA, but autopilot worked.
    • BU - Mark working w/ Saul. No progress since the meeting.
    • SWT2_UTA - dpcc has been running jobs w/ autopilot. * Can we agree that we have this mileston completed by December 15? Yes. * Can we run jobs on a regular basis, and collect information? Mark will automate submission of pathena test jobs.

Accounting (Shawn, Rob)

Follow-up on (see Accounting) issues.

Throughput initiative - overview (Shawn)

  • AGLT2:
    • Working on changes to local equipment - exploring XFS filesystem issues: creation, performance, mounting.
    • For single partitions 400-500 MB/s, so this looks good.
    • New storage coming online at MSU - to be used as test systems.
    • Exploring parameters, correlations
  • Dedicated throughput meeting, Monday, 2pm EST.
  • Include Rich Carlson
  • BU:
    • ok
  • SWT2:
    • ok
  • WT2:
    • not present
  • MWT2:
    • ok

Load test displays, issues from the last week (Jay)

OSG

  • OSG 0.8 released, deployment instructions: OSGservices. General OSG summary is: Why upgrade?
  • OSG site administrators meeting at Fermilab: Dec 12-13

Panda release installation jobs (Wei, Tadashi)

  • Kaushik: usatlas1 versus usatlas2. Separate pilots to be sent. Xin is handling this. In contact w/ Tadashi. Need to indentify a machine. Concerns about contention - shouldn't be a problem since we have
  • Milestone - 2 weeks. (Xin)

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Internal discussion at BNL - simpler solution put a web interface in front of his database. Need to convince him of this approach. Nagios will grab the results.
  • Need to follow-up w/ reporting of RSV data to WLCG.
  • Lost contact w/ Panda monitors that generated false alarms.

Site news and issues (all sites)

  • T1: two updates: BNL network to Canadian Tier1 at Layer 2. On Monday morning - local disk filled for local submitter disk. Need to upgrade the hardware for this. Still working on the Panda infrastructure setup, close to delivery to Torre and Tadashi.
  • AGLT2: Shawn: there were issues with changes to Tiers of ATLAS, a firewall issue, and Panda mover was stuck. MSU resources coming online (Tom): room is set, power/cooling; network is up betwen two sites and Chicago. Working on provisioning compute nodes in Rocks. Working on Condor queue setup at Ann Arbor.
  • NET2: no major reports. Production going okay. Putting GPFS storage online.
  • MWT2: all okay - no news. IU_OSG resource down today for maintenance.
  • SWT2_UTA: not much since Friday. Bringing up new cluster. Issue w/ Gratia reporting going slow. Will follow-up w/ Gratia folks. Production running fine.
  • SWT2_OU: all running fine. Ocassional crashes w/ gridftp server - something in the network. Draining right now.
    • Kaushik: seeing strange timeouts in Panda core components. Are there network probs at BNL?
  • WT2: out today

RT Queues and pending issues (Tomasz)

Carryover action items

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics

  • Carryover; some benchmarking work w/ quad core opterons.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 04 Dec 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback