r5 - 19 Dec 2007 - 14:45:33 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec19

MinutesDec19

Introduction

Minutes of the Facilities Integration Program meeting, December 19, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Charles, Rich, John Hover, Patrick, Alexei, Karthik, Torre, John@BU, Jay, Dantong, Kaushik, Wei
  • Apologies: none

Integration program update (Rob, Michael)

  • Phase 3 plan: here
  • Phase 3 SiteCertificationP3
  • Review of action items from Tier2 meeting at SLAC: NotesTier2Nov30. Overarching near term goals (December 15) are:
    • Establish 200 MB/s sustained throughput to all Tier2s
    • Establish analysis queues at all Tier2s
    • Replicate Rel 12 AODs to all Tier2, for routine pathena analysis

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Running well, lots of jobs for running through the holidays
    • Job types? Mostly simulation jobs, and reconstruction of jobs w/ RDO's on disk.
    • Dec 24, 25, 31, Jan 1 shifts are cancelled.
    • There is an agreement for aspects of Tier1 operations at BNL.
    • Tier2: MWT2 - down on the above dates; NET2 - will be checking email and respond as needed; GLT2: will respond to email, will maintain operations informally; MSU - up and online; need more pilots. SW: plans same as MW; will monitor emails and keep an eye on things; WT2: will continue to check status during holiday; SLAC shutdown Dec 22-Jan 2. OU: will respond via email, and can fix things remotely.
  • Production shift report (Nurcan/Mark)
    • Conversion to eElog now routine, available from Panda page. Comments/suggestions weclome.
  • Dec 24, 25, 31, Jan 1 shifts are cancelled.
  • Follow-up on ADC Operations plan to submit to Alexei. Kaushik will send to ATLAS management today. Note January 21-22 at CERN there will be a combined shift training meeting. still-on

Operations: DDM (Alexei)

Analysis Queues (Bob, Mark)

  • See AnalysisQueues - updated
  • Three sites fully tested and ready to go. AGLT2, SWT2-OU, SWT2-UTA are fully testing. Test jobs, and pathena test jobs are successful.
  • Email Bob, ball@umich.edu.
  • Follow-up on sites
    • NET2 - something wrong w/ python. Saul notified.
    • MWT2 - pathena test job has been submitted, waiting for result; waiting for an autopilot
    • WT2 - power problems, need to debug why pilots are failing, working w/ Paul
    • SWT2_UTA * Can we agree that we have this milestone completed by December 15? Yes. * Very close, will push harder this week.

Accounting (Shawn, Rob)

Follow-up on (see Accounting) issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up from last meeting:
    • SWT2_UTA still being addressed. Need VORS registration - not yet. Hopefully this week, may take until January.
  • BNL accounting info was lost - Xin investigating. There was confusion on the WLCG APEL site - having to do with the change in the Gratia site name - they appear to have static mappings. Xin still investigating.
    • John W is still clarifying w/ EGEE people on naming convention. Xin will continue to push the issue w/ John Weigand. Michael will push this with Ruth.
  • Schedule a phone call w/ Sue to get the US Facility view available.

Throughput initiative - status (Shawn)

  • High throughput between T1 and T2 is the goal.
  • Determining exact data path between tests for disk-to-disk. Source pool of Thumper to the Tier2. Low rates may be attributed to mis-direction through firewall rather than door.
  • Dantong notes that a large fraction of traffic from pool nodes to the remote site, transfers not proceeding as expected.
  • Hiro notes that FTS uses the doors. Dantong notes there are two steps used.
  • Hiro is testing the buffer size between BNL and UM.
  • glite-url-copy going to be used.
  • Upgrade of BNL gridftp doors. Need to verify w/ testing current doors and see if the doors are a bottleneck. Do the doors need more memory? Or do we need to add more doors? Michael notes we may need to take out a door for structured tests.
  • Is there an issue w/ memory on the dcache pools? Depends on the node which is doing the long-haul transfer - could be the door.
  • Shawn notes there is an rpm call stress that is useful for testing. See yum repo.
  • Shawn is hopeful that we can meet the milestone of 200 MB/s to UM.
  • Can we arrange a test w/ multiple files simultaneously?

Panda release installation jobs (Xin)

  • Follow-up with Xin on the status of the dedicated submit host. Xin has been in contact w/ Tadashi and has made some recommendations for more features.
  • Preliminary version to try out w/ test machine. Installing Panda job scheduler on it today - there are some issues to resolved.
  • Milestone - December 19. Will contact Wei for test installs before going to other sites.

OSG

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Split of Nagios server into internal and external - still working on this. End of the year. Hardware problem.
  • RSV proposal to Arvind. Tomasz, Dantong, John. Tenatively this Friday
  • Will meet w/ MWT2 for Nagios-RSV

Site news and issues (all sites)

  • T1: Panda relocation servers to new hardware is going smoothly. Tomorrow upgrading dCache 8-6pm. Does Panda mover will have to stop. Hiro will follow-up w/ Wensheng.
  • AGLT2: MSU online! All 53 new nodes filled w/ 8 jobs running each.
  • NET2: Going well. GPFS is being used in production and AOD replication and is working well.
  • MWT2: Optical fiber break to machine room yesterday caused disruption, otherwise running smoothly.
  • SWT2_UTA: new cluster ready to run, asked Xin to install releases. Will setup new DQ2 site, and will setup for production.
  • SWT2_OU: nothing new, running smoothly. still have an issue w/ tier2-02 - updating bios. going well.
  • WT2: recovered from power outtage - ramping back up. In negotiation w/ Sun for Thumper purchase. Done some stress testing w/ Bestman and xrootd; attempting to overload the srm server; 5000 puts, gets. Will put into production as soon as new hardware arrives. Will first concentrate on DQ2 and analysis queue. Can see the load balance working but w/ "old" machines for gridftp doors.

RT Queues and pending issues (Tomasz)

Carryover action items

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.
  • Initial work starting in the OSG ITB.

Site performance jobs and metrics

  • Carryover; some benchmarking work w/ quad core opterons.
  • No news.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • Proposed Computing Operations/Integration Program holiday schedule:
    • December 19 - regular meeting
    • December 26 - no meeting
    • January 2 - no meeting
    • January 9, 2008 - resume w/ Phase IV
  • none

-- RobertGardner - 18 Dec 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback