r4 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune11

MinutesJune11

Introduction

Minutes of the Facilities Integration Program meeting, June 11, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Fred, Michael, Rob, Shawn, Justin, Nurcan, Kaushik, Mark, Horst, Karthik, Sarah, Charles, Wei, Marco, Bob, Patrick, Xin, Wensheng, John/BU, Torre, Saul
  • Apologies: none
  • Guests: none

Integration program update (Rob, Michael)

Site certification review

Operations overview: Production (Kaushik)

  • Most important issue - re-processing at Tier2's. There were problems related to pilot2 - handling AtlasPoint1 releases. Unhandled exceptions due to the filesize issue.
  • Some jobs are accessing Oracle database at CERN from jobs on nodes on private firewalls. .local is being mapped to BNL and Triumf. Future transformations will be better equipped to handle these requests. Would be better to have a proxy handle these requests. How do we handle this? Sasha and Wei are discussing this currently. Wait to hear from Sasha and Richard Hawking - since it is a global issue - and to see what SLAC comes up with. Put on Carry-over list.
  • Out of 5000 jobs, 1700 at BNL, 2800 at AGLT2, and no other Tier2. Most tasks have finished now. Will ask Rod for an M6 reprocessing task.
  • worker-nodes have FQDN at AGLT2.
  • Panglia not reporting?

Shifts (Mark)

  • Follow-up: pilot3 updates
  • Conversion to pilot3 is almost completed at MWT2. Should be done sometime this week.
  • Large transfer backlog at MWT2 - there were

Analysis queues, FDR analysis (Nurcan)

  • Follow-up: SUSY validation and other jobs based on release 14.
  • Have tested queues at all Tier2's now. Successfully run at MWT2, OU, BU.
  • swt2, aglt2 and slac are failing. Not completely understood. May be a cmt problem.

Operations: DDM (Hiro)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • Shawn: has a gums configuration where an atlas user with production role now maps to the correct account. There still is an issue with multiple roles in dCache.

Pilot upgrade for space tokens (Kaushik (Paul))

  • A bit of development to do. Carry-over
  • No results from tests at AGLT2.

RSV SE probe update status (Fred)

Metric Results Summary

--------------------------------------------------------------------------------------------
|           Site           | Daily  | Weekly | Monthly | gridftp | version | ce perm | crl |
|                          | Avail. | Avail. |  Avail. |         |         |         |     |
--------------------------------------------------------------------------------------------
|    gate02.grid.umich.edu |    100 |     91 |      90 |     100 |     100 |     100 | 100 |
|          iut2-dc1.iu.edu |    100 |     82 |      59 |     100 |      NT |      NT |  NT |
|                  MWT2_IU |    100 |     83 |      93 |      NT |     100 |     100 | 100 |
|                PROD_SLAC |    100 |    100 |      60 |     100 |     100 |     100 |  NT |
|                 UIUC-HEP |    100 |    100 |      89 |     100 |     100 |     100 | 100 |
|            OU_OCHEP_SWT2 |    100 |    100 |      99 |     100 |     100 |     100 | 100 |
|    uct2-dc1.uchicago.edu |    100 |     29 |       6 |     100 |      NT |      NT |  NT |
|                  MWT2_UC |    100 |     82 |      80 |      NT |     100 |     100 | 100 |
|                 SWT2_CPB |     75 |     94 |      89 |      75 |      75 |      75 |  75 |
|                 UTA_SWT2 |     75 |     75 |      73 |      79 |      76 |      75 |  75 |
|                    AGLT2 |     58 |     55 |      63 |     100 |      95 |      66 |  50 |
|              BNL_ATLAS_2 |     50 |     39 |       9 |     100 |      66 |      91 |  66 |
|                 UTA_DPCC |     25 |     60 |      58 |      75 |      63 |      75 |  75 |
|           BU_ATLAS_Tier2 |      0 |      0 |      40 |       0 |       0 |       0 |   0 |
|            UC_ATLAS_MWT2 |      0 |     53 |      70 |     100 |      93 |     100 | 100 |
--------------------------------------------------------------------------------------------

The facility average availability (defined to be the average availability of reporting sites) is 57% for the day, 53% for the week, and 51% for the month.

OSG Resources not tested (count = 44): BNL_ATLAS_1, IU_OSG, LONI_OSG1, LTU_OSG, OUHEP_OSG, OU_OSCER_ATLAS, OU_OSCER_CONDOR, SMU_PHY, UC_Teraport, UNM_HPC, USCMS-FNAL-WC1-CE3, UVA-sunfire, UWMilwaukee, UmissHEP? , cinvestav, gpnjayhawk, isuhep

Non-OSG Resources tested (count = 8): gate02.grid.umich.edu, iut2-dc1.iu.edu, uct2-dc1.uchicago.edu

WLCG accounting

Next procurements

  • Standing agenda item, see CapacitySummary.
  • Follow-up:
    • Kaushik should give guidance for production and analysis needs. DONE
    • t2storage-june08.ppt.pdf: Storage requirements - Kaushik
    • Jim has given requirements as well. Will start from those numbers. Deployment at sites by September 15; implies need to go out for bids in July.
    • Several concerns - even as the numbers are preliminary.
    • Question about how to manage the US 20% fraction.
  • Specifications from Internet2 for network monitoring hosts (Rich)
    • Almost done - may be recommending a single host, dual core. Bind each core to a separate NIC.
    • Two roles - latency and bandwidth.

OSG 1.0 (Rob)

  • VDT security update
  • OSG 1.0 deployment schedule

Site news and issues (all sites)

  • T1: Tomorrow potential power cuts during: 7-8am, 5-7pm time periods. Wensheng will be on-site to restart panda services. There is some documentation for this for the services. Tadashi and Torre will be watching the services.
  • AGLT2: Currently having pnfs database problem. Using more than 105 TB in a single area - vacuuming postgres, etc. Expect to be back online in an hour. Notified panda shift, set an OSG maintenance window.
  • NET2: About a day of down time due to an AC failure, otherwise no issue.
  • MWT2: looks like we're filling up again. Moving ANLY_MWT2 queue.
  • SWT2 (UTA): All okay.
  • SWT2 (OU): Swapped tier2-02 yesterday; currently testing. About to start 10G testing.
  • WT2: All okay.

Carryover issues

LFC status (John)

  • We need to comet to a formal decision about our own deployment model. If we stick to our existing model, we should prepare arguments, and the converse. Fault tolerance and scalability issues. Suggestion is to revisit.
  • Revisit in 2 weeks

Release installation via Pacballs (Xin)

  • Follow-up
    • Progress - this morning to discuss this. Fred - hoping this week to have first set of pacballs installed in DQ2. Will test with some older releases on some test machines.
    • Need official naming scheme.
    • Get installed with a special Panda pilot job using the software role. Expect performance to improve.
    • Expect a couple of weeks of testing.
    • Goal to bring into production by end of the month (June).

Throughput initiative - status (Shawn)

Nagios monitoring subcommittee (Dantong)

  • Available space reporting at all sites.
  • Tomasz was organizing a meeting to test globus-job-run (?)

AOB

  • LFC migration questions. Kaushik notes Nordugrid has made the migration to LFC with a Mysql backend, apparently with success.


-- RobertGardner - 10 Jun 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf t2storage-june08.ppt.pdf (3252.2K) | RobertGardner, 11 Jun 2008 - 12:32 | Storage requirements - Kaushik
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback