r6 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune18

MinutesJune18

Introduction

Minutes of the Facilities Integration Program meeting, June 18, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Charles, Sarah, Wei, Marco, Fred, Bob, Mark, Xin, Pat, Hiro, Saul, Tom
  • Apologies: Michael, Horst, Karthik, Nurcan, Kaushik
  • Guests: none

Integration program update (Rob, Michael)

Site certification review

Operations overview: Production (Mark)

  • Not many jobs; expect some single-particle tasks, but no recent news from Kaushik.
  • Migrating sites autopilot: both MWT2 sites; will start autopilots at UC_ATLAS_MWT2. All then MWT2 converted. And then all Tier2 sites are now converted.
  • A few Tier3's remain to be converted, eg. UTD. Very close to having all sites converted to autopilot.
  • Large memory jobs - heavy-on jobs requiring 3 GB memory.
    • What are the requirements? What should we do about it? Setup additional queues? Enlarge swap?
    • There are possibilities for handling these.
    • Need management guidance here as to prioritize setting up the facility for heavy ion jobs.
    • Do we need to consider a swap space policy in the facility?

Shifts (Marco)

  • Database maintenance at CERN today.
  • An analysis tutorial today at Vancouver - heads up to site admins.

User LRC deletion (Charles)

  • Nurcan reports this is currently failing - Charles addressing this.

Analysis queues, FDR analysis (Mark)

  • Follow-up: SUSY validation and other jobs based on release 14.
    • All analysis queues functional except AGLT2 - Bob in contact with Nurcan.
  • Regular exercising of analysis queues over data sets, especially when there are new releases. Nurcan is doing this.
  • Mark and Nurcan will meet to discuss some systematic testing at the sites and will re-consider the analysis benchmarks.
  • Metric - define a standard for time required to process a standard dataset
  • Consider site availability monitor which indicates basic functionality indicating site-readiness; this would help users distinguish "site" problems for "user-code" problems.

Operations: DDM (Hiro)

  • Hiro: all okay; srm v2 for UM calibration; otherwise quiet.
  • Not known if there are any FT
  • Space tokens: Hiro has a way of

LFC migration (John)

  • Formation of the LFC sub-committee, SubCommitteeLFC
  • Will hold first meeting next week.

RSV SE & CE probe update status (Fred)

  • Working with Brian Bockelman (U Nebraska Lincoln) to understand problems with the Gratia daily report for CEs. I have included the daily report from M0nday that seems to be incorrect in a couple of places. Please let me know if you think something is wrong.

RSV Daily report for the period 06/16/08 00:00:00 - 06/16/08 23:59:59.

All values in the below table are percentages; if no tests were run, it is marked with NT.

Metric Results Summary for Resources of type CE
------------------------------------------------------------------------------------------------------------
|      Resource Name       |    Daily     | Change from  |    Daily    | gridftp | version | ce perm | crl |
|                          | Availability | Previous Day | Reliability |         |         |         |     |
------------------------------------------------------------------------------------------------------------
|               * UTA_SWT2 |          100 |            0 |         100 |     100 |     100 |     100 | 100 |
|               * SWT2_CPB |          100 |            0 |         100 |     100 |     100 |     100 | 100 |
|          * OU_OCHEP_SWT2 |          100 |            0 |         100 |     100 |     100 |     100 | 100 |
|                  * AGLT2 |          100 |            0 |         100 |      NT |     100 |      NT | 100 |
|                 * IU_OSG |          100 |            0 |         100 |      NT |      NT |      NT |  NT |
|                * MWT2_UC |          100 |            0 |         100 |      NT |     100 |     100 | 100 |
|    gate02.grid.umich.edu |          100 |            0 |         100 |      NT |     100 |      NT | 100 |
|          * UC_ATLAS_MWT2 |          100 |            0 |         100 |      NT |      NT |      NT |  NT |
|               * UTA_DPCC |           95 |           -2 |         100 |     100 |     100 |     100 | 100 |
|              * PROD_SLAC |            0 |            0 |         100 |     100 |     100 |     100 |  NT |
|      uct2-grid6.mwt2.org |            0 |            0 |           0 |      NT |      NT |      NT |  NT |
|              BNL_ATLAS_2 |            0 |            0 |           0 |     100 |      41 |      91 |   0 |
|            * BNL_ATLAS_1 |            0 |            0 |           0 |     100 |     100 |     100 |   0 |
|                * MWT2_IU |            0 |            0 |           0 |      NT |      90 |      90 |  90 |
|         * BU_ATLAS_Tier2 |            0 |            0 |           0 |       0 |       0 |       0 |   0 |
------------------------------------------------------------------------------------------------------------

  • All sites showing red on CA cert expiring probe.
  • SRM probes needed for AGLT2, SWT2, NET2
    • AGLT2 - has 2.0 probes, just not enabled. Will run configure.
    • BU - has RSV 2.0 running, but not reporting. Saul will follow-up, will install OSG 1.0 by next week.
    • SW - need SRM probes. Did upgrade, but may not have enabled SRM probe.
    • BNL - why not reporting? Xin claims its reporting fine locally. Are they going into Gratia correctly? Fred will follow-up with Xin.

Scheduling maintenance downtimes with the GOC (Sarah)

WLCG accounting

Next procurements

  • Standing agenda item, see CapacitySummary.
  • Follow-up issues:
    • Storage capacity recommendations/guidance for the Facility
    • Revised WLCG pledges

  • Specifications from Internet2 for network monitoring hosts (Rich)

OSG 1.0 (Rob)

  • OSG 1.0 now released
  • On-going testing with UC_ATLAS_MWT2 and ANALY_MWT2.
  • See https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/WebHome
  • Constraints
  • Action item: determine schedule for the Facility
    • AGLT2 - within the next week
    • NET2 - within the next week
    • UTA_SWT2 - perhaps next week or earlier
    • SWT2_CPB - probably 2 weeks away since there will be a major shutdown
    • SLAC - first needs to do xrootd storage upgrade; earliest end of this month
    • MWT2 -

Site news and issues (all sites)

  • T1: no report
  • AGLT2: swapped old GK hardware, gone well. typical load of 4, down from 40. otherwise all okay. dCache replica manager does not work well, dcache developers on vacation it seems.
  • NET2: all is well. Will need to upgrade srm-bestman 2.2.0.8c3 for RSV probes. (sudo support for non-root)
  • MWT2: working on PBS job manager; analysis queue upgraded to 1.0;
  • SWT2 (UTA): integrating new hardware starting next week. Hase srm-bestman c1; will upgrade to c3.
  • SWT2 (OU): all is well, 10G work.
  • WT2: no news. srm-xroot-bestman at a version which works with RSV.

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • A bit of development to do. Carry-over
  • No results yet from tests at AGLT2.

Release installation via Pacballs (Xin)

  • Follow-up
    • Progress - this morning to discuss this. Fred - hoping this week to have first set of pacballs installed in DQ2. Will test with some older releases on some test machines.
    • Need official naming scheme.
    • Get installed with a special Panda pilot job using the software role. Expect performance to improve.
    • Expect a couple of weeks of testing.
    • Goal to bring into production by end of the month (June).

Throughput initiative - status (Shawn)

Nagios monitoring subcommittee (Dantong)

  • Available space reporting at all sites.
  • Tomasz was organizing a meeting to test globus-job-run (?)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

AOB

  • none


-- RobertGardner - 17 Jun 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


gif drstats_FDR_bycloud_all_TIER2S.gif (123.3K) | RobertGardner, 17 Jun 2008 - 11:48 | FDR2 monitor
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback