r7 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJul9

MinutesJul9

Introduction

Minutes of the Facilities Integration Program meeting, July 9, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Fred, Rob, Charles, Shawn, Wen, Marco, Saul, Armen, Kaushik, Michael, Torre, Tom, Sarah, Wei, Bob, Wensheng, Mark, Nurcan, Rich
  • Apologies: none
  • Guests:

Integration program update (Rob, Michael)

  • IntegrationProgram for Phase 5 (April 1 - June 30, 2008: FY08Q3) Report in progress
  • Overarching near term goals for Phase 5:
    • Full and effective participation FDR-2 exercises
    • Complete the benchmarks of 200 MB/s sustained disk-to-disk throughput to all Tier2s
    • SRM v2.2 functionality for all ATLAS sites
  • Upcoming meetings:
  • Milestones from the Ann Arbor meeting: AnnArborNotesMay2008:
    • FDR2: data replication and analysis queues
    • 200/400 MB/s T1-T2
    • OSG 1.0 deployed
    • LFC evaluation and deployment strategy complete
    • WLCG - SAM/RSV, reliability availability metrics for CE and SE reporting >80% for all sites.
    • Provisioning of capacities according to pledges on track for September 15 2008 deployment.
    • Network performance monitoring infrastructure deployed.
    • Revision to the Tier 3 white paper, and a reference Tier 3 facility defined.
    • Analysis benchmarks demonstrated at increasing scale (100/200/500/1000 simultaneous jobs) at all Tier 2 facilities.

RSV and WLCG SAM (Fred)

  • See https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvSAMGridView for links to SAM and Gridview reporting consoles.
  • For scheduling downtimes, the OIM system: https://oim.grid.iu.edu/
  • Need to fix site names in OIM (use site name, but host name).
  • Need additional SE's to register
  • Will we be able to publish these at the end of the month?
    • MWT2_IU_SE - why is it not reporting correctly? Just sent an email to osg-storage. Looks like the service is working, but there are problems with the client. Sarah is investigating.
    • Fred will drive the process with sites and the GOC. Need responsiveness on the part of sites not reporting correctly.
  • We will be publishing officially for July

Below is an a table derived from RSV reports generated by Brian Bockelman (U Nebraska Lincoln):

RSV Daily report for 07/08/08.

This report uses the LCG algorithm for computing availability and reliability.  This algorithm is documented at:

https://twiki.grid.iu.edu/pub/Operations/RSVPeriodicReporting/Gridview_Service_Availability_Computation-1.pdf

Metric Results Summary for Resources of type CE
-----------------------------------------------------------------------------------------------------------------------
|     Resource Name     |    Daily     | Change from  |    Daily    | version | ce perm | crl | ca cert | gram | ping |
|                       | Availability | Previous Day | Reliability |         |         |     |         |      |      |
-----------------------------------------------------------------------------------------------------------------------
|  * OU_OCHEP_SWT2      |          100 |            0 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
|  * MWT2_IU            |          100 |            0 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
|  * PROD_SLAC          |          100 |            0 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
| BNL_ATLAS_2           |          100 |            0 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
|  * UTA_DPCC           |          100 |            0 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
|  * BNL_ATLAS_1        |          100 |            0 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
|  * BU_ATLAS_Tier2     |          100 |            0 |         100 |     100 |       0 | 100 |     100 |  100 |  100 |
|  * AGLT2              |          100 |            0 |         100 |     100 |      NT | 100 |     100 |  100 |  100 |
|  * UC_ATLAS_MWT2      |          100 |            0 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
|  * MWT2_UC            |          100 |           77 |         100 |     100 |     100 | 100 |     100 |  100 |  100 |
| gate02.grid.umich.edu |           96 |           96 |          93 |     100 |     100 | 100 |     100 |  100 |  100 |
|  * IU_OSG             |            0 |            0 |           0 |       0 |       0 |   0 |       0 |  100 |  100 |
-----------------------------------------------------------------------------------------------------------------------

Metric Results Summary for Resources of type SE
--------------------------------------------------------------------------------------
|      Resource Name       |    Daily     | Change from  |    Daily    | srm | srmcp |
|                          | Availability | Previous Day | Reliability |     |       |
--------------------------------------------------------------------------------------
| head01.aglt2.org         |          100 |           86 |         100 | 100 |    91 |
|  * PROD_SLAC_SE          |           91 |           -9 |          91 | 100 |    91 |
| BNL_ATLAS_SE             |           90 |          -10 |         100 |  91 |    88 |
|  * MWT2_IU_SE            |            8 |           -4 |          36 | 100 |    33 |
| gk04.swt2.uta.edu:8446   |            0 |            0 |           0 | 100 |     0 |
| uct2-dc1.uchicago.edu    |            0 |            0 |           0 |  97 |    NT |
|  * UTA_SWT2              |            0 |            0 |           0 |   0 |     0 |
--------------------------------------------------------------------------------------

RSV versus WLCG values for WLCG sites
----------------------------------------------------------------------------------------------------------------------------
|        Site        | RSV Availability | WLCG Availability | Difference | RSV Reliability | WLCG Reliability | Difference |
----------------------------------------------------------------------------------------------------------------------------
| UTA_SWT2           |              100 |               100 |          0 |             100 |              100 |          0 |
| OU_OCHEP_SWT2      |              100 |               100 |          0 |             100 |              100 |          0 |
| UTA_DPCC           |              100 |               100 |          0 |             100 |              100 |          0 |
| BNL_ATLAS_1        |              100 |               100 |          0 |             100 |              100 |          0 |
| AGLT2              |              100 |                50 |         50 |             100 |              100 |          0 |
| MWT2_UC            |              100 |               100 |          0 |             100 |              100 |          0 |
| BU_ATLAS_Tier2     |              100 |               100 |          0 |             100 |              100 |          0 |
| UC_ATLAS_MWT2      |              100 |               100 |          0 |             100 |              100 |          0 |
| PROD_SLAC          |               92 |                92 |          0 |              92 |               92 |          0 |
| MWT2_IU            |                8 |                 8 |          0 |              32 |               32 |          0 |
| SWT2_CPB           |                0 |                 0 |          0 |         Unknown |          Unknown |          0 |
| IU_OSG             |                0 |                 0 |          0 |               0 |                0 |          0 |
| OUHEP_OSG          |                0 |                 0 |          0 |         Unknown |          Unknown |          0 |
| OU_OSCER_ATLAS     |                0 |                 0 |          0 |         Unknown |          Unknown |          0 |
----------------------------------------------------------------------------------------------------------------------------

GLIBC_2.4 dCache Issue

See Savannah bug report 38467 "dCache client in ATLAS release fails because of GLIBC_2.4 requirement" which can be found at:

https://savannah.cern.ch/bugs/?func=detailitem&item_id=38467.

This issue affects 14.1.0 but not 14.2.0 and could reappear in future releases if action is not taken. We need to agree on one of two courses so David Quarrie can make the appropriate change:

Let me add back in the BNL folks to this report. I don't want to come up with one solution for one set of sites only for it not to work for others. You all need to agree on what you want before we can implement something. As I understand it, the alternatives are:

Possible solutions:
1. Approve and apply in all sites a patch similar to the one used by BNL
2. (better) Apply a patch to exclude the dcache client/library from the release. dCache would be part of the OS. Its libraries are already installed if it is used. Furthermore OSG wn-client includes a working version of dcache client libraries.

I think it might be possible in 14.1.0 to apply a patch that removes the dcache client library fragment from LD_LIBRARY_PATH. Should we be doing the same in general and just not using the LCG dcache_client package at all? I understood from BNL that they wanted it.

  • Consensus is to remove this from the release. Fred will take action on this.
  • Okay to fix local releases.

Next procurements

  • Standing agenda item, see CapacitySummary.
  • No specific news about numbers, etc.
  • New storage device fro Dell, successor to MD1000, that looks interesting. Guidelines are being discussed. Need a deadline for UCI so we can move forward.
  • Our deadline for deployment is Sep 15 - so we need firm information from Dell within 2 weeks.
  • There will be an IBM visit to BNL - next week.

Follow-up issues

  • Storage capacity recommendations/guidance for the Facility (320 TB capacity, from Kaushik's model on MinutesJune11).
  • Revised WLCG pledges - need info by July 15. Action item for Rob
  • Specifications from Internet2 for network monitoring hosts Rich has sent something today.
  • Need a schedule to put the services in place in the facility.
  • Joint-techs meeting in Lincoln in two weeks. Dantong will be attending.

Operations overview: Production (Kaushik)

  • Follow-up on space token description assignments (PRODDISK, MCDISK, GROUPDISK, etc)
    • Kaushik will send reminder via email. Will re-work these numbers with latest from Kors.
  • Still without jobs mostly, some new 14 TeV? samples coming through today. They look to be large di-jet data samples.
  • Site naming issues w/ space tokens. DDM team would like to use "alternate name" in ToA. This name is how they aggregate space tokens from a particular site. Two associated issues - they use this name to connect to WLCG BDII. Should match w/ alternate name in ToA. But we don't want to do this.
  • Downtimes are published via BDII. But we want to avoid this - it can be done via OIM to WLCG GOCDB.

Shift report (Marco)

  • Observed several FTS channels going red. Consulting Wensheng.
  • Several sites w/ errors, but they seem to be task-related.
  • There is a bug in srmcp-fnal client. This has been fixed, but in the current version. It affects only dCache sites. It is 2.0.3, need 2.0.8.

Analysis queues, FDR analysis (Nurcan)

  • TAG selection jobs failed at AGLT2 due to the reason as explained above (GLIBC_2.4 dCache Issue). Suggested to users to use non-dCache sites (NET2, SLAC, SWT2, OU) till AGLT2 and MWT2 are patched.
  • Jamboree and ATLAS offline software tutorial next week. Which release to be used? 14.2.10 is available - installed on most sites already. If not this release, 14.2.0 will be used, already available.
  • Mark and Nurcan discussing flooding sites with analysis queues writing a script.
  • Open bugs in Savannah - many bugs have been resolved. Will be part of the new analysis support organization that Kaushik is organizing. Users are discouraged from submitting bugs when they see so many open.

DDM - FT

  • See http://atladcops.cern.ch:8000/drmon/ftmon_T1-T2_matrix_all.html
  • What are the issues with this round of FT?
    • AGLT2 - there are issues with disks on the worker nodes; these are known, waiting for space-token enabled spaces to be used. However, there were
    • MWT2 - was orange, turned green. There were problems over the weekend

Operations: DDM (Hiro - on vacation this week)

Carryover - file exists problem

  • Last week there was the "file exists" problem @ BNL
    • There is a problem with a site service with files that came to BNL, but weren't registered, realized only after the service was restarted. Under investigation.
    • Kaushik reports this problem has been solved by Miguel at ADC development meeting.
    • Need to follow-up next week. (Hiro to discuss with Miguel)

LFC migration (Rob)

  • SubCommitteeLFC
  • LFC sub-committee meeting yesterday but no significant news: LFCMeetJul8?
  • Follow-up: possibility of adding an http interface to LFC - would provide clear separation between the client and service; request from Kaushik to bring up with LFC developers. Dantong will contact LFC developers. (Now has information from Hiro.)

wlcg-client (Marco)

  • There were problems with the wlcg-client having two versions of globus.
  • LFC pieces are from a binary distribution - solves DQ2 utilities, but introduces probs with other client programs.
  • New release has workaround.

WLCG accounting

OSG 1.0 (Rob)

  • See https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/WebHome
  • OSG Sites meeting this week.
  • Deployed 1.0
    • AGLT2
    • MWT2_UC
    • MWT2_IU
    • UC_ATLAS_MWT2
    • PROD_SLAC
    • UTA_DPCC
    • BNL_ATLAS_1 - all BNL sites upgraded now.
  • Status update:
    • SWT2_CPB: post-integration, will be 1.0.
    • OU - will be upgraded later this month.
    • NET2 - downtime scheduled next Tuesday.
  • glexec deployed in production at BNL; required at SLAC, especially for analysis jobs. glexec needs outbound access. Has been tested on the ITB at BNL.

Site news and issues (all sites)

  • T1: bringing more thumper nodes online, to be put into production.
  • AGLT2: pursuing why we're getting DATADISK errors - investigating. Perhaps back to usatlas3 vs usatlas1?
  • NET2: no particular issues - except next Tuesday's downtime. May be in issue validation AODs not on site. Nurcan will take a look.
  • MWT2: we've had some disk failures over the weekend, recovered.
  • SWT2 (UTA): bringing new hardware online - tweaking rocks configs and building torque; xroot; then OSG.
  • SWT2 (OU): trying to get a quote from Dell for 100 TB storage.
  • WT2: working on deploying glexec at slac. Will consult with Torre and Maxim. Looking at replicating conditions db at slac - had discussion this morning; what kind of hardware is required for this database.

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • A bit of development to do. Carry-over

Release installation via Pacballs (Xin)

  • There was a meeting this week.

Throughput initiative - status (Shawn)

Nagios monitoring subcommittee (Dantong)

  • Tom still on vacation. Dantong covering alerts.

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

Site certification review

User LRC deletion (Charles)

  • Nurcan reports this is currently failing - Charles has addressed bug reported. New version available for Nurcan to try, will follow-up. Will email Nurcan today.

AOB

  • none


-- RobertGardner - 09 Jul 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback