r12 - 18 Aug 2006 - 15:24:02 - FrederickLuehringYou are here: TWiki >  Admins Web > TierTwoOperationsUserSupport

Tier 2 Operations and User Support Planning

Grid Middleware and Atlas Software Versions and Upgrades

  • Currently we are running a mixed set of OSG versions even on our production systems. Thanks to the efforts of Xin, we do have a consistent versions for the Athena software everywhere. We still need to firm up that DQ2 is distributed using Pacman. We also need to track that the correct GUMS/VOMS versions are installed on our Tier 1/2 sites. We do not include the queuing system in this list of software. Suggestions for the near future:
    • We should inventory what OSG, GUMS, VOMS, and Athena versions are on each Tier 1/2. Time frame: 1 week. Milestone: Inventory completed.
    • The production Tier 1 and Tier 2 sites should install the same software stack. This makes maintenance and support easier. Time frame: 6 months. Milestone: Sites have same software installed.
    • We should proactively validate proper operation of the current Athena version and new versions of the OSG middleware on the ITB. Time frame: Now.
    • Each Tier 1 & Tier 2 site should maintain an ITB site for testing & validation. Time frame: 3 months.
    • Each Tier 1/2 site should maintain a Pacman mirror of the all releases to have common environment and ease installation on user computers. Milestone: All sites have have a mirror.

Security

  • Currently each site takes care of its own security. For example some sites have firewalls and some do not. There is no US ATLAS security document. Suggested tasks for the near future:
    • US ATLAS should write a security document. Time frame: 1 month. Milestone: Document written.
    • All sites should implement the security document and have their implementation audited by an independent authority. Time frame: 1 month after security document is completed. Mileston: Completion of security audit.
    • Implement secure shell key pairs on all sites and eliminate the use of passwords. Alternatively use grid certificates to obtain interactive login to US ATLAS resources.

Ticketing and User Help Requests

  • Currently the Tier 1 has the CTS ticketing system (email-based) and in the the process of moving to the RT web-based ticketing system. Suggestions for the near future:
    • Move to using HyperNews as a front-end for screening user help requests and problem reports. Suggest using a moderator to decide when a discussion has reached a point where it is clear that a ticket to be opened. Of course users should be able to directly create a ticket. Suggested time frame 2 months. High priority. Milestone: Moderator appointed.
    • Similarly for the Tier 2 administrators, we should have a HyperNews forum for the Tier 1/2/3 administrators to discuss technical issues with the middleware, DQ2, etc. Suggested time frame 2 months. High priority. Milestone: Forum completed.
    • Known problems are reported to ticketing system. Time frame: now.

Number of Interactive Users on a Tier 2 Center

  • Currently the Tier 2 centers have of the order of 10-20 interactive users. Suggestions for the near future:
    • We need to decide how many interactive users we should support. This could be a number as low as zero. Suggested time frame: now. High priority. Milestone: Number of users defined.
    • We should divert as many as possible to the Tier 1 center and/or the Tier 3 centers and not try to support large number of users on the Tier 2. The Tier 1 site is the place with funding for dedicated support personnel.
    • To the extent that we have interactive users on Tier 2 sites, we need to define a support model for them. Time frame: 6 months. Milestone: Support model defined.

FAQs and Instructional Information About the Tier 2 Sites.

  • Currently Horst supports the user web pages and there is no FAQ list or Knowledge Base. Suggested tasks for the near future:
    • Create and maintain a moderated FAQ topic on the TWiki.

Support for Tier3 sites

  • Currently Tier3 sites are self-supported. Suggested tasks:
    • Define guidelines for establishing infrastructure and a basic support model. Identify paths for escalation of issues. Time frame: 3 months. Milestone: Guidelines defined.
    • Tier2s should be able to provide some baseline setup to Tier3s.

-- FrederickLuehring - 10 May 2006

Grid Middleware and Atlas Software Versions and Upgrades

  • All sites have upgraded to 0.4.1.
  • Need to decide when and how to upgrade 0.6.0.
  • Need a clear plan for upgrading (including validation).
    • Each site installs and tests on ITB.
    • Marco tests Panda against each ITB site.
    • Agreement to go is reached.
    • 1 day shutdown and upgrade for everyone.
    • Production resumes on all machines after upgrade.
    • Previous version of OSG is NOT deleted in case we need to revert.
    • Two weeks warning should be given before upagrade.
  • Need to decide how to treat SLC4/RHEL4 upgrade.
    • Need test system at each site.
    • CERN will be fully on SLC4 in October.
  • Xin/Tomasz/Yuri are installing releases for ATLAS.
    • Xin, Alex, and Saul should talk about mirrors.
    • Xin should use BNL mirrors.
    • Once installed onto grid sites no patches are possible.
    • How to synchronize over multiple mirrors in the US?
    • Need to setup distribution hierarchy for mirrors at different tiers.

Security

  • CryptoCards will be required for user to gain interactive access BNL Tier 1. Schedule is to be decided. This applies interactive use (NOT grid use).
  • We need to write several security documents including a risk assessment and user security manual.
  • The twiki at BNL should use https instead http when authenticating users.
  • We need an incident response plan soon.

Ticketing and User Help Requests

  • We probably need all three systems: HyperNews-like system for discussion, Savannah-like system for bug tracking, and an RT-like system for user help/support.
    • Cannot merge Savannah and RT functionality.
    • It makes more sense to use CERN HyperNews system. We need to send a list of forums to CERN HyperNews? administrator to get it setup. Srini wants a BNL version of HyperNews. Kaushik will follow up with Srini on why.
  • Need to refactorize support model to be by local fabric service.
    • Should not have separate Tier 1 and Tier 2 ddm operations.
    • There is the work that is ddm only.
    • Support should have three catagories: Production support, facility and site support, and software support.
  • Panda shifters should decide disposition of user requests: site issue, production issue, software issue, infrastructure issue, etc.

Monitoring and Service Availability Evaluation

  • Facility and service monitoring. (ganglia, nagios, OSG, EGEE)
  • Filtered Information should be made available to general user and production managers.
  • Do we need a service level agreement. How long can a service be offline. How many sites are affected? How much can performance degrade?
  • Quantify service and facility Uptime: what is the percentage over a year of being available for each critical service?

Number of Interactive Users on a Tier 2 Center

  • Still need to determine the level of interactive use at Tier 2 sites.
  • Users may prefer to work at Tier 2 sites rather then deal with CryptoCards at the Tier 1.

FAQs and Instructional Information About the Tier 2 Sites.

  • Want one uniform set of instructions not 5 different sets of instructions.
    • Panda team needs to provide documenation for users.

Support for Tier3 sites

  • The model for support for Tier 3 support still needs to be worked out but clearly the support level will be lower than at the Tier 1 and Tier 2 site.

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback