r27 - 22 Nov 2006 - 14:53:19 - JohnDeStefanoYou are here: TWiki >  Projects Web > ServiceChallengesMonitoring

Service Challenge Monitoring

ATLAS DDM Monitoring

Summarized Data Transfer Rate Collected by CERN

Site Functional Test and Service Availability Monitoring

Grid.View on CERN Traffic Monitoring

Ganglia Monitoring for dCache Door and Pool Nodes (Available to Public Network)

    • Overall data transfer rate during one hour period: Current time <18 Mar 2018 - 05:47>

    • CERN SRM to BNL dCache

Network Monitoring Snapshots

Network Monitoring (BNL Internal Links)

ATLAS DDM Data Management Monitoring

Monitoring and Problem Reporting Procedure for Operation Team

Read Emails from Email List "service-challenge-tech@cern.ch"

Skim through the email list, if there is problem regarding BNL, such as "can not transfer data to BNL", "BNL Storage Server Crashed", then please call the on-call person. Please do not call between 11:00PM and 9:00AM.

Monitor and Find Problems

If this ganglia Plot shows there is a continuous performance degradation (Bandwidth < 100M Bytes/second) for half an hour, that indicates a problem. Please send email to atlas-t0-ops@cern.ch and atlas-t1-ddm-oper@cern.ch, and cc to service challenge mailing list: service-challenge-tech@cern.ch to report the degradation. If the degradation continues for one hour with the speed of less than 100M Byte/second, you need to verify whether it is a CERN problem or BNL problem by looking at the plots of CERN to ALL Tier 1 sites and CERN to BNL. Normally the plot is updated every hour and the current transfer will be showed in the plot one hour later. For example, according to the current GRIDVIEW configuration, if the degradation starts after 1:00 pm, the effect can only be seen in the plots after 2:15pm. Please also be aware that plots use GMT. If plot "CERN to ALL Tier 1 Site" shows low performance in the hour, it is a CERN problem, you do not need to call BNL team. If only Plot "CERN to BNL" shows significantly low performance number, then it is a BNL problem. You need to reach on-call person via our Help Desk.

The nagios monitoring system for grid machines is here. To see it you need to give the user id (nagios) and pwd (nagios). To have best view select option "Service detail". You will see a table with all machines and services which we monitor on the grid cluster. Green means that the service is OK, yellow - that it has a warning, red - error.

Nagios sends e-mail notifications about status of machines to people who are responsible for each service. The operators will receive by email alerts about status of the machines related to dcache.

Sometimes nagios sends false alarms due to intermitent network problems. Therefore if you receive an e-mail from nagios, saying that a particular machine or service is CRITICAL - wait for 10-15 minutes. If during this time you do not receive a message that this particular machine or service recovered - then please contact experts.

How to check FTS and DQ2

The FTS server for Tier 1 data transfer is at CERN. To check the BNL-CERN channel information, please do

glite-transfer-channel-list -s https://prod-fts-ws.cern.ch:8443/glite-data-transfer-fts/services/ChannelManagement

To check DQ2 catalog: please go to /home/atlassgm/config/BNL. There are two logs: subscriptions.log and progressFTS.log.

Who is on-Call?

If you have any problem regarding BNL SC4, please report problmes via our Help Desk. Help Desk includes Trouble Ticket System, Operator and Facility On-call personnel.

Please Log your Some Critical Problems and Failures

Some Snapshots of SC4 Monitoring

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback