Service Challenge Monitoring
Grid.View on CERN Traffic Monitoring
Ganglia Monitoring for dCache Door and Pool Nodes (Available to Public Network)
-
- Overall data transfer rate during one hour period: Current time <07 Dec 2019 - 05:01>
Network Monitoring (BNL Internal Links)
ATLAS DDM Data Management Monitoring
Monitoring and Problem Reporting Procedure for Operation Team
Read Emails from Email List "service-challenge-tech@cern.ch"
Skim through the email list, if there is problem regarding BNL, such as "can not transfer data to BNL", "BNL Storage Server Crashed", then please call the on-call person. Please do not call between 11:00PM and 9:00AM.
Monitor and Find Problems
If
this ganglia Plot shows there is a continuous performance degradation (Bandwidth < 100M Bytes/second) for half an hour, that indicates a problem. Please send email to
atlas-t0-ops@cern.ch and
atlas-t1-ddm-oper@cern.ch, and cc to service challenge mailing list:
service-challenge-tech@cern.ch to report the degradation. If the degradation continues for one hour with the speed of less than 100M Byte/second, you need to verify whether it is a CERN problem or BNL problem by looking at the plots of
CERN to ALL Tier 1 sites and
CERN to BNL. Normally the plot is updated every hour and the current transfer will be showed in the plot one hour later. For example, according to the current GRIDVIEW configuration, if the degradation starts after 1:00 pm, the effect can only be seen in the plots after 2:15pm. Please also be aware that plots use GMT. If plot "CERN to ALL Tier 1 Site" shows low performance in the hour, it is a CERN problem, you do not need to call BNL team. If only Plot "CERN to BNL" shows significantly low performance number, then it is a BNL problem. You need to reach on-call person via our
Help Desk.
The nagios monitoring system for grid machines is
here. To see it you need to give the user id (nagios) and pwd (nagios). To have best view select option "Service detail". You will see a table with all machines and services which we monitor on the grid cluster. Green means that the service is OK, yellow - that it has a warning, red - error.
Nagios sends e-mail notifications about status of machines to people who are responsible for each service. The operators will receive by email alerts about status of the machines related to dcache.
Sometimes nagios sends false alarms due to intermitent network problems. Therefore if you receive an e-mail from nagios, saying that a particular machine or service is CRITICAL - wait for 10-15 minutes. If during this time you do not receive a message that this particular machine or service recovered - then please contact experts.
How to check FTS and DQ2
The FTS server for Tier 1 data transfer is at CERN. To check the BNL-CERN channel information, please do
glite-transfer-channel-list -s
https://prod-fts-ws.cern.ch:8443/glite-data-transfer-fts/services/ChannelManagement
To check DQ2 catalog:
please go to /home/atlassgm/config/BNL. There are two logs: subscriptions.log and progressFTS.log.
Who is on-Call?
If you have any problem regarding BNL SC4, please report problmes via our
Help Desk. Help Desk includes Trouble Ticket System, Operator and Facility On-call personnel.