r1 - 31 Jan 2018 - 14:17:01 - AlexUndrusYou are here: TWiki >  PAS Web > DdmTransSEHealth

DdmTransSEHealth - Coping with SE Health Issues (info from Vikas, 01/31/2018)


If operation/deletion takes too much time ... Look at DataOperationRequestTasks table. Example:

| STATUS   | OperationType        | RetryNum | Priority | QueueTime           | TaskID | Error    | InitialUpdate       | DataOperationRequestLFNID | LastUpdate          | DEST_SE         | HeartBeatTime       | RequestID | SOURCE_SE       |
+----------+----------------------+----------+----------+---------------------+--------+----------+---------------------+---------------------------+---------------------+-----------------+---------------------+-----------+-----------------+
| Done     | DeepDelete           |        0 |        1 | 2018-01-30 02:17:27 |      7 |          | 2018-01-30 02:16:57 |                         7 | 2018-01-30 02:18:11 | BNL-TMP-SE      | 2018-01-30 02:18:11 |         0 | BNL-TMP-SE      |
| Done     | DeepDelete           |        0 |        1 | 2018-01-30 02:17:27 |      8 | ; NoAMGA | 2018-01-30 02:16:57 |                         7 | 2018-01-30 03:57:22 | KEK-DISK-TMP-SE | 2018-01-30 03:57:22 |         0 | KEK-DISK-TMP-SE |

We observe that DeepDelete on KEK-DISK-TMP-SE required ~ 1:40. Then look at logfiles like

/opt/dirac//runit/DistributedDataManagement/StorageElementStatusAgent/log/current

Recommendation from Vikas to BNL: Archive DIRAC components log on a more permanent basis as DIRAC only logs for a week or so. Those setting can be changed in DIRAC but still it is better to archive at a more persistent place outside of DIRAC software area.

Then make query to StorageElementStatusDB like

select * from StorageElementStatusAccounting WHERE StorageElement='KEK-DISK-TMP-SE' ORDER BY UpdateTime DESC

There should not be zeroes. In this example:

|        1 |    1 |    100000000000000 |      1 |    1 |       1 | 14158 |    1 | 2018-01-30 03:56:12 | KEK-DISK-TMP-SE | BELLEDISK  |       1 |    1 |        1 | 10831554793136 |
|        1 |    1 |    100000000000000 |      1 |    0 |       1 | 14096 |    0 | 2018-01-30 02:40:23 | KEK-DISK-TMP-SE | BELLEDISK  |       1 |    0 |        1 | 10830844127620 |
|        1 |    1 |    100000000000000 |      1 |    0 |       1 | 14034 |    0 | 2018-01-30 01:44:44 | KEK-DISK-TMP-SE | BELLEDISK  |       1 |    0 |        1 | 10825084920789 |

Which confirms that this SE was in a bad health on 2018-01-30 02:40:23. One can create a JIRA ticket if SE problems are persistent, e.g. https://agira.desy.de/browse/BIIDCO-724 .

Important : once there is 5000 stuck tasks on any SE - DDM stops submitting results to RMS - immediate alarm for shifters

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback