Yuri's summary from the weekly ADCoS meeting (this week provided by Jarka Schovancova):
http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=140582
1) In general several older Savannah DDM tickets were resolved and/or closed. Thanks.
2) 5/19: SMU_LOCALGROUPDISK - DDM failures with "error during TRANSFER_FINALIZATION file/user checksum mismatch." Justin at SMU thinks this
issue has been resolved. Awaiting confirmation so ggus 70737 can be closed. eLog 25537.
3) 5/19: WISC_LOCALGROUPDISK - DDM failures with [GRID_FTP_ERROR] globus_ftp_client : the server responded with an error500." Wen reported
the problem has been fixed (configured the BestMan server to obtain checksum values). Site was blacklisted in DDM while the issue was being addressed -
since removed (https://savannah.cern.ch/support/?121061). ggus 70734 closed, eLog 25664.
4) 5/23: New pilot release from Paul (version SULU 47d). Details here:
http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_SULU_47d.html
5) 5/23: AGLT2 network incident - from Shawn: We had a dCache storage node reorder its NICs, breaking the bonding configuration. This has been fixed
now. To prevent a re-occurrence we assigned HWADDR in the relevant /etc/sysconfig/network-scripts/ifcfg-ethX files. eLog 25692.
6) 5/24: NET2 - DDM transfer errors. Saul reported that the underlying issue was a networking problem that caused a gatekeeper to become overloaded.
Thinks the issue is now resolved. https://gus.fzk.de/ws/ticket_info.php?ticket=70844, eLog 25722.
Savannah site exclusion: https://savannah.cern.ch/support/?121125.
7) 5/24: Charles announced an updated version of the pandamover-cleanup.py script. See:
http://repo.mwt2.org/viewvc/admin-scripts/lfc/pandamover-cleanup.py, and the talk by Tadashi regarding updated procedures for pandamover cleaning:
https://indico.cern.ch/conferenceDisplay.py?confId=140214.
Follow-ups from earlier reports:
(i) 4/8: NERSC - file transfer errors. See ggus 69526 (in-progress), eLog 24176.
Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
Update 5/17: long discussion thread in the ggus ticket - it was marked as 'solved' on this date. (Recent transfers had succeeded.)
(ii) 4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors. Site was set off-line 4/11 due to a spike in the failure rate. Discussed in:
https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
Update 5/16: No conclusive understanding of the seg fault job failures. Decided to set the site back on-line (5/16) to see if the problem persists. Awaiting
new results (so far no jobs have run at the site).
Update 5/19: Initial set of jobs at OU_OSCER have completed successfully, so ggus 69558 & RT 19757 were closed, eLog 25555.
https://savannah.cern.ch/support/?120307 was closed. Will continue to monitor the status of new jobs.
(iii) 5/17: SWT2_CPB maintenance outage for cluster software updates, reposition a couple of racks, etc. Expect to complete by late afternoon/ early evening
5/18. eLog 25474, https://savannah.cern.ch/support/index.php?121013.
Update 5/18: Outage is over - test jobs completed successfully. Queues back to on-line. eLog 25553. http://savannah.cern.ch/support/?121013 closed.
(iv) 5/17: AGLT2_USERDISK to MAIGRID_LOCALGROUPDISK file transfer failures ("globus_ftp_client: Connection timed out"). Appears to be a network
routing problem between the sites. ggus 70671 in-progress, eLog 25480.
Update 5/24: NGI_DE helpdesk personnel are working on the problem. ggus ticket appended with additional info.
Yuri's summary from the weekly ADCoS meeting (this week provided by Jarka Schovancova):
http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_16_2011.html
1) 5/26: NET2 - file transfer errors to DATADISK. (Issue related to the performance of checksum calculations, Bestman crashes, etc.) See discussion
thread in https://ggus.eu/ws/ticket_info.php?ticket=70973, eLog 25826.
2) 5/27: New pilot version from Paul (SULU 47e), produced to help with a production problem at LYON. This had the effect of generating thousands of
errors at two FR cloud sites (see for example https://ggus.eu/ws/ticket_info.php?ticket=71032). Problem under investigation.
3) 5/28: Job brokerage was broken in the US & IT clouds. Issue was a disk space check against an incorrect value. Problem resolved.
4) 5/29: MWT2_UC - job failures with transfer timeout errors. From Rob: Not a site problem - caused by low concurrency settings for FTS instances at FR,
CERN for transfers from MWT2 endpoints. ggus 71036 closed, eLog 25993.
5) 5/31: ADCR database maintenance (switch db services back to original hardware - see eLog 25529 and thread therein for original issue). Affected
services: ADCR_DQ2, ADCR_DQ2_LOCATION, ADCR_DQ2_TRACER, ADCR_PANDA, ADCR_PANDAMON, ADCR_PRODSYS, ADCR_AMI.
Duration ~one hour. Work completed as of ~4:00 a.m. CST. eLog 25949/50.
6) 5/30-5/31: OU_OCHEP_SWT2 file transfer failures (two issues: (i) incorrect checksums, (ii) files with zero bytes size). Horst reported that the issue
is resolved. https://rt.racf.bnl.gov/rt/Ticket/Display.html?id=20106 closed, eLog 25943.
7) 5/31: From Sarah at MWT2_IU: We have a storage pool off-line with disk issues at MWT2_IU. We have paused the scheduler to prevent new jobs from
starting while it is down, and are working to bring it back online. We may see some transfers fail for files on the pool while it is off-line.
8) 5/31: UTD-HEP set off-line at request of site admin (cleaning dark data from the storage). eLog 25944.
9) 6/1: Start of TAG reprocessing campaign (p-tag: p586). From Jonas Strandberg: This will be a light-weight campaign starting from the merged AODs
and producing just the TAG and the FASTMON as output which are both very small.
Follow-ups from earlier reports:
(i) 5/17: AGLT2_USERDISK to MAIGRID_LOCALGROUPDISK file transfer failures ("globus_ftp_client: Connection timed out"). Appears to be a network
routing problem between the sites. ggus 70671 in-progress, eLog 25480.
Update 5/24: NGI_DE helpdesk personnel are working on the problem. ggus ticket appended with additional info.
Update 5/31 from Shawn: I am marking this as resolved but the solution seems to be that the remote site only has commercial network peering and will
be unable to connect to AGLT2 and WestGrid because of this. Not sure if the systems involved have been configured to limit their interactions to reachable
sites. ggus 70671 closed, eLog 25905.
(ii) 5/19: SMU_LOCALGROUPDISK - DDM failures with "error during TRANSFER_FINALIZATION file/user checksum mismatch." Justin at SMU thinks this
issue has been resolved. Awaiting confirmation so ggus 70737 can be closed. eLog 25537.
Update 5/27: resolution of the problem confirmed - ggus 70737 closed.
(iii) 5/24: NET2 - DDM transfer errors. Saul reported that the underlying issue was a networking problem that caused a gatekeeper to become overloaded.
Thinks the issue is now resolved. https://gus.fzk.de/ws/ticket_info.php?ticket=70844, eLog 25722. Savannah site exclusion:
https://savannah.cern.ch/support/?121125.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.