Yuri's summary from the weekly ADCoS meeting:
https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=186845
or
http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_16_2012.html
1) 4/11: WISC - DDM errors ("failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]"). Problem reported to be fixed
(no details). ggus 81162 closed, eLog 35113.
2) 4/12: UTD-HEP - site requested to be unblacklisted in DDM. However, file transfers started failing heavily, so had to set site off again.
See: https://savannah.cern.ch/support/?127808, eLog 35123/24/203.
3) 4/12: Network issue at CERN created various problems for a period of several hours. More details in eLog 35148/50.
4) 4/12: ggus 81213 was opened for what appeared to be SRM errors at BNL. Issue was actually the network link between TRIUMF and BNL.
See details in the ticket (now closed) & eLog 35182.
5) 4/13 early a.m.: power outage at SLAC. Power restored as of late afternoon 4/14. eLog 35168.
6) 4/13 early a.m.: AGLT2 - file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]"). Shawn reported
that the DB partition on the dCache headnode filled up. Issue resolved - ggus 81228 closed, eLog 35171.
7) 4/14: BNL - file transfer errors due to expired host certificate ("Credential with subject: /DC=org /DC=doegrids /OU=Services/CN =
dcsrm.usatlas.bnl.gov has expired"). Certificate quickly renewed - issue resolved. ggus 81281 closed, eLog 35214.
8) 4/17: SLAC - user reported problems attempting to download files from the site when using certificates signed by APACGrid. Wei reported
the certificate list was updated, and this fixed the problem (user transfers now succeeding). ggus 81351 closed.
Follow-ups from earlier reports:
(i) 2/29: UTD-HEP set off-line to replace a failed disk. Savannah site exclusion: https://savannah.cern.ch/support/index.php?126767. As of 3/7
site reported the problem was fixed. Test jobs are failing with the error "Put error: nt call last): File , line 10, in ? File /usr/lib/python2.4/site-packages/
XrdPosix.py, line 5, in ? import _XrdPosix ImportError: /usr/lib/python2.4/site-packages/_XrdPosixmodule.so: undefined symbol: XrdPosix_Truncate."
Under investigation. eLog 34259.
Update 3/12: ggus 80175 was opened for the site due to test jobs failing with the error shown above. Closed on 3/13 since this issue is being tracked
in the Savannah ticket.
Update 4/17: Savannah 126767 closed, as latest site issues being tracked in https://savannah.cern.ch/support/?127808.
(ii) 3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period). ggus 79827 in-progress, eLog 34150.
Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214. eLog 34336.
(iii) 3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]"). Issue with a fileserver which hosts
gridftp & SRM services being investigated. ggus 80126, eLog 34315. ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.
Tickets cross-referenced. System marked as 'off-line' while the hardware problem is worked on. eLog 34343. DDM blacklist ticket:
https://savannah.cern.ch/support/index.php?127055
Update 4/5: Downtime extended until the end of April.
(iv) 4/7: UTD-HEP - following being un-blacklisted in DDM (see (iv) below) site requested to be tested by HC/panda. However, pilots were not able
to find/access the atlas s/w release areas. RT 21898 opened.
(v) 4/8: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]"). ggus 81035, eLog 35046.
(vi) 4/9: NERSC - file transfer errors to SCRATCHDISK ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").
ggus 81050 in-progress, eLog 81050.
(vii) 4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version." ggus tickets 81011, 81012 & 81110 all related to this issue.
Yuri's summary from the weekly ADCoS meeting:
https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=187835
or
http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_23_2012.html
1) 4/18: From Hiro - In case if you have noticed that BNL FTS has stopped working with gsiftp endpoint since the upgrade, I think I fixed it. Please let
me know if it is still not working.
2) 4/18: MWT2 - job failures ("Unspecified error, consult log file") due to some test nodes picking up production jobs. Nodes removed - problem solved.
eLog 35346.
3) 4/18: SWT2_CPB - file transfer errors - issue was an expired host. Certificate updated, solved the problem. eLog 35347.
4) 4/19: High number of panda analysis jobs in the 'holding' state. Issue was slowness in the LFC+DQ2 registration step. Issue seemed to clear up
after a few hours. eLog 35399, http://savannah.cern.ch/bugs/?93869.
5) 4/20: SWT2_CPB - file transfer errors ( "[CONNECTION_ERROR] failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]").
A dataserver was heavily loaded, which in turn impacted the SRM host. The SRM host was rebooted, and to reduce the load on the overheated dataserver
disk clean-ups were run in the background to free up some space on other hosts. Also, an additional rack of storage to help alleviate the space crunch.
ggus 81465 / RT 21941 closed, eLog 35397.
6) 4/21: WISC_LOCALGROUPDISK file transfer failures with "source file doesn't exist" errors. ggus 81474 in-progress, eLog 35501.
7) 4/22: MWT2 - job failures with " Get error: lsm-get failed." See details in ggus 81477 (in-progess) - eLog 35424.
8) 4/22: MWT2 - ggus 81487 opened due to jobs failing with the "lost heartbeat" error. Ticket in-progress, eLog 35433.
9) 4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA. (A similar problem occurred
last week at SLAC - see ggus 81351.) Under investigation - details in ggus 81495 / RT 21947.
10) 4/23: Attempts to create a proxy when accessing the BNL voms server (vo.racf.bnl.gov) were hanging up. From John Hover: Service hung at 4:00AM.
A service monitoring tool detected the problem and attempted a restart. The restart failed because of a full log partition. Service is now restored, and we're
looking into why the partition status didn't generate any internal alerts. ggus 81505 closed, eLog 35452.
11) 4/24: SMU_LOCALGROUPDISK file transfer errors ("source file doesn't exist"). Update from Justin: These files have been deleted and an LFC update
has been requested. ggus 81526 in-progress, eLog 35463.
12) 4/24: John at NET2 reported that the HU_ATLAS site was draining for lack of production jobs. Pilots are unable to download files from the panda servers,
and immediately exit with the message "curl: (52) Empty reply from server /usr/bin/python: can't open file 'atlasProdPilot.py': [Errno 2] No such file or directory."
Problem under investigation - see details in e-mail thread. eLog 35477.
Follow-ups from earlier reports:
(i) 3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period). ggus 79827 in-progress, eLog 34150.
Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214. eLog 34336.
(ii) 3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]"). Issue with a fileserver which hosts gridftp &
SRM services being investigated. ggus 80126, eLog 34315. ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK. Tickets
cross-referenced. System marked as 'off-line' while the hardware problem is worked on. eLog 34343. DDM blacklist ticket:
https://savannah.cern.ch/support/index.php?127055
Update 4/5: Downtime extended until the end of April.
(iii) 4/7: UTD-HEP - following being un-blacklisted in DDM (see (iv) below) site requested to be tested by HC/panda. However, pilots were not able to find/access
the atlas s/w release areas. RT 21898 opened.
Update 4/24: RT ticket marked as 'solved' (no explanation).
(iv) 4/8: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]"). ggus 81035, eLog 35046.
(v) 4/9: NERSC - file transfer errors to SCRATCHDISK ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").
ggus 81050 in-progress, eLog 81050.
(vi) 4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version." ggus tickets 81011, 81012 & 81110 all related to this issue.
Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
(vii) 4/12: UTD-HEP - site requested to be unblacklisted in DDM. However, file transfers started failing heavily, so had to set site off again.
See: https://savannah.cern.ch/support/?127808, eLog 35123/24/203.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.