Yuri's summary from the weekly ADCoS meeting (this week presented by Tom Fifield): http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=124091 Note: AGLT2 will take a maintenance outage on 2/1. See more details in the message from Bob to the usatlas-t2-l mailing list. 1) 1/19: UTD-HEP - job failures with missing input file errors - for example: "19 Jan 07:07:10|Mover.py | !!FAILED!!2999!! Failed to transfer HITS.170554._000123.pool.root.2: 1103 (No such file or directory)." ggus 66284, eLog 21346. 2) 1/19: BNL - file transfer failures with the error "FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [FILE_EXISTS] at Tue Jan 18 23:56:18 EST 2011 state Failed : file exists]." ggus 66280, eLog 21352. Issue resolved as of 1/21 - from Hiro: since DDM will retry with different physical name, this is not a issue. The dark data will be taken care of later. ggus ticket closed. 3) 1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed." ggus 66298. From Hiro: There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading. 4) 1/21: SLACXRD file transfer errors - "failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]." Issue was reported to be fixed by Wei, but the errors reappeared later the same day, so the ticket (ggus 66346) was re-opened. eLog 21409. 5) 1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]." https://savannah.cern.ch/bugs/index.php?77251, eLog 21440. 6) 1/21: HammerCloud auto-exclusion policy updated - see: https://twiki.cern.ch/twiki/bin/view/IT/HammerCloud#10_HammerCloud_Automatic_Site_Ex 7) 1/21: SWT2_CPB - user reported that the command 'lcg-ls' was hanging when attempting to communicate via the SRM interface. Possibly a temporary network glitch - subsequent tests are working correctly. RT 19309 / ggus 66386. Update 1/26: user reports command now working, no recent instances of this error. RT & ggus tickets closed. 8) 1/22: MWT2_IU_DATADISK timeout problems - for example " [TRANSFER error during TRANSFER phase: [TRANSFER_MARKERS_TIMEOUT] No transfer markers received for more than 180 seconds] ACTIVITY: User Subscriptions." Issue understood - from Aaron: We had a storage pool which was throwing errors and needed to be restarted. The behavior described here doesn't exactly match this failure, but it does match the timeframe. Please let us know if this is still occurring at any frequency, otherwise we can consider the issue resolved. ggus 66410 closed, eLog 21429. 9) 1/22: MWT2_IU_PRODDISK => BNL-OSG2_MCDISK file transfer failures with source errors. Resolved - from Aaron: This was tracked down to a failing pool, which was restarted and is now delivering data as expected. We should see this transfers succeed, and this issue should now be cleared up. ggus 66415 closed, eLog 21439. 10) 1/22: BNL-OSG2_LOCALGROUPDISK file transfer errors (from NDGF-T1_PHYS-SUSY) like " [TRANSFER error during TRANSFER phase: [FIRST_MARKER_TIMEOUT] First non-zero marker not received within 180 seconds]." From Michael: The issue is no longer observed. The ticket can be closed. ggus 66416 closed, eLog 21441. 11) 1/24: ALGT2 job & file transfer errors - "[SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]." ggus 66450 in-progress, eLog 21488. Update from Bob at ALGT2: Just restarted dcache services on head01. rsv srmcp-readwrite had been red. Hopefully that will clear the issue. Since the queues at the site (analy_, prod) had been set offline (ADC site exclusion ticket: https://savannah.cern.ch/support/?118828) test jobs were submitted, and they completed successfully (eLog 21497). Are we ready to close this ticket? 12) 1/25: SLACXRD_DATADISK file transfer errors - "[Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] No space found with at least 2895934054 bytes of unusedSize]." http://savannah.cern.ch/bugs/?77346. 13) 1/25: SLACXRD job failures with stage-in errors - for example "!!FAILED!!2999!! Failed to transfer ... 1099 (Get error: Staging input file failed)." From Wei: I think this is fixed. Sorry for the noise. They are a lot of places to change when I change the config so I may still be missing something... let me know if that is still the case. ggus 66520 closed. Follow-ups from earlier reports: (i) 12/17, 12/20: ANALY_SWT2_CPB was auto-blacklisted twice. Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up. Once the transfer completed the test jobs began to complete successfully. Discussion underway about how to address this issue. (ii) 12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]." ggus 65617 in-progress, eLog 20810. (iii) 1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)." Site is investigating. (iv) 1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist." ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036. Also https://savannah.cern.ch/bugs/index.php?77139. 1/25: Update from Shawn: I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can track the "repair" at http://bourricot.cern.ch/dq2/consistency/ Let me know if there are further issues.
Yuri's summary from the weekly ADCoS meeting:
http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-1_31_11.html
1) 1/26: data transfer errors from SLACXRD_USERDISK to MWT2_UC_LOCALGROUPDISK ("source file doesn't exist"). From Wei: I think you can close this ticket. There is only a few missing files and they do not exist at WT2. I don't know why FTS were asked to transfer them (maybe they were there when the request was submitted?) Repeated transfer request created lots of failures simply because they don't exist. ggus 66613 closed, eLog 21535.
2) 1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value." Consolidated into a single goc ticket, https://ticket.grid.iu.edu/goc/viewer?id=9871. Will be resolved in a new OSG release currently being tested in the ITB.
3) 1/27: from Bob at AGLT2 - At 1pm EST AGLT2 had a dCache issue. Available postgres connections had been dropped from 1000 to 300 during a pgtune a few days ago, and was not noticed until this failure was noticed. Unfortunately, this caused a LOT of job failures during the last 3 hours.
Later that evening / next morning:
We had some sort of "event" on our gate keeper around 11pm last night. Ultimately, condor was shot, and our load is lost. I have disabled auto-pilots this morning to both AGLT2 and ANALY_AGLT2 while we investigate the cause. Indications of hitting an open file limit on the system were found, and we need to understand the cause. Queues were set off-line. Later Friday afternoon, from Bob: We increased several sysctl parameters on gate01 dealing with total number of available file handles. Issues resolved, queues set back on-line. eLog 21583.
4) 1/30: AGLT2 - job (stage-out: "Internal name space timeout lcg_cp: Invalid argument") & file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]"). From Shawn: This morning around 8 AM Eastern time our postgresql server for the dCache namespace (Chimera) filled its partition with logging info (over 10 GB in the last 24 hours). This was traced to multiple attempts to re-register a few files over and over. We have cleaned up space on the partition and modified the logging to be "terse" so this won't happen as easily in the future. ggus 66794 in-progress, eLog 21616.
5) 2/1: Maintenance outage at AGLT2 - from Bob: The outage will include all of Condor, as well as a dCache outage and upgrade.
Update 2/1 late afternoon: outage extended in OIM to 10 p.m. EST. Later, early a.m. 2/2: work completed, test jobs were successful, queues set back on-line. eLog 21696.
6) 2/2: UTD-HEP set off-line at request of site admin. Rolling blackouts in the D-FW area (unfortunately). eLog 21702.
7) 2/2: WISC_DATADISK - failing functional tests with file transfer errors like " Can't mkdir: /atlas/xrootd/atlasdatadisk/step09]." ggus 66897 in-progress, eLog 21695.
Follow-ups from earlier reports:
(i) 12/17, 12/20: ANALY_SWT2_CPB was auto-blacklisted twice. Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up. Once the transfer completed the test jobs began to complete successfully. Discussion underway about how to address this issue.
(ii) 12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]." ggus 65617 in-progress, eLog 20810.
Update 1/30 from a shifter: No more problems seen - closing this ticket (ggus 65617).
(iii) 1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)." Site is investigating.
(iv) 1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist." ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036. Also https://savannah.cern.ch/bugs/index.php?77139.
1/25: Update from Shawn:
I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can track the "repair" at http://bourricot.cern.ch/dq2/consistency/
Let me know if there are further issues.
(v) 1/19: UTD-HEP - job failures with missing input file errors - for example: "19 Jan 07:07:10|Mover.py | !!FAILED!!2999!! Failed to transfer HITS.170554._000123.pool.root.2: 1103 (No such file or directory)." ggus 66284, eLog 21346.
Update 1/27: from the site admin: These errors seem to have been resolved by the LFC cleaning -- closing the ticket. eLog 21612.
(vi) 1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed." ggus 66298. From Hiro:
There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
(vii) 1/21: SLACXRD file transfer errors - "failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]." Issue was reported to be fixed by Wei, but the errors reappeared later the same day, so the ticket (ggus 66346) was re-opened. eLog 21409.
Update 1/30 from a shifter: No more errors in the last 12 hours, 400 successful transfers, maybe migration comes to an end. ggus 66346 closed, eLog 21611.
(iix) 1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]." https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
(ix) 1/24: ALGT2 job & file transfer errors - "[SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]." ggus 66450 in-progress, eLog 21488. Update from Bob at ALGT2:
Just restarted dcache services on head01. rsv srmcp-readwrite had been red. Hopefully that will clear the issue. Since the queues at the site
(analy_, prod) had been set offline (ADC site exclusion ticket: https://savannah.cern.ch/support/?118828) test jobs were submitted, and they completed successfully (eLog 21497). Are we ready to close this ticket?
Update 1/26: The site team restarted dcache services on head01 (rsv srmcp-readwrite had been red). Test jobs completed OK. ggus 66450 closed, eLog 21526.
(x) 1/25: SLACXRD_DATADISK file transfer errors - "[Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] No space found with at least 2895934054 bytes of unusedSize]." http://savannah.cern.ch/bugs/?77346.
Update 1/26 from Wei: this can be ignored. I was moving data amount storage nodes and was filling the quota fast.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.