Yuri's summary from the weekly ADCoS meeting:
http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-1_31_11.html
1) 1/26: data transfer errors from SLACXRD_USERDISK to MWT2_UC_LOCALGROUPDISK ("source file doesn't exist"). From Wei: I think you can close this ticket. There is only a few missing files and they do not exist at WT2. I don't know why FTS were asked to transfer them (maybe they were there when the request was submitted?) Repeated transfer request created lots of failures simply because they don't exist. ggus 66613 closed, eLog 21535.
2) 1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value." Consolidated into a single goc ticket, https://ticket.grid.iu.edu/goc/viewer?id=9871. Will be resolved in a new OSG release currently being tested in the ITB.
3) 1/27: from Bob at AGLT2 - At 1pm EST AGLT2 had a dCache issue. Available postgres connections had been dropped from 1000 to 300 during a pgtune a few days ago, and was not noticed until this failure was noticed. Unfortunately, this caused a LOT of job failures during the last 3 hours.
Later that evening / next morning:
We had some sort of "event" on our gate keeper around 11pm last night. Ultimately, condor was shot, and our load is lost. I have disabled auto-pilots this morning to both AGLT2 and ANALY_AGLT2 while we investigate the cause. Indications of hitting an open file limit on the system were found, and we need to understand the cause. Queues were set off-line. Later Friday afternoon, from Bob: We increased several sysctl parameters on gate01 dealing with total number of available file handles. Issues resolved, queues set back on-line. eLog 21583.
4) 1/30: AGLT2 - job (stage-out: "Internal name space timeout lcg_cp: Invalid argument") & file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]"). From Shawn: This morning around 8 AM Eastern time our postgresql server for the dCache namespace (Chimera) filled its partition with logging info (over 10 GB in the last 24 hours). This was traced to multiple attempts to re-register a few files over and over. We have cleaned up space on the partition and modified the logging to be "terse" so this won't happen as easily in the future. ggus 66794 in-progress, eLog 21616.
5) 2/1: Maintenance outage at AGLT2 - from Bob: The outage will include all of Condor, as well as a dCache outage and upgrade.
Update 2/1 late afternoon: outage extended in OIM to 10 p.m. EST. Later, early a.m. 2/2: work completed, test jobs were successful, queues set back on-line. eLog 21696.
6) 2/2: UTD-HEP set off-line at request of site admin. Rolling blackouts in the D-FW area (unfortunately). eLog 21702.
7) 2/2: WISC_DATADISK - failing functional tests with file transfer errors like " Can't mkdir: /atlas/xrootd/atlasdatadisk/step09]." ggus 66897 in-progress, eLog 21695.
Follow-ups from earlier reports:
(i) 12/17, 12/20: ANALY_SWT2_CPB was auto-blacklisted twice. Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up. Once the transfer completed the test jobs began to complete successfully. Discussion underway about how to address this issue.
(ii) 12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]." ggus 65617 in-progress, eLog 20810.
Update 1/30 from a shifter: No more problems seen - closing this ticket (ggus 65617).
(iii) 1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)." Site is investigating.
(iv) 1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist." ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036. Also https://savannah.cern.ch/bugs/index.php?77139.
1/25: Update from Shawn:
I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can track the "repair" at http://bourricot.cern.ch/dq2/consistency/
Let me know if there are further issues.
(v) 1/19: UTD-HEP - job failures with missing input file errors - for example: "19 Jan 07:07:10|Mover.py | !!FAILED!!2999!! Failed to transfer HITS.170554._000123.pool.root.2: 1103 (No such file or directory)." ggus 66284, eLog 21346.
Update 1/27: from the site admin: These errors seem to have been resolved by the LFC cleaning -- closing the ticket. eLog 21612.
(vi) 1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed." ggus 66298. From Hiro:
There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
(vii) 1/21: SLACXRD file transfer errors - "failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]." Issue was reported to be fixed by Wei, but the errors reappeared later the same day, so the ticket (ggus 66346) was re-opened. eLog 21409.
Update 1/30 from a shifter: No more errors in the last 12 hours, 400 successful transfers, maybe migration comes to an end. ggus 66346 closed, eLog 21611.
(iix) 1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]." https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
(ix) 1/24: ALGT2 job & file transfer errors - "[SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]." ggus 66450 in-progress, eLog 21488. Update from Bob at ALGT2:
Just restarted dcache services on head01. rsv srmcp-readwrite had been red. Hopefully that will clear the issue. Since the queues at the site
(analy_, prod) had been set offline (ADC site exclusion ticket: https://savannah.cern.ch/support/?118828) test jobs were submitted, and they completed successfully (eLog 21497). Are we ready to close this ticket?
Update 1/26: The site team restarted dcache services on head01 (rsv srmcp-readwrite had been red). Test jobs completed OK. ggus 66450 closed, eLog 21526.
(x) 1/25: SLACXRD_DATADISK file transfer errors - "[Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] No space found with at least 2895934054 bytes of unusedSize]." http://savannah.cern.ch/bugs/?77346.
Update 1/26 from Wei: this can be ignored. I was moving data amount storage nodes and was filling the quota fast.
Yuri's summary from the weekly ADCoS meeting:
http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=125577
1) 2/3: AGLT2 - job failures (stage-out errors) & DDM transfer failures. From Shawn: Last night I was working on getting the head02 setup as similar as possible to old head02.
I installed yum-autoupdate as part of the process. This morning it upgraded postgres90 from 9.0.2-1 to 9.0.2-2. The problem is the version on head02 is custom built. This caused
postgresql to shutdown around 6:50 AM. I reverted, put the exclude into /etc/yum.conf and got things running again.
Also, there was a brief network outage which resulted in many "lost heartbeat" errors. Everything resolved by ~noon CST. eLog 21718.
2) 2/3: MWT2_UC - job failures with lost heartbeat & stage-in errors. From Nate at MWT2: We had a network outage at IU which caused those lost heartbeats. The nodes are still
down until someone there can replace the switch. eLog 21729.
3) 2/3: US sites HU_ATLAS_Tier2, UTA_SWT2, SWT2_CPB - job failures due to a problem with atlas release 16.6.0.1. Xin reinstalled the s/w, issue resolved. ggus 66992-94,
RT 19389-91 tickets closed, eLog 21732-34.
4) 2/4-2/5: BNL-OSG2_DATADISK transfer errors such as "failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries]." Issue was
due to excessive load on the dCache pnfs server - now resolved. ggus 67005 closed, eLog 21745.
5) 2/5: Number of running production jobs in the U.S. cloud temporarily decreased - from Michael: The reason for the reduced number of running jobs was a file system on one of
the Condor-G submit hosts filled up earlier today. An alarm was triggered and Xin started cleaning up the filesystem a couple of hours ago. You will see the US cloud at full
capacity shortly. eLog 21797.
6) 2/5-2/6: SWT2-CPB-MCDISK file transfer failures. Issue understood and resolved - from Patrick: The SRM failed when the partition containing bestman filled up due to logging.
The logs were removed and the srm restarted. ggus 67070 / RT 19394 closed, eLog 21902.
7) 2/6: MWT2_UC - job failures with the error "Can't find [AtlasProduction_16_0_3_6_i686_slc5_gcc43_opt]." Xin was eventually able to install this cache (initially had a problem
accessing the CE due to a load spike) - issue resolved. ggus 67074 closed, eLog 21856.
8) 2/7: IllinoisHEP lost heartbeat job failures. From Dave at Illinois: These were caused by a problem on our NFS server early this morning. The problem was fixed, but only
after the currently running jobs all failed. ggus 67121 closed, eLog 21907.
9) 2/8: NET2_DATADISK - failing functional tests with "failed to contact on remote SRM" errors. Issue resolved - from Saul: Fixed (bestman needed a restart when we updated
our host certificate). ggus 67145 closed, eLog 21912.
10) 2/8: OU_OCHEP_SWT2_DATADISK failing functional tests with "failed to contact on remote SRM" errors. Horst couldn't find an issue on the OU end, and subsequent
transfers were succeeding. ggus 67146 closed, eLog 21913.
11) 2/8: FTS errors for transfers to a couple of U.S. cloud site. The messages indicated a full disk on the FTS host: "ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT
error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] cannot create archive repository: No space left on device]." Issue resolved by Hiro. ggus 67132 closed, eLog 21905.
Follow-ups from earlier reports:
(i) 12/17, 12/20: ANALY_SWT2_CPB was auto-blacklisted twice. Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site
when the first jobs started up. Once the transfer completed the test jobs began to complete successfully. Discussion underway about how to address this issue.
(ii) 1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)." Site is investigating.
(iii) 1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist." ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.
Also https://savannah.cern.ch/bugs/index.php?77139.
1/25: Update from Shawn:
I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and
you can track the "repair" at http://bourricot.cern.ch/dq2/consistency/. Let me know if there are further issues.
Update 1/28: files were declared 'recovered' - Savannah 77036 closed. (77139 dealt with the same issue.) ggus 66150 in-progress.
(iv) 1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running
on t301.hep.tau.ac.il reports Error reading token data header: Connection closed." ggus 66298. From Hiro: There is a known issue for users with Israel CA having problem accessing
BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other
sites (LOCAGROUPDISK area) for the downloading.
(v) 1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during
TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]." https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
(vi) 1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value." Consolidated into a single goc ticket,
https://ticket.grid.iu.edu/goc/viewer?id=9871. Will be resolved in a new OSG release currently being tested in the ITB.
(vii) 1/30: AGLT2 - job (stage-out: "Internal name space timeout lcg_cp: Invalid argument") & file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").
From Shawn: This morning around 8 AM Eastern time our postgresql server for the dCache namespace (Chimera) filled its partition with logging info (over 10 GB in the last 24 hours). This was
traced to multiple attempts to re-register a few files over and over. We have cleaned up space on the partition and modified the logging to be "terse" so this won't happen as easily in the future.
ggus 66794 in-progress, eLog 21616.
Update 2/3: issue resolved by reducing the level of postgresql logging. ggus 66794 closed, eLog 21717.
(iix) 2/2: UTD-HEP set off-line at request of site admin. Rolling blackouts in the D-FW area (unfortunately). eLog 21702.
Update 2/8: site recovered from power issues - test jobs completed successfully - set back on-line. eLog 21901,
https://savannah.cern.ch/support/index.php?119022.
(ix) 2/2: WISC_DATADISK - failing functional tests with file transfer errors like " Can't mkdir: /atlas/xrootd/atlasdatadisk/step09]." ggus 66897 in-progress, eLog 21695.
Update 2/4: Site admin reported issue was resolved. No more errors, ggus 66897 closed.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.