Yuri's summary from the weekly ADCoS meeting:
https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=163385
or
http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_21_2011.html
1) 11/16 - 1/17: MWT2_UC - Sarah noticed the site was draining due jobs not progressing from 'assigned' to 'activated'. Issue was traced to a configuration
issue on the panda servers following their migration to VM's - now fixed.
2) 11/16 - 11/17: Job failures at OU_OCHEP_SWT2 with the error "Required CMTCONFIG (x86_64-slc5-gcc43-opt) incompatible with that of local system
(local cmtconfig not set)." Some of the release 16.6.7 caches not yet re-installed - now completed, so the issue apparently resolved.
http://savannah.cern.ch/bugs/?88904 closed, eLog 31548. (See item #10 from last week's summary.)
3) 11/16: SLAC - file transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]'). Wei reported the problem
had been fixed. ggus 76524 closed, eLog 31539.
4) 11/17: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]"). Site set off-line for a maintenance
outage. http://savannah.cern.ch/support/?124772 (Savannah site exclusion).
Update 11/19: outage completed - test jobs successful, site set back on-line. ggus 76570 closed, eLog 31550/614.
5) 11/18: SLAC - file transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]"). Wei reported that a dataserver
went off-line, now back up. ggus 76575 closed, eLog 31578.
6) 11/18: NET2 - file transfer failures at several sites with NET2 as the source (example: "...has trouble with canonical path. cannot access it"). From Saul: We're
having a file system problem this evening with GPFS which will cause jobs to fail with get/put errors. We've turned off FTS, ddm endpoints and put our PanDA
queues in brokeroff. ggus 76587. Later, a GPFS hardware problem was fixed - issue resolved, ggus ticket closed.
7) 11/18: AGLT2_CALIBDISK transfer errors ("user has no permission to create file /pnfs/aglt2.org/atlascalibdisk/..."). ggus 76576 in-progress, eLog 31563.
8) 11/19: Michael reported that an SRM issue at BNL which was causing file transfer failures had been resolved. eLog 31617.
9) 11/19: OU_OCHEP_SWT2 - file transfer errors ("failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]"). A restart of the SRM
service fixed the problem. ggus 76626 / RT 21254 closed, eLog 31625.
10) 11/21 - 11/22: AGLT2 - Job failures with pilot errors like "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, Could not secure the connection)|
Log put error: lsm-put failed (201)." Issue resolved - test jobs successful, site set back on-line. ggus 76684 closed, eLog 31677/93/727. (Site was in a scheduled
downtime 21 November, 17:00 – 20:00, to clean-up the dCache database.) Savannah site exclusion:
https://savannah.cern.ch/support/index.php?124827.
11) 11/21 - 11/22: MWT2_UC was draining due to a lack of activated jobs. Wensheng noticed a couple of issues affecting pandamover (not site specific), and
eventually the problem seemed to go away. More details in the associated e-mail thread.
12) 11/22 early a.m.: ADCR db's not accessible for ~30 minutes. (Afftects among other services access to panda servers.) Issue possibly related to an intervention
on a network switch arounf the same time. eLog 31707.
13) 11/22: OU_OCHEP_SWT2 - jobs failing with the error "Rlease16.6.7 jobs failed with Required CMTCONFIG (x86_64-slc5-gcc43-opt) incompatible with that of
local system." (A similar release issue occurred around 11/12 - 11/15, see ggus 76278.) Apparently some of the cache re-installs for release 16.6.7 were still
needed - now completed. Test jobs to the site successful - set back on-line (finally...) on 11/27. ggus 76708 / RT 21264 closed,
https://savannah.cern.ch/support/?124858 (Savannah site exclusion), eLog 31732/859.
14) 11/22: MWT2 sites were set off-line due to a crashed NFS server. Once the service was migrated to a new server test jobs submitted, completed successfully,
so set the queues back on-line. eLog 31739.
15) 11/23: SWT2_CPB - failed file transfers, due to a storage server going off-line. Colling fan on the NIC died, now replaced. Server back to available, transfers
succeeding. http://savannah.cern.ch/support/?124876 (Savannah site exlcusion), eLog 31768.
Follow-ups from earlier reports:
(i) Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3. Test jobs have run successfully at the site.
(ii) 10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF. This is the same type of issue
that has been observed at several US tier-2d's when attempting to copy job output files to other clouds. Working to understand this problem and decide how best
to handle these situations. Discussion in eLog 30170.
(iii) 10/14: AGLT2 file transfer errors ("[NO_PROGRESS] No markers indicating progress received for more than 180 seconds"). Probably not an issue on the AGLT2
side, but rather slowness on the remote end (in this case TRIUMF). There was a parallel issue causing jobs failures (NFS server hung up), which has since been
resolved. ggus 75302 closed, eLog 30415.
Update 10/15: ggus 75348 was opened for the site, again initially related to job failures due to slow output file transfer timeouts. What was probably a real site issue
(jobs failing with "ERROR 201 Copy command exited with status 256 | Log put error: lsm-put failed (201)") began appearing on 10/17. Most likely related to a dCache
service restart. Ticket in-progress, eLog 30443, https://savannah.cern.ch/support/?124073 (Savannah site exclusion).
Update 11/3: https://savannah.cern.ch/support/?124073 closed, but ggus 75348 is still 'in-progress'.
Update 11/17: ggus 75348 marked as 'solved'.
(iv) 11/11: UTD-HEP - job failures with the errors like "Payload stdout file too big: 3889399268 B (larger than limit 2147483648 B)." Seems to be a site issue, since
jobs from the same tasks run successfully elsewhere. http://savannah.cern.ch/bugs/?88774, eLog 31306.
(v) 11/11: SLAC - job failures with a message like "Error accessing path/file for root file..." From Wei: I see many failed jobs. they were retries of several missing files.
I manually checked the file in the ticket, along with a few other files used by other failed jobs. They are not presented in our storage. The storage logs show that they
were deleted a few hours before the jobs. The log also shows that they were all deleted by the SRM host, all at 13:43 UTC (5:43 PST). ggus 76264 in-progress,
eLog 31307.
(vi) 11/12: NET2 - jobs failures with the error "!!WARNING!!3000!! Trf setup file does not exist at: /atlasgrid/Grid3-app/atlas_app/atlas_rel/16.6.8/AtlasProduction/
16.6.8.2/AtlasProductionRunTime/cmt/setup.sh." Site investigating - ggus 76271, eLog 31469.
Update 11/17: a later kit validation restored a missing link in the release area - issue resolved. ggus 76271 closed.
Yuri's summary from the weekly ADCoS meeting:
https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=164357
or
http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_28_2011.html
1) 11/23: ADCR DB's down. Issue was a disk hardware failure. Services restored as of early a.m. 11/24. eLog 31789/802.
2) 11/23: related to 1) above, central LFC service was unavailable due to being unable to contact the adcr_lfc Oracle database. This resulted in a very large number
of failed jobs at many sites. ggus 76769 was opened for job failures at AGLT2 during this time, but not a site issue, but rather due to the LFC outage. ggus 76769
closed, 76770 (ticket for the LFC outage) also closed. https://savannah.cern.ch/bugs/?89216, eLog 31784.
3) 11/24 - 11/25: transfer of output datasets was taking a long time. Tadashi noticed a python2.5 problem on panda server machines, such that datasets were not
getting closed properly. Pandaserver was modified to use curl instead of pycurl, and this appears to have fixed the problem. More details in:
http://savannah.cern.ch/support/?124915, plus the associated e-mail thread.
4) 11/25: OU_OCHEP_SWT2_DATADISK file transfer errors (" [SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/
CN=531497/CN=Robot: ATLAS Data Management"). Problem solved by Host by implementing a round robin adler32 checksum method. ggus 76830 / RT 21275
closed, eLog 31849. https://savannah.cern.ch/support/index.php?124934 (Savannah site exclusion).
5) 11/27 early a.m.: MWT2_UC file transfer failures (" failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]"). From Sarah: The dCache
headnode had run out of diskspace and was causing services to fail. I have freed up diskspace and restarted dCache. ggus 76840 in-progress, eLog 31854.
Update 11/29: Issue seems to be resolved - ggus 76840 closed. eLog 31924.
6) 11/27: BELLARMINE-T3_DATADISK - file transfer errors ("failed to contact on remote SRM
[httpg://tier3-atlas2.bellarmine.edu:8443/srm/v2/server]"). From Horst: It looks like this was a network problem which went away again, since now
my srm tests are working again. ggus 76846 in-progress, eLog 31869.
Follow-ups from earlier reports:
(i) Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3. Test jobs have run successfully at the site.
(ii) 10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF. This is the same type of issue
that has been observed at several US tier-2d's when attempting to copy job output files to other clouds. Working to understand this problem and decide how best
to handle these situations. Discussion in eLog 30170.
(iii) 11/11: UTD-HEP - job failures with the errors like "Payload stdout file too big: 3889399268 B (larger than limit 2147483648 B)." Seems to be a site issue, since
jobs from the same tasks run successfully elsewhere. http://savannah.cern.ch/bugs/?88774, eLog 31306.
Update 11/29: It was pointed out in the Savannah ticket that these errors could be associated with a corrupt db release file (in cvmfs). Since all of the recent failed
jobs were occurring on two specific WN's the site admin ran a 'service cvmfs flush' on these hosts, and this appears (at least so far) to have fixed the problem.
During this period ggus 76757 was opened, and closed with the status 'not a site issue' - eLog 31886.
(iv) 11/11: SLAC - job failures with a message like "Error accessing path/file for root file..." From Wei: I see many failed jobs. they were retries of several missing files.
I manually checked the file in the ticket, along with a few other files used by other failed jobs. They are not presented in our storage. The storage logs show that
they were deleted a few hours before the jobs. The log also shows that they were all deleted by the SRM host, all at 13:43 UTC (5:43 PST). ggus 76264 in-progress,
eLog 31307.
Update 11/29 from Wei: I don't think we know the reason why we don't have the data files. and it is not happening. ggus 76264 closed.
(v) 11/18: AGLT2_CALIBDISK transfer errors ("user has no permission to create file /pnfs/aglt2.org/atlascalibdisk/..."). ggus 76576 in-progress, eLog 31563.
Update 11/29: No recent occurrences of this error - issue appears to be resolved. Closed ggus 76576 - eLog 31926.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.