Yuri's summary from the weekly ADCoS meeting:
http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=143330
1) 6/15: IllinoisHEP, from Dave: For some reason, many jobs in the IllinoisHEP production queue are failing. I am not sure why just yet, so I have put
this queue offline.
2) 6/19: DDM transfer errors to SLACXRD_PERF-JETS from multiple sources (" [DDM Site Services internal] Timelimit of 172800 seconds exceeded").
ggus 71675 in-progress, eLog 26572.
3) 6/20: DDM transfer errors to NET2_* tokens (" failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server"). From Saul & John at NET2:
Our srm was having a problem picking up new credentials this morning and was rejecting most requests and needed to be re-started. All seems to be
fine now. ggus 71701 closed, eLog 26603.
4) 6/21 early a.m.: SWT2_CPB - file transfer errors ("/bin/mkdir: cannot create directory..."). Issue was a failed disk in one of the RAID's, which triggered
a re-build, but the controller hung up, which necessitated a reboot of the storage server. System back up as of early evening - test jobs successful,
prod & analy queues back to 'on-line'. ggus 71758 / RT 20237 closed, eLog 26680.
http://savannah.cern.ch/support/?121682.
5) 6/21: From Shawn at AGLT2: We have lost the current Condor job load at AGLT2. We had a problem with the iSCSI server that hosts the OSGHOME
and ATLAS release areas and a quick reboot turned into a much longer repair than anticipated. All running Condor jobs are lost and will show up (over
the next N hours ) as lost-heartbeats I assume. (Shifters were requested to ignore any associated lhb errors.)
6) 6/21: SLAC - job failures with errors like "Put error: lfc_creatg failed with (1015, Internal error)|Log put error: lfc_creatg failed with (1015, Internal error)."
Wei reported the issue was a failed disk, now fixed. ggus 71774 closed, eLog 26658, https://savannah.cern.ch/support/index.php?121698.
7) 6/21-22: NET2: jobs not being brokered to the site. Saul reported that tomcat died on the BU gatekeeper, causing the system to stop reporting to the
OSG bdii. Re-started, will monitor.
Follow-ups from earlier reports:
(i) 6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..." Not a site issue, but rather related to the problem seen recently with
transfers between US tier-2's and European destinations (under investigation). ggus 71177 closed, eLog 26032.
Update 6/7: still see large numbers of these kinds of job failures. ggus 71314, eLog 26202.
See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
(ii) 6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime. Outage completed as of early a.m. 6/13. However, jobs
are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?). Issue being tracked here:
https://ticket.grid.iu.edu/goc/viewer?id=10566.
(iii) 6/13: SLAC - production job failures with the error "pilot: Exception caught by pilot watchdog: [Errno 10] No child processes trans: Unspecified error,
consult log file." Wei solved the problem by disabling the multi-job pilots. Issue will be raised with panda / pilot developers. ggus 71475 closed,
eLog 26382.
(iv) 6/14: Job failures at HU_ATLAS_Tier2 with the error "lsm-get failed: time out after 5400 seconds." ggus 71539, eLog 26438.
Update 6/17 from Saul & John at NET2: Problem resolved by improving our LSM so that it can handle the whole Harvard site starting at once.
ggus 71539 closed.
Yuri's summary from the weekly ADCoS meeting:
http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=2&confId=143581
1) 6/22: SLACXRD SRM errors (" failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]"). Later that day Wei
reported the problem had been fixed. ggus 71834 closed, eLog 26697.
2) 6/23 early a.m.: NET2 - low DDM transfer efficiency. From Saul: we saw a big burst of adler32 checksumming of small USERDISK files overnight
(I suspect that this is part of an ATLAS-wide burst of user activity). This caused our adler software to run out of I/O resources and eventually caused
bestman to stop. We added more I/O resources and re-started bestman about 1.5 hours ago. The adler backlog is down and we have been operating
normally since then. ggus 71843 closed, eLog 26712.
3) 6/23: (minor) pilot update from Paul (v47h): added debugging info in order to understand failures seen on dCache sites (TypeError: 'int' object is not
callable), related to Savannah ticket https://savannah.cern.ch/bugs/index.php?83380.
4) 6/23: IllinoisHEP - job failures with the error "SyntaxError: invalid syntax." ggus 71863, eLog 26723. Production queue set off-line.
Update 6/27-6/28: Dave reported that the issue was likely due to a problem with a squid server, which in turn impacted releases/cvmfs. Machine was
taken off-line - test jobs completed successfully, site back => on-line. (Following the re-start jobs were initially failing on one problematic WN, since removed.)
ggus 71863 closed, eLog 26886.
5) 6/23: BNL - SE maintenance intervention. Some file transfer / job errors, but went away once the work was completed. eLog 26722.
6) 6/24: Major issue with production across all clouds. Issue was traced to an overloaded host (atlascomputing.web.cern.ch) which was being hit with large
numbers of 'wget' requests to download MC job options files. (This system has been in place for several years, but over time the size of the job options .tgz
files has grown considerably.)
Many tasks were either paused or aborted to relieve the load on the server. Discussions underway about how to address this problem. Some info in
eLog 26744, 52, 54-56, more in an e-mail thread.
7) 6/25: ggus 71925 opened due to file transfer failures between IN2P3-CC & MWT2. Incorrectly assigned to MWT2 - actually an issue in the IN2P3 side.
Awaiting a response from IN2P3 personnel. ggus ticket closed, eLog 26781. (Also see related ggus ticket 71933.)
8) 6/25: BNL voms server was not accessible (a 'voms-proxy-init' against the server was hanging up). From John at BNL: I checked the server and although
the process was running, voms-proxy-init was indeed failing. A service restart has restored the functionality. ggus 71926 closed, eLog 26785.
9) 6/25-6/26: NET2 - DDM errors ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]"). Issue was due to heavy SRM activity. Saul
reported that changes were implemented to address the problem. No additional errors as of early 6/26. ggus 71923 closed, eLog 26778.
10) 6/27: SWT2_CPB - a user reported that his jobs were failing with the error "No input file available - check availability of
input dataset at site." Issue understood and resolved - from Patrick: The problem was traced to how the input files were registered in our LFC. The files
were registered in a compact form that causes problems for the run-athena transform because our system is configured to read ROOT files directly from
storage. The problematic LFC registrations were isolated to a week long period in May when BNL began to run a new DQ2 Stite Service version.
ggus 71935 / RT 20296 closed.
11) 6/28: Longstanding ggus ticket 69526 at NERSC closed (recent file transfer failures eventually succeeded on subsequent attempts). eLog 26876.
12) 6/28: ALGT2 - Bob reported that the site analysis queue was still set to 'brokeroff' after being auto-excluded by hammercloud testing on 6/25. For some
reason the 'HC.Test.Me' comment wasn't set for the site. This was corrected, but as of 6/29 a.m. ANALY_AGLT2 is still in the 'brokeroff' state?
Follow-ups from earlier reports:
(i) 6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..." Not a site issue, but rather related to the problem seen recently with
transfers between US tier-2's and European destinations (under investigation). ggus 71177 closed, eLog 26032.
Update 6/7: still see large numbers of these kinds of job failures. ggus 71314, eLog 26202.
See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
(ii) 6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime. Outage completed as of early a.m. 6/13. However, jobs
are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?). Issue being tracked here:
https://ticket.grid.iu.edu/goc/viewer?id=10566.
(iii) 6/13: SLAC - production job failures with the error "pilot: Exception caught by pilot watchdog: [Errno 10] No child processes trans: Unspecified error,
consult log file." Wei solved the problem by disabling the multi-job pilots. Issue will be raised with panda / pilot developers. ggus 71475 closed, eLog 26382.
(iv) 6/19: DDM transfer errors to SLACXRD_PERF-JETS from multiple sources (" [DDM Site Services internal] Timelimit of 172800 seconds exceeded").
ggus 71675 in-progress, eLog 26572.
Update 6/27 from Wei: I will trace this one via GGUS ticket system. It is not a bug anywhere, and I made agreement with US ATLAS computing management
that this looks like a long term small project. ggus 71675 closed.
. 1. Follow-up on failed production jobs overnight at Illinois to understand cause (Dave et al) 2. Alessandro will modify the installation & validation code to check for the presence of local site overrides to setup files for either: a) traditional: use of pool file catalog file exported out of NFS plus conditions data in HOTDISK b) cvmfs: use PFC and conditions data from cvmfs This will provide the option to roll-back changes if there are problems with cvmfs, and to test performance and other issues associated with having conditions data served from cvmfs. Test both modes at Illinois. Note dbrelease files are still required in HOTDISK (even if unused) for Panda brokering purposes. The ATLAS worker node client will continue to be supported with the OSG worker node client for the time being; we discussed dependency issues and testing required in the case that dq2 clients may be drawn from CVMFS itself (involves worker node client, local site mover, pilot). 3. Prepare first pass of OSG-specific documentation in the ATLAS twiki, https://twiki.cern.ch/twiki/bin/view/Atlas/CernVMFS#Setup_Instructions_for_OSG_Grid 4. Broaden tests to include the following sites: MWT2 (new queue) - Sarah, starting next week SWT2_CPB - Patrick, starting in two weeks BNL_ITB - Xin, starting next week 5. Clearing of grid3-locations and re-validation and tagging of releases at sites from cvmfs. Note Panda brokering requires (sites) 6. Running of validation jobs over these sites - analysis, production, and HC (eg. this test as a template, http://hammercloud.cern.ch/atlas/10004919/test/) (sites) 7. Finalize any deployment instructions based on these tests.
cvmf-talk proxy info will give you the information as to which squid is being used by a node and it showed that UC was being used. So far, after rebooting and cleaning up the Illinois squid, I have not seen any problems with jobs and missing files in cvmfs, but, since it is rare....it will take some time to know if that problem is now gone. There is a web site that I do not think has a link on the CVMFS twiki page that folks might like to know about http://cernvm.cern.ch/portal. You can find the release notes for cvmfs, etc on this site. Also, the writeup by Jakob Blomer is very useful https://cernvm.cern.ch/project/trac/downloads/cernvm/cvmfstech-0.2.70-1.pdf. One piece of information in this document is that it claims the servers for the repositories should have access to the local squid. It does not explain why though. I doubt if my "missing files" problem is related to the fact that I did not have these servers in the allowed ACL but....just in case I have added them to my squid configuration file....which now looks like.... acl our_networks src 192.17.18.32/28 192.168.207.0/24 128.174.118.0/24 127.0.0.1 cvmfs-stratum-one.cern.ch cernvmfs.gridpp.rl.ac.uk cvmfs.racf.bnl.gov http_access allow our_networksIf this is really is a requirement of a squid for CVMFS, then I think it might be good to put this on the twiki as well. Perhaps Doug can get some clarification on this point.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.