scheddb modifications: SchedConfid modification
Yuri's summary from the weekly ADCoS meeting: he's on vacation this week. See summary from Alessandra Forti here: http://www-hep.uta.edu/~sosebee/ADCoS/World-wide-Panda_ADCoS-report-%28Aug17-23-2010%29.txt 1) 8/18: WISC - DDM transfer errors: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries]. Resolved as of 8/23, ggus 61283 closed, eLog 16167. (Note: ggus 61352 may be a related ticket?) 2) 8/18 - 8/19: SWT2_CPB - Source file/user checksum mismatch errors affecting DDM transfers. Issue resolved - from Patrick: The RAID in question rebuilt because of a failed drive and the problem has disappeared. ggus 61249, RT 1791 both closed, eLog 15965. 3) 8/18 - 8/19: SWT2_CPB - fiber cut on a major AT&T segment in the D-FW area. Reduced network capacity during this time, so Hiro reduced FTS transfers to single file to help alleviate the load. Cut fiber repaired, system back to normal. 4) 8/19: OU_OCHEP transfer errors: [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. From Horst: This must've been another SE overload because of the Lustre timeout bug. It has resolved itself with an auto-restart of Bestman, therefore I turned the channel back on and am resolving this ticket. ggus 61289, RT 17997 both closed, eLog 16019. 5) 8/19: from Charles at UC: Our PNFS server at MWT2_UC is temporarily down due to a power problem. It will be back up ASAP. In the meanwhile, FTS channels for UC are paused. Later: Server is back up and channels are being reopened. 6) 8/19: SRM problem at BNL - issue resolved. From Michael: Most likely caused by a faulty DNS mapping file that was created by BNL's central IT services. Forensic investigations are still ongoing. Transfers resumed at good efficiency. 7) 8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system. 8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line. Still waiting for a complete set of ATLAS s/w releases to be installed at OU_OSCER_ATLAS. eLog 16119. 8) 8/19: SWT2_CPB off-line for ~6 hours to add storage to the cluster. Work completed, system back up as of late Thursday evening. eLog 16072. 9) 8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out." Issue with a dCache pool - from Pedro (8/25): We still have a dcache pool offline. It should be back by tomorrow noon. In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt. ggus 61338/355, eLog 16154/230. 10) 8/20: Job failures at IllinoisHEP with the error "lcg_cp: Communication error on send." Issue resolved - from Dave: I believe this was due to some networking issues on campus. ggus 61339 closed, eLog 16084. 11) 8/23: BNL - file transfer errors from BNL to several destinations: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries]. From Michael: The problem is solved. The dCache core server lost connection with all peripheral dCache components. Transfers resumed after restart of the server. ggus 61359 closed, eLog 16174. 12) 8/24 - 8/25: SLAC - DDM errors: 'failed to contact on remote SRM': SLACXRD_DATADISK and SLACXRD_MCDISK. From Wei: Disk failure. Have turned off FTS channels to give priority to RAID resync. Will turn them back online when resync finish. ggus 61537 in progress, eLog 16223/33. Follow-ups from earlier reports: (i) 8/15-8/16: NERSC_HOTDISK request timeout transfer errors. ggus 61148 (in progress), eLog 15906. Update from Hiro, 8/23: NERSC has fixed the problem. The transfers are working normally. ggus 61148 closed. (ii) 8/16-8/17: MWT2_UC maintenance outage - from Aaron: We are taking a downtime tomorrow, Tuesday August 17th all day in order to migrate our PNFS service to new hardware. Site back on-line as of ~4:30 p.m. CST 8/17. Some issues with pilots at the site - under investigation. Update, 8/24: Xin added a second submit host for ANALY_MWT2, as it appeared that a single one could not keep up, especially in situations where the analysis jobs are very short (<5minutes, etc.) and cycle through the system at a high rate. Also, the schedconfig variable "timefloor = None" was be set to 60 (i.e., pilots will run for at least 60 minutes if there are real jobs to pick up). (iii) 8/17: From Wei at SLAC: Due to power work we will need to shutdown some of our batch nodes. This will likely result in reduced capacity at WT2 (8/18 - 8/20). We may also use the opportunity to reconfigure our storage (If that happens, we will send out an outage notice). 8/25: Presumably this outage is over?
Yuri's summary from the weekly ADCoS meeting: not available this week (Yuri just back from vacation) 1) 8/25: DDM failures on WISC_GROUP token - "cannot continue since no size has been returned after PrepareToGet or SrmStat." Issue resolved. ggus 61572 closed, eLog 16245. 2) 8/26: ggus 60982 (transfer errors from SLACXRD_USERDISK to SLACXRD_LOCALGROUPDISK) was closed. eLog 16268. 3) 8/27: SWT2_CPB_USERDISK - several files were unavailable from this token. Issue resolved - from Patrick: Sorry for the problems. The Xrootd system was inconsistent between the SRM and the disk contents. The issue has been resolved and the files are now available via the SRM. (A 4 hour maintenance outage was taken 8/27 p.m. to fix this problem.) ggus 61611 & RT 18043 closed, eLog 16349. 4) 8/28: From Wei at SLAC: We will take WT2 down on Monday Aug 30, 10am to 5pm PDT (5pm - 12am UTC) for site maintenance and to bring additional storage online. Maintenance completed on 8/30 - test jobs successful, queues set back on-line as of ~1:30 p.m. CST. eLog 16453. 5) 8/29: BNL DDM errors - T0 exports failing to BNL-OSG2_DATADISK. Initially seemed to be caused by a high load on pnfs01. Problem recurred, from Jane: pnfs was just restarted to improve the situation. ggus 61627 & goc #9155 both closed; eLog 16360/77. 6) 8/30: BNL - upgrade of storage servers' BIOS - from Michael: This is to let shifters know that at BNL we are taking advantage of the LHC technical stop and conductiong an upgrade of the BIOS of some of our storage servers. This maintenance is carried out in a "transparent" fashion meaning the servers will briefly go down for a reboot (takes ~10 minutes). As files requested during that time are not available there may be a few transfer failures due to "locality is unavailable." Maintenance completed - eLog 16419. 7) 9/1: BNL - Major upgrade of the BNL network routing infrastructure that connects the US Atlas Tier 1 Facility at BNL to the Internet. Duration: Wednesday, Sept 1, 2010 9:00AM EDT to 6:00PM EDT. See eLog 16476. Follow-ups from earlier reports: (i) 8/18: WISC - DDM transfer errors: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries]. Resolved as of 8/23, ggus 61283 closed, eLog 16167. (Note: ggus 61352 may be a related ticket?) As of 8/31 ggus 61352 is still 'in progress'. (ii) 8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system. 8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line. Still waiting for a complete set of ATLAS s/w releases to be installed at OU_OSCER_ATLAS. eLog 16119. As of 8/31 no updates about atlas s/w installs on OU_OSCER. (iii) 8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out." Issue with a dCache pool - from Pedro (8/25): We still have a dcache pool offline. It should be back by tomorrow noon. In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt. ggus 61338/355, eLog 16154/230. As of 8/31 ggus 61338 'solved', 61355 'in progress'. (iv) 8/24 - 8/25: SLAC - DDM errors: 'failed to contact on remote SRM': SLACXRD_DATADISK and SLACXRD_MCDISK. From Wei: Disk failure. Have turned off FTS channels to give priority to RAID resync. Will turn them back online when resync finish. ggus 61537 in progress, eLog 16223/33. Update, 8/31: a bad hard drive was replaced - issue resolved. ggus 61537 closed, eLog 16327.
USATLAS Throughput Meeting Notes --- August 24, 2010
====================================================
Attending: Shawn, Jason, Dave, Andy, Sarah, Horst, Karthik, Philippe, Aaron, Tom, Hiro
1) perfSONAR identified issue resolution status:
a) OU: Karthik reporting on OU testing. Relocated test node which improved things. Now using Tier-2 nodes directly testing to BNL (BNL->OU is OK, OU->BNL is poor). Lots of tunings on OneNet and at BNL and OU. Results from today suggest some issue in the tracepath but perhaps should stay focused on the current tools/tests for now. Horst and Karthik are enabling BWCTL on their 10GE hosts. Hiro tested with real file transfers (BNL->OU is OK (1 file 100MB/sec, 5 streams), OU->BNL is poor performance (20MB/sec; same options; same file)).
b) Illinois - Configuration change made to the 6500...no change in dropped packets from BNL to Illinois (problem is opposite direction from OU). Net admins see some possible issues but haven't had time to track things down yet and have a new testing node to help isolate the issue. Trying to enable jumbo frames (MTU=9000). (Bit of a discussion on jumbo frame issues seen at MWT2 and OU). Tracepath may be useful to debug UC-IU jumbo problems involving PBS. Some discussion about how NATs deal with jumbo frames on outside/normal frames on inside.
2) perfSONAR --- RC3 out maybe next week. New version (CentOS/driver) seems to have changed the performance of the system at BNL (say 900 Mbps -> 700 Mbps ). This is possibly a show-stopper. Trying new driver at BNL to see if it resolves things. Some new patches out to extract some timing info from old and current perfSONAR instances from Jason. Philippe reported on patching/fixing systems at AGLT2_UM. After 'myisamchk -er' on DB things were much faster. Some nightly checks not in place or perhaps not working? No progress yet on "single perfSONAR instance" Dell R410 at UM. Waiting for the release to test it. Very soon will have focused access on testing/deploying on the R410. Some discussion about install options for next perfSONAR release. Jason is working on documenting the "to disk" install variant and it should be available when the release is ready. Will use a repo and YUM to maintain on disk version.
3) Monitoring perfSONAR --- Plugins with "liveness" and thresholds are being worked on. Work in progress to convert them into RPM. Can be installed on perfSONAR host and/or Nagios server. Plugins are perl scripts underneath. Tom thought it would be straightforward to implement on the BNL Nagios server and can even provide a "perfSONAR" focused page showing just the perfSONAR related tests. Andy mentioned that documentation will be ready in advance of the plugins and he will let us know so we can start discussing options.
4) Round-table, site reports, other topics --- Hiro - Question: will perfSONAR be extended to each ATLAS site? Jason: perfSONAR MDM exists at Tier-1, Tier-0. Hiro points out problems with trying to debug Tier-1 issues and perfSONAR would be helpful. Also question about Nagios plugin working with MDM variant; Andy: may work but slight differences may cause problems. MWT2 has asymmetry involving UC, IU is OK.
Heads up: once the perfSONAR release is ready sites should be prepared to upgrade ASAP (within 1-2 weeks).
Plan to have our next meeting in 2 weeks. Please send along any additions or corrections to these notes via email.
Thanks,
Shawn
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.