Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=95901 1) 5/19: ATGLT, from Bob: An AT&T fiber was cut around 6:30pm. This caused a partial disruption between MSU and UM machines of AGLT2, with the ultimate effect that MSU workers have no afs access at all (OSGWN setup is in afs). I don't know what will happen with jobs actually running at MSU at this time. Jobs running at UM will run and commplete fine, and dCache file servers at MSU are fine. I have therefore initiated a peaceful condor idle of all MSU worker nodes. This means we will run at reduced capacity until the fiber problem can be resolved. 2) 5/19-20: New pilot version from Paul (44a), and minor patch (44b). Details are here: http://www-hep.uta.edu/~sosebee/ADCoS/pilot-update-May19-20-44a_b.html 3) 5/20: second half of the May 2010 reprocessing exercise has begun. Status: http://atladcops.cern.ch:8000/j_info/ 4) 5/21: From Hiro: There was a change in alias for LFC within BNL CE hosts to solve the network issue for some jobs to fail under certain heavy traffic. However, although it worked in the test, this change made the clients/jobs to fail with authentication errors. As a result, the alias was changed it to back the original setting. Meantime, you will notice some jobs failed with authentication errors. 5) 5/21: SWT2_CPB - file transfer failures like: FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error500 500-Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914:500-open() fail500 End.] From Patrick: The data server that was trying to store the transfers was misconfigured. The server was reconfigured and xrootd was restarted. ggus RT 58418, RT 17002 (both closed), eLog 13137. 6) 5/21-22: SWT2_CPB - A/C water leak in the machine room forced a power shutdown. Once power was restored and the services brought back on-line test jobs succeeded - the site is now back up. eLog 13002. 7) 5/22: AGLT2 - low efficiency for file transfers. Issue was heavy load on an SRM server, now resolved. eLog 12997. 8) 5/23: MWT2_UC - file transfer failures: FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY] Source file .... locality is UNAVAILABLE]. From Sarah: Two of the pools at MWT2 went offline this morning due to memory issues. They're back online now, and these transfers should start to succeed. 9) 5/23: MWT2_DATADISK low on free space: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] at Sat May 22 19:17:57 CDT 2010 state Failed : space with id=2310037 does not have enough space] As of 5/25 47 TB free space now available. Savannah 67809 (closed), eLog 13009. 10) 5/23-26: WISC_DATADISK - file transfer errors. From site admin: We have some power problem. Now all data servers are not available. I already submit an OIM unscheduled downtime. Sorry for the problem. We will make the service available as soon as possible when the power problem is solved. Later: The problem was solved. We will have a scheduled downtime tomorrow evening in the university to upgrade the power. On 5/25: After the power upgrade in the whole CS room, some of our servers failed to get ip address. Now we are working on it. ggus 58444 (in progress), eLog 13110,13. 11) 5/24: From John at NET2: Since there's been so little demand for production grid jobs over the past few days (today we ramped down to zero) I'm going to set HU_ATLAS_Tier2 to brokeroff so that we can perform some i/o tests without grid jobs interfering or getting harmed. This should only be for about a day or so. 12) 5/25: From Wei at SLAC, regarding problems with the SE: A data server went down at midnight. I got it back. I think we also have some intermediate DNS issue due to partial power outage today. 13) 5/25: From Bob at AGLT2: I have stopped auto-pilots to AGLT2 and to ANALY_AGLT2 while we update the OSGWN version at our site. I will let the remaining jobs here (63 at last count) complete to a great extent, update the distribution, then re-enable the pilots. Later: OSGWN version upgraded to 1.2.9 and tested. Restarted queues. Jobs are running cleanly. Follow-ups from earlier reports: (i) 4/11: Failed jobs at AGLT2 with errors like: 11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist. Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera. Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"... are there follow-on attempts or is this site-db configured? Paul added to the thread in case there is an issue on the pilot side. ggus 57186, RT 15953, eLog 11406. In progress. Update, 4/16: Still see this error at a low level, intermittently. For example ~80 failed jobs on this date. More discussion posted in the ggus ticket (#57186). Update, 5/4: Additional information posted in the ggus ticket. Also, see comments from Paul. Update, 5/10: Additional information posted in the ggus ticket. Update, 5/17: Additional information posted in the ggus ticket. Update, 5/21: Additional information posted in the ggus ticket. (ii) 4/23: OU sites were set off-line in advance of major upgrades -- from Horst: We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning. So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week? I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade. eLog 11813. Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=96677 1) 5/26: SE problem at SLAC, now resolved. From Wei: A large number of group data was put on one data server which I put in service as we were about to run out of space. Now we are paying the price for that. I will open FTS and SRM for now with reduced transfer rate (# parallel transfers in fts) and will put a cap on the number of analysis jobs. 2) 5/27: AGLT2 - autopilot failures with the error "Failed to download/unpack pilotcode.tar.gz." Issue resolved - from Bob: A missing routing table entry within Ultralight was repaired around 6pm. All seems fine now. 3) 5/27-28: From Charles at MWT2_UC: There has been a sudden loss of connectivity to the MWT2_UC cluster, looks like either a power or network disruption. We are investigating currently. Later: The network problem has been resolved and MWT2_UC is back online. eLog 13195. 4) 5/27-28: Transfer errors at AGLT2: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]. Issue resolved - from Bob: We have just doubled the RAM in our dCache head node, and modified the vm.swappiness parameter. Response and load look good right now. We will continue to monitor the system. It had been io bound (iostat maxed) since yesterday. eLog 13199. 5) 5/29: Transfer errors at MWT2_UC: DEST SURL: srm://uct2-dc1.uchicago.edu:8443/srm/managerv2?SFN=/pnfs/uchicago.edu/atlasdatadisk/... FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries]. From Rob: The problem has been corrected and transfers can resume. ggus 58622 (closed), eLog 13252. 6) 5/30: From Michael at BNL: DDM reports some files being "unavailable" at BNL. We are investigating. Later: Investigations unveiled that dCache took 2 pools offline because of conflicting information about available space. The pools are operational again and transfers out of these pools resume. eLog 13269. 7) 5/31: Pilot update from Paul: Two additional minor patches were released today related to user jobs and file stager. Using the optional switch --accessmode=filestager[/direct] now updates the copysetup field correctly. The pilot is now always setting the runAthena option –lfcHost (requested by Tadashi Maeno et al.). A left-over test code snippet in v 44c released a few hours ago forced file stager to be used which caused problems for user jobs on sites not supporting file stager, corrected in v 44c2. Now running. 8) 5/31-6/1: Transfer errors at AGLT2: DEST SURL: srm://head01.aglt2.org:8443/srm/managerv2?SFN=/pnfs/aglt2.org/atlascalibdisk/... FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Mon May 31 19:24:36 EDT 2010 state Failed : Marking Space as Being Used failed =>ERROR: duplicate key value violates unique constraint "srmspacefile_pkey"]. Issue resolved - from Shawn: In trying to add some files to the srmspacefile table an inconsistency was introduced such that dCache kept trying to use an existing record id to insert a new srmspacefile record. The srmspacemanagernextid table needed to be updated and dCache restarted to fix the problem. SRM transfers are again working at AGLT2 and this problem should be resolved. ggus 58671 (closed), eLog 13370. 9) 5/31-6/2: LHC outage. This period is available for site maintenance periods as needed. 10) 6/1: Maintenance outage at BNL - Major facility maintenance at the US Atlas Tier 1 facility at BNL will result in all services hosted at BNL being unavailable for four hours. The maintenance involves multiple services at BNL. Services restored as of ~1:30 EST. eLog 13341. 11) 6/1: Maintenance outage at AGLT2 completed. Test jobs successful, site set back to on-line. eLog 13346. 12) 6/2: Maintenance outage at MWT2_UC - from Aaron: We are taking a downtime June 2nd in order to update our networking infrastructure as well as upgrade to a new kernel on our worker nodes. We will be down from 9AM - 5PM CST, will send an announcement when we're back on-line. (In progress.) Follow-ups from earlier reports: (i) 4/11: Failed jobs at AGLT2 with errors like: 11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist. Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera. Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured? Paul added to the thread in case there is an issue on the pilot side. ggus 57186, RT 15953, eLog 11406. In progress. Update, 4/16: Still see this error at a low level, intermittently. For example ~80 failed jobs on this date. More discussion posted in the ggus ticket (#57186). Update, 5/4: Additional information posted in the ggus ticket. Also, see comments from Paul. Update, 5/10: Additional information posted in the ggus ticket. Update, 5/17: Additional information posted in the ggus ticket. Update, 5/21: Additional information posted in the ggus ticket. Update, 5/26: A combination of recent pilot changes + updated WN client s/w at AGLT2 solved this problem. ggus ticket 57186 closed. (ii) 4/23: OU sites were set off-line in advance of major upgrades -- from Horst: We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning. So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week? I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade. eLog 11813. Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon. (iii) 5/23-26: WISC_DATADISK - file transfer errors. From site admin: We have some power problem. Now all data servers are not available. I already submit an OIM unscheduled downtime. Sorry for the problem. We will make the service available as soon as possible when the power problem is solved. Later: The problem was solved. We will have a scheduled downtime tomorrow evening in the university to upgrade the power. On 5/25: After the power upgrade in the whole CS room, some of our servers failed to get ip address. Now we are working on it. ggus 58444 (in progress), eLog 13110,13. Update, 5/26 p.m. - From Wen at WISC: The network problem is fixed. The SRM service is available now.
lcg-cp hangs, for up to 8 hours.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.