Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=111758 1) 10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS production voms proxy. Hiro set the site off-line in DDM until the problem is resolved. ggus 63319 in-progress, eLog 18494. 2) 10/21: HU_ATLAS_Tier2 - job failures due to missing/incomplete ATLAS release 16.0.2. Missing s/w installed, issue resolved. https://savannah.cern.ch/bugs/index.php?74275 closed, eLog 18551. 3) 10/21 - 22: SLAC maintenance outage - completed, back on-line as of ~4:45 p.m. CST Friday. ggus tickets 63369 & 63372 were opened during this period, both subsequently closed; eLog 18550. From Wei: It took much longer than we expect. WT2 is now back online to the status before the outage (with one failed disk). But at least we produced more error/warning logs in the effect to satisfy Oracle's disk warranty requirement. 4) 10/23- 10/25: SWT2_CPB went off-line on Saturday due to a problem with the building generator-backed power feed to the cluster UPS. Power was restored, but it was decided to use this outage to make a planned change to the xrootd system. Back on-line as of 11:00 p.m. on Monday. eLog 18640. 5) 10/24: MWT2_DATADISK - file transfer errors with "source file doesn't exist." Issue understood - from Wensheng: This is a kind of race condition that happened. The dataset replica removal at MWT2_DATADISK was triggered for space purpose. There are multiple replicas available elsewhere. Savannah 74358 closed, eLog 18618. 6) 10/27: HU_ATLAS_Tier2 - job failures with lsm errors: "27 Oct 04:18:14|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed: time out after 5400 seconds." ggus 63486 in-progress, eLog 18670. 7) 10/27 early a.m.: RT # 18441 was generated for SWT2_CPB_SE due to one or more RSV tests failing for a short period of time. Issue understood - from Patrick: The addition of new storage to the SE required a restart of the SRM. This seems to have occurred during the RSV tests, as later tests are passing. Ticket closed. 8) 10/27: Job failures at several U.S. sites due to missing atlas s/w release 16.0.2. Issue understood - from Xin: SIT released a new version of the pacball for release 16.0.2, so I had to delete the existing 16.0.2 and re-install them. So far the base kit 16.0.2 has been re-installed, and 16.0.2.2 cache is also available at most sites, I just start the re-installation of 16.0.2.1 cache, which should be done in a couple of hours. ggus 63503 in-progress (but can probably be closed at this point), eLog 18678. Follow-ups from earlier reports: (i) Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line. Update 10/14: Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS. Update 10/20: Still trying to understand the brokerage problem. (ii) 9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364." ggus 62428 / RT 18249 in-progress, eLog 17474. Update from Patrick, 10/4: We are investigating the use of round-robin DNS services to create a coarse load balancing mechanism to distribute data access to multiple Frontier/Squid clients. Update from Patrick, 10/21: This issue has been resolved. The POOLFileCatalog.xml file is now being generated correctly for the cluster and we have configured a squid instance to support Frontier access, when needed. ggus & RT tickets closed. (iii) 9/30: HU_ATLAS_Tier2 - jobs from several tasks were failing with the error "TRF_UNKNOWN | 'poolToObject: caught error." ggus 62642 in-progress, eLog 17662. Update: as of 10/20 issue resolved, and the ggus ticket was closed. (iv) 10/4: ANL_LOCALGROUPDISK - all transfers to / from the token are failing. ggus 62750 in-progress, eLog 17803. Update 10/22: ggus ticket closed by Doug B. eLog 18539.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=112416 1) 10/27: WISC_DATADISK - possibly a missing file. ggus 63526 in-progress, eLog 18698. 2) 10/27: NET2_USERDISK transfer errors - " [SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]." >From Saul: This doesn't appear to be a site issue (the files are indeed no longer listed for our site in DQ2, in our LFC, or on disk), but rather some sort of race condition between scheduling to use us as a source and deletion. ggus 63533 closed, eLog 18704. 3) 10/27: SMU_LOCALGROUPDISK - file transfer errors due to an expired host cert. New cert installed on 10/29, site un-blacklisted on 10/31. ggus 63535 closed, eLog 18810. 4) 10/27: SLACXRD - job failures with the error "FID "8E91164C-1E3C-DB11-8CAB-00132046AB63" is not existing in the catalog." Xin fixed the PFC - issue resolved. https://savannah.cern.ch/bugs/?74553, eLog 18760. 5) 10/28: UPENN - file transfer errors due to an expired host cert. New cert installed on 10/29, but continued to see transfer errors. Hiro helped the site to debug the problem. Issue seems to be resolved as of 11/2, so ggus 63574 closed, eLog 18951. 6) 10/28: Job failures at OU_OCHEP & OSCER with an error like "pilot: Get error: No such file or directory." Issue was an incorrect entry in schedconfig (seprodpath = storage/data/atlasproddisk). Updated, issue resolved. 7) 10/29: From Bob at AGLT2: 3 short, closely spaced power hits took down 7 WN, and the jobs that were running on them at the time. Perhaps 80 jobs were lost. WN are back up now. 8) 10/29: Disk failure problem at SLAC - from Wei: We have many disk failures in a storage box. I am shutting down everything to minimum data loss. Data from the affected storage was moved elsewhere - issue resolved. 9) 10/30: BNL-OSG2_DATADISK - file transfer errors due to timeouts. Issue resolved - from Michael: The load across pools was re-balanced. eLog 18794. 10) 10/30: AGLT2 - large number of failed jobs with "lost heartbeat" errors. From Tom at MSU: "Some cluster network work yesterday had more impact than foreseen resulting in removal of all running Atlas jobs." ggus 63629 / RT 18466 closed, eLog 18832. 11) 10/30: Job failures at OU_OSCER_ATLAS due to missing release 16.0.2. Alessandro fixed an issue with the s/w install system that was preventing a re-try after one or more earlier failed attempts. 16.0.2 now available at the site. ggus 63635 / RT 18467 closed, eLog 18812. 12) 10/30: ANL - file transfer tests failing with the error "failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]. Givin' up after 3 tries]." ggus 63633 in-progress, eLog 18807. 13) 10/30: New site SLACXRD_LMEM-lsf availabe, test jobs submitted. Initially an issue with getting pilots to run at the site - now resolved. Queue is currently set to 'brokeroff'. 14) 10/31: Job failures at SLACXRD with the error "Required CMTCONFIG (i686-slc5-gcc43-opt) incompatible with that of local system." From Xin: The installation at SLAC is corrupted, I am reinstalling there, will update the ticket after the re-install is done. ggus 63639 in-progress, eLog 18845. 15) 11/1: Maintenance outage at AGLT2. Queues back on-line as of ~10:00 p.m. EST. 16) 11/1: Power outage at BNL (Switching back to utility power failed following completion of work on a primarily electrical feed circuit). Issue resolved, all services restored as of ~11:00 p.m. EST. eLog 18896. 17) 11/1: OU_OCHEP_SWT2 file transfer errors. Issue understood - from Horst: SRM errors were caused by our Lustre servers crashing and rebooting. DDN fixed the problem, and they are investigating what happened. ggus 63644 & 63662 / RT 18473 & 18480 closed, eLog 18851. 18) 11/1: NET2 job failures understood - from John & Saul: Just to let you know that we're getting some LSM errors at our Harvard sites due to an overloaded gatekeeper at BU. We've taken some steps which should clear this up, but we're expecting a batch of failed jobs in the next couple of hours. eLog 18875. 19) 11/1: HU_ATLAS_Tier2 - large number of job failures with the error "sm-get failed: time out after 5400 seconds." Issue understood - from Saul: This problem is gone now. It was caused by a sudden bunch of production jobs with huge 2.6 GB log files. Paul Nilsson has submitted a ticket about that. We've also made networking adjustments so that these kind of jobs wouldn't actually fail in the future. ggus 63665 / RT 18545 closed, eLog 18891. https://savannah.cern.ch/bugs/index.php?74720. 20) 11/2: AGLT2 - all jobs failing with the errors indicating a possible file system problem. From Bob: We have determined that the problem is a corrupted NFS file system hosting OSG/DATA and OSG/APP. That is the bad news. The good news is that this is a copy to a new host from yesterday, so the original will be used to re-create it. ggus 63684 in-progress, eLog 18913. Queues set off-line. 21) 11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites. Alessandro checked the release installation, and this doesn't appear to be the issue. May need to run a job ""by-hand" to get more detailed debugging information. In-progress. Follow-ups from earlier reports: (i) Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line. Update 10/14: Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS. Update 10/20: Still trying to understand the brokerage problem. Update 10/27: The field 'CMTCONFIG' in schedconfig for OSCER had an old value, so jobs now getting brokered to the site. (ii) 10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS production voms proxy. Hiro set the site off-line in DDM until the problem is resolved. ggus 63319 in-progress, eLog 18494. (iii) 10/27: HU_ATLAS_Tier2 - job failures with lsm errors: "27 Oct 04:18:14|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed: time out after 5400 seconds." ggus 63486 in-progress, eLog 18670. Update 10/28, from Saul: We had a networking problem last night between about midnight and 6 a.m. EST. We don't yet understand exactly what happened, but during this period, our network throughput went way down and about 500 jobs failed due to LSM timeout. ggus 63486 closed. (iv) 10/27: Job failures at several U.S. sites due to missing atlas s/w release 16.0.2. Issue understood - from Xin: SIT released a new version of the pacball for release 16.0.2, so I had to delete the existing 16.0.2 and re-install them. So far the base kit 16.0.2 has been re-installed, and 16.0.2.2 cache is also available at most sites, I just start the re-installation of 16.0.2.1 cache, which should be done in a couple of hours. ggus 63503 in-progress (but can probably be closed at this point), eLog 18678. Update 11/2: No additional errors - ggus ticket closed.
USATLAS Throughput Meeting - October 26, 2010 ============================================= Attending: Shawn, Dave, Andy, Philippe, Sarah, Karthik, Hiro, John, Horst, Tom, Doug Excused: Jason 1) No updates for retesting OU - BNL path. Karthik reported that as of the USATLAS facility meeting there was still poor throughput. John will re-run BNL tests to various ESnet locations. Dave reported that not too much progress. The perfSONAR box was moved and then broke. Need to get back to. Karthik reports on tests during the call: OU->Kansas, OU->ESnet(BNL) gets 3 Mbps. Unable to run reverse direction. Fixed reverse direction (problem in config at OU) (More details later in notes) 2) perfSONAR status. BNL and OU have CDs burned. Plan to install/upgrade soon. Illinois has success in using the LiveCD. Attempt to upgrade using net-install option not completely successful. The perfSONAR MA won't start in this case. Followed Jason's instructions (twice)...need to work with Jason on debugging. MSU updated both nodes to v3.2 using LiveCD method; latency is OK, bandwidth has a service not running. Philippe has email out about this problem. 3) Monitoring - Nagios monitoring discussed. Tom gave overview of current situation and is willing to work with our group in defining further monitoring capabilities for the dashboard. Dashboard for perfSONAR seems very useful and should meet our needs for monitoring perfSONAR instances. Currently have SLAC, BU and OU instances down. Discussed possible extensions for Tom's Nagios dashboard. Tom can also add additional email notifications. If site's want to add additional responsible perfSONAR people they can send Tom the address(es). Hiro is working on gathering the perfSONAR data and augmenting it with additional tracking of the traceroute (forward and reverse) between sites. Further testing info. John gets ~1 Gbps BNL-Chicago while only 16-200 Mbps BNL-Kansas City. Traceroute shows the path to OU includes both Chicago and Kansas City. Karthik's tests from bnl-pt1.es.net to Kansas City got 4.5 Gbps and 4.2 Gbps Kansas to bnl-pt1.es.net. John's succeeding tests show BNL-Kansas City close to 1 Gbps. Could be real congestion is complicating the testing. Situation seems to be that there is a problem between OU and Kansas City but this could also be real traffic congesting the links. Needs further work. Hiro mentioned Tier-2 to Tier-2 tests (worldwide) are underway. Important to have network data to help support this work longer term. Doug mentioned alerting ATLAS sites to the DYNES process and the need to make sure we have a large number of ATLAS institutions participating. Note: deadline for DYNES site submissions is the end of November! USATLAS related sites should be strongly encouraged to participate. See http://www.internet2.edu/ion/dynes.html for more information (and pass the word). We plan to meet again in 2 weeks at our regular time. Please send along correction or additions to the list. Thanks, Shawn
From: Tanya LevshinaDate: November 2, 2010 1:30:49 PM CDT To: Wei Yang , Marco Mambelli , Alain Roy , Doug Benjamin , Charles G Waldman , Tim Cartwright Cc: Doug Benjamin , Rob Gardner , Brian Bockelman , Rik Yoshida , Fabrizio Furano , Wilko Kroeger , OSG-T3-LIASON@OPENSCIENCEGRID.ORG Subject: OSG-ATLAS-Xrootd meeting - November 2nd @11 am CST - minutes Attendees; Alain, Tim, Marco, Andy, Charles, Wei and Tanya Agenda: 1. Meeting time. Tanya will set up a doodle poll to decide on the time for the next meeting (ATLAS week at CERN during first week of December). 2. VDT progress with RPMS (Tim Cartwright) a. xrootd rpm will be released within couple of weeks b. yum repository is setup in vdt b. configuration will come later in a separate package ~ in a month c. VDT will work on related packages xrootd-grdiftp, xrootfdfs, bestman after this is done. Xrootd has a higher priority then these other packages. Tanya: Are these priorities coming from Atlas Tier-3? Marco: I've talked to Doug B. and it looks like the xrootd rpm will be used outside US as well, so xrootd rpm release has higher priority then other components. Also, it is ok if rpms have to be installed as "root" but authorized user should be able to change configuration. Andy: no-privileged authorized user should have access to configuration files and be able to start/stop services. Alain: this could be done with sudo. 3. The xrootd release with all the patches provided by Brian should come within next week. New xrootdFS will be included into repository and can be built simultaneously with xrootd. Tim: Please let me know when the new release is ready. Andy: you should subscribe to xrootd-l@slac.stanford.edu to get notification Tanya: Should we change anything in VTD configuration for this release? Andy: It is backward compatible and will remain so for the next two(?) years but you can drop couple of unnecessary directives if you want. Tanya: Should we worry about adding in configuration changes for demonstrator projects. Do Atlas Tier-3 sites need them right away? Andy: I don't think that the regular Tier-3 site will use it now and the sites that are working with demonstrator projects know how to change the configuration. You should talk to Doug and Rik to understand their requirements. Tanya: is this new xrootdFS has fixes that take care of file deletion by not authorized users? Wei: Yes, it is fixed but xtoordFS now requires creation of the keys for xootdfs and distribution them to all data servers. Tanya: We need to understand this better. Could we talk about it in details? Please feel free to add/modify my notes. Thanks, Tanya
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.