Yuri's summary from the weekly ADCoS meeting: http://www-hep.uta.edu/~sosebee/ADCoS/Panda_ADCoS-status-report-10_6-10_12.html 1) 10/6-10/8 -- Intermittent job failures at various US sites due to lack of tape access during the HPSS upgrade at BNL. 2) Problems with the panda servers at CERN following a move to new hosts. One machine was blocked by a firewall. Issues seem to be resolved now. eLog 6092, 6098. 3) 10/9: MWT2_IU -- issue with access to library file ( libpopt.so.0) from some SL5 worker nodes -- they were taken offline to fix the problem. ggus 52293. 4) 10/9: Armen and Alden completed migration of user analysis areas to USERDISK at BNL. 5) 10/9: BU -- gatekeeper reboot -- resulted in ~50 "lsm-get failed" errors. 6) Over this past weekend (10/10 - 10/12) -- large number of failed jobs at BNL - issue was a misconfiguration in schedconfigdb -- resolved. See ggus 52281. 7) 10/12: BU -- ~500 failed jobs due to a GPFS partition filling up -- resolved. ggus 52283. 8) 10/12: AGLT2 -- Jobs failing due to lack of free space in AGLT2_PRODDISK -- resolved. ggus 52274. 9) 10/13: dCache upgrade at BNL -- some residual issues following re-start, but everything seems to be resolved now. 10) 10/13: UTA_SWT2 set 'offline' to investigate problems with the ibrix storage. 11) 10/13-14: SLAC outage for OSG upgrade -- initially some issues sending test jobs to the site, owing to stale entries on the BNL submit host -- cleaned up by Xin -- test jobs eventually succeeded, site set back to 'online'. 12) Large increase recently in the number of nagios alerts -- from Tomasz: Nagios seems to flip flop on gatekeeper tests. The problem started few days ago and we do not know the cause. It seems intermittent: I can run the bare test several times by hand and it works and then suddenly it fails. In addition to that we do see network interruptions which come and go. Those two problems may or may not be related. I will disable nagios e-mail alerts for gatekeeper tests in order to reduce noise. Later: Last few days nagios was going nuts about gatekeeper tests: the probes were flipping up and down continuously. We had some sort of connectivity problem: nagios could not reach various hosts. The connections would intermittently fail. To make matters harder to debug the connection failures appeared completely random. In the end I had to disable notifications from nagios gatekeeper probes until the underlying connectivity problem is resolved. It seems that by now we have a partial understanding of what caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications. I still have one issue which I need to discuss with administrators of sites which run osg 1.2 - I will contact you off line. Follow-ups from earlier reports: (iii) ATLAS User Analysis Test (UAT) scheduled for October 21-23.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=1&materialId=0&confId=71379 1) 10/15: From Tomasz, regarding recent issues with nagios: It seems that by now we have a partial understanding of what caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications. 2) 10/15: From Wei, regarding Bestman / CA certs issue: The issue between Bestman and users with Canada's westgrid certificates (see my original email below) is addressed via a workaround provided by the LBL team. The workaround is to replace $VDT_LOCATION/bestman/lib/globus/cry*.jar in bestman 2.2.1.2.i5 or above by the .jars in bestman 2.2.1.2.i3. 3) 10/16: Jobs were failing at MWT2_UC with the error "Input file DBRelease-7.5.1.tar.gz not found." From Sarah: DBRelease file was transferred with __DQ2 extension, which is incompatible with panda/pilot/athena software. I've renamed the file and updated the lfc registration. 4) 10/17: Storage upgade at SLAC completed-- no major interruptions. 5) 10/19: Job failures at MWT2_UC likely due to disk cleanup removing still-needed files -- from Charles: I think the production job failures at MWT2_UC may have been due to overzealous cleanup of MWT2_UC_PRODDISK triggered by our site almost running out of space over the weekend. ggus 52475. 6) 10/19: Tadashi modified PandaMover to delete redundant files. (Thanks Charles for the feedback.) 7) 10/20: Large number of failed jobs at BNL, with the message "Get error: Too many/too large input files." Ongoing discussions about how to deal with this issue (i.e., in the pilot, split the jobs, etc.) 8) 10/21: Jobs were failing at all U.S. sites due to them attempting to access the Oracle db at BNL. Affected tasks were aborted. (Number of failed jobs >25k.) Follow-ups from earlier reports: (i) ATLAS User Analysis Test (UAT) re-scheduled for October 28-30.
USATLAS Throughput Meeting Notes – October 13, 2009 Attending: Shawn, David, Mike, Sarah, Doug, Jeff, Horst, Hiro Excused: Karthik, Jason Primary topic of discussion was last week’s perfSONAR installation/configuration for USATLAS. A survey during the call showed that OU’s instance was working fine since configuration, MWT2_IU had some issues with services stopping but a reconfiguration and reboot fixed it. AGLT2_UM had problems with the perfSONAR-BUOY services stopping as well is PingER stopping. The Wisconsin site is up but is not yet configured properly. Neng is looking into this. The AGLT2_UM issues are being debugged by the Internet2 developers. The AGLT2_MSU instances also seem to be running without an issue so far. Didn’t get reports from the other sites. Jeff Boote mentioned syslog configuration specifically on the AGLT2 boxes. UM needs to look at it to try for a more rational syslog configuration that also sends data to the central syslog host UM uses. Jeff also mentioned if perfSONAR software changes are needed another ISO could be produced. We will have to see what debugging the problems to-date turns up. Sarah provided our first perfSONAR measurement question for testing from IU to UTA. Sarah is seeing a lot of packet loss to UTA SWT2 (70/600) during the OWAMP testing. Even losing 1 OWAMP packet/600 could be significant so this is really a large loss that needs to be tracked down. The relevant traceroute is here (both directions): [knoppix@Knoppix ~]$ traceroute netmon1.atlas-swt2.org traceroute to netmon1.atlas-swt2.org (129.107.255.26), 30 hops max, 40 byte packets 1 149.165.225.254 (149.165.225.254) 11.924 ms 0.301 ms 0.335 ms 2 xe-0-2-0.2012.rtr.ictc.indiana.gigapop.net (149.165.254.249) 0.237 ms 0.265 ms 0.246 ms 3 tge-0-1-0-0.2093.chic.layer3.nlr.net (149.165.254.226) 6.450 ms 5.447 ms 5.252 ms 4 hous-chic-67.layer3.nlr.net (216.24.186.24) 31.837 ms 31.056 ms 30.961 ms 5 hstn-hstn-nlr-ge-0-0-0-0-layer3.tx-learn.net (74.200.188.34) 30.759 ms 30.803 ms 30.726 ms 6 dlls-hstn-nlr-ge-1-0-0-3002-layer3.tx-learn.net (74.200.188.38) 36.091 ms 36.169 ms 36.092 ms 7 74.200.188.42 (74.200.188.42) 36.112 ms 36.139 ms 36.218 ms 8 as16905_uta7206_m320_nlr.uta.edu (129.107.35.114) 37.548 ms 37.546 ms 37.494 ms 9 netmon1.atlas-swt2.org (129.107.255.26) 37.664 ms 37.645 ms 37.669 ms Reverse traceroute to my laptop: Executing exec(traceroute, -m 30 -q 3 -f 3, 149.166.143.177, 140) traceroute to 149.166.143.177 (149.166.143.177), 30 hops max, 140 byte packets 3 74.200.188.41 (74.200.188.41) 1.885 ms 1.772 ms 1.807 ms 4 hstn-dlls-nlr-ge-3-0-0-3002-layer3.tx-learn.net (74.200.188.37) 7.169 ms 7.150 ms 7.099 ms 5 hstn-hstn-nlr-layer3.tx-learn.net (74.200.188.33) 7.650 ms 7.598 ms 7.536 ms 6 chic-hous-67.layer3.nlr.net (216.24.186.25) 33.475 ms 33.489 ms 33.393 ms 7 xe-1-2-0.2093.rtr.ictc.indiana.gigapop.net (149.165.254.225) 37.669 ms 37.555 ms 37.657 ms 8 tge-1-2.9.br.ul.net.uits.iu.edu (149.165.254.230) 37.695 ms 37.733 ms 37.751 ms 9 tge-1-4.912.cr.ictc.net.uits.iu.edu (149.166.5.6) 38.877 ms 38.922 ms 40.268 ms 10 149-166-143-177.dhcp-in.iupui.edu (149.166.143.177) 37.809 ms 37.905 ms 37.944 ms Testing from Tier-2 to Tier-3 enabled for Hiro’s ( NET2 - Duke and MWT2_UC - Argonne). Moving 7 files from dataset. See Hiro’s update page at: https://www.usatlas.bnl.gov/dq2/throughput Milestone for 1GB/sec for 1 hour was ALMOST completed from BNL to MWT2_UC. Need to redo this during the next week. Sites should contact Hiro to arrange a throughput test. Need to get 1GB/sec for one hour from BNL -> (set of one or more Tier-2s). Individual sites with 10GE should strive for 400MB/sec for > ½ hour. IU notices a slowdown via Hiro’s automated load-test starting between Sep 30 and October 1st 2009. Sarah is looking into what changed. Future calls will regularly discuss perfSONAR measurement results once we start acquiring enough data from our testing configuration. Hiro will be contacting Jeff Boote (Internet2) to get information on the API for accessing perfSONAR measurement results for future integration into his plots. Please send along any corrections or additions to these minutes via email to the list. We plan to meet again next week at the normal time. Shawn
Meeting Notes from USATLAS Throughput Call
------------------------------------------
Attending: Shawn, Dave, Jason, Sarah, John, Horst, Karthik, Hiro, Doug
Discussion about perfSONAR status.
AGLT2 (rc4 version?), BNL, MSU, MWT2_IU (previous version) all lost configurations (apparent disk-problems, read-only disk, rebooting loses the config and data on disk). Possible issue for the KOI hardware? Needs to be debugged.
Issues with perfSONAR-BUOY Regular Testing (Throughput and/or One-Way Latency) changing from "Running" to "Not Running". Happening at AGLT2, BNL, MWT2, OU.
Dave reported running perfSONAR on a box for a few weeks (non-KOI; Intel box) and has had no problems. However tests only running a few days.
Internet2 developers will be looking into these two problems...may be the same problem (I/O errors may cause services to stop and/or loss data). May request files and/or access to problematic perfSONAR instances.
Next discussion was about using perfSONAR results to find issues:
1) SWT2-UTA possibly an issue. OU, MWT2_IU and AGLT2 show low throughput. IU was showing lots of packet lose in latency tests. However BNL->SWT2-UTA and MWT2_UC->SWT2-UTA seems OK. Needs looking into. Perhaps path differences may show where the problem lies.
2) AGLT2-BNL showing asymmetry again. AGLT2->BNL is good (>910Mbits/sec) but BNL->AGLT2 is poor (40-70 Mbits/sec) since October 15th. Need to investigate.
Discussion about Hiro's new testing. Now Tier-2 to Tier-2 testing in place. Only 10 files moved in tests and only using DATADISK. Select Tier-2 site (_DATADISK version) from http:///www.usatlas.bnl.gov/dq2/throughput and you can see results (scroll-down for graphs).
Slowdown to IU Tier-2 at the end of October: reason was found. The gridftp2 settings lost on the FTS channel was lost, preventing door-to-door transfers. Hiro fixed and throughput restored.
Discussion about LHC network (LHCOPN CERN<->BNL). BNL/John Bigrow wants to test new LHCOPN link to BNL for load-balancing. Needs to arrange high-throughput test from CERN <-> BNL (could use iperf). Will work with Hiro/Eduardo/Simone on this sometime next week.
Milestone: BNL->(set of Tier-2's at 1GB/sec for 1 hour). Test this Friday (October 23 at either 10 or 11 AM). Need to get feedback from other's on which time is best.
We will NOT be meeting next week unless someone wants to chair the meeting in my absence ( I will be on a plane to Shanghai). Next meeting TBD.
--------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.