Yuri's summary from the weekly ADCoS meeting: http://www-hep.uta.edu/~sosebee/ADCoS/Panda_ADCoS-status-report-9_29-10_5.html [ ESD-ESD reprocessing exercise is done. Merging jobs completed on 10/2. Proposed dates for a postmortem: 10/14 or 10/15. ] 1) 9/30 p.m.: Power outage at SLAC ended. This was followed by a test of some recently installed RHEL 5 nodes on 10/2-10/3. Jobs finished successfully. 2) 10/1 a.m. -- from Saul: We had an air conditioning incident this morning at BU and had to turn off some of the blades while the room cooled down. A corresponding bunch of failed panda jobs will follow. Site services were not interrupted. 3) 10/2: From Tomasz, updates to nagios URL's: New nagios page: https://nagios.racf.bnl.gov/ New dashboard locations: https://nagios.racf.bnl.gov/nagios/sla_array.html (BNL) https://nagios.racf.bnl.gov/nagios/tier2.html (Tier2 services) 4) 10/2: Issue with LFC db corruption at MWT2 resolved -- thanks Charles, Hiro. 5) 10/4: AGLT2, from Shawn: ==> The certificates updating at AGLT2 is not working because the AFS volume Certificates has some issues. ==> This issue should be resolved now. I fixed the AFS volume replication. There was a second problem on gate01.aglt2.org: the ‘rsync’ RPM was not installed. Once I restored it and re-ran the synchronize script things started working. This was done on gate01, gate02, head01 and head02. 6) 10/5: MWT2_IU -- several hundred job failures with the error "Get error: lsm-get failed." From Sarah: ==> One of our dCache pools is having memory issues. I've stopped our local job scheduler and dq2-site services until it is recovered. ==> The pool has recovered. I've restarted the local job scheduler & dq2-siteservices. 7) 10/6-10/7: Maintenance downtime at the MWT2 sites to apply security patches and work on SL5.3 migration. Test jobs submitted after the downtime completed successfully 8) Various s/w upgrades announced for BNL: a) ATLAS dCache upgrade -- 13 Oct 2009 08h00 - 13 Oct 2009 17h00 b) Deploy security patches bug fixes in the OS and Oracle Cluster database underlying software -- 10/12/09 10:00 EDT - 10/12/09 13:00 EDT c) GUMS Database Reconfiguration -- Tuesday, October 13th, 2009 1100 - 1300 EST. d) HPSS software upgrade -- Tuesday Oct 6 - Thursday Oct 8 Follow-ups from earlier reports: (i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress, but still a few remaining issues. (ii) SLC5 upgrades are ongoing at the sites during the month of September. (iii) ATLAS User Analysis Test (UAT) scheduled for the second half of October.
Yuri's summary from the weekly ADCoS meeting: http://www-hep.uta.edu/~sosebee/ADCoS/Panda_ADCoS-status-report-10_6-10_12.html 1) 10/6-10/8 -- Intermittent job failures at various US sites due to lack of tape access during the HPSS upgrade at BNL. 2) Problems with the panda servers at CERN following a move to new hosts. One machine was blocked by a firewall. Issues seem to be resolved now. eLog 6092, 6098. 3) 10/9: MWT2_IU -- issue with access to library file ( libpopt.so.0) from some SL5 worker nodes -- they were taken offline to fix the problem. ggus 52293. 4) 10/9: Armen and Alden completed migration of user analysis areas to USERDISK at BNL. 5) 10/9: BU -- gatekeeper reboot -- resulted in ~50 "lsm-get failed" errors. 6) Over this past weekend (10/10 - 10/12) -- large number of failed jobs at BNL - issue was a misconfiguration in schedconfigdb -- resolved. See ggus 52281. 7) 10/12: BU -- ~500 failed jobs due to a GPFS partition filling up -- resolved. ggus 52283. 8) 10/12: AGLT2 -- Jobs failing due to lack of free space in AGLT2_PRODDISK -- resolved. ggus 52274. 9) 10/13: dCache upgrade at BNL -- some residual issues following re-start, but everything seems to be resolved now. 10) 10/13: UTA_SWT2 set 'offline' to investigate problems with the ibrix storage. 11) 10/13-14: SLAC outage for OSG upgrade -- initially some issues sending test jobs to the site, owing to stale entries on the BNL submit host -- cleaned up by Xin -- test jobs eventually succeeded, site set back to 'online'. 12) Large increase recently in the number of nagios alerts -- from Tomasz: Nagios seems to flip flop on gatekeeper tests. The problem started few days ago and we do not know the cause. It seems intermittent: I can run the bare test several times by hand and it works and then suddenly it fails. In addition to that we do see network interruptions which come and go. Those two problems may or may not be related. I will disable nagios e-mail alerts for gatekeeper tests in order to reduce noise. Later: Last few days nagios was going nuts about gatekeeper tests: the probes were flipping up and down continuously. We had some sort of connectivity problem: nagios could not reach various hosts. The connections would intermittently fail. To make matters harder to debug the connection failures appeared completely random. In the end I had to disable notifications from nagios gatekeeper probes until the underlying connectivity problem is resolved. It seems that by now we have a partial understanding of what caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications. I still have one issue which I need to discuss with administrators of sites which run osg 1.2 - I will contact you off line. Follow-ups from earlier reports: (iii) ATLAS User Analysis Test (UAT) scheduled for October 21-23.
DDM/Throughput related issue. 1. The DDM dataset/throughput monitor service was moved to the different host (from my desktop). But, it has the same link. 2. The test of FTS 2.2 continues. The bug is still found and being fixed by the developers. 3. The mass deletion of many old datasets (probably in MCDISK) by ADC is going on. 4. On next Tuesday, using the scheduled down-time for the maintenance of dCache, BNL FTS and LFC as well as the PANDA mover will be shutdown. And, during that time, there will be several operations. Application of ORACLE patch. The change in the routing of network between F5 and LFC. The relocation for the PANDA mover hosts. 5. ANL_LOCALGROUPDISK has been added to the T3 throughput test.
USATLAS Throughput Meeting Notes – October 13, 2009 Attending: Shawn, David, Mike, Sarah, Doug, Jeff, Horst, Hiro Excused: Karthik, Jason Primary topic of discussion was last week’s perfSONAR installation/configuration for USATLAS. A survey during the call showed that OU’s instance was working fine since configuration, MWT2_IU had some issues with services stopping but a reconfiguration and reboot fixed it. AGLT2_UM had problems with the perfSONAR-BUOY services stopping as well is PingER stopping. The Wisconsin site is up but is not yet configured properly. Neng is looking into this. The AGLT2_UM issues are being debugged by the Internet2 developers. The AGLT2_MSU instances also seem to be running without an issue so far. Didn’t get reports from the other sites. Jeff Boote mentioned syslog configuration specifically on the AGLT2 boxes. UM needs to look at it to try for a more rational syslog configuration that also sends data to the central syslog host UM uses. Jeff also mentioned if perfSONAR software changes are needed another ISO could be produced. We will have to see what debugging the problems to-date turns up. Sarah provided our first perfSONAR measurement question for testing from IU to UTA. Sarah is seeing a lot of packet loss to UTA SWT2 (70/600) during the OWAMP testing. Even losing 1 OWAMP packet/600 could be significant so this is really a large loss that needs to be tracked down. The relevant traceroute is here (both directions): [knoppix@Knoppix ~]$ traceroute netmon1.atlas-swt2.org traceroute to netmon1.atlas-swt2.org (129.107.255.26), 30 hops max, 40 byte packets 1 149.165.225.254 (149.165.225.254) 11.924 ms 0.301 ms 0.335 ms 2 xe-0-2-0.2012.rtr.ictc.indiana.gigapop.net (149.165.254.249) 0.237 ms 0.265 ms 0.246 ms 3 tge-0-1-0-0.2093.chic.layer3.nlr.net (149.165.254.226) 6.450 ms 5.447 ms 5.252 ms 4 hous-chic-67.layer3.nlr.net (216.24.186.24) 31.837 ms 31.056 ms 30.961 ms 5 hstn-hstn-nlr-ge-0-0-0-0-layer3.tx-learn.net (74.200.188.34) 30.759 ms 30.803 ms 30.726 ms 6 dlls-hstn-nlr-ge-1-0-0-3002-layer3.tx-learn.net (74.200.188.38) 36.091 ms 36.169 ms 36.092 ms 7 74.200.188.42 (74.200.188.42) 36.112 ms 36.139 ms 36.218 ms 8 as16905_uta7206_m320_nlr.uta.edu (129.107.35.114) 37.548 ms 37.546 ms 37.494 ms 9 netmon1.atlas-swt2.org (129.107.255.26) 37.664 ms 37.645 ms 37.669 ms Reverse traceroute to my laptop: Executing exec(traceroute, -m 30 -q 3 -f 3, 149.166.143.177, 140) traceroute to 149.166.143.177 (149.166.143.177), 30 hops max, 140 byte packets 3 74.200.188.41 (74.200.188.41) 1.885 ms 1.772 ms 1.807 ms 4 hstn-dlls-nlr-ge-3-0-0-3002-layer3.tx-learn.net (74.200.188.37) 7.169 ms 7.150 ms 7.099 ms 5 hstn-hstn-nlr-layer3.tx-learn.net (74.200.188.33) 7.650 ms 7.598 ms 7.536 ms 6 chic-hous-67.layer3.nlr.net (216.24.186.25) 33.475 ms 33.489 ms 33.393 ms 7 xe-1-2-0.2093.rtr.ictc.indiana.gigapop.net (149.165.254.225) 37.669 ms 37.555 ms 37.657 ms 8 tge-1-2.9.br.ul.net.uits.iu.edu (149.165.254.230) 37.695 ms 37.733 ms 37.751 ms 9 tge-1-4.912.cr.ictc.net.uits.iu.edu (149.166.5.6) 38.877 ms 38.922 ms 40.268 ms 10 149-166-143-177.dhcp-in.iupui.edu (149.166.143.177) 37.809 ms 37.905 ms 37.944 ms Testing from Tier-2 to Tier-3 enabled for Hiro’s ( NET2 - Duke and MWT2_UC - Argonne). Moving 7 files from dataset. See Hiro’s update page at: https://www.usatlas.bnl.gov/dq2/throughput Milestone for 1GB/sec for 1 hour was ALMOST completed from BNL to MWT2_UC. Need to redo this during the next week. Sites should contact Hiro to arrange a throughput test. Need to get 1GB/sec for one hour from BNL -> (set of one or more Tier-2s). Individual sites with 10GE should strive for 400MB/sec for > ½ hour. IU notices a slowdown via Hiro’s automated load-test starting between Sep 30 and October 1st 2009. Sarah is looking into what changed. Future calls will regularly discuss perfSONAR measurement results once we start acquiring enough data from our testing configuration. Hiro will be contacting Jeff Boote (Internet2) to get information on the API for accessing perfSONAR measurement results for future integration into his plots. Please send along any corrections or additions to these minutes via email to the list. We plan to meet again next week at the normal time. Shawn
--------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.