Yuri's summary from the weekly ADCoS meeting: www-hep.uta.edu/~sosebee/ADCoS/World-wide-Panda_ADCoS-report-(Nov24-30-2009).html 1) 11/25: UTD-HEP -- Site set back to 'online' after test jobs finished successfully. 2) BNL: cyber-security port scans, originally scheduled for December 2/3, have been rescheduled for December 21/22. 3) 11/27: ~600 failed jobs at MWT2 with the error "SFN not set in LFC for guid..." Resolved -- from Sarah: The problem is that we ran proddisk_cleanse.py at our site to clean up proddisk, but some of the cleaned files were needed by activated jobs. The files have been re-transferred now. It's possible we'll have a few more errors, but it should be mostly clear now. ggus 53690. 4) 11/27 - 11/29: Transfer errors between SWT2_CPB and FZK -- problem resolved. ggus 53707, RT 14813, eLog 7566. 5) 11/30: New pilot version from Paul (41a): * Further modifications for SLC5/gcc43/i686-slc5-gcc43-opt jobs for installations with shared compiler and automatic selection of gcc43. AtlasLogin/AtlasSettings verification is now looking in $SITEROOT if the corresponding dirs exist there, if not, it will fall back to swdir (constructed from appdir/$VO_ATLAS_SW_DIR) as before. If it is not needed, the pilot will not set up the compiler as it did/does at BNL/CERN. Successfully tested at INFN-ROMA1 using release 15.3.1 + 15.3.1.4. Changes are compatible with current tests at BNL. * User analysis jobs are now setup with forceConfig where available, as well as explicitly setting non-empty AtlasVersion/AtlasProject. * File stager corrections including release check (file stager not compatible with release < 15.4.0). Direct access is now switched off if copysetup does not contain proper setup. * Protection against down site in user analysis trf download, including re-trials and fallback to optional download site. * HOTDISK correction. Replica randomization now properly handles hotdisk exception (should not be randomized). * Storage paths for mover log files now more detailed (fully date based, previous version only created monthly subdirs). 6) 11/30: UTD-HEP -- ~200 failed jobs due to missing db release v7.5.1 -- site LFC still had an entry, though no longer on disk. Wensheng removed the LFC entry, PandaMover re-staged the file, production resumed. ggus 53722, eLog 7596. 7) 12/1: Minor pilot update from Paul (41b): * Correction for install jobs. The previous pilot version had a problem with the internal handling of prodSourceLabel=software, now corrected. * After the getJob operation, the pilot now stores the dispatcher return code StatusCode in a file. Requested by Peter Love. 8) 12/1: Power outage in the CERN computing center affected a large number of systems related to panda / production. Very large number of "lost heartbeat" jobs were seen at most sites as a result. 9) 12/2: Some sites (for example AGLT2) were draining due to a pilot problem. The voms proxy on the submit host at BNL had a lifetime under 24 hours, causing pilots to fail with the error "Voms proxy certificate does not exist or is too short." From Xin: The BNL voms server only gives 24hours voms proxy. And 1 out of 4 times so far voms-proxy-init hits BNL one. I disabled the BNL voms server in the configure file .glite/vomses. Follow-ups from earlier reports: (iii) BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until Monday, November 16th, and eventually to the 21st of December.
Yuri's summary from the weekly ADCoS meeting: http://www-hep.uta.edu/~sosebee/ADCoS/World-wide-Panda_ADCoS-report-(Dec1-7-2009).html 1) 12/2-3: Another instance of a db release file in the LFC for UTD-HEP, but no longer on disk. Fixed by Wensheng (thanks!). RT 14843. (One more instance of this issue on 12/6 as well.) 2) 12/3: From Charles at UC: We had an apparent power interruption at UC last night at around 2AM CST. Expect some "lost heartbeat" errors from jobs that were running at the time. 3) 12/3: BNL: From Michael: Due to a configuration issue associated with the dccp client some jobs at BNL failed. The problem was resolved in the meanwhile. (~4k failed jobs.) eLog 7687. 3) 12/3: IU_OSG -- Jobs were failing with the error "Put error: lfc-mkdir failed: LFC_HOST iut2-grid5.iu.edu cannot create.... Could not secure the connection |Log put error: lfc-mkdir failed." From Aaron at MWT2: This has been resolved by a restart of proxies at IU_OSG. RT 14849. 4) 12/5-7: Power problems at AGLT2 -- from Bob: On Saturday night (~11:40pm EST) there was a power hit at Michigan State that took out a number of worker nodes. It also apparently took out a central air conditioner. On Sunday night (~11:20pm) that central air caught up with a major switch room at the MSU campus, and took down the network switch equipment for 2 hours, completely isolating more than half of our dcache disk servers from the systems that remained up at University of Michigan. Three of these did not restore properly when the network connectivity was re-established and were manually restarted early this morning, total down time for them about 8 hours. All jobs running at the time at MSU were lost. We had other issues this afternoon with network instability, that may have blown our running job load, but should now be back on track. All of these blown jobs should eventually show up with lost heart beat. 5) 12/7: SLAC -- ADCoS shifter reported t1-t2 transfer errors. ggus 53942. This issue was resolved by restarting the SRM service. 6) 12/7: BNL DQ2 site services s/w upgraded to the newest production version (Hiro). 7) 12/7: AGLT2_PRODDISK to BNL-OSG2_MCDISK transfer errors. From Shawn: We have two storage nodes with dCache service problems. I believe a simple restart should fix it. ggus 53915, eLog 7819. 8) 12/8: Power outage at BNL completed: The partial power outage at the RACF that affected a portion of the Linux Farm cluster on Tuesday, Dec. 8 is now over. All affected systems (ATLAS, BRAHMS, LSST, PHENIX and STAR) have been restored and are available to the Condor batch system again. Follow-ups from earlier reports: (i) BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until Monday, November 16th, and eventually to the 21st of December. (ii) BNL -- cyber-security port scans, originally scheduled for December 2/3, have been rescheduled for December 21/22.
Notes for USATLAS Throughput Meeting
====================================
Attending: Shawn, Andy, Dave, Karthik, Jeff,
Excused: Horst
1) Dave mentioned problem attributed to time-change. Karthik noted his throughput node was running OWAMP. Andy will be in touch with Karthik about node details to debug what may have happened in this case. Shawn noted the SNMP archive was running on the latency node at UM but was "Not Running" on the throughput node. Originally (after install) was running on both. Not sure of the details but this doesn't seem to be a problem. Discussion about perfSONAR measurements. Items
a) UIUC measurements show an asymmetry: Outbound from UIUC is typically good (>900 Mbps) while inbound is usually much less (~500 Mbps UM, ~130 Mbps BU). Need path details. Path from UIUC to UM throughput node is:
Traceroute security issues.
Executing exec(traceroute, -m 30 -q 3 -f 3, 192.41.230.20, 140)
traceroute to 192.41.230.20 (192.41.230.20), 30 hops max, 140 byte packets
3 192.17.17.42 (192.17.17.42) 0.634 ms 0.331 ms 0.292 ms
4 * uiuc-vrfeo-dmzo-lnk.gw.uiuc.edu (192.17.17.1) 0.944 ms 0.585 ms
5 ur1rtr-uiuc.ex.ui-iccn.org (72.36.127.1) 0.537 ms 0.626 ms 0.463 ms
6 t-710rtr.ix.ui-iccn.org (72.36.126.110) 3.235 ms 3.320 ms 3.184 ms
7 nlr-710rtr.ex.ui-iccn.org (72.36.127.170) 4.918 ms 4.197 ms 3.958 ms
8 216.24.184.34 (216.24.184.34) 3.829 ms 3.716 ms 3.671 ms
9 192.84.86.230 (192.84.86.230) 9.429 ms 9.409 ms 9.412 ms
10 psum02.aglt2.org (192.41.230.20) 9.301 ms 9.372 ms 9.354 ms
Path the other direction:
Executing exec(traceroute, -m 30 -q 3 -f 3, 192.17.18.41, 140)
traceroute to 192.17.18.41 (192.17.18.41), 30 hops max, 140 byte packets
3 ge-2-1-0.348.rtr.chic.net.internet2.edu (198.32.11.45) 181.511 ms * 186.440 ms
4 710rtr-internet2.ex.ui-iccn.org (72.36.127.157) 6.271 ms 6.273 ms 6.249 ms
5 t-ur2rtr.ix.ui-iccn.org (72.36.126.66) 9.044 ms 9.195 ms 9.201 ms
6 iccn-ur2rtr-uiuc2.gw.uiuc.edu (72.36.127.6) 9.236 ms 9.863 ms 10.244 ms
7 t-dmzo.gw.uiuc.edu (130.126.0.81) 21.031 ms 9.177 ms 9.109 ms
8 192.17.17.2 (192.17.17.2) 9.292 ms 9.147 ms 9.186 ms
9 192.17.17.41 (192.17.17.41) 9.227 ms 9.139 ms 9.053 ms
10 192.17.17.22 (192.17.17.22) 9.256 ms 9.208 ms 17.992 ms
11 osgx4.hep.uiuc.edu (192.17.18.41) 9.266 ms 9.329 ms 9.242 ms
b) Bandwidth tests to SWT2-UTA are very poor...typically 20-30 Mbits/sec. Tests to the Australian Tier-2 in Melbourne are 50-100 Mbits/sec for comparison. Need to determine what issues there may be along the SWT2-UTA path.
c) For next week we should have more sites join and review their ongoing tests. Some things to check: i) Find the best and worst paths (in One-Way Latency graphs) for packet loss (tests with least and most packet losses for a 24 hour period), ii) Find any tests with large asymmetries in throughput, iii) Find throughput tests which are unusually bad compared to expectations.
2) Milestones and benchmarking: Postponed topic until next week (need better attendance)
3) Site reports: No other issues reported from UIUC, OU or UM.
4) AOB: We will plan to meet again next week but this will be the last Throughput meeting of 2009. In 2010 we will start bi-weekly meetings. Please try to attend next week and review your perfSONAR tests before the call.
Please send along any corrections or additions via email. Thanks,
Shawn --------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.