dq2-get.
Yuri's summary from the weekly ADCoS meeting:
http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=68600
[ ESD reprocessing -- restarted mid-week once the new s/w cache became available. Most (all?) jobs are running with the flag "--ignoreunknown accepted,"
which means errors like "Unknown Transform error" can be ignored. Primary error seen so far is "Athena ran out of memory." ]
See: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCReproFromESDAug2009
[Other production generally running very smoothly this past week -- most tasks ("queue fillers") have low error rates. ]
1) 9/17: upgrade of dCache pool nodes at MWT2_UC to SL5.3.
2) 9/17: From Xin, s/w patch for SLC4 ==> 5 migration:
http://www.usatlas.bnl.gov/~xinzhao/installSW/patch-slc4-to-slc5.tar.gz
Details:
The patch fixes problems encountered by analysis jobs, which run on SL5 platform and involve compilation in the job.
Other production jobs and SL4 platform sites are fine without it, while having it is harmless as well.
3) 9/20: Test jobs completed successfully at UCITB_EDGE7.
4) 9/21: Intermittent transfer errors at MWT2 sites likely due to ongoing testing -- from Charles:
We've been running some throughput/load tests from UC to IU, which are almost certainly the cause of these transfer failures.
I'll terminate the test now and the errors ought to clear up. https://gus.fzk.de/ws/ticket_info.php?ticket=51697
5) 9/23: UTD-HEP completed hardware maintenance (new RAID controller on fileserver) -- test jobs finished successfully, site set back to 'online'.
6) 9/23: All jobs were failing at AGLT2 with "Put error: lfc-mkdir failed." Hiro was able to fix a problem with an ACL -- site set back to 'online'.
https://gus.fzk.de/ws/ticket_info.php?ticket=51770
Follow-ups from earlier reports:
(i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress,
but still a few remaining issues.
(ii) SLC5 upgrades are ongoing at the sites during the month of September.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=0&materialId=0&confId=69159 [ ESD reprocessing is essentially done.] See: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCReproFromESDAug2009 1) 9/24: Some files were lost in the MWT2 storage due to a dCache misconfiguration / cleanup operation. Not a major issue -- jobs should simply fail and get rerun. eLog 5731. 2) 9/25 ==> Large number of failed jobs in the US cloud from task 78741 -- error was "could not add files to dataset." Remaining jobs were aborted. Issue discussed in Savannah 56127, RT 14134. 3) 9/25: Jobs failed at AGLT2 with the error "Put error: Error in copying the file from job workdir to localSE." Issue was expired host certs on several machines -- resolved. 4) 9/26: NET2 - problematic WN atlas-c01.bu.edu taken offline -- all pilots were failing on the machine with the error "Did not find a valid proxy, will now abort:" 5) 9/29 p.m.- 9/30 a.m.: NET2 sites offline due to a problem with the gatekeeper. Issue resolved, test jobs finished successfully, sites set back 'online'. eLog 5831. 6) 9/30: Power outage at SLAC today -- from Wei: SLAC will take a power outage at 9/30 to work on urgently needed maintenance of two transformers that supply power to machine rooms. We will start setting things offline from 6pm 9/29 and eventually will shutdown all ATLAS services. The outage is scheduled to complete at 6pm of 9/30. 7) 9/30: New pilot s/w from Paul, v39c: A problem with job recovery was discoverer due to the usage of a wrong error code related to LFC registration. When lfc-mkdir encountered an error, the wrong error code was set which led the pilot to believe that the job could be recovered on sites that support job recovery. The current job recovery version can not handle these cases. 8) Grid certificate for special user 'sm' updated at BNL & UTA (thanks Nurcan). 9) Heads-up: ATLAS User Analysis Test (UAT) scheduled for the second half of October. Follow-ups from earlier reports: (i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress, but still a few remaining issues. (ii) SLC5 upgrades are ongoing at the sites during the month of September.
Meeting Notes USATLAS Throughput Call – September 29, 2009
================================================
Attending: Sarah, Joe, Shawn, Hiro, Jason, Dave, Horst, Karthik, Doug
Excused: Saul, Neng
(We had some issues with noisy lines. ESnet is aware of the issue. There is a problematic line they are trying to track down. Only solution for now seems to be redialing till you get a good line.)
1) perfSONAR status. Release v3.1 is out. Already installed at OU and AGLT2. Possible issue with iptables: Shawn will send screen capture to Jason. MWT2, NET2 and Wisconsin have all confirmed they should be updated this week. Need to hear from WT2 and SWT2-UTA.
2) Update on Tier-3 testing. The info for the KOI perfSONAR box is (total cost is for 2 of them; note there is an additional charge for rails ~$30):
Item Number Qty Description Unit Cost Total Amount
2 1U Intel Pentium Dual-Core E2200 2.2GHz System $598.00 $1,196.00
Breakdown per System:
1 ASUS RS100-X5/P12 1U Chassis with 180W Single Power Supply. Intel
945GC/ICH7 Chipset Main Board. Onboard 2 x marvel 8056 GbE LAN
Controller, Intel Graphics Media Accelerator 950, 2 x SATA Ports.
1 Intel BX80557E2200 Pentium DC E2200 2.2GHz 1MB 800MHz Processor
2 Kingston KVR667D2N5/1G 1GB DDR2-5300 667MHz Non-ECC Unbuffered
1 Seagate ST3160815AS 160GB SATA 16MB 7200RPM Hard Drive
1 ASUS Slim DVD-ROM Drive
1 Labor/Shipping
1 Three Year Parts Repair/Replacement Warranty
TOTAL: $1,196.00
3) No updates. Still working on “3rd party” transfer capability for use in Tier-2 to Tier-3 testing. Will need to prestage long-term source files at Tier-2s for this. Tier-2s will need to set aside ~30GB of space for testing files.
4) Site reports
a. BNL – Nothing to report
b. AGLT2 – Still low throughput to debug. Issues with SRM hanging.
c. MWT2 – SL5 upgrade underway to fix TCP/Network issues.
d. NET2 – Working on perfSONAR updates.
e. SWT2 – perfSONAR installed and running.
f. WT2 – No report
g. Wisconsin - perfSONAR boxes should be upgraded this week.
5) AOB - Some review of perfSONAR milestones. October 7th to have all USATLAS Tier-2/Tier-1 sites config’ed for mesh-testing.
a. Manual load-test to AGLT2 on Wednesday 9:30 AM Eastern
b. MWT2 will schedule a manual load-test sometime after their SL5 update
c. Analysis stress test coming up. May have implications for our preparations…
We plan to meet again next week at the usual time (3 PM Eastern on Tuesday). Send along any corrections or additions to these notes via email.
Thanks,
Shawn
--------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.