dq2-get.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=0&materialId=0&confId=69159 [ ESD reprocessing is essentially done.] See: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCReproFromESDAug2009 1) 9/24: Some files were lost in the MWT2 storage due to a dCache misconfiguration / cleanup operation. Not a major issue -- jobs should simply fail and get rerun. eLog 5731. 2) 9/25 ==> Large number of failed jobs in the US cloud from task 78741 -- error was "could not add files to dataset." Remaining jobs were aborted. Issue discussed in Savannah 56127, RT 14134. 3) 9/25: Jobs failed at AGLT2 with the error "Put error: Error in copying the file from job workdir to localSE." Issue was expired host certs on several machines -- resolved. 4) 9/26: NET2 - problematic WN atlas-c01.bu.edu taken offline -- all pilots were failing on the machine with the error "Did not find a valid proxy, will now abort:" 5) 9/29 p.m.- 9/30 a.m.: NET2 sites offline due to a problem with the gatekeeper. Issue resolved, test jobs finished successfully, sites set back 'online'. eLog 5831. 6) 9/30: Power outage at SLAC today -- from Wei: SLAC will take a power outage at 9/30 to work on urgently needed maintenance of two transformers that supply power to machine rooms. We will start setting things offline from 6pm 9/29 and eventually will shutdown all ATLAS services. The outage is scheduled to complete at 6pm of 9/30. 7) 9/30: New pilot s/w from Paul, v39c: A problem with job recovery was discoverer due to the usage of a wrong error code related to LFC registration. When lfc-mkdir encountered an error, the wrong error code was set which led the pilot to believe that the job could be recovered on sites that support job recovery. The current job recovery version can not handle these cases. 8) Grid certificate for special user 'sm' updated at BNL & UTA (thanks Nurcan). 9) Heads-up: ATLAS User Analysis Test (UAT) scheduled for the second half of October. Follow-ups from earlier reports: (i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress, but still a few remaining issues. (ii) SLC5 upgrades are ongoing at the sites during the month of September.
Yuri's summary from the weekly ADCoS meeting: http://www-hep.uta.edu/~sosebee/ADCoS/Panda_ADCoS-status-report-9_29-10_5.html [ ESD-ESD reprocessing exercise is done. Merging jobs completed on 10/2. Proposed dates for a postmortem: 10/14 or 10/15. ] 1) 9/30 p.m.: Power outage at SLAC ended. This was followed by a test of some recently installed RHEL 5 nodes on 10/2-10/3. Jobs finished successfully. 2) 10/1 a.m. -- from Saul: We had an air conditioning incident this morning at BU and had to turn off some of the blades while the room cooled down. A corresponding bunch of failed panda jobs will follow. Site services were not interrupted. 3) 10/2: From Tomasz, updates to nagios URL's: New nagios page: https://nagios.racf.bnl.gov/ New dashboard locations: https://nagios.racf.bnl.gov/nagios/sla_array.html (BNL) https://nagios.racf.bnl.gov/nagios/tier2.html (Tier2 services) 4) 10/2: Issue with LFC db corruption at MWT2 resolved -- thanks Charles, Hiro. 5) 10/4: AGLT2, from Shawn: ==> The certificates updating at AGLT2 is not working because the AFS volume Certificates has some issues. ==> This issue should be resolved now. I fixed the AFS volume replication. There was a second problem on gate01.aglt2.org: the ‘rsync’ RPM was not installed. Once I restored it and re-ran the synchronize script things started working. This was done on gate01, gate02, head01 and head02. 6) 10/5: MWT2_IU -- several hundred job failures with the error "Get error: lsm-get failed." From Sarah: ==> One of our dCache pools is having memory issues. I've stopped our local job scheduler and dq2-site services until it is recovered. ==> The pool has recovered. I've restarted the local job scheduler & dq2-siteservices. 7) 10/6-10/7: Maintenance downtime at the MWT2 sites to apply security patches and work on SL5.3 migration. Test jobs submitted after the downtime completed successfully 8) Various s/w upgrades announced for BNL: a) ATLAS dCache upgrade -- 13 Oct 2009 08h00 - 13 Oct 2009 17h00 b) Deploy security patches bug fixes in the OS and Oracle Cluster database underlying software -- 10/12/09 10:00 EDT - 10/12/09 13:00 EDT c) GUMS Database Reconfiguration -- Tuesday, October 13th, 2009 1100 - 1300 EST. d) HPSS software upgrade -- Tuesday Oct 6 - Thursday Oct 8 Follow-ups from earlier reports: (i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress, but still a few remaining issues. (ii) SLC5 upgrades are ongoing at the sites during the month of September. (iii) ATLAS User Analysis Test (UAT) scheduled for the second half of October.
DDM/Throughput related issue. 1. The DDM dataset/throughput monitor service was moved to the different host (from my desktop). But, it has the same link. 2. The test of FTS 2.2 continues. The bug is still found and being fixed by the developers. 3. The mass deletion of many old datasets (probably in MCDISK) by ADC is going on. 4. On next Tuesday, using the scheduled down-time for the maintenance of dCache, BNL FTS and LFC as well as the PANDA mover will be shutdown. And, during that time, there will be several operations. Application of ORACLE patch. The change in the routing of network between F5 and LFC. The relocation for the PANDA mover hosts. 5. ANL_LOCALGROUPDISK has been added to the T3 throughput test.
Meeting Notes USATLAS Throughput Call – September 29, 2009
================================================
Attending: Sarah, Joe, Shawn, Hiro, Jason, Dave, Horst, Karthik, Doug
Excused: Saul, Neng
1) perfSONAR status. Release v3.1 is out. Already installed at OU and AGLT2. Possible issue with iptables: Shawn will send screen capture to Jason. MWT2, NET2 and Wisconsin have all confirmed they should be updated this week. Need to hear from WT2 and SWT2-UTA.
2) Update on Tier-3 testing. The info for the KOI perfSONAR box is (total cost is for 2 of them; note there is an additional charge for rails ~$30):
Item Number Qty Description Unit Cost Total Amount
2 1U Intel Pentium Dual-Core E2200 2.2GHz System $598.00 $1,196.00
Breakdown per System:
1 ASUS RS100-X5/P12 1U Chassis with 180W Single Power Supply. Intel
945GC/ICH7 Chipset Main Board. Onboard 2 x marvel 8056 GbE LAN
Controller, Intel Graphics Media Accelerator 950, 2 x SATA Ports.
1 Intel BX80557E2200 Pentium DC E2200 2.2GHz 1MB 800MHz Processor
2 Kingston KVR667D2N5/1G 1GB DDR2-5300 667MHz Non-ECC Unbuffered
1 Seagate ST3160815AS 160GB SATA 16MB 7200RPM Hard Drive
1 ASUS Slim DVD-ROM Drive
1 Labor/Shipping
1 Three Year Parts Repair/Replacement Warranty
TOTAL: $1,196.00
3) No updates. Still working on “3rd party” transfer capability for use in Tier-2 to Tier-3 testing. Will need to prestage long-term source files at Tier-2s for this. Tier-2s will need to set aside ~30GB of space for testing files.
4) Site reports
a. BNL – Nothing to report
b. AGLT2 – Still low throughput to debug. Issues with SRM hanging.
c. MWT2 – SL5 upgrade underway to fix TCP/Network issues.
d. NET2 – Working on perfSONAR updates.
e. SWT2 – perfSONAR installed and running.
f. WT2 – No report
g. Wisconsin - perfSONAR boxes should be upgraded this week.
5) AOB - Some review of perfSONAR milestones. October 7th to have all USATLAS Tier-2/Tier-1 sites config’ed for mesh-testing.
a. Manual load-test to AGLT2 on Wednesday 9:30 AM Eastern
b. MWT2 will schedule a manual load-test sometime after their SL5 update
c. Analysis stress test coming up. May have implications for our preparations…
We plan to meet again next week at the usual time (3 PM Eastern on Tuesday). Send along any corrections or additions to these notes via email.
Thanks,
Shawn
--------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.