pyutils that didn't support xrootd (needs to be checked in Release 15.1.0). An then an old release, 14.4.0. A few failures because the xrootd server failed. Also moving release install area off the xrootd server.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=0&materialId=0&confId=73716 1) 11/5: UTD-HEP -- following the FTS upgrade at BNL file transfers at UTD-HEP were failing due to the older version of the SRM s/w at the site. It was upgraded, and transfers are now succeeding. (FTS error was "locality is NONE," fixed by adding [retention:REPLICA][latency:ONLINE] to each space token definition.) 2) 11/6: SLAC migrated the atlas s/w releases area to a different NFS server, but initially there were problems with some of the s/w paths / links, etc. Wei resolved the problems -- SLAC set back to 'online'. 3) 11/6: BNL -- disk failures in one of the storage servers for MCDISK. From Pedro: One of our storage servers was suffering from several disk failures. This affected 33TB of data stored in the MCDISK space token. The problematic disks have been replaced and the RAID set is being rebuilt. All data is available again. 4) 11/6: Jobs were failing at OU_OCHEP with the error "lfc_getreplicas(): SFN not set in LFC." Understood -- file removed during a disk clean-up. From Wensheng: The dq2-cache cleaner that ran yesterday removed the physical copy of the file DBRelease-7.5.1.tar.gz. A bit later pandamover brought it back, jobs have been running fine since then. 5) 11/7 - 11/8: AGLT2 -- jobs were failing with input file staging errors -- Bob tracked down the problem to a hung gridftp server. Issue resolved -- site set back to 'online'. RT 14551. 6) 11/8-11/9 -- very large number of panda jobs failed with the error "No reply to sent job." To fix the problem with ORACLE CERN PandaDB as Tadashi reported, the panda server was down for 5~10 min at 15:00 CERN time today (11/9) in order to optimize a schema in Oracle database. 7) 11/11, early a.m. -- intermittent network problems at BNL -- issue resolved: The RHIC and US Atlas facilities were experiencing intermittent network connectivity problems. The source of the problem has been identified and steps were taken to correct the problem. The underlying cause is under investigation and, in the event that the problem recurs, steps will be taken to resolve the problem quickly. Follow-ups from earlier reports: (i) UAT -- a postmortem announcement to follow. (ii) A new test instance of the RT server at BNL was announced by Jason (message to the usual mail lists). Try it out at: https://rt.racf.bnl.gov/rt3/
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=1&materialId=0&confId=74330 1) 11/11-11/12: IU_OSG -- kernel upgrades completed, site set back to 'online'. 2) 11/11: BNL -- ~150 jobs failed with stage-in errors -- issue was an off-line storage server -- resolved. RT 14585. 3) 11/12: BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until Monday, November 16th, and eventually to the 21st of December. 4) 11/13: ~500 failed jobs at BU with local site mover errors. The log extract included "no space left on device." From Saul: We got short of disk space in the process of moving our DATADISK. It should be fixed now. eLog 6926. 5) 11/13: At the beginning of the shift BNL and AGLT2 had no activated jobs, but plenty of assigned ones. Issue with the BNL_ATLAS_DDM queue was eventually resolved. See extensive mail thread for details. 6) 11/14: Jobs at AGLT2 were gradually draining out. From Bob: Running job count at aglt2 began to drop at 17:40pm. I subsequently found a crashed "ypbind", and restarted it at 20:15. All times EST. Grid services are once again authenticating, however, we expect a number of dead/crashed jobs to show up from this time period. 7) 11/15: srm storage filled up at UTD-HEP. Some issues running the "proddisk-cleanse.py" script. Being worked on. Site set 'off-line'. RT 14708. 8) 11/17 early a.m.: AGLT2 -- transfer errors, jobs were failing with " Put error: Copy command returned error code 256 and output: httpg://head01.aglt2.org:8443/srm/managerv2: CGSI-gSOAP: Could not open connection! Resolved -- from Shawn: The /var partition on the dCache headnode was full. This was apparently due to excessive logging into the postgres DB. Some space has been freed and both postgres and dcache services restarted on head01.aglt2.org. 9) 11/17: LFC migration to BNL completed for tier 3 site IllinoisHEP. Test jobs submitted, but they seem to still be using the old LFC information. Wensheng updated the ToA, and the jobs have now finished successfully. Will set the site to 'on-line' once the output file transfers complete. Follow-ups from earlier reports: (i) UAT -- postmortem will be held November 19, 2:00pm CET. (ii) A new test instance of the RT server at BNL was announced by Jason (message to the usual mail lists). Try it out at: https://rt.racf.bnl.gov/rt3/
As we agree, we must consolidate all US T3 DDM related services: DQ2 SS and LFC to BNL. As the first step, I would like to bring all DQ2 SS to BNL tomorrow. Basically, I need to ask you to turn off DQ2 since BNL's DQ2 SS will serve your sites. If you run DQ2 SS serving the following sites, please stop your DQ2 (or remove them from your configuration dq2.cfg). UCT3 OUHEP (is this T3?) WISC XYZ UTD XYZ ILLINOIS XYZ (done) DUKE XYZ (done) ANL XYZ (done) If you know any other sites, please let me know. Please keep your LFC's running. That will be the second step. I would like to do this tomorrow at 12PM US Eastern time. If you have any questions, please let me know. Hiro
--------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.