Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=86064 1) 2/17: NET2 -- following recovery from a power outage test jobs completed successfully -- site set back to 'on-line'. 2) 2/18: UTD-HEP set 'off-line' in preparation for OS and other s/w upgrades. 3) 2/19: New pilot version from Paul (42e): * Begun process of tailing the brief pilot error diagnostics (not completed). Long error messages were previously cut off by the 256 char limit on the server side, which often lead to the actual error not being displayed. Some (not yet all) error messages will now be tailed (i.e. the tail of the error message will be shown and not only the beginning of the string). Requested by I Ueda. * Now grabbing the number of events from non-standard athena stdout info strings (which are different for running with "good run list"). See discussion in Savannah ticket 62721. * Added dCache sub directory verification (which in turn is used to determine whether checksum test or file size test should be used on output files). Needed for sites that share dCache with other sites. Requested by Brian Bockelman et al. * Pilot queuedata downloads are now using new format for retrieving the queuedata from schedconfig. Not yet added to autopilot wrapper. Requested by Graeme Stewart. * DQ2 tracing report now contains PanDA job id as well as hostname, ip and user dn (a DQ2 trace can now be traced back to the original PanDA job). Requested by Paul Nilsson(!)/Angelos Molfetas. * Size of user workdir is now allowed to be up to 5 GB (previously 3 GB). Discussed in separate thread. Requested by Graeme Stewart. 4) 2/19 - 2/22: AGLT2 -- dCache issues discovered after the site was coming back on-line following a maintenance outage for s/w upgrades (SL5, etc.) Issue eventually resolved. Test jobs succeeded, back to 'on-line'. ggus 55709, eLog 9750. 5) 2/19: Oracle outage at CERN on 2/18 described here: https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/9644 6) 2/19: SLAC -- DDM errors like "FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] source file doesn't exist]." Issue resolved -- from Wei: This is a name space problem. I will investigate. In the mean time, I switched to run the xrootdfs (pnfs equivalent) without using name space. I hope DDM will retry. 7) 2/19: Large numbers of tasks were killed in the US & DE clouds to allow high priority ones to run. (The high priority tasks were needed for a min-bias paper in preparation.) eLog 9645. 8) 2/20: From John at Harvard -- HU_ATLAS_Tier2 set back to 'on-line' after test jobs completed successfully. 9) 2/23: BNL -- FTS upgraded to v2.2.3. From Hiro: This is just to inform you that BNL FTS has been upgraded to the checksum capable version. There will be some test for this capabilities. Also, as we have planed all along, the consolidation of DQ2 site service will happen after some tests in coming weeks after BNL DQ2 is upgraded in the next week. 10) New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here: https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime Follow-ups from earlier reports: (i) 2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/ More to say about this later as we see how its use evolves. (ii) This past week: What began with test jobs at UCITB_EDGE7 to verify the latest OSG release (1.2.7) led to the issue of having another site available besides UCITB. Long mail thread about this topic. Conclusion (from Alden, 2/23): Things are pretty much resolved. We'll need to create a new queue for ATLAS ITB activities, and shift all focus for ATLAS from off the BNL_ITB_Test1-condor queue. I'll get that started this afternoon. (iii) Issue about pilot space checking / minimum space requirements noted last week -- has there been a decision here?
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=86925 1) This past week -- ongoing work to setup a new BNL ITB test site. Long mail thread about this topic. 2) 2/24 - 2/26: AGLT2 -- transfer errors for AGLT2_DATADISK. Most likely a network issue -- from Shawn: We have seen no indication of a problem with our AGLT2_DATADISK space-token area. We have seen intermittent network problems (as observed by the perfSONAR OWAMP testing between AGLT2 and BNL) where there are periods of large packet loss. Our best guess is that these problems were correlated with some network issue along the path between AGLT2 and BNL. Almost all the time between this ticket and now the ARDA dashboard has been "Green" for the AGLT2 space-token areas. ggus 55854, eLog 9786. 3) 2/24 - 2/25: SWT2_CPB -- data transfers were failing due to a problem with one of the data servers. Data from this machine was replicated elsewhere in the cluster, resolving this issue. ggus 55895, eLog 9810, RT 15556 4) 2/25: Transfer errors at BNL. Issue resolved. eLog 9849, ggus 55936, RT 15563. 5) 2/25: New pilot version from Paul (42f): The pilot has been updated. The mini-release contains a few corrections to the DQ2 tracing reports (added DN for job owner, corrections for input file dataset, missing appdir variable for log transfers). Requested by Angelos Molfetas. 6) 2/25: SLAC -- brief outage to upgrade OS of SRM host and upgrade kernel for LFC DB host -- completed. 7) 2/26: MWT2_UC -- from Sarah: We've encountered a hardware fault on our LFC server, and are working to repair it. MWT2_UC and Analy_MWT2 will remain offline until it is back up. 2/27: update from Rob: The server hosting the LFC catalog failed and is being restored. In addition we are re-synching LFC, dCache and DQ2 central catalogs for data in the MWT2_UC_* space token areas. MWT2_UC and ANALY_MWT2 should be kept offline until both of these are complete. As of 3/2 issue resolved, test jobs completed successfully, and all MWT2_* sites set back to 'online'. eLog 9919, 9920, 9987. 8) 2/26: NET2 -- Transfer errors like: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNEC TION_ERROR] failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]. Givin' up after 3 tries]. Issue with certificate updates -- From John: Even though the cronjob is present, and this had been running regularly, something critical was updated so recently that we needed to --force an update. RT 15580, eLog 9896. 9) 2/28, 3/1: Additional space added to MWT2_UC_MCDISK after transfers failed with "id=381513 does not have enough space" errors. eLog 9934. 10) 3/1: BNL -- new gatekeeper is available. From Xin: Just let you know that there is a new gatekeeper gridgk05.racf.bnl.gov, which can run production/analysis/pandamover jobs now at BNL. Please feel free to make new queues, or direct existing queue pilots to it. 11) 3/2: SLAC -- almost 100% job failures with the error "unable to safeguard against Oracle overload due to ORA-12170: TNS:Connect timeout occurred." Problem understood -- from Wei: A new set of batch nodes that we are bringing online don't have the correct setup for tunneling (actually via iptables and xinetd) to BNL Oracle. I am sending an inquire on this issue. Hopefully this is just a neglection of our part. 12) 3/3 (early a.m.): BNL DQ2 site services upgraded. Follow-ups from earlier reports: (i) Issue about pilot space checking / minimum space requirements noted last week -- has there been a decision here? (ii) New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here: https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
I am wondering if we can agree on the consistent site naming convention for various services in the ATLAS production system used in US. There seems to be confusions among people/shifters outside of the US to identify the actual responsible site from various names used in the US production services/queues. In fact, some of them are openly commenting the frustration of the difficulty in the computing log. Hence, I am wondering if we can/should put the effort to use the consistent naming conventions for the site name used in the various systems. In the below, I have identified some of the systems which could help users if the consistent naming were being used. 1. PANDA site name 2. DDM site name 3. BDII site name At least, since these three names come to the front of the major ATLAS computing monitoring system, the good consistent naming for each site in these three separate systems should help ease problems encountered by the other people. So, is it possible to change any of the name? ( I know some of them are pain to change. If needed, I can make a table of names for each site used in these three system. ) Hiro
This is a report of pledged installed computing and storage capacity at sites.
Report date: 2010-01-25
--------------------------------------------------------------------------
# | Site | KSI2K | HS06 | TB |
--------------------------------------------------------------------------
1. | AGLT2 | 1,570 | 10,400 | 0 |
2. | AGLT2_CE_2 | 100 | 640 | 0 |
3. | AGLT2_SE | 0 | 0 | 1,060 |
--------------------------------------------------------------------------
Total: | US-AGLT2 | 1,670 | 11,040 | 1,060 |
--------------------------------------------------------------------------
| | | | |
4. | BU_ATLAS_Tier2 | 1,910 | 0 | 400 |
--------------------------------------------------------------------------
Total: | US-NET2 | 1,910 | 0 | 400 |
--------------------------------------------------------------------------
| | | | |
5. | BNL_ATLAS_1 | 0 | 0 | 0 |
6. | BNL_ATLAS_2 | 0 | 0 | 1 |
7. | BNL_ATLAS_SE | 0 | 0 | 0 |
--------------------------------------------------------------------------
Total: | US-T1-BNL | 0 | 0 | 1 |
--------------------------------------------------------------------------
| | | | |
8. | MWT2_IU | 3,276 | 0 | 0 |
9. | MWT2_IU_SE | 0 | 0 | 179 |
10. | MWT2_UC | 3,276 | 0 | 0 |
11. | MWT2_UC_SE | 0 | 0 | 200 |
--------------------------------------------------------------------------
Total: | US-MWT2 | 6,552 | 0 | 379 |
--------------------------------------------------------------------------
| | | | |
12. | OU_OCHEP_SWT2 | 464 | 0 | 16 |
13. | SWT2_CPB | 1,383 | 0 | 235 |
14. | UTA_SWT2 | 493 | 0 | 15 |
--------------------------------------------------------------------------
Total: | US-SWT2 | 2,340 | 0 | 266 |
--------------------------------------------------------------------------
Total: | All US ATLAS | 12,472 | 11,040 | 2,106 |
--------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.