Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=85286 1) 2/10: MWT2_UC -- Transfer errors from MWT2_UC_PRODDISK to BNL-OSG2_MCDISK, with errors like "[FTS] FTS State [Failed] FTS Retries [7] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] source file doesn't exist]." Issue resolved. RT 15424, eLog 9427. 2) 2/11: SWT2_CPB -- maintenance outage complete, test jobs finished successfully, site set back to 'on-line'. 3) 2/11: AGLT2 -- ~200 failed jobs with the error "Get error: dccp get was timed out after 18000 seconds ." From Bob: This was fixed around 9am EST today. There was a missing routing table entry. Normal operations have since resumed. There may be more jobs reporting this failure until everything is caught up. Only half of AGLT2 worker nodes were affected by this. 4) 2/11: New pilot version from Paul (42c) -- * "t1d0" added to LFC replica sorting algorithm to prevent such tape replicas from appearing before any disk residing replicas. Requested by Hironori Ito et al. * Pilot is now monitoring the size of individual output files. Max allowed size is 5 GB. Currently the check is performed once every ten minutes. New error code was introduced: 1124 (pilot error code), 'Output file too large', 441003/EXEPANDA_OUTPUTFILETOOLARGE (proddb error code). Panda monitor, Bamboo and ProdDB was updated as well. Requested by Dario Barberis et al. * The limit of the total size of all input files is now read from the schedconfig DB (maxinputsize). The default site value is set to 14336 MB (by Alden Stradling). Brokering has been updated as well (by Tadashi Maeno). Any problem reading the schedconfig value (not set/bad chars) will lead to pilot setting its internal default as before (14336 MB). maxinputsize=0 means unlimited space. Monitoring of individual input sizes will be added to a later pilot version. * Fixed problem with local site mover; making sure that all input file names are defined by LFN. Previously problems occurred with PFN's containing the *__DQ2-* part in the file name (copied into the local directory leading to file not found problem). Requested by John Brunelle. * Fixed issue with FileStager access mode not working properly for sites with direct access switched off. Discovered by Dan van der Ster. * Removed panda servers voatlas19-21 from internal server list. Added voatlas58-59 (voatlas57 was added in pilot v 42b). Requested by Tadashi Maeno. 5) 2/11: IllinosHEP -- Job were failing with the errors "EXEPANDA_GET_FAILEDTOGETLFCREPLICAS " & "EXEPANDA_DQ2PUT_LFC-REGISTRATION-FAILED (261)." Seems the tier 3 LFC was unresponsive for a period of time, causing these errors. Issue resolved by Hiro. RT 15427, eLog 9468. 6) 2/11: AGLT2, power problem -- from Bob: At 10:55am EST a power trip at the MSU site dropped 12 machines accounting for up to 144 job slots. Not all of those slots were running T2 jobs, but many were. Those jobs were lost. 7) 2/12 - 2/13: Reprocessing validation jobs were failing at MWT2_IU due to missing atlas release 15.6.3. The release was subsequently installed by Xin. 8) 2/12: Reprocessing job submission begun with tag r1093. 9) 2/13: BNL -- Issue with data transfers resolved by re-starting the pnfs and SRM services. Problem appears to be related to high loads induced by reprocessing jobs. Another instance of the problem on 2/15. See details in eLog 9524, 9508. 10) 2/16: Power outage at NET2. From Saul: NET2 has had a power outage at the BU site. All systems have been rebooted and brought back to normal. Test jobs completed successfully, site set back to 'on-line'. 11) 2/16: Test jobs submitted to UCITB_EDGE7 to verify the latest OSG version (1.2.7, a minor update of version 1.2.6 -- contains mostly security fixes). Jobs completed successfully. 12) 2/17: Maintenance outage at AGLT beginning at 6:30 a.m. EST. From Shawn: The dCache headnodes have been restored on new hardware running SL5.4/x86_64. We are now working on BIOS/FW updates and will then rebuild the storage(pool) nodes. Outage is scheduled to end at 5 PM Eastern but we hope to be back before then. 13) 2/17: Note from Paul about impending pilot update: Currently the pilot only downloads a job if the WN has at least 5 GB available space. As discussed in a separate thread, we need to increase this limit to guarantee (well..) that a job using a lot of input data, can finish and not run out of local space. It was suggested by Kaushik to use the limit 14 + 5 + 2 = 21 GB (14 GB for input files, 5 GB for output, 2 GB for log). Please let me know as soon as possible if this needs to be discussed any further. Follow-ups from earlier reports: (i) 2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/ More to say about this later as we see how its use evolves.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=86064 1) 2/17: NET2 -- following recovery from a power outage test jobs completed successfully -- site set back to 'on-line'. 2) 2/18: UTD-HEP set 'off-line' in preparation for OS and other s/w upgrades. 3) 2/19: New pilot version from Paul (42e): * Begun process of tailing the brief pilot error diagnostics (not completed). Long error messages were previously cut off by the 256 char limit on the server side, which often lead to the actual error not being displayed. Some (not yet all) error messages will now be tailed (i.e. the tail of the error message will be shown and not only the beginning of the string). Requested by I Ueda. * Now grabbing the number of events from non-standard athena stdout info strings (which are different for running with "good run list"). See discussion in Savannah ticket 62721. * Added dCache sub directory verification (which in turn is used to determine whether checksum test or file size test should be used on output files). Needed for sites that share dCache with other sites. Requested by Brian Bockelman et al. * Pilot queuedata downloads are now using new format for retrieving the queuedata from schedconfig. Not yet added to autopilot wrapper. Requested by Graeme Stewart. * DQ2 tracing report now contains PanDA job id as well as hostname, ip and user dn (a DQ2 trace can now be traced back to the original PanDA job). Requested by Paul Nilsson(!)/Angelos Molfetas. * Size of user workdir is now allowed to be up to 5 GB (previously 3 GB). Discussed in separate thread. Requested by Graeme Stewart. 4) 2/19 - 2/22: AGLT2 -- dCache issues discovered after the site was coming back on-line following a maintenance outage for s/w upgrades (SL5, etc.) Issue eventually resolved. Test jobs succeeded, back to 'on-line'. ggus 55709, eLog 9750. 5) 2/19: Oracle outage at CERN on 2/18 described here: https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/9644 6) 2/19: SLAC -- DDM errors like "FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] source file doesn't exist]." Issue resolved -- from Wei: This is a name space problem. I will investigate. In the mean time, I switched to run the xrootdfs (pnfs equivalent) without using name space. I hope DDM will retry. 7) 2/19: Large numbers of tasks were killed in the US & DE clouds to allow high priority ones to run. (The high priority tasks were needed for a min-bias paper in preparation.) eLog 9645. 8) 2/20: From John at Harvard -- HU_ATLAS_Tier2 set back to 'on-line' after test jobs completed successfully. 9) 2/23: BNL -- FTS upgraded to v2.2.3. From Hiro: This is just to inform you that BNL FTS has been upgraded to the checksum capable version. There will be some test for this capabilities. Also, as we have planed all along, the consolidation of DQ2 site service will happen after some tests in coming weeks after BNL DQ2 is upgraded in the next week. 10) New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here: https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime Follow-ups from earlier reports: (i) 2/8: A new Savannah project is available for handling the status of sites: https://savannah.cern.ch/projects/adc-site-status/ More to say about this later as we see how its use evolves. (ii) This past week: What began with test jobs at UCITB_EDGE7 to verify the latest OSG release (1.2.7) led to the issue of having another site available besides UCITB. Long mail thread about this topic. Conclusion (from Alden, 2/23): Things are pretty much resolved. We'll need to create a new queue for ATLAS ITB activities, and shift all focus for ATLAS from off the BNL_ITB_Test1-condor queue. I'll get that started this afternoon. (iii) Issue about pilot space checking / minimum space requirements noted last week -- has there been a decision here?
I am wondering if we can agree on the consistent site naming convention for various services in the ATLAS production system used in US. There seems to be confusions among people/shifters outside of the US to identify the actual responsible site from various names used in the US production services/queues. In fact, some of them are openly commenting the frustration of the difficulty in the computing log. Hence, I am wondering if we can/should put the effort to use the consistent naming conventions for the site name used in the various systems. In the below, I have identified some of the systems which could help users if the consistent naming were being used. 1. PANDA site name 2. DDM site name 3. BDII site name At least, since these three names come to the front of the major ATLAS computing monitoring system, the good consistent naming for each site in these three separate systems should help ease problems encountered by the other people. So, is it possible to change any of the name? ( I know some of them are pain to change. If needed, I can make a table of names for each site used in these three system. ) Hiro
This is a report of pledged installed computing and storage capacity at sites. Report date: 2010-01-25 -------------------------------------------------------------------------- # | Site | KSI2K | HS06 | TB | -------------------------------------------------------------------------- 1. | AGLT2 | 1,570 | 10,400 | 0 | 2. | AGLT2_CE_2 | 100 | 640 | 0 | 3. | AGLT2_SE | 0 | 0 | 1,060 | -------------------------------------------------------------------------- Total: | US-AGLT2 | 1,670 | 11,040 | 1,060 | -------------------------------------------------------------------------- | | | | | 4. | BU_ATLAS_Tier2 | 1,910 | 0 | 400 | -------------------------------------------------------------------------- Total: | US-NET2 | 1,910 | 0 | 400 | -------------------------------------------------------------------------- | | | | | 5. | BNL_ATLAS_1 | 0 | 0 | 0 | 6. | BNL_ATLAS_2 | 0 | 0 | 1 | 7. | BNL_ATLAS_SE | 0 | 0 | 0 | -------------------------------------------------------------------------- Total: | US-T1-BNL | 0 | 0 | 1 | -------------------------------------------------------------------------- | | | | | 8. | MWT2_IU | 3,276 | 0 | 0 | 9. | MWT2_IU_SE | 0 | 0 | 179 | 10. | MWT2_UC | 3,276 | 0 | 0 | 11. | MWT2_UC_SE | 0 | 0 | 200 | -------------------------------------------------------------------------- Total: | US-MWT2 | 6,552 | 0 | 379 | -------------------------------------------------------------------------- | | | | | 12. | OU_OCHEP_SWT2 | 464 | 0 | 16 | 13. | SWT2_CPB | 1,383 | 0 | 235 | 14. | UTA_SWT2 | 493 | 0 | 15 | -------------------------------------------------------------------------- Total: | US-SWT2 | 2,340 | 0 | 266 | -------------------------------------------------------------------------- Total: | All US ATLAS | 12,472 | 11,040 | 2,106 | --------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.