Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=0&materialId=0&confId=76917 1) 12/9: Panda server modified to use new db accounts. Temporarily created a problem with attempts to modify the status of sites via the usual 'curl' interface. Fixed by Graeme. 2) 12/9: Some sites noted an increase in the number of pilots waiting in their queues. Possibly due to (from Torre): The autopilot setup on voatlas60 is the same as its been for a couple of weeks, but condor there has been tuned up and it seems it is now more effective at getting pilots to the US queues. The motivation for CERN submissions is to have a centrally managed submit point for everyone that provides redundancy for regional submission, so I think we should adapt ourselves to pilots coming from CERN as well as BNL. Whatever the nqueue setting for a queue is, each submitter will maintain that nqueue independently, so two equally successful submitters will result in ~double the pilot flow. Hence I would suggest reducing nqueue (not necessarily by a factor 2) such that pilot flow is reasonable again. 3) 12/10: Job failures at MWT2_IU & IU_OSG with stage-in/out errors -- from Charles: dCache service at IU was interrupted for some maintenance which took longer than expected. We're back online now. If job recovery is enabled for MWT2_IU (which I believe is the case) these output files should be recoverable. RT 14890, eLog 7892. 4) 12/10 p.m. - 12/11 a.m.: A couple of storage server outages at BNL -- resolved. eLog 7910. 5) 12/15: Pilot update from Paul (v41c): * Local site mover is now using --guid option. Requested by Charles. * Correction for appdir used by CERN-UNVALID since previous pilot version caused problems there (pilot v 40b used until now). $SITEROOT was used to build path to release instead of schedconfig.appdir. CERN-PROD and CERN-RELEASE were not affected since $SITEROOT and appdir both points to .../release area. * Pilot options -gand -m can now be used to specify locations and destinations of input and output files in combination with mv site mover (compatible with Nordugrid). Requested by Predrag Buncic for CERNVM project. * Empty copyprefix substrings replaced with dummy value. Initially caused problems at UTD-HEP due to misconfiguration in schedconfig. * STATUSCODE file now created in all getJob scenarios. Requested by Peter Love. * Value of ATLAS_POOLCOND_PATH dumped in pilot log. Requested by Rod. * The xrdcp site mover (written by Eric for use at ANALY_LYON) has been updated to also work at ANALY_CERN. * Note: There will be at least one more minor pilot release before Christmas. 6) 12/14: From Bob at AGLT2: At approximately 4:50am EST today, cluster activity at AGLT2 began to ramp down. We discovered processes were hung on dCache admin nodes and probably on a few disk servers as well. At 10:35am cluster activity resumed to normal after services were restarted. We expect this will throw errors in running jobs during this time period. 7) 12/15: Jobs failures at OU with stage-in errors. Coincided with a pilot update, which exposed some needed updates to schedconfigdb entries for the site. Alden made the updates to schedconfigdb, Paul is working on a modification to the pilot which should be ready in the next day or so. Site set to 'off-line'. RT #14912. 12/16 a.m. -- problem now appears to be solved, OU set back to 'on-line'. Follow-ups from earlier reports: (i) BNL -- US ATLAS conditions oracle cluster db maintenance, originally scheduled for 11/12/09, was postponed until Monday, November 16th, and eventually to the 21st of December. (ii) BNL -- cyber-security port scans, originally scheduled for December 2/3, have been rescheduled for December 21/22.
Yuri's summary from the weekly ADCoS meeting: http://www-hep.uta.edu/~sosebee/ADCoS/World-wide-Panda_ADCoS-report-(Dec15-21-2009).html 1) 12/16-17: AGLT2 -- file transfer errors -- "locality is UNAVAILABLE" -- resolution (from Shawn): UMFS07.AGLT2.ORG (hosting dCache pools for DATADISK, CALIBDISK and PRODDISK again had problems. This was traced to a combination of old driver and newer firmware. Driver and kernel were updated and system was rebooted. Last 1.5 hours SRM shows no errors. Things seem to be working so I am closing this ticket. eLog 8128, RT 14916. 2) 12/17-18: AGLT2 maintenance outage -- initially an issue with dCache after re-starting -- from Shawn: We found that jobs using 'dccp' at our site were failing after coming back online from our upgrade. We checked the systems and the /pnfs area was seemingly mounted correctly on the affected nodes but the 'dccp' copy command would fail like: dccp /pnfs/aglt2.org/atlashotdisk/ddo/DBRelease/ddo.000001.Atlas.Ideal.DBRelease.v070302/DBRelease-7.3.2.tar.gz /tmp/test.db Failed to open config file /pnfs/aglt2.org/atlashotdisk/ddo/DBRelease/ddo.000001.Atlas.Ideal.DBRelease.v070302/.(config)(dCache)/dcache.conf Failed to create a control line Failed open file in the dCache. Can't open source file : Can not open config file System error: No such file or directory Other nodes would work correctly with the same command. To fix the issue we found we had to 'umount /pnfs' and then 'mount /pnfs' to restore proper functioning. Note that prior to the remount the /pnfs mount seemed to be OK (you could do 'ls /pnfs/aglt2.org' for example) but 'dccp' would fail as above. This must have had something to do with the reboot of the /pnfs headnode creating some kind of "stale" mount during our upgrade today. The unusual thing is that the /pnfs headnode was rebooted before we rebuilt/upgraded our new worker nodes. eLog 8166. 3) 12/18 (ongoing): IU_OSG -- site was origianlly set off-line due to pilot failures that were blocking jobs from other VO's -- nothing unusual seen on the pilot submit host -- tried some test jobs yesterday (12/22) -- 9 of 10 failed with the error " Pilot has decided to kill looping job." Jobs were seemingly "stuck" on the WN's -- problem still under investigation. 4) 12/19: MWT2_IU -- ~65 failed jobs with stage-in/out errors -- quickly resolved -- from Sarah: We had one missing file, an 'lfc ghost': /pnfs/iu.edu/atlasproddisk/panda/dis/09/12/19/panda.EVNT.101177.12.19.b3b9be6e-ffc0-4296-9459-f61724e4d4b3_dis1036715239/EVNT.101137._000148.pool.root.1 I've manually fetched it from BNL. We should see these errors stop. 5) 12/19: Discussions about the best way to submit / track change requests for schedconfigdb (Alden, others). New e-amil address: schedconfig@gmail.com 6) 12/20-21: UTD-HEP -- scheduled power outage -- site took this opportunity to upgrade their bestman s/w -- test jobs completed successfully, back to 'online'. 7) 12/21: BNL -- US ATLAS conditions oracle cluster db maintenance: Memory RAM in the cluster nodes will be upgraded from 16GB to 32GB. The intervention will be done in rolling fashion (one node at the time), no database service interruption is expected during this maintenance. 8) 12/21-22: BNL -- cyber-security port scanning. Comment from Hiro: As noticed by the several people in this morning, many jobs failed due to the error caused by LFC in this morning (11:09 AM to be exact.) This seems to be caused by the scheduled Nessus security scan. Although the outage was very brief (about 30 seconds), since the persistent connections from the client to LFC services seemed to have been lost during that time, some of jobs, which happen to have connections (or trying to make connection) to LFC during that time, lost the connection to BNL LFC, resulting in failed jobs. LFC itself is working fine after this brief outage. 9) 12/22: UTA_SWT2 -- Maintenance outage (SL5, many other s/w upgrades) is completed. atlas s/w releases are being re-installed by Xin (this was necessary since the old storage was replaced). Test jobs have finished successfully -- will resume production once the atlas releases are ready.
--------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.