Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=99202 1) 6/16: OU_OCHEP_SWT2 DDM transfers failing with: AGENT error during ALLOCATION phase: [CONFIGURATION_ERROR]. >From Horst: Bestman had hung up for some reason - restarted. ggus 59114 & RT 17254 (both closed), eLog 13819. 2) 6/16 - 6/17: AGLT2 - bad disk in one of the RAID arrays causing DDM transfer errors. From Bob: Same disk shelf as last night failed again, same disk. Off line from 8am-11:40am EDT. Disk removed from array and system rebooted. srmwatch looks good since reboot. Replacement disk for RAID-6 array due here tomorrow. 3) 6/17: Job failures at NET2: Error details: pilot: Too little space left on local disk to run job: 1271922688 B (need > 2147483648 B). Unknown transExitCode error code 137. From Saul: This is a low level problem that we know about. It's caused because the local scratch space used by some production jobs has been gradually increasing over time causing a problem on some of our nodes with small scratch volumes. We're working on a solution and are watching for these in the mean time. ggus 59145 (closed), eLog 13827. 4) 6/18: NET2_DATADISK, NET2_MCDISK - DDM errors like: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]. Givin' up after 3 tries]. From John & Saul: We had a 1 hour outage to upgrade one of our GPFS volumes. All systems are back and files are arriving now. ggus 59203 (closed), eLog 13855. 5) 6/19 - 6/20: NET2 - "No space left on device" errors at NET2_DATADISK & MCDISK. From John & Saul: There has been a big burst of data arriving at NET2 and our DATADISK and MCDISK space tokens have run out of space. Armen and Wensheng have been helping us with this today, but since we can write data very fast, our space tokens can fill up very quickly. There is far more subscribed than free space, so we need some DDM help. ggus 59220 (in progress), eLog 13888/90. 6) 6/22: Upgrade of core network routers at BNL completed. No impact on services. eLog 13941. Follow-ups from earlier reports: (i) 4/23: OU sites were set off-line in advance of major upgrades -- from Horst: We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning. So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week? I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade. eLog 11813. Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon. Update 6/10, from Horst: We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system. Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro). As of 6/22 test jobs are still failing with ddm registration errors - under investigation. (ii) 6/5: IllinoisHEP - jobs failing with the error (from the pilot log): |Mover.py | !!FAILED!!3000!! Exception caught: Get function can not be called for staging input files: \'module\' object has no attribute \'isTapeSite\'. ggus 58813 (in progress), eLog 13468. Update, 6/21, from Dave at Illinois: I believe this problem has been solved. The problem was due to the DQ2Clients package in the AtlasSW not being properly updated at my site. Those problems have been resolved and the package is now current. ggus ticket closed. (iii) 6/12: WISC - job failures with errors like: FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir. Update 6/15 from Wen: The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible. Savannah 115123 (open), eLog 13790. (iv) 6/14 - 6/16: SWT2_CPB - problem with the internal cluster switch stack. Restored once early Monday morning, but the problem recurred after ~ 4 hours. Working with Dell tech support to resolve the issue. ggus 59006 & RT 17220 (open), eLog 13776. Update, 6/17: one of the switches in the stack was replaced, and this appears to have solved the problem, as the stack has been stable since then. ggus and RT tickets closed.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=99883 1) SWT2_CPB: beginning on 6/22 issues with gatekeeper being heavily loaded, apparently due to activity from the condor grid monitor agents. Ongoing discussions with Xin, Jamie to diagnose this problem. 2) 6/23: SLAC - ~half-day outage to patch and restart a NFS server hosting ATLAS releases. Completed. eLog 13990. 3) 6/24: SLAC - DDM errors such as: FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries] >From Wei: We have blacklisted SLACXRD_GROUPDISK because it ran out of disk space. However, the effect will not be immediate. Our SRM will likely be overrun once a while due to large number of srmGetSpaceTokens and srmGetSpaceMetadata in short period. This is because there are still large number of transfer requests already in DQ2 SS queues for SLACXRD_GROUPDISK. ggus 59384 (closed), eLog 13991. 4) 6/25: Job failures at NET2 with errors like: 25 Jun 14:24:09 | /atlasgrid/Grid3-app/atlas_app/atlas_rel/15.8.0/cmtsite/setup.sh: No such file or directory 25 Jun 14:24:09 | runJob.py | !!WARNING!!2999!! runJob setup failed: installPyJobTransforms failed >From Saul: I suddenly realized that I probably caused this problem my own self. Around the time of the errors, I was making an egg demo that would have been hitting the release file system pretty hard ( http://egg.bu.edu/15.8.0-egg/index.html ). We'll confirm, but there was probably nothing wrong otherwise. No additional failed jobs observed. 5) 6/25: BNL - file transfer errors such as: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long]. From Hiro: BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. (Users should not put all metadata of files in the filename itself.) I have contacted the DQ2 developers to limit the length. Savannah 69217, eLog 14016. 6) 6/26: BNL - jobs stuck in the "waiting" state due to missing file data10_7TeV.00155550.physics_MinBias.recon.ESD.f260._lb0156._0001.1 from dataset data10_7TeV.00155550.physics_MinBias.recon.ESD.f260. Pavel subscribed BNL, and the files are now available. Savannah 69223, eLog 14024. 7) 6/28-29: OU_OCHEP_SWT2 - file transfer errors: FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3 tries]. From Horst: [Our head node, tier2-01, had crashed with a kernel panic. It's back up now, and a couple of SRM and LFC test commands succeded fine. Can you please try again, and close this ticket, if it works again now?] No recent errors of this type observed - ggus 59490 & RT 17324 (closed), eLog 14118. 8) 6/28-29: NET2 - job failures with the error: pilot: installPyJobTransforms failed: sh: /atlasgrid/Grid3-app/atlas_app/atlas_rel/15.6.10/cmtsite/setup.sh: No such file or directory. From Saul: One of our sysadmins mistakenly mounted an old copy of the releases area on some of our nodes, so it's not surprising that a recent production cache is missing. These nodes have been taken offline and the correct releases will be remounted. I'll keep the ticket open until these nodes are back with the right mounts. Issue resolved, ggus 59439 (closed), eLog 14078. 9) 6/29: Proxies on all DDM VO boxes at CERN expired. Proxies renewed as of ~8:30 a.m. CST. eLog 14109. 10) 6/29: MWT2_IU maintenance outage completed as of ~4:00 p.m. CST. >From Sarah: We will be applying software updates to the MWT2_IU gatekeeper, with the goal of performance and job scheduling improvements. 11) 6/30: Apparently a recurrence of the "long filename" issue in 5) above. From Michael: The problem is caused by an overload situation of the storage system's namespace component. The likely reason for the high load are user analysis jobs requesting to create files with very long file names. These operations fail and are retried at high rate. Experts are looking into ways to fix the problem. ggus 59567 (closed), eLog 14141/33. Follow-ups from earlier reports: (i) 4/23: OU sites were set off-line in advance of major upgrades -- from Horst: We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning. So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week? I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade. eLog 11813. Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon. Update 6/10, from Horst: We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system. Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro). As of 6/22 test jobs are still failing with ddm registration errors - under investigation. (ii) 6/12: WISC - job failures with errors like: FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir. Update 6/15 from Wen: The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible. Savannah 115123 (open), eLog 13790. (iii) 6/19 - 6/20: NET2 - "No space left on device" errors at NET2_DATADISK & MCDISK. From John & Saul: There has been a big burst of data arriving at NET2 and our DATADISK and MCDISK space tokens have run out of space. Armen and Wensheng have been helping us with this today, but since we can write data very fast, our space tokens can fill up very quickly. There is far more subscribed than free space, so we need some DDM help. ggus 59220 (in progress), eLog 13888/90. Update, 6/26: additional space made available, MCDISK and DATADISK now o.k. ggus 59220 closed.
USATLAS Throughput Meeting - June 29, 2010
==========================================
Attending: Shawn, Andy, Dave, Sarah, Aaron, Saul, Karthik, Hiro, Philippe
Excused: Jason
0) Action items...reminder about the list from the agenda
1) perfSONAR status: New Dell R410 is online with current perfSONAR release. Will be used for testing. Status of work on improving responsiveness of perfSONAR. Next release has some SQL improvements optimizing data access. Currently the next release (V3.2) is under "internal" testing. A few weeks from now USATLAS can start testing.
Issues for discussion from perfSONAR results:
a) Sarah noted UC has a longstanding issue with outbound larger than inbound (same as OU).
b) Problem with services at OU stopping. Possibly just a display issue? Doesn't seem to be the case since there are matching discontinuities in the data. Log file have been sent and still awaiting resolution.
c) Philippe noted that MSU-UC testing shows larger bandwidth "outbound" (900 MSU-UC vs. 600 UC-MSU) while for MSU-OU it is opposite (400 MSU-OU vs. 800 OU-MSU). Checking the UM and MSU perfSONAR nodes shows:
from MSU and UM
AGL to UChicago > UChi to AGL
AGL to OU < OU to AGL
looking from BNL, same thing
BNL to UChicago > UChi to BNL
BNL to OU < OU to BNL
looking from IU, same thing
IU to Uchicago > UChi to UI
IU to OU < OU to IU
d) David noted that campus is slow (perfSONAR found this issue) and campus was notified...problem was then identified as a missed firmware upgrade. Will be fixed tomorrow. ICCN vs. ESnet difference. Over ESnet is showing asymmetry. Testing to ESnet (Chicago) is good from Illinois but testing to the next "hop" Cleveland shows a big asymmetry (Ill->Cleveland 900 Mbps, but Cleveland->Ill is 200-400 Mbps).
2) Transaction rate capability: Hiro has created new plots of FTS details. Follow "Show Plots" in the following list. A sample test is done. The test is to transfer 1K files of 1MB in size. The first plot is the
histogram of transfer time per file while the second plot is the histogram of transfer time + times in the queue. As a result, the 2nd one is indicative of SRM performance. However, it is mostly controlled
by the FTS channel parameter for the number of concurrent transfers.
But, you can general see how many files your sites should be able to get (if files are small.)
AGLT2
http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=8b747104-83a9-11df-9d63-f0c52a5177e3
UC
http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=8c56cffa-83ab-11df-9d63-f0c52a5177e3
SLAC
http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=f69bca7f-83ad-11df-9d63-f0c52a5177e3
Other sites will follow (doing that right now.) Hiro will setup to have same concurrency for this test. Also add in plot of "overhead" time per transfer.
3) Alerting progress: Shawn or Sarah? No progress yet. Andy will send documentation links for using clients to access perfSONAR data (Hiro inquired about how to best access the existing data).
4) Site reports: Open forum for reports on throughput related items. Hiro reports that PANDA will now be subscribing data and will be potentially getting datasets from sources outside the hierarchy. This means that better monitoring and debugging is critical. Default will be to rely on DQ2 to select the source for transfers. Will need watching.
Please send along any corrections or additions to the mailing list. Next meeting in 2 weeks (July 13).
Shawn
Site Squid Installed Squid Works Fail-Over Works
AGLT2 Yes Yes Yes
ANL T3 Test jobs on analy queue don't start.
BNL Yes Yes No failover
Duke Missing 15.8.0
Illinois Test jobs on build failed / don't start
MWT2 IU Yes Yes Not tested
UC Yes Yes Yes
NET2 BU Yes Yes Yes
HU Test jobs on analy queue don't start
SWT2 CPB Yes Yes No
UTA Test job build failed
OU No Still upgrading hardware
WT2 Yes Yes Yes
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.