AnalysisStep09PostMortem
Here we collect the analysis job issues during STEP09 as well as tests/improvements after STEP09.
STEP09 results/failures
BNL
- Error details: pilot: Too little space left on local disk to run job.
From Michael:
There were/are two issues:
- there is a long history with evidence that PanDA (the pilot?) is sometimes not cleaning the temporary workspace
aggressively enough or the cleaning process fails for whatever reason after jobs completed
- as some of the older machines were configured with only a small temporary work area the latter fills up rather quickly.
Though there is much more disk space available on these nodes most of it is currently devoted to dCache pools. This
configuration will be changed in late July when we will remove those pools from the WNs. Chris is planning to re-partition
the disk space such that more temporary space will be available in the future. Still the issue regarding a sensible cleanup
policy and reliable implementation thereof is likely to remain.
From Paul: see at
https://savannah.cern.ch/bugs/?52194
- Error details: pilot: Get error: dccp failed with output: ec 65280, output dccp failed
SWT2
Three main issues with STEP09 jobs at SWT2_CPB:
- Many jobs failed with errors related to xrootd protocol with error messages like:
R__unzip: error in inflate (zlib)
Error in <TBasket::ReadBasketBuffers>: fNbytes = 719, fKeylen = 101, fObjlen = 2040, noutot = 0, nout=0,
nin=618, nbuf=2040 Error in <TBranchElement::GetBasket>: File: root://xrdb.local//xrd/mcdisk/mc08/
AOD/mc08.105009.J0_pythia_jetjet.merge.AOD.e344_s479_d150_r642_t53_tid058726/
AOD.058726._00007.pool.root.1 at byte:1384774962, branch:m_event_type.m_user_type, entry:6792,
badread=0
This error is related to a problem in the root release that can be potentially remedied with modifications to LD_LIBRARY_PATH.
- Some jobs are failing because the pilot consider them to be looping. This is still under investigation.
- Overall low performance on events/sec. The most telling feature is that the network links coming out of the data servers were saturated at 1Gbps. We are looking implementing 10Gbps links.
SLAC
Total failure rate 11% (1630 jobs)
- 13.3% (216) ZFS hung due to failed SATA disk (3-4 June).
- Problem resolved in later Solaris/ZFS.
- 15.0% (244) Thumpers over run by analysis jobs
- Directly reading and small reading blocks (200-300 bytes) may amplify the problem. (IN2P3? reported the small reading block and frequent (random) seeking issue. Report from AGLT2 (below) also seems to indicate this issue on ROOT files copied to batch nodes)
- Put in control on batch job submission to prevent jobs from piling up against a thumper.
- Tuning ZFS to improve performance
- Work with Xrootd team to develop long term plan to prevent clients from overrunning the underline storage hardware.
(the two issues above produce identical error messages "file not found". Believe also the cause of SRM timeout).
- 22.9% (374) DDM errors: Two types of DDM related errors. Not sure what do they mean:
- ddm: Adder._updateOutputs() XML is inconsistent with filesTable
- ddm: Setupper._setupDestination() could not register...
- 24.5% (400) LFC accessing timeout: Most due to cross-lab cyber security system falsely identifying LFC host for port scanning. Should have been addressed.
- 22.9% (373) Unclassified:
- Athena crash:
- "R__unzip: error in header, Error in : File: root://... at byte:0, branch:m_DataHeader, entry:0, badread=0, nerrors=0, basketnumber=1".
- Similar error was seen by another experiment. Believe to be a cache replacement bug in Xrootd client package (with some doubts). Addressed in Xrootd release 20090623. Also seen at US SWT2.
- "Not a ROOT file": ROOT files have correct Adler32 and can be opened by ROOT. Have seen this problem recently. Working with Athena developer.
- Both affect only sites doing direct data reading from Xrootd storage.
- 1.4% (23) Misc:
Other issues:
- DQ2 not registering transferred data files to LFC. Consulted DQ2 developers and will provide fluorescent evident next time it happens.
- Update: this may caused by MySQL? db corruption.
- Disappear/Broken DDM subscriptions, see at other sites as well.
- Nurcan reported the following, not sure which failure category it contributes to:
- SLAC failed test-436. It looks like all runAthena jobs failed since its associated build job failed, however that's not the case, build job is successful. Asked Tadashi, he reported that two build jobs at two different sites tried to build the same lib file, this kind of duplication is disallowed. Nurcan will follow up with Dan.
MWT2
- The step09 releated efficiency was 97% and mostly ran without disruptions.
- Stats:
| Site |
# Jobs |
# Successful |
# Failed |
Efficiency |
#files |
#events |
Hz |
CPU/Wall |
| ANALY_MWT2 |
35195 |
34076 |
1119 |
0.968 |
117075 |
1051568615 |
6.7 |
37.9 |
- Our main issue during this period was not from step09, but from a user of the ANALY_MWT2 queue which attempted to run 100 jobs, each reading (dccp-copying to worker node disk) 200 input files. 99% of these jobs fail reading file #170 in each job. The only job that succeeded had ~150 input files.
- We studied the time distribution for situations with large number of jobs reading lots of files.
- Have added additional mover queues to storage pools so that "slow" dcap pools can keep connections open over the length of a job while "fast" dccp mover queues can stage data off pools.
- leading failure mode were "dccp/pcache" copy related - these were probably stuck transfers that could have resulted from highly loaded pools.
- We had a backlog at the end of step 09 - # FTS channels was too low; after adjusting back up backlog finished quickly.
- Errors reported:
Errors at ANALY_MWT2
AGLT2
Failure rate was 4.6% and AGLT2 completed 61356 jobs (more than any site except GLASGOW). The failures were varied (see below). Some known issues during STEP09:
- Get error: Staging input file failed (same issue as below?)
- Error message: Failed to open config file/pnfs/aglt2.org/atlasuserdisk/user09.JohannesElmsheuser/user09.JohannesElmsheuser.ganga.voatlas49_1244215797.lib._000046/.(config)(dCache)/dcache.conf. Solution: dcache.conf was made world readable by Shawn.
- dq2 server errors. CERN DQ2 central catalog had a problem on June 3rd.
- Speculations on the analysis job load placed on compute nodes. https://www.usatlas.bnl.gov/HyperNews/racf/get/StressTest/22.html
- We saw that some of the job types created high random IO loads from the input root files. On an 8 core compute node, 2 running jobs saturated IOPS of SATA scratch disk. Based on strace output, job was performing forward and backwards seeks in input root file. It seems that the pattern of seeks and reads made the normal IO buffering ineffective. Job efficiency was then limited by the low random IOPS limits of the compute node instead of the decent sequential IO capacity.
- Minor problems (not related to STEP09) with our AFS system caused some of the errors below (especially 1099 errors)
- Errors reported:
Errors at ANALY_AGLT2
NET2
- Get errors caused by file corruption (adler32 checksum failure) due to a bad network card. This was fixed during STEP09.
- lfc_createg fails with "bad magic number." There are a small number of these. Cause under investigation.
- Two jobs fail with "stdout too large". Cause under investigation.
IO Pattern Studying
- IO pattern studying by IN2P3 provided by Eric Lancon and Jean-Yves Nief and comment from Eric:
- disk read : 240MB/s read
- network : very low 10MB/s
- Nb of IO operation / s : very high ~7.000
- block size read : peak around 200B
- read position on file vs time
- We have looking for mofiying ZFS (on Thors) setup : change of RAID mode; change of block size; suppresion of pre-fetch
- But there is suspicion for an ATLAS software implementation problem, as was suggested by contacts we had with Rene Brun. This later issue should be looked at carefully by software team.
- Note from Wei: according to Jean-Yves, the data on offset and read block size were obtained by running "truss (Solaris equivalent of strace)" of a particular file during STEP09
- IO pattern studying by SLAC on Thumpers and ZFS:
- Data obtained by monitoring Xrootd reading requests on a single file, sent by one user analysis job over a period of ~ 10-15 minutes, after STEP09. This particular set of user analysis jobs generate enough IO to make storage busy. Other analysis jobs may not. This user was the only one running Panda analysis jobs at SLAC at the time.
- Similar to the observation at IN2P3? , most reading block are smaller than a few hundred bytes. There is a mixture of sequential and random IO, based on changes of reading offset.
- One page 4 and 5, the bottom of each point is the offset of the read request, the height of the box represents the read block size, and height of the stick (from the bottom) represents 16kbytes.
Post-STEP09 Tests
SLAC
AGLT2
MWT2
SWT2
NET2
BNL
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
Attachments
IO-studying-at-IN2P3.pdf (183.8K) |
WeiYang, 09 Jul 2009 - 18:05 | IO pattern studying by IN2P3?
IO-studying-at-SLAC.pdf (431.4K) |
WeiYang, 09 Jul 2009 - 18:05 | IO pattern studying by SLAC