r14 - 27 Aug 2009 - 12:09:34 - NurcanOzturkYou are here: TWiki >  Admins Web > AnalysisStep09PostMortem

AnalysisStep09PostMortem

Here we collect the analysis job issues during STEP09 as well as tests/improvements after STEP09.

STEP09 results/failures

BNL

  • Error details: pilot: Too little space left on local disk to run job.

From Michael:

There were/are two issues:
- there is a long history with evidence that PanDA (the pilot?) is sometimes not cleaning the temporary workspace 
aggressively enough or the cleaning process fails for whatever reason after jobs completed
- as some of the older machines were configured with only a small temporary work area the latter fills up rather quickly. 
Though there is much more disk space available on these nodes most of it is currently devoted to dCache pools. This 
configuration will be changed in late July when we will remove those pools from the WNs. Chris is planning to re-partition 
the disk space such that more temporary space will be available in the future. Still the issue regarding a sensible cleanup 
policy and reliable implementation thereof is likely to remain.
From Paul: see at https://savannah.cern.ch/bugs/?52194

  • Error details: pilot: Get error: dccp failed with output: ec 65280, output dccp failed

SWT2

Three main issues with STEP09 jobs at SWT2_CPB:
  • Many jobs failed with errors related to xrootd protocol with error messages like:
R__unzip: error in inflate (zlib)
Error in <TBasket::ReadBasketBuffers>: fNbytes = 719, fKeylen = 101, fObjlen = 2040, noutot = 0, nout=0, 
nin=618, nbuf=2040 Error in <TBranchElement::GetBasket>: File: root://xrdb.local//xrd/mcdisk/mc08/
AOD/mc08.105009.J0_pythia_jetjet.merge.AOD.e344_s479_d150_r642_t53_tid058726/
AOD.058726._00007.pool.root.1 at byte:1384774962, branch:m_event_type.m_user_type, entry:6792, 
badread=0
This error is related to a problem in the root release that can be potentially remedied with modifications to LD_LIBRARY_PATH.
  • Some jobs are failing because the pilot consider them to be looping. This is still under investigation.
  • Overall low performance on events/sec. The most telling feature is that the network links coming out of the data servers were saturated at 1Gbps. We are looking implementing 10Gbps links.

SLAC

Total failure rate 11% (1630 jobs)

  • 13.3% (216) ZFS hung due to failed SATA disk (3-4 June).
    • Problem resolved in later Solaris/ZFS.
  • 15.0% (244) Thumpers over run by analysis jobs
    • Directly reading and small reading blocks (200-300 bytes) may amplify the problem. (IN2P3? reported the small reading block and frequent (random) seeking issue. Report from AGLT2 (below) also seems to indicate this issue on ROOT files copied to batch nodes)
    • Put in control on batch job submission to prevent jobs from piling up against a thumper.
    • Tuning ZFS to improve performance
    • Work with Xrootd team to develop long term plan to prevent clients from overrunning the underline storage hardware.

(the two issues above produce identical error messages "file not found". Believe also the cause of SRM timeout).

  • 22.9% (374) DDM errors: Two types of DDM related errors. Not sure what do they mean:
    • ddm: Adder._updateOutputs() XML is inconsistent with filesTable
    • ddm: Setupper._setupDestination() could not register...
  • 24.5% (400) LFC accessing timeout: Most due to cross-lab cyber security system falsely identifying LFC host for port scanning. Should have been addressed.
  • 22.9% (373) Unclassified:
    • Athena crash:
      • "R__unzip: error in header, Error in : File: root://... at byte:0, branch:m_DataHeader, entry:0, badread=0, nerrors=0, basketnumber=1".
      • Similar error was seen by another experiment. Believe to be a cache replacement bug in Xrootd client package (with some doubts). Addressed in Xrootd release 20090623. Also seen at US SWT2.
    • "Not a ROOT file": ROOT files have correct Adler32 and can be opened by ROOT. Have seen this problem recently. Working with Athena developer.
    • Both affect only sites doing direct data reading from Xrootd storage.
  • 1.4% (23) Misc:

Other issues:

  • DQ2 not registering transferred data files to LFC. Consulted DQ2 developers and will provide fluorescent evident next time it happens.
    • Update: this may caused by MySQL? db corruption.
  • Disappear/Broken DDM subscriptions, see at other sites as well.
  • Nurcan reported the following, not sure which failure category it contributes to:
    • SLAC failed test-436. It looks like all runAthena jobs failed since its associated build job failed, however that's not the case, build job is successful. Asked Tadashi, he reported that two build jobs at two different sites tried to build the same lib file, this kind of duplication is disallowed. Nurcan will follow up with Dan.

MWT2

  • The step09 releated efficiency was 97% and mostly ran without disruptions.
  • Stats:
Site # Jobs # Successful # Failed Efficiency #files #events Hz CPU/Wall
ANALY_MWT2 35195 34076 1119 0.968 117075 1051568615 6.7 37.9
  • Our main issue during this period was not from step09, but from a user of the ANALY_MWT2 queue which attempted to run 100 jobs, each reading (dccp-copying to worker node disk) 200 input files. 99% of these jobs fail reading file #170 in each job. The only job that succeeded had ~150 input files.
  • We studied the time distribution for situations with large number of jobs reading lots of files.
  • Have added additional mover queues to storage pools so that "slow" dcap pools can keep connections open over the length of a job while "fast" dccp mover queues can stage data off pools.
  • leading failure mode were "dccp/pcache" copy related - these were probably stuck transfers that could have resulted from highly loaded pools.
  • We had a backlog at the end of step 09 - # FTS channels was too low; after adjusting back up backlog finished quickly.
  • Errors reported:

    Errors at ANALY_MWT2

    SiteExit Code (or Pilot Code)Failure Type#Ex. Job
    ANALY_MWT20551010284095
    ANALY_MWT20Adder._updateOutputs2731008819512
    ANALY_MWT20Copy command returne381009886943
    ANALY_MWT20lcg-cp get was timed431009727230
    ANALY_MWT20Log put error: Copy 681008881084
    ANALY_MWT20No such file or dire261008761169
    ANALY_MWT21099Get error: dccp fail5321008805712
    ANALY_MWT21099Get error: Unexpecte81010495576
    ANALY_MWT21103Get error: No such f31010912553
    ANALY_MWT21106Payload stdout file 31010520150
    ANALY_MWT21111Exception caught in 121010829254
    ANALY_MWT21151Get error: dccp get 41008819480
    ANALY_MWT21156Put error: Copy comm21009808416
    ANALY_MWT21156Put error: lcg-cp ge11009799031
    ANALY_MWT21164Get error: No such f131009234498
    ANALY_MWT21169Put error: lfc_creat11010217329
    ANALY_MWT21171Get error: Remote an361009260886
    ANALY_MWT21175Get error: Remote an11009715719

AGLT2

Failure rate was 4.6% and AGLT2 completed 61356 jobs (more than any site except GLASGOW). The failures were varied (see below). Some known issues during STEP09:
  • Get error: Staging input file failed (same issue as below?)
  • Error message: Failed to open config file/pnfs/aglt2.org/atlasuserdisk/user09.JohannesElmsheuser/user09.JohannesElmsheuser.ganga.voatlas49_1244215797.lib._000046/.(config)(dCache)/dcache.conf. Solution: dcache.conf was made world readable by Shawn.
  • dq2 server errors. CERN DQ2 central catalog had a problem on June 3rd.
  • Speculations on the analysis job load placed on compute nodes. https://www.usatlas.bnl.gov/HyperNews/racf/get/StressTest/22.html
  • We saw that some of the job types created high random IO loads from the input root files. On an 8 core compute node, 2 running jobs saturated IOPS of SATA scratch disk. Based on strace output, job was performing forward and backwards seeks in input root file. It seems that the pattern of seeks and reads made the normal IO buffering ineffective. Job efficiency was then limited by the low random IOPS limits of the compute node instead of the decent sequential IO capacity.
    • Here is some info on the jobs:
      >
      > athena.py -s pre_[UUID].py HighPtViewNtuple_topOptions.py 
      > post_[UUID2].py PandaID 1010490282
      >   
      part of jobset 21851.  all the jobs for this jobset have gone to AGLT2.
      
      > athena.py -s pre_[UUID].py BUSTop.py post_[UUID2].py PandaID 
      > 1010507441
      >
      >   
      part of job set 21910.  Again all jobs in jobset have gone to AGLT2.
    • Because of the poor CPU usage of these jobs, we are very interested in understanding if they are representative of user analysis and if we need to plan Tier2 capacity for such jobs. We also wonder if there is something pathological about these jobs or their input files
  • Minor problems (not related to STEP09) with our AFS system caused some of the errors below (especially 1099 errors)
  • Errors reported:

    Errors at ANALY_AGLT2

    SiteExit Code (or Pilot Code)Failure Type#Ex. Job
    ANALY_AGLT201361009430213
    ANALY_AGLT20Adder._updateOutputs4401009035255
    ANALY_AGLT20Job failed: Non-zero131008847122
    ANALY_AGLT20Setupper._setupDesti6741009067889
    ANALY_AGLT21099Get error: dccp fail10161008847103
    ANALY_AGLT21099Get error: Unexpecte241008847143
    ANALY_AGLT21103Put error: The pilot21009315782
    ANALY_AGLT21113Get error: getLFCFil11009025283
    ANALY_AGLT21137Copy command returne211009088178
    ANALY_AGLT21137lfc-mkdir failed: LF11009022100
    ANALY_AGLT21137No such file or dire31008977499
    ANALY_AGLT21137Put error: Copy comm3651008818780
    ANALY_AGLT21137Put error: lfc-mkdir121008847087
    ANALY_AGLT21137Put error: No such f21009274388
    ANALY_AGLT21150Pilot has decided to1471009506981
    ANALY_AGLT21152lcg-cp get was timed61009036269
    ANALY_AGLT21152Put error: lcg-cp ge81009105457
    ANALY_AGLT21163Put error: Voms prox211008846910
    ANALY_AGLT21165Expected output file11009017782
    ANALY_AGLT21169lfc_addreplica faile11010431250
    ANALY_AGLT21169Put error: lfc_creat61008789745
    ANALY_AGLT21171Get error: Remote an261009004764
    ANALY_AGLT21176None|Log put error: 91009011486
    ANALY_AGLT21179Get error: Failed to341009280525

NET2

  • Get errors caused by file corruption (adler32 checksum failure) due to a bad network card. This was fixed during STEP09.
  • lfc_createg fails with "bad magic number." There are a small number of these. Cause under investigation.
  • Two jobs fail with "stdout too large". Cause under investigation.

IO Pattern Studying

  • IO pattern studying by IN2P3 provided by Eric Lancon and Jean-Yves Nief and comment from Eric:
    • disk read : 240MB/s read
    • network : very low 10MB/s
    • Nb of IO operation / s : very high ~7.000
    • block size read : peak around 200B
    • read position on file vs time
    • We have looking for mofiying ZFS (on Thors) setup : change of RAID mode; change of block size; suppresion of pre-fetch
    • But there is suspicion for an ATLAS software implementation problem, as was suggested by contacts we had with Rene Brun. This later issue should be looked at carefully by software team.
    • Note from Wei: according to Jean-Yves, the data on offset and read block size were obtained by running "truss (Solaris equivalent of strace)" of a particular file during STEP09
  • IO pattern studying by SLAC on Thumpers and ZFS:
    • Data obtained by monitoring Xrootd reading requests on a single file, sent by one user analysis job over a period of ~ 10-15 minutes, after STEP09. This particular set of user analysis jobs generate enough IO to make storage busy. Other analysis jobs may not. This user was the only one running Panda analysis jobs at SLAC at the time.
    • Similar to the observation at IN2P3? , most reading block are smaller than a few hundred bytes. There is a mixture of sequential and random IO, based on changes of reading offset.
    • One page 4 and 5, the bottom of each point is the offset of the read request, the height of the box represents the read block size, and height of the stick (from the bottom) represents 16kbytes.

Post-STEP09 Tests

SLAC

AGLT2

MWT2

SWT2

NET2

BNL



About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf IO-studying-at-IN2P3.pdf (183.8K) | WeiYang, 09 Jul 2009 - 18:05 | IO pattern studying by IN2P3?
pdf IO-studying-at-SLAC.pdf (431.4K) | WeiYang, 09 Jul 2009 - 18:05 | IO pattern studying by SLAC
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback