r3 - 17 Apr 2009 - 10:51:53 - RobertGardnerYou are here: TWiki >  Admins Web > AnalyQueueMetrics

AnalyQueueMetrics

Staring with metrics established by HammerCloud stress tests.

Overall Running Jobs vs time

  • Time since beginning of job (minutes)
  • Note: these represent only the jobs that have already completed or failed. This plot does not include the presently running jobs.
  • All sites within a cloud

Site Running Jobs vs time

  • Same, broken down by ANALY queue

Overall Efficiency

  • Job efficiency
  • Pie chart plots w/ Legend: s: submitted, r: running, g: completing, c: completed, f: failed
  • All sites within a cloud

Site Efficiency

  • Same, broken down by ANALY queue

Overall CPU/Walltime

  • CPU/Walltime is the CPU Percent Utilization
  • Historgram, 0-100%
  • Sum for all sites within a cloud

Site CPU/Walltime

  • Same, broken down by ANALY queue

Overall Events/Second

  • This is the event rate for the athena execution only, i.e. the denominator is the time from athena start to finish.
  • Histogram, 0-30 Hz
  • Sum for all sites within a cloud

Site Events/Second

  • Same, broken down by ANALY queue

Backend Failure Codes

  • Failure code statistics for the "backend"
  • Broken down by site

Application Failure Codes

  • Failure code statistics for Athena (presumably)
  • Broken down by site

Backend Failure Reasons

  • Similar to failure code statistics for the "backend" - w/ reason string
  • Broken down by site

Number of Events

  • Number of analysis events processed
  • Broken down by site

Number of Files

  • Processed: Number of input files processed by "completed" jobs
  • Expected: The number of files that was expected to be processed by "completed" jobs

Job Timings

  • Mean Athena Software Setup Time
  • Mean Prepare Inputs Time
  • Mean Athena Running Time
  • Mean Athena Running Time, Normalized to Number of Events (Hz^-1)
  • Mean Output Storage Time
  • Mean Network RX KBps
  • Mean Network RX Bytes Total
  • Max Network RX Bytes Total
  • Min Network RX Bytes Total
  • Overall Software Setup Time (all sites within a cloud)
  • Site Software Setup Time (same, broken down by ANALY queue)
  • Overall Prepare Inputs Time (stage in time for input files to local compute node?)
  • Site Prepare Inputs Time (same, broken down by ANALY queue)
  • Overall Athena Running Time
  • Site Athena Running Time (same, broken down by ANALY queue)
  • Overall Output Storage Time (stage-out time to local storage element?)
  • Site Output Storage Time (same, broken down by ANALY queue)
  • Overall Mean Network Transfer Rates
    • These are mean received KB per second from the start of athena until it finishes.
  • Site Mean Network Transfer Rates (same, broken down by ANALY queue)

Stress test challenge metrics

We expect a number of additional metrics will be of interest beyond the daily summary reports to capture the overall performance during stress testing periods. In addition, US-based stress testing will involve a heterogenous mix of job types and users.

  • Selected metrics as defined above with time interval selections (day, week, month, year)
  • Running jobs by User (stacked bar charts)
  • Analysis output access metrics
    • GB downloaded vs Time
    • Download speeds MB/s vs Time

Stress tests monitor in Panda

Metrics data from the pilot

  • Data is collected by the pilot for jobs, for example this job in the Panda monitor
pilotTiming	1|123|15270|66

15 Apr 2009 12:19:52| ..Time report.................................................................................................
15 Apr 2009 12:19:52| . CPU consumption time      : 20397 kSI2kseconds
15 Apr 2009 12:19:52| . Payload execution time    : 15270 s
15 Apr 2009 12:19:52| . GetJob consumption time   : 1 s
15 Apr 2009 12:19:52| . Stage-in consumption time : 123 s
15 Apr 2009 12:19:52| . Stage-out consumption time: 66 s
15 Apr 2009 12:19:52| ..............................................................................................................

  • Time after creation until execution

Job profile metrics

  • # input files for the job
  • # output files for the job
  • total GB staged-in
  • total GB staged-out


-- RobertGardner - 15 Apr 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback