Lecture 2: Performance Measurement

About This Presentation

Title:

Lecture 2: Performance Measurement

Description:

... Does not represent real life programs Compiler optimization ... benchmarks Real application programs: C compiler Finite element modeling Fluid dynamics, ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 65

Provided by: ICSFacu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 2: Performance Measurement

1
Lecture 2Performance Measurement
2
Performance Evaluation

The primary duty of software developers is to
create functionally correct programs
Performance evaluation is a part of software
development for well-performing programs

3
Performance Analysis Cycle

Have an optimization phase just like testing and
debugging phase

Code Development
Functionally complete and correct program
Measure
Analyze
Modify / Tune
Complete, correct and well-performing program
Usage
4
Goals of Performance Analysis

The goal of performance analysis is to provide
quantitative information about the performance of
a computer system

5
Goals of Performance Analysis

Compare alternatives
When purchasing a new computer system, to provide
quantitative information
Determine the impact of a feature
In designing a new system or upgrading, to
provide before-and-after comparison
System tuning
To find the best parameters that produce the best
overall performance
Identify relative performance
To quantify the performance relative to previous
generations
Performance debugging
To identify the performance problems and correct
them
Set expectations
To determine the expected capabilities of the
next generation

6
Performance Evaluation

Performance Evaluation steps
Measurement / Prediction
What to measure? How to measure?
Modeling for prediction
Simulation
Analytical Modeling
Analysis Reporting
Performance metrics

7
Performance Measurement

Interval Timers
Hardware Timers
Software Timers

8
Performance Measurement

Hardware Timers
Counter value is read from a memory location
Time is calculated as

Tc
Clock
Counter
n bits
to processor memory bus
Time (x2 - x1) x Tc
9
Performance Measurement

Software Timers
Interrupt-based
When interrupt occurs, interrupt-service routine
increments the timer value which is read by a
program
Time is calculated as

Tc
Clock
Prescaling Counter
Tc
to processor interrupt input
Time (x2 - x1) x Tc
10
Performance Measurement

Timer Rollover
Occurs when an n-bit counter undergoes a
transition from its maximum value 2n 1 to zero
There is a trade-off between roll over time and
accuracy

Tc 32-bit 64-bit
10 ns 42 s 5850 years
1 ms 1.2 hour 0.5 million years
1 ms 49 days 0.5 x 109 years
11
Timers

Solution
Use 64-bit integer (over half a million year)
Timer returns two values
One represents seconds
One represents microseconds since the last second
With 32-bit, the roll over is over 100 years

12
Performance Measurement

Interval Timers
T0 ? Read current time
Event being timed ()
T1 ? Read current time
Time for the event is T1-T0

13
Performance Measurement

Timer Overhead
Initiate read_time
Current time is read
Event begins
Event ends Initiate read_time
Current time is read

Measured time
Tm T2 T3 T4
Desired measurement
Te Tm (T2 T4)
Tm (T1 T2) since T1 T4
Timer overhead
Tovhd T1 T2
Te should be 100-1000 times greater than Tovhd .

T1
T2
T3
T4
14
Performance Measurement

Timer Resolution
Resolution is the smallest change that can be
detected by an interval timer.

nTc lt Te lt (n1)Tc If Tc is large relative to
the event being measured, it may be impossible to
measure the duration of the event.
15
Performance Measurement

Measuring Short Intervals
Te lt Tc

Tc
? 1
Te
Tc
? 0
Te
16
Performance Measurement

Measuring Short Intervals

Solution Repeat measurements n times.
Average execution time Te (m x Tc) / n
m number of 1s measured
Average execution time Te (Tt / n ) - h
Tt total execution time of n repetitions
h repetition overhead

Tc
Te
Tt
17
Performance Measurement

Time
Elapsed time / wall-clock time / response time
Latency to complete a task, including disk
access, memory access, I/O, operating system
overhead, and everything (includes time consumed
by other programs in a time-sharing system)
CPU time
The time CPU is computing, not including I/O time
or waiting time
User time / user CPU time
CPU time spent in the program
System time / system CPU time
CPU time spent in the operating system performing
tasks requested by the program

18
Performance Measurement

UNIX time command
90.7u 12.9s 239 65
Drawbacks
Resolution is in milliseconds
Different sections of the code can not be timed

User time
Elapsed time
Percentage of elapsed time
System time
19
Timers

Timer is a function, subroutine or program that
can be used to return the amount of time spent in
a section of code.

zero 0.0 t0 timer(zero) lt code
segment gt t1 timer(t0) time t1
t0 timer() lt code segment gt t1
timer() time t1 t0
20
Timers

Read Wadleigh, Crawford pg 130-136 for
time, clock, gettimeofday, etc.

21
Timers

Measuring Timer Resolution

main() . . . zero 0.0 t0
timer(zero) t1 0.0 j0 while (t1 0.0)
j zero0.0 t0 timer(zero) foo(j)
t1 timer(t0) printf (It took d
iterations for a nonzero time\n, j) if (j1)
printf (timer resolution lt 13.7f seconds\n,
t1) else printf (timer resolution is
13.7f seconds\n, t1) foo(n) . .
. i0 for (j0 jltn j) i return(i)
22
Timers

Measuring Timer Resolution
Using clock()
Using times()
Using getrusage()

It took 682 iterations for a nonzero time timer
resolution is 0.0200000 seconds
It took 720 iterations for a nonzero time timer
resolution is 0.0200000 seconds
It took 7374 iterations for a nonzero time timer
resolution is 0.0002700 seconds
23
Timers

Spin Loops
For codes that take less time to run than the
resolution of the timer
First call to a function may require an
inordinate amount of time.
Therefore the minimum of all times may be
desired.

main() . . . zero 0.0 t2 100000.0 for
(j0 jltn j) t0 timer(zero) foo(j)
t1 timer(t0) t2 min(t2, t1) t2
t2 / n printf (Minimum time is 13.7f
seconds\n, t2) foo(n) . . . lt code segment
gt
24
Profilers

A profiler automatically insert timing calls into
applications to generate calls into applications
It is used to identify the portions of the
program that consumes the largest fraction of the
total execution time.
It may also be used to find system-level
bottlenecks in a multitasking system.
Profilers may alter the timing of a programs
execution

25
Profilers

Data collection techniques
Sampling-based
This type of profilers use a predefined clock
every multiple of this clock tick the program is
interrupted and the state information is
recorded.
They give the statistical profile of the program
behavior.
They may miss some important events.
Event-based
Events are defined (e.g. entry into a subroutine)
and data about these events are collected.
The collected information shows the exact
execution frequencies.
It has substantial amount of run-time overhead
and memory requirement.
Information kept
Trace-based The compiler keeps all information
it collects.
Reductionist Only statistical information is
collected.

26
Performance Evaluation

Performance Evaluation steps
Measurement / Prediction
What to measure? How to measure?
Modeling for prediction
Analysis Reporting
Performance metrics

27
Predicting Performance

Performance of simple kernels can be predicted to
a high degree
Theoretical performance and peak performance must
be close
It is preferred that the measured performance is
over 80 of the theoretical peak performance

28
Performance Metrics

Time
Elapsed time / wall-clock time / response time
CPU time
User time / user CPU time
System time / system CPU time

29
Performance Modeling

CPU Performance

CPU time instructions x cycles x
time program
instruction cycle
CPI (cycles per instruction)
CPU time instruction count x CPI x 1

clock rate
30
Performance Modeling

CPU Performance
CPI
is an average
depends on the design of micro-architecture
(hardwired/microprogrammed, pipelined)
Number of instructions
is the number of instructions executed at runtime
Depends on
instruction set architecture (ISA)
compiler

CPU time instruction count x CPI x 1

clock rate
31
Performance Modeling

CPU Performance
Drawbacks
In modern computers, no program runs without some
operating system running on the hardware
Comparing performance between machines with
different operating systems will be unfair

32
Performance Evaluation

Performance Evaluation steps
Measurement / Prediction
What to measure? How to measure?
Modeling for prediction
Simulation
Analytical Modeling
Queuing Theory
Analysis Reporting
Performance metrics

33
Performance Metrics

Performance Comparison
Relative performance

Performancex 1 .
Execution timeX
Performance Ratio PerformanceX Execution
timeY
PerformanceY Execution timeX
34
Performance Metrics

Relative Performance
If workload consists of more than one program,
total execution time may be used.
If there are more than one machine to be
compared, one of them must be selected as a
reference.

35
Performance Metrics

Throughput
Total amount of work done in a given time
Measured in tasks per time unit
Can be used for
Operating system performance
Pipeline performance
Multiprocessor performance

36
Performance Metrics

Statistical Analysis
Used to compare performance
Workload consists of many programs
Depends on the nature of the data as well as
distribution of the test results

37
Performance Metrics

Statistical Analysis
Arithmetic mean
May be misleading if the data are skewed or
scattered

Arithmetic mean S xi , 1 i n
n
MA MB MC
Prog1 50 100 500
Prog2 400 800 800
Prog3 5550 5100 4700
Average 2000 2000 2000
38
Performance Metrics

Statistical Analysis
Weighted average
weight is the frequency of each program in daily
processing
Results may change with a different set of
execution frequencies

Weighted average ? wi . xi , 1 i n
weight MA MB MC
Prog1 60 50 100 500
Prog2 30 400 800 800
Prog3 10 5550 5100 4700
Average 705 810 1010
39
Performance Metrics

Statistical Analysis
Geometric mean
Results are stated in relation to the performance
of a reference machine

Geometric mean ( ? xi )1/n , 1 i n
MA Normalized to MB MB (reference) Normalized to MB MC Normalized to MB
Prog1 50 2 100 1 500 0.2
Prog2 400 2 800 1 800 1
Prog3 5550 0.92 5100 1 4700 1.085
Average 1.54 1 0.60

Results are consistent no matter which system is
chosen as reference

40
Performance Metrics

Statistical Analysis
Harmonic mean
Used to compare performance resuts that are
expressed as a rate (e.g. operations per second,
throughput, etc.)
Slowest rates have the greatest influence on the
result
?It identifies areas where performance can be
improved

Harmonic mean n , 1 i n
? 1/xi
41
Performance Metrics

MIPS (Million instructions per second)
Includes both integer and floating point
performance
Number of instructions in a program varies
between different computers
Number of instructions varies between different
programs on the same computer

MIPS Instruction count Clock rate
Execution time x 106 CPI x 106
42
Performance Metrics

MFLOPS
(Million floating point operations per second)
Give performance of only floating-point
operations
Different mixes of integer and floating-point
operations may have different execution times
Integer and floating-point units work
independently
Instruction and data caches provide instruction
and data concurrently

43
Performance Metrics

Utilization
Speciality ratio
1 ? general purpose

Utilization Busy time .
Total time
Speciality ratio Maximum performance .
Minimum performance
44
Performance Metrics

Asymptotic and Half performance
r? asymptotic performance
n1/2 half performance

T r? (n n1/2) r? 1/t n1/2 t0/t
Slope r?-1
2t0
t0
n1/2
-n1/2
45
Performance Evaluation Methods

Benchmarking
Monitoring
Analytical Modeling
Queuing Theory

46
Benchmarking

Benchmark is a program that is run on a computer
to measure its performance and compare it with
other machines
Best benchmark is the users workload the
mixture of programs and operating system commands
that users run on a machine.
? Not practical
Standard benchmarks

47
Benchmarking

Types of Benchmarks
Synthetic benchmarks
Toy benchmarks
Kernels
Real Applications

48
Benchmarking

Synthetic benchmarks
Artificially created benchmark programs that
represent the average frequency of operations of
a large set of programs
Whetstone benchmark
Dhrystone benchmark
Rhealstone benchmark

49
Benchmarking

Synthetic benchmarks
Whetstone benchmark
First written in Algol60 in 1972, today Fortran,
C/C, Java versions are available
Represents the workload of numerical applications
Measures floating point arithmetic performance
Unit is Millions of Whetstone instructions per
second (MWIPS)
Shortcommings
Does not represent constructs in modern
languages, such as pointers, etc.
Does not consider cache effects

50
Benchmarking

Synthetic benchmarks
Dhrystone benchmark
First written in Ada in1984, today
Represents the workload of C version is available
Statistics are collected on system software, such
as operating system, compilers, editors and a few
numerical programs
Measures integer and string performance, no
floating-point operations
Unit is the number of program iteration
completions per second
Shortcommings
Does not represent real life programs
Compiler optimization overstates system
performance
Small code that may fit in the instruction cache

51
Benchmarking

Synthetic benchmarks
Rhealstone benchmark
Multi-tasking real-time systems
Factors are
Task switching time
Pre-emption time
Interrupt latency time
Semaphore shuffling time
Dead-lock breaking time
Datagram throughput time
Metric is Rhealstones per second

6 ? wi . (1/ ti) i1
52
Benchmarking

Toy benchmarks
10-100 lines of code that the result is known
before running the toy program
Quick sort
Sieve of Eratosthenes
Finds prime numbers http//upload.wikimedia.org
/wikipedia/commons/8/8c/New_Animation_Sieve_of_Era
tosthenes.gif

func sieve( var N ) var PrimeArray as array
of size N initialize PrimeArray to all true
for i from 2 to N for each j from i
1 to N, where i divides j set PrimeArray( j
) false
53
Benchmarking

Kernels
Key pieces of codes from real applications.
LINPACK and BLAS
Livermore Loops
NAS

54
Benchmarking

Kernels
LINPACK and BLAS Libraries
LINPACK linear algebra package
Measures floating-point computing power
Solves system of linear equations Axb with
Gaussian elimination
Metric is MFLOP/s
DAXPY - most time consuming routine
Used as the measure for TOP500 list
BLAS Basic linear algebra subprograms
LINPACK makes use of BLAS library

55
Benchmarking

Kernels
LINPACK and BLAS Libraries
SAXPY Scalar Alpha X Plus Y
Y a X Y, where X and Y are vectors, a is a
scalar
SAXPY for single and DAXPY for double precision
Generic implementation

for (int i m i lt n i) yi a xi
yi
56
Benchmarking

Kernels
Livermore Loops
Developed at LLNL
Originally in Fortran, now also in C
24 numerical application kernels, such as
hydrodynamics fragment,
incomplete Cholesky conjugate gradient,
inner product,
banded linear systems solution, tridiagonal
linear systems solution,
general linear recurrence equations,
first sum, first difference,
2-D particle in a cell, 1-D particle in a cell,
Monte Carlo search,
location of a first array minimum, etc.
Metrics are arithmetic, geometric and harmonic
mean of CPU rate

57
Benchmarking

Kernels
NAS Parallel Benchmarks
Developed at NASA Advanced Supercomputing
division
Paper-and-pencil benchmarks
11 benchmarks, such as
Discrete Poisson equation,
Conjugate gradient
Fast Fourier Transform
Bucket sort
Embarrassingly parallel
Nonlinear PDE solution
Data traffic, etc.

58
Benchmarking

Real Applications
Programs that are run by many users
C compiler
Text processing software
Frequently used user applications
Modified scripts used to measure particular
aspects of system performance, such as
interactive behavior, multiuser behavior

59
Benchmarking

Benchmark Suites
Desktop Benchmarks
SPEC benchmark suite
Server Benchmarks
SPEC benchmark suite
TPC
Embedded Benchmarks
EEMBC

60
Benchmarking

SPEC Benchmark Suite
Desktop Benchmarks
CPU-intensive
SPEC CPU2000
11 integer (CINT2000) and 14 floating-point
(CFP2000) benchmarks
Real application programs
C compiler
Finite element modeling
Fluid dynamics, etc.
Graphics intensive
SPECviewperf
Measures rendering performance using OpenGL
SPECapc
Pro/Engineer 3D rendering with solid models
Solid/Works 3D CAD/CAM design tool,
CPU-intensive and I/O intensive tests
Unigraphics solid modeling for an aircraft
design
Server Benchmarks
SPECWeb for web servers
SPECSFS for NFS performance, throughput-oriented

61
Benchmarking

TPC Benchmark Suite
Server Benchmark
Transaction processing (TP) benchmarks
Real applications
TPC-C simulates a complex query environment
TPC-H ad hoc decision support
TPC-R business decision support system where
users run a standard set of queries
TPC-W business-oriented transactional web server
Measures performance in transactions per second.
Throughput performance is measured only when
response time limit is met.
Allows cost-performance comparisons

62
Benchmarking

EEMBC Benchmarks
for embedded computing systems
34 benchmarks from 5 different application
classes
Automotive/industrial
Consumer
Networking
Office automation
Telecommunications

63
Timers

Roll Over
Suppose a timer returns 32-bit integer data and
measures microseconds.
It rolls over after 232 microseconds ( 1.2
hours)
Timers that measure milliseconds and use 32-bit
data roll over after 232 milliseconds ( 49 days)
There is a trade-off between roll over time and
accuracy.

64
Performance Evaluation