Title: Lecture 2: Performance Measurement
1Lecture 2Performance Measurement
2Performance Evaluation
- The primary duty of software developers is to
create functionally correct programs - Performance evaluation is a part of software
development for well-performing programs
3Performance Analysis Cycle
- Have an optimization phase just like testing and
debugging phase
Code Development
Functionally complete and correct program
Measure
Analyze
Modify / Tune
Complete, correct and well-performing program
Usage
4Goals of Performance Analysis
-
- The goal of performance analysis is to provide
quantitative information about the performance of
a computer system
5Goals of Performance Analysis
- Compare alternatives
- When purchasing a new computer system, to provide
quantitative information - Determine the impact of a feature
- In designing a new system or upgrading, to
provide before-and-after comparison - System tuning
- To find the best parameters that produce the best
overall performance - Identify relative performance
- To quantify the performance relative to previous
generations - Performance debugging
- To identify the performance problems and correct
them - Set expectations
- To determine the expected capabilities of the
next generation
6Performance Evaluation
- Performance Evaluation steps
- Measurement / Prediction
- What to measure? How to measure?
- Modeling for prediction
- Simulation
- Analytical Modeling
- Analysis Reporting
- Performance metrics
7Performance Measurement
-
- Interval Timers
- Hardware Timers
- Software Timers
8Performance Measurement
- Hardware Timers
- Counter value is read from a memory location
- Time is calculated as
Tc
Clock
Counter
n bits
to processor memory bus
Time (x2 - x1) x Tc
9Performance Measurement
- Software Timers
- Interrupt-based
- When interrupt occurs, interrupt-service routine
increments the timer value which is read by a
program - Time is calculated as
Tc
Clock
Prescaling Counter
Tc
to processor interrupt input
Time (x2 - x1) x Tc
10Performance Measurement
- Timer Rollover
- Occurs when an n-bit counter undergoes a
transition from its maximum value 2n 1 to zero - There is a trade-off between roll over time and
accuracy
Tc 32-bit 64-bit
10 ns 42 s 5850 years
1 ms 1.2 hour 0.5 million years
1 ms 49 days 0.5 x 109 years
11Timers
- Solution
- Use 64-bit integer (over half a million year)
- Timer returns two values
- One represents seconds
- One represents microseconds since the last second
- With 32-bit, the roll over is over 100 years
12Performance Measurement
- Interval Timers
- T0 ? Read current time
- Event being timed ()
- T1 ? Read current time
- Time for the event is T1-T0
13Performance Measurement
- Timer Overhead
-
- Initiate read_time
- Current time is read
- Event begins
- Event ends Initiate read_time
- Current time is read
- Measured time
- Tm T2 T3 T4
- Desired measurement
- Te Tm (T2 T4)
- Tm (T1 T2) since T1 T4
- Timer overhead
- Tovhd T1 T2
- Te should be 100-1000 times greater than Tovhd .
T1
T2
T3
T4
14Performance Measurement
- Timer Resolution
- Resolution is the smallest change that can be
detected by an interval timer.
nTc lt Te lt (n1)Tc If Tc is large relative to
the event being measured, it may be impossible to
measure the duration of the event.
15Performance Measurement
- Measuring Short Intervals
- Te lt Tc
Tc
? 1
Te
Tc
? 0
Te
16Performance Measurement
- Measuring Short Intervals
- Solution Repeat measurements n times.
- Average execution time Te (m x Tc) / n
- m number of 1s measured
- Average execution time Te (Tt / n ) - h
- Tt total execution time of n repetitions
- h repetition overhead
Tc
Te
Tt
17Performance Measurement
- Time
- Elapsed time / wall-clock time / response time
- Latency to complete a task, including disk
access, memory access, I/O, operating system
overhead, and everything (includes time consumed
by other programs in a time-sharing system) - CPU time
- The time CPU is computing, not including I/O time
or waiting time - User time / user CPU time
- CPU time spent in the program
- System time / system CPU time
- CPU time spent in the operating system performing
tasks requested by the program
18Performance Measurement
- UNIX time command
- 90.7u 12.9s 239 65
- Drawbacks
- Resolution is in milliseconds
- Different sections of the code can not be timed
User time
Elapsed time
Percentage of elapsed time
System time
19Timers
- Timer is a function, subroutine or program that
can be used to return the amount of time spent in
a section of code.
zero 0.0 t0 timer(zero) lt code
segment gt t1 timer(t0) time t1
t0 timer() lt code segment gt t1
timer() time t1 t0
20Timers
- Read Wadleigh, Crawford pg 130-136 for
- time, clock, gettimeofday, etc.
21Timers
- Measuring Timer Resolution
main() . . . zero 0.0 t0
timer(zero) t1 0.0 j0 while (t1 0.0)
j zero0.0 t0 timer(zero) foo(j)
t1 timer(t0) printf (It took d
iterations for a nonzero time\n, j) if (j1)
printf (timer resolution lt 13.7f seconds\n,
t1) else printf (timer resolution is
13.7f seconds\n, t1) foo(n) . .
. i0 for (j0 jltn j) i return(i)
22Timers
- Measuring Timer Resolution
- Using clock()
-
- Using times()
- Using getrusage()
It took 682 iterations for a nonzero time timer
resolution is 0.0200000 seconds
It took 720 iterations for a nonzero time timer
resolution is 0.0200000 seconds
It took 7374 iterations for a nonzero time timer
resolution is 0.0002700 seconds
23Timers
- Spin Loops
- For codes that take less time to run than the
resolution of the timer - First call to a function may require an
inordinate amount of time. - Therefore the minimum of all times may be
desired.
main() . . . zero 0.0 t2 100000.0 for
(j0 jltn j) t0 timer(zero) foo(j)
t1 timer(t0) t2 min(t2, t1) t2
t2 / n printf (Minimum time is 13.7f
seconds\n, t2) foo(n) . . . lt code segment
gt
24Profilers
- A profiler automatically insert timing calls into
applications to generate calls into applications - It is used to identify the portions of the
program that consumes the largest fraction of the
total execution time. - It may also be used to find system-level
bottlenecks in a multitasking system. - Profilers may alter the timing of a programs
execution
25Profilers
- Data collection techniques
- Sampling-based
- This type of profilers use a predefined clock
every multiple of this clock tick the program is
interrupted and the state information is
recorded. - They give the statistical profile of the program
behavior. - They may miss some important events.
- Event-based
- Events are defined (e.g. entry into a subroutine)
and data about these events are collected. - The collected information shows the exact
execution frequencies. - It has substantial amount of run-time overhead
and memory requirement. - Information kept
- Trace-based The compiler keeps all information
it collects. - Reductionist Only statistical information is
collected.
26Performance Evaluation
- Performance Evaluation steps
- Measurement / Prediction
- What to measure? How to measure?
- Modeling for prediction
- Analysis Reporting
- Performance metrics
27Predicting Performance
- Performance of simple kernels can be predicted to
a high degree - Theoretical performance and peak performance must
be close - It is preferred that the measured performance is
over 80 of the theoretical peak performance
28Performance Metrics
- Time
- Elapsed time / wall-clock time / response time
- CPU time
- User time / user CPU time
- System time / system CPU time
29Performance Modeling
CPU time instructions x cycles x
time program
instruction cycle
CPI (cycles per instruction)
CPU time instruction count x CPI x 1
clock rate
30Performance Modeling
- CPU Performance
- CPI
- is an average
- depends on the design of micro-architecture
(hardwired/microprogrammed, pipelined) - Number of instructions
- is the number of instructions executed at runtime
- Depends on
- instruction set architecture (ISA)
- compiler
CPU time instruction count x CPI x 1
clock rate
31Performance Modeling
- CPU Performance
- Drawbacks
- In modern computers, no program runs without some
operating system running on the hardware - Comparing performance between machines with
different operating systems will be unfair
32Performance Evaluation
- Performance Evaluation steps
- Measurement / Prediction
- What to measure? How to measure?
- Modeling for prediction
- Simulation
- Analytical Modeling
- Queuing Theory
- Analysis Reporting
- Performance metrics
33Performance Metrics
- Performance Comparison
- Relative performance
Performancex 1 .
Execution timeX
Performance Ratio PerformanceX Execution
timeY
PerformanceY Execution timeX
34Performance Metrics
- Relative Performance
- If workload consists of more than one program,
total execution time may be used. - If there are more than one machine to be
compared, one of them must be selected as a
reference.
35Performance Metrics
- Throughput
- Total amount of work done in a given time
- Measured in tasks per time unit
- Can be used for
- Operating system performance
- Pipeline performance
- Multiprocessor performance
36Performance Metrics
- Statistical Analysis
- Used to compare performance
- Workload consists of many programs
- Depends on the nature of the data as well as
distribution of the test results
37Performance Metrics
- Statistical Analysis
- Arithmetic mean
- May be misleading if the data are skewed or
scattered
Arithmetic mean S xi , 1 i n
n
MA MB MC
Prog1 50 100 500
Prog2 400 800 800
Prog3 5550 5100 4700
Average 2000 2000 2000
38Performance Metrics
- Statistical Analysis
- Weighted average
- weight is the frequency of each program in daily
processing - Results may change with a different set of
execution frequencies
Weighted average ? wi . xi , 1 i n
weight MA MB MC
Prog1 60 50 100 500
Prog2 30 400 800 800
Prog3 10 5550 5100 4700
Average 705 810 1010
39Performance Metrics
- Statistical Analysis
- Geometric mean
- Results are stated in relation to the performance
of a reference machine
Geometric mean ( ? xi )1/n , 1 i n
MA Normalized to MB MB (reference) Normalized to MB MC Normalized to MB
Prog1 50 2 100 1 500 0.2
Prog2 400 2 800 1 800 1
Prog3 5550 0.92 5100 1 4700 1.085
Average 1.54 1 0.60
- Results are consistent no matter which system is
chosen as reference
40Performance Metrics
- Statistical Analysis
- Harmonic mean
- Used to compare performance resuts that are
expressed as a rate (e.g. operations per second,
throughput, etc.) - Slowest rates have the greatest influence on the
result - ?It identifies areas where performance can be
improved
Harmonic mean n , 1 i n
? 1/xi
41Performance Metrics
- MIPS (Million instructions per second)
- Includes both integer and floating point
performance - Number of instructions in a program varies
between different computers - Number of instructions varies between different
programs on the same computer
MIPS Instruction count Clock rate
Execution time x 106 CPI x 106
42Performance Metrics
- MFLOPS
- (Million floating point operations per second)
- Give performance of only floating-point
operations - Different mixes of integer and floating-point
operations may have different execution times - Integer and floating-point units work
independently - Instruction and data caches provide instruction
and data concurrently
43Performance Metrics
- Utilization
- Speciality ratio
- 1 ? general purpose
Utilization Busy time .
Total time
Speciality ratio Maximum performance .
Minimum performance
44Performance Metrics
- Asymptotic and Half performance
- r? asymptotic performance
- n1/2 half performance
T r? (n n1/2) r? 1/t n1/2 t0/t
Slope r?-1
2t0
t0
n1/2
-n1/2
45Performance Evaluation Methods
- Benchmarking
- Monitoring
- Analytical Modeling
- Queuing Theory
46Benchmarking
- Benchmark is a program that is run on a computer
to measure its performance and compare it with
other machines - Best benchmark is the users workload the
mixture of programs and operating system commands
that users run on a machine. - ? Not practical
- Standard benchmarks
47Benchmarking
- Types of Benchmarks
- Synthetic benchmarks
- Toy benchmarks
-
- Kernels
- Real Applications
-
48Benchmarking
- Synthetic benchmarks
- Artificially created benchmark programs that
represent the average frequency of operations of
a large set of programs - Whetstone benchmark
- Dhrystone benchmark
- Rhealstone benchmark
49Benchmarking
- Synthetic benchmarks
- Whetstone benchmark
- First written in Algol60 in 1972, today Fortran,
C/C, Java versions are available - Represents the workload of numerical applications
- Measures floating point arithmetic performance
- Unit is Millions of Whetstone instructions per
second (MWIPS) - Shortcommings
- Does not represent constructs in modern
languages, such as pointers, etc. - Does not consider cache effects
50Benchmarking
- Synthetic benchmarks
- Dhrystone benchmark
- First written in Ada in1984, today
- Represents the workload of C version is available
- Statistics are collected on system software, such
as operating system, compilers, editors and a few
numerical programs - Measures integer and string performance, no
floating-point operations - Unit is the number of program iteration
completions per second - Shortcommings
- Does not represent real life programs
- Compiler optimization overstates system
performance - Small code that may fit in the instruction cache
51Benchmarking
- Synthetic benchmarks
- Rhealstone benchmark
- Multi-tasking real-time systems
- Factors are
- Task switching time
- Pre-emption time
- Interrupt latency time
- Semaphore shuffling time
- Dead-lock breaking time
- Datagram throughput time
- Metric is Rhealstones per second
6 ? wi . (1/ ti) i1
52Benchmarking
- Toy benchmarks
- 10-100 lines of code that the result is known
before running the toy program - Quick sort
- Sieve of Eratosthenes
- Finds prime numbers http//upload.wikimedia.org
/wikipedia/commons/8/8c/New_Animation_Sieve_of_Era
tosthenes.gif
func sieve( var N ) var PrimeArray as array
of size N initialize PrimeArray to all true
for i from 2 to N for each j from i
1 to N, where i divides j set PrimeArray( j
) false
53Benchmarking
- Kernels
- Key pieces of codes from real applications.
- LINPACK and BLAS
- Livermore Loops
- NAS
54Benchmarking
- Kernels
- LINPACK and BLAS Libraries
- LINPACK linear algebra package
- Measures floating-point computing power
- Solves system of linear equations Axb with
Gaussian elimination - Metric is MFLOP/s
- DAXPY - most time consuming routine
- Used as the measure for TOP500 list
- BLAS Basic linear algebra subprograms
- LINPACK makes use of BLAS library
55Benchmarking
- Kernels
- LINPACK and BLAS Libraries
- SAXPY Scalar Alpha X Plus Y
- Y a X Y, where X and Y are vectors, a is a
scalar - SAXPY for single and DAXPY for double precision
- Generic implementation
for (int i m i lt n i) yi a xi
yi
56Benchmarking
- Kernels
- Livermore Loops
- Developed at LLNL
- Originally in Fortran, now also in C
- 24 numerical application kernels, such as
- hydrodynamics fragment,
- incomplete Cholesky conjugate gradient,
- inner product,
- banded linear systems solution, tridiagonal
linear systems solution, - general linear recurrence equations,
- first sum, first difference,
- 2-D particle in a cell, 1-D particle in a cell,
- Monte Carlo search,
- location of a first array minimum, etc.
- Metrics are arithmetic, geometric and harmonic
mean of CPU rate
57Benchmarking
- Kernels
- NAS Parallel Benchmarks
- Developed at NASA Advanced Supercomputing
division - Paper-and-pencil benchmarks
- 11 benchmarks, such as
- Discrete Poisson equation,
- Conjugate gradient
- Fast Fourier Transform
- Bucket sort
- Embarrassingly parallel
- Nonlinear PDE solution
- Data traffic, etc.
58Benchmarking
- Real Applications
-
- Programs that are run by many users
- C compiler
- Text processing software
- Frequently used user applications
- Modified scripts used to measure particular
aspects of system performance, such as
interactive behavior, multiuser behavior
59Benchmarking
- Benchmark Suites
- Desktop Benchmarks
- SPEC benchmark suite
- Server Benchmarks
- SPEC benchmark suite
- TPC
- Embedded Benchmarks
- EEMBC
60Benchmarking
- SPEC Benchmark Suite
- Desktop Benchmarks
- CPU-intensive
- SPEC CPU2000
- 11 integer (CINT2000) and 14 floating-point
(CFP2000) benchmarks - Real application programs
- C compiler
- Finite element modeling
- Fluid dynamics, etc.
- Graphics intensive
- SPECviewperf
- Measures rendering performance using OpenGL
- SPECapc
- Pro/Engineer 3D rendering with solid models
- Solid/Works 3D CAD/CAM design tool,
CPU-intensive and I/O intensive tests - Unigraphics solid modeling for an aircraft
design - Server Benchmarks
- SPECWeb for web servers
- SPECSFS for NFS performance, throughput-oriented
61Benchmarking
- TPC Benchmark Suite
- Server Benchmark
- Transaction processing (TP) benchmarks
- Real applications
- TPC-C simulates a complex query environment
- TPC-H ad hoc decision support
- TPC-R business decision support system where
users run a standard set of queries - TPC-W business-oriented transactional web server
- Measures performance in transactions per second.
Throughput performance is measured only when
response time limit is met. - Allows cost-performance comparisons
62Benchmarking
- EEMBC Benchmarks
- for embedded computing systems
- 34 benchmarks from 5 different application
classes - Automotive/industrial
- Consumer
- Networking
- Office automation
- Telecommunications
63Timers
- Roll Over
- Suppose a timer returns 32-bit integer data and
measures microseconds. - It rolls over after 232 microseconds ( 1.2
hours) - Timers that measure milliseconds and use 32-bit
data roll over after 232 milliseconds ( 49 days) - There is a trade-off between roll over time and
accuracy.
64Performance Evaluation
- Performance Evaluation steps
- Measurement / Prediction
- What to measure? How to measure?
- Modeling for prediction
- Analysis Reporting
- Performance metrics