CS 1104 Help Session II Performance Measures - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

CS 1104 Help Session II Performance Measures

Description:

Class Cycles Per Instructions ... In general, the peak overall CPI will be the CPI of the fastest class. ... the compromises result in higher class CPIs. ... – PowerPoint PPT presentation

Number of Views:12

Avg rating:3.0/5.0

Slides: 34

Provided by: polar

Category:

more less

Transcript and Presenter's Notes

Title: CS 1104 Help Session II Performance Measures

1
CS 1104 Help Session IIPerformance Measures

Colin Tan
ctank_at_comp.nus.edu.sg
http//www.comp.nus.edu.sg/ctank

2
Basic ConceptsInstruction Execution Cycles

Processors execute instructions in several steps
Instruction Fetch (IF)
Instructions are fetched from memory and placed
into an Instruction Register (IR).
Instruction Decode (ID)
The opcode portion of the instruction is sent to
a decoder, which generates control signals.
Control signals determine tell the Arithmetic
Logic Unit (ALU) what to do with data add,
rotate the bits, etc.
The operands portion may be sent to the
register-file to fetch register data, or sent
directly to the ALU to be operated on (for
constants).
Operand Fetch (OF)
Data required for the operation is taken from
memory or the register file and sent to the ALU
inputs

3
Basic ConceptsInstruction Execution Cycles

Execution steps (contd)
Instruction Execute (IE)
The ALU computes the results based on the data
fetched and the control signals generated.
Writeback (WB)
The results are written back to the destination
register or memory location.

4
Basic ConceptsThe Need for Synchronization

How will the processor know
When the instruction has been fetched and placed
into IR?
If the instruction is not yet in IR, neither the
opcodes nor operands will make sense!
Decoding nonsense and fetching invalid data leads
to incorrect execution.
When the instruction has been decoded?
If the instructions have not been decoded
completely, the ALU is receiving invalid control
signals.
When the operands have been fetched?
If the operands have not yet been fetched from
the registers or from memory, then the inputs to
the ALU are invalid, and the ALU will compute
invalid results!

5
Basic ConceptsClock Cycles

The other steps (IE, OF, WB) also need to know
when to proceed in order to work correctly.
To coordinate each step, the processor relies on
a series of ticks called clock cycles (CC).
CC1 Perform IF
By the end of CC1, the instruction is definitely
sitting in IR, and the decoder can proceed to
interpret the opcode.
CC2 Perform ID
Decode the instruction in IR, and generate all
the control signals by the end of this clock
cycle.
CC3 Perform OF
Fetch the data from registers or from memory.
Must get all the data ready and presented to the
ALU by the end of this clock cycle.

6
Basic ConceptsClock Cycles

CC4 Perform IE
The ALU must operate (i.e. add, subtract etc) on
the inputs and produce the results by the end of
this clock cycle.
CC5 Perform WB
The outputs of the ALU must be written back to
register or memory by the end of this clock
cycle.
CC6 Start IF of next instruction
If every step obeys the constraints laid out
here, then each step will know for sure that the
results of the previous step are already
available before starting, and execution will
proceed correctly.

7
Basic ConceptsInstruction Classes

A typical processor supports many instructions.
Typically instructions are divided into groups
Arithmetic Instructions add, sub, mul, div, mod
Bitwise Instructions rol, ror, shl, shr, and,
or, not
Floating Point Instructions fadd, fsub, fmul,
fdiv
Load/Store Instructions lw, sw
Etc.

8
Basic ConceptsClass Cycles Per Instructions

We have seen how instructions take several clock
cycles to execute (in our example, each
instruction takes 5 clock cycles).
Each instruction actually takes different number
of clock cycles to execute, depending on how
complex the instruction is, or how slow each
stage of an instruction each.
E.g. Floating Point Adds More complex than
integer adds, and require more clock cycles.
lw, sw access memory, which takes more clock
cycles to fetch an operand from compared with
registers.

9
Basic ConceptsClass Cycles Per Instruction

The Class Cycles Per Instruction (class CPI) is
the average number of clock cycles required by
instructions within a particular class
E.g.
of cycles for ADD 2 cycles
of cycles for SUB 2 cycles
of cycles for MUL 4 cycles
of cycles for DIV 8 cycles
---------------
Total 16 cycles
Average 16/4 4 CPI.
So the class CPI for this class of instructions
is 4.

10
Basic ConceptsInstruction Frequency

A program (e.g. Microsoft Word) is made up of
many instructions coming from each of the
different classes of instructions.
The number of instructions in each class is
called the instruction frequency of that class.
This is often expressed as a percentage or as a
fraction.

11
Basic ConceptsOverall Cycles Per Instruction

The class instruction frequency and the class CPI
can be used to compute what the overall Cycles
Per Instruction, or overall CPI of a particular
program.
Each type of instruction would take a different
number of clock cycles.
A program consists of several different types of
instructions.
The overall CPI is the average number of cycles
required to execute each instruction, across all
types of instructions.

12
Calculating Overall CPI

Find the overall CPI of a program running on a
processor with the class CPIs and instruction
frequencies shown here
Class CPI Instruction Frequency
A 3 0.4
B 2 0.25
C 4 0.15
D 5 0.20

13
Calculating Overall CPI

Lets assume that the total number of
instructions is IC. Then there are 0.4IC
instructions in class A, 0.25IC in class B,
0.15IC in class C and 0.2 IC in class D.
Total number of clock cycles used by instructions
in class A is 0.4IC x 3, class B is 0.25IC x 2,
class C is 0.15IC x 4, class D is 0.2IC x 5
Hence total number of clock cycles used by this
program is 0.4IC x 3 0.25IC x 2 0.15IC x 4
0.2IC x 5
Number of instructions is IC. Hence average
number of cycles per instruction (average CPI) is
(0.4IC x 3 0.25IC x 2 0.15IC x 4 0.2IC x
5)/1.0IC
IC cancels off, leaving 0.4 x 3 0.25 x 2 0.15
x 4 0.2 x 5, the famous Overall CPI. Final
answer is 2.7.

14
Calculating Overall CPI

Suppose the previous program was re-compiled with
a different compiler, and the CPI/instruction
frequency table is modified to the one below

Class CPI Instruction Frequency A 3 0.2 B 2 0
.35 C 4 0.15 D 5 0.20
15
Calculating Overall CPI

We take a short-cut and use the famous formula
Overall CPI 3 x 0.2 2 x 0.35 4 x 0.15 5
x 0.2
2.9
If we left the answer like this, it will WRONG!
Reason The instruction frequencies do not add up
to 1.0!
Returning back to definitions, lets compute the
total number of clock cycles taken by this
program
Total Clock Cycles 0.2IC x 3 0.35IC x 2
0.15IC x 4 0.2IC x 5
Total number of instructions 0.2IC 0.35IC
0.15IC 0.2IC
0.9IC

16
Calculating Overall CPI

Finding the overall CPI
(0.2IC x 3 0.35IC x 2 0.15IC x 4 0.2IC x 5)
/ (0.9IC)
Canceling out IC, we get
(0.2 x 3 0.35 x 2 0.15 x 4 0.2 x 5) / 0.9
Final answer is 3.22
Moral Always divide the overall CPI you get with
the total frequency. In the previous example, the
total frequency was 1.0, and we didnt have a
problem. Here this is not the case.

17
Calculating Peak CPI

The peak overall CPI is obtained when every
instruction in a program is from the fastest
class. Using our previous example, we will have
peak performance if our instruction frequencies
are as shown.
This will give us a peak CPI of 0.0 x 3 1.0 x 2
0.0 x 4 0.0 x 5 2.0

Class CPI Instruction Frequency A 3 0.0 B 2 1
.0 C 4 0.0 D 5 0.0
18
Calculating Peak CPI

In general, the peak overall CPI will be the CPI
of the fastest class.
It is not possible to modify the class CPIs
without modifying the hardware organization
itself.
However, by hacking the hardware, the peak class
CPI can be as low as 0!

19
Basic ConceptsClock Rate

We have seen how the processor coordinates the
various instruction execution stages using a
common tick, or clock cycle.
The number of ticks per second is called the
clock rate, or clock frequency.
Obviously the higher the clock rate, the faster
each stage has to complete, and therefore the
faster the processor completes an instruction
This implies that a higher clock rate will give
you faster processors.
However there is a limit to how fast each stage
can do something.
Cranking the clock rate beyond the capabilities
of the hardware will cause execution to fail.

20
Basic ConceptsClock Rate

To overcome speed limitations, processor
designers often make compromises in the designs
for each stage
The compromises allow each stage to work faster
than before, allowing you to crank up the clock
rate faster than ever.
Such compromises give you faster execution rates
under ideal circumstances, but may give you worse
performance under normal circumstances.
This is because the compromises result in higher
class CPIs.
Hence faster clock rate may actually result in
poorer performance
This translates to longer execution times for a
program.
The length of a clock cycle measured in seconds
is called the clock cycle time or clock period.
It is equal to the reciprocal of the frequency
(i.e. cycle time 1/(clock_rate))

21
Execution Time

The execution time T of a program is the amount
of time a program takes to run to completion.
This will depend on the overall CPI, the total
number of instructions executed (IC), and the
clock rate (R) of the processor.
IC x CPI will give us the total number of clock
cycles used to execute all the instructions in
the program
(IC x CPI) / R will give us the execution time.
If my program takes 10,000 cycles, and if my
clock produces 100,000 cycles per second, then my
program would take 10,000/100,000 0.1 seconds
to execute.
Hence T (IC x CPI)/R

22
Execution Time

From the previous example, suppose the program
has a total of 15,000,000 instructions, and
suppose that the clock rate of the processor is
500 MHz, what is the total execution time of the
program?
T (15 x 106) x 2.7 / 500 x 106 0.081
seconds.

23
Execution Time Issues

The execution time computed is unique only to
this program. Other programs will have different
execution times.
Execution time is affected by
Hardware Organization This affects individual
class CPIs, and hence the overall CPI.
E.g. ADD instructions implemented using
carry-propagate adders will have much higher CPIs
than those implemented using carry-generate
adders.
Compiler Technology This affects the individual
class frequencies
A good compiler will select more instructions
from faster classes to accomplish the same
objective.

24
Execution Time Issues

Execution Time is affected by (contd)
The program being run
Different programs will have different
instruction distributions (i.e. different
instruction class frequencies), resulting in
different overall CPIs.
Different programs will have different
instruction counts IC
Instruction Set Architecture
A richer ISA will give the compiler more choices
of instructions to use to minimize IC, CPI or
both.
All this will give you different execution time T.

25
Benchmarking

Benchmarks allow us to determine the performance
of a system, usually relative to another system.
A common benchmark that we use is execution time.
We take the same program and run it on two
machines, and compare their execution times.
We cannot use overall CPI or clock frequencies as
basis for comparisons
High clock frequency processors may make
compromises that dramatically increase individual
class CPIs, and hence overall CPI.
Instructions may have very low CPIs because clock
cycle times are very big.
Long clock cycle times mean that the processor
may be able to accomplish gt1 step in 1 clock
cycle, leading to lower cycle requirements.
Unfortunately due to low clock rates, performance
may be poor.

26
BenchmarkingExecution Time Example

The processor in the previous example is
optimized, and the new class CPIs are shown
below. Clock frequencies and instruction counts
remain the same. How much faster is the new
machine over the old?

Class CPI Instruction Frequency A 2 0.4 B 1 0
.25 C 5 0.15 D 4 0.20
27
BenchmarkingExecution Time Example

Overall CPI 2 x 0.4 1 x 0.25 5 x 0.15 4 x
0.2
2.6
Execution Time 2.6 x (15 x 106) / 500 x
106
0.0936s
Previous Execution Time 0.078 s
We can measure the speed-up by taking the old
execution time and dividing it by the new
Speedup 0.081 / 0.078 1.04
This figure of 1.04 means that the new design is
1.04 times faster than the old one.

28
BenchmarkingInstruction Throughput

Measuring how fast a machine can execute a
particular program is just one way of determining
performance.
Another good measure is instruction throughput,
or how many instructions a processor can execute
per second.
The most common measure for throughput is MIPS,
which is short for Millions of Instructions Per
Second.
This is not to be confused with the MIPS R2000.
In this case, this MIPS is actually a companys
name.
So we have two meanings for MIPS
Millions of Instructions Per Second
The company that makes the R2000.

29
BenchmarkingMIPS Example

Find the MIPS rating for both machines used in
these notes
CPI for first machine 2.7
This means that every instruction requires, on
average, 2.7 cycles.
The clock rate is 500 MHz, so each second there
are 500 x 106 cycles.
Therefore you can execute 500 x 106 / 2.7
185.2 x 106 instructions per second, or 185.2
MIPS.
CPI for second machine 2.6
Clock rate remains the same at 500x106 Hz.
So throughput is 500 x 106 / 2.6 192.3 MIPS

30
Types of Benchmarks

Micro-Benchmarks
These are very small benchmarks aimed primarily
at gauging the peak performance of a processor.
Kernel Benchmarks
These are very small benchmarks designed to
measure processor performance (e.g. benchmarks to
measure MIPS ratings).
Full Applications Benchmarks
These use actual applications (or simulations of
actual applications) to measure the performance
of CPU, memory and IO systems. Gives a good idea
of how system will perform running such
applications.
Target Workload
These use the actual programs that are going to
be run on the system to measure performance.

31
Amdahls Law

Amdahls Law basically states that
Execution time depends on a number of factors,
such as the speeds of various classes of
instructions.
If you improved the performance of one factor by
X times, then the overall improvements in
execution time will always be less than X.
If we were to improve the execution time of a
particular class of instructions, then the new
execution time is given by
New Ex Time Ex Time of unaffected classes (Ex
Time of affected class / speedup)

32
Amdahls Law

Suppose a program runs in 100 seconds on a
machine, and multiplies account for 80 seconds of
this time. What improvement in execution time
will we have if we improved (i) executions by 5
times, ii) improved the other instructions by 10
times?
i) New ex time unaffected time affected time
/ speedup
20 80/5 36 seconds
Improvement 100/36 2.77 times faster.
ii) New ex time 80 20/10 82 seconds
Improvement 100 / 82 1.22 times faster
Moral Always improve the common case to get the
best increase in performance!
Here the common case is the multiply (80).
Improving multiplies by 5 times gives far better
gains than improving the other instructions (20)
by 10 times!

33
Summary

We looked at how instructions take several steps
to execute, and each step is synchronized with
the tick of a clock a clock cycle.
Execution time is the only reliable way to tell
which machine is faster.
Machine performance may also be measured using
instruction throughput
How many instructions can this machine execute in
1 second?
Amdahls law allows us to see how much
improvements we need to make to a class of
instructions in order to achieve a desired order
of improvement in performance.