Title: CS 1104 Help Session II Performance Measures
1CS 1104 Help Session IIPerformance Measures
- Colin Tan
- ctank_at_comp.nus.edu.sg
- http//www.comp.nus.edu.sg/ctank
2Basic ConceptsInstruction Execution Cycles
- Processors execute instructions in several steps
- Instruction Fetch (IF)
- Instructions are fetched from memory and placed
into an Instruction Register (IR). - Instruction Decode (ID)
- The opcode portion of the instruction is sent to
a decoder, which generates control signals. - Control signals determine tell the Arithmetic
Logic Unit (ALU) what to do with data add,
rotate the bits, etc. - The operands portion may be sent to the
register-file to fetch register data, or sent
directly to the ALU to be operated on (for
constants). - Operand Fetch (OF)
- Data required for the operation is taken from
memory or the register file and sent to the ALU
inputs
3Basic ConceptsInstruction Execution Cycles
- Execution steps (contd)
- Instruction Execute (IE)
- The ALU computes the results based on the data
fetched and the control signals generated. - Writeback (WB)
- The results are written back to the destination
register or memory location.
4Basic ConceptsThe Need for Synchronization
- How will the processor know
- When the instruction has been fetched and placed
into IR? - If the instruction is not yet in IR, neither the
opcodes nor operands will make sense! - Decoding nonsense and fetching invalid data leads
to incorrect execution. - When the instruction has been decoded?
- If the instructions have not been decoded
completely, the ALU is receiving invalid control
signals. - When the operands have been fetched?
- If the operands have not yet been fetched from
the registers or from memory, then the inputs to
the ALU are invalid, and the ALU will compute
invalid results!
5Basic ConceptsClock Cycles
- The other steps (IE, OF, WB) also need to know
when to proceed in order to work correctly. - To coordinate each step, the processor relies on
a series of ticks called clock cycles (CC). - CC1 Perform IF
- By the end of CC1, the instruction is definitely
sitting in IR, and the decoder can proceed to
interpret the opcode. - CC2 Perform ID
- Decode the instruction in IR, and generate all
the control signals by the end of this clock
cycle. - CC3 Perform OF
- Fetch the data from registers or from memory.
Must get all the data ready and presented to the
ALU by the end of this clock cycle.
6Basic ConceptsClock Cycles
- CC4 Perform IE
- The ALU must operate (i.e. add, subtract etc) on
the inputs and produce the results by the end of
this clock cycle. - CC5 Perform WB
- The outputs of the ALU must be written back to
register or memory by the end of this clock
cycle. - CC6 Start IF of next instruction
- If every step obeys the constraints laid out
here, then each step will know for sure that the
results of the previous step are already
available before starting, and execution will
proceed correctly.
7Basic ConceptsInstruction Classes
- A typical processor supports many instructions.
- Typically instructions are divided into groups
- Arithmetic Instructions add, sub, mul, div, mod
- Bitwise Instructions rol, ror, shl, shr, and,
or, not - Floating Point Instructions fadd, fsub, fmul,
fdiv - Load/Store Instructions lw, sw
- Etc.
8Basic ConceptsClass Cycles Per Instructions
- We have seen how instructions take several clock
cycles to execute (in our example, each
instruction takes 5 clock cycles). - Each instruction actually takes different number
of clock cycles to execute, depending on how
complex the instruction is, or how slow each
stage of an instruction each. - E.g. Floating Point Adds More complex than
integer adds, and require more clock cycles. - lw, sw access memory, which takes more clock
cycles to fetch an operand from compared with
registers.
9Basic ConceptsClass Cycles Per Instruction
- The Class Cycles Per Instruction (class CPI) is
the average number of clock cycles required by
instructions within a particular class - E.g.
- of cycles for ADD 2 cycles
- of cycles for SUB 2 cycles
- of cycles for MUL 4 cycles
- of cycles for DIV 8 cycles
- ---------------
- Total 16 cycles
- Average 16/4 4 CPI.
- So the class CPI for this class of instructions
is 4.
10Basic ConceptsInstruction Frequency
- A program (e.g. Microsoft Word) is made up of
many instructions coming from each of the
different classes of instructions. - The number of instructions in each class is
called the instruction frequency of that class. - This is often expressed as a percentage or as a
fraction.
11Basic ConceptsOverall Cycles Per Instruction
- The class instruction frequency and the class CPI
can be used to compute what the overall Cycles
Per Instruction, or overall CPI of a particular
program. - Each type of instruction would take a different
number of clock cycles. - A program consists of several different types of
instructions. - The overall CPI is the average number of cycles
required to execute each instruction, across all
types of instructions.
12Calculating Overall CPI
- Find the overall CPI of a program running on a
processor with the class CPIs and instruction
frequencies shown here - Class CPI Instruction Frequency
- A 3 0.4
- B 2 0.25
- C 4 0.15
- D 5 0.20
13Calculating Overall CPI
- Lets assume that the total number of
instructions is IC. Then there are 0.4IC
instructions in class A, 0.25IC in class B,
0.15IC in class C and 0.2 IC in class D. - Total number of clock cycles used by instructions
in class A is 0.4IC x 3, class B is 0.25IC x 2,
class C is 0.15IC x 4, class D is 0.2IC x 5 - Hence total number of clock cycles used by this
program is 0.4IC x 3 0.25IC x 2 0.15IC x 4
0.2IC x 5 - Number of instructions is IC. Hence average
number of cycles per instruction (average CPI) is
(0.4IC x 3 0.25IC x 2 0.15IC x 4 0.2IC x
5)/1.0IC - IC cancels off, leaving 0.4 x 3 0.25 x 2 0.15
x 4 0.2 x 5, the famous Overall CPI. Final
answer is 2.7.
14Calculating Overall CPI
- Suppose the previous program was re-compiled with
a different compiler, and the CPI/instruction
frequency table is modified to the one below
Class CPI Instruction Frequency A 3 0.2 B 2 0
.35 C 4 0.15 D 5 0.20
15Calculating Overall CPI
- We take a short-cut and use the famous formula
- Overall CPI 3 x 0.2 2 x 0.35 4 x 0.15 5
x 0.2 - 2.9
- If we left the answer like this, it will WRONG!
- Reason The instruction frequencies do not add up
to 1.0! - Returning back to definitions, lets compute the
total number of clock cycles taken by this
program - Total Clock Cycles 0.2IC x 3 0.35IC x 2
0.15IC x 4 0.2IC x 5 - Total number of instructions 0.2IC 0.35IC
0.15IC 0.2IC - 0.9IC
16Calculating Overall CPI
- Finding the overall CPI
- (0.2IC x 3 0.35IC x 2 0.15IC x 4 0.2IC x 5)
/ (0.9IC) - Canceling out IC, we get
- (0.2 x 3 0.35 x 2 0.15 x 4 0.2 x 5) / 0.9
- Final answer is 3.22
- Moral Always divide the overall CPI you get with
the total frequency. In the previous example, the
total frequency was 1.0, and we didnt have a
problem. Here this is not the case.
17Calculating Peak CPI
- The peak overall CPI is obtained when every
instruction in a program is from the fastest
class. Using our previous example, we will have
peak performance if our instruction frequencies
are as shown. - This will give us a peak CPI of 0.0 x 3 1.0 x 2
0.0 x 4 0.0 x 5 2.0
Class CPI Instruction Frequency A 3 0.0 B 2 1
.0 C 4 0.0 D 5 0.0
18Calculating Peak CPI
- In general, the peak overall CPI will be the CPI
of the fastest class. - It is not possible to modify the class CPIs
without modifying the hardware organization
itself. - However, by hacking the hardware, the peak class
CPI can be as low as 0!
19Basic ConceptsClock Rate
- We have seen how the processor coordinates the
various instruction execution stages using a
common tick, or clock cycle. - The number of ticks per second is called the
clock rate, or clock frequency. - Obviously the higher the clock rate, the faster
each stage has to complete, and therefore the
faster the processor completes an instruction - This implies that a higher clock rate will give
you faster processors. - However there is a limit to how fast each stage
can do something. - Cranking the clock rate beyond the capabilities
of the hardware will cause execution to fail.
20Basic ConceptsClock Rate
- To overcome speed limitations, processor
designers often make compromises in the designs
for each stage - The compromises allow each stage to work faster
than before, allowing you to crank up the clock
rate faster than ever. - Such compromises give you faster execution rates
under ideal circumstances, but may give you worse
performance under normal circumstances. - This is because the compromises result in higher
class CPIs. - Hence faster clock rate may actually result in
poorer performance - This translates to longer execution times for a
program. - The length of a clock cycle measured in seconds
is called the clock cycle time or clock period.
It is equal to the reciprocal of the frequency
(i.e. cycle time 1/(clock_rate))
21Execution Time
- The execution time T of a program is the amount
of time a program takes to run to completion. - This will depend on the overall CPI, the total
number of instructions executed (IC), and the
clock rate (R) of the processor. - IC x CPI will give us the total number of clock
cycles used to execute all the instructions in
the program - (IC x CPI) / R will give us the execution time.
- If my program takes 10,000 cycles, and if my
clock produces 100,000 cycles per second, then my
program would take 10,000/100,000 0.1 seconds
to execute. - Hence T (IC x CPI)/R
22Execution Time
- From the previous example, suppose the program
has a total of 15,000,000 instructions, and
suppose that the clock rate of the processor is
500 MHz, what is the total execution time of the
program? - T (15 x 106) x 2.7 / 500 x 106 0.081
seconds.
23Execution Time Issues
- The execution time computed is unique only to
this program. Other programs will have different
execution times. - Execution time is affected by
- Hardware Organization This affects individual
class CPIs, and hence the overall CPI. - E.g. ADD instructions implemented using
carry-propagate adders will have much higher CPIs
than those implemented using carry-generate
adders. - Compiler Technology This affects the individual
class frequencies - A good compiler will select more instructions
from faster classes to accomplish the same
objective.
24Execution Time Issues
- Execution Time is affected by (contd)
- The program being run
- Different programs will have different
instruction distributions (i.e. different
instruction class frequencies), resulting in
different overall CPIs. - Different programs will have different
instruction counts IC - Instruction Set Architecture
- A richer ISA will give the compiler more choices
of instructions to use to minimize IC, CPI or
both. - All this will give you different execution time T.
25Benchmarking
- Benchmarks allow us to determine the performance
of a system, usually relative to another system. - A common benchmark that we use is execution time.
We take the same program and run it on two
machines, and compare their execution times. - We cannot use overall CPI or clock frequencies as
basis for comparisons - High clock frequency processors may make
compromises that dramatically increase individual
class CPIs, and hence overall CPI. - Instructions may have very low CPIs because clock
cycle times are very big. - Long clock cycle times mean that the processor
may be able to accomplish gt1 step in 1 clock
cycle, leading to lower cycle requirements. - Unfortunately due to low clock rates, performance
may be poor.
26BenchmarkingExecution Time Example
- The processor in the previous example is
optimized, and the new class CPIs are shown
below. Clock frequencies and instruction counts
remain the same. How much faster is the new
machine over the old?
Class CPI Instruction Frequency A 2 0.4 B 1 0
.25 C 5 0.15 D 4 0.20
27BenchmarkingExecution Time Example
- Overall CPI 2 x 0.4 1 x 0.25 5 x 0.15 4 x
0.2 - 2.6
- Execution Time 2.6 x (15 x 106) / 500 x
106 - 0.0936s
- Previous Execution Time 0.078 s
- We can measure the speed-up by taking the old
execution time and dividing it by the new - Speedup 0.081 / 0.078 1.04
- This figure of 1.04 means that the new design is
1.04 times faster than the old one.
28BenchmarkingInstruction Throughput
- Measuring how fast a machine can execute a
particular program is just one way of determining
performance. - Another good measure is instruction throughput,
or how many instructions a processor can execute
per second. - The most common measure for throughput is MIPS,
which is short for Millions of Instructions Per
Second. - This is not to be confused with the MIPS R2000.
In this case, this MIPS is actually a companys
name. - So we have two meanings for MIPS
- Millions of Instructions Per Second
- The company that makes the R2000.
29BenchmarkingMIPS Example
- Find the MIPS rating for both machines used in
these notes - CPI for first machine 2.7
- This means that every instruction requires, on
average, 2.7 cycles. - The clock rate is 500 MHz, so each second there
are 500 x 106 cycles. - Therefore you can execute 500 x 106 / 2.7
185.2 x 106 instructions per second, or 185.2
MIPS. - CPI for second machine 2.6
- Clock rate remains the same at 500x106 Hz.
- So throughput is 500 x 106 / 2.6 192.3 MIPS
30Types of Benchmarks
- Micro-Benchmarks
- These are very small benchmarks aimed primarily
at gauging the peak performance of a processor. - Kernel Benchmarks
- These are very small benchmarks designed to
measure processor performance (e.g. benchmarks to
measure MIPS ratings). - Full Applications Benchmarks
- These use actual applications (or simulations of
actual applications) to measure the performance
of CPU, memory and IO systems. Gives a good idea
of how system will perform running such
applications. - Target Workload
- These use the actual programs that are going to
be run on the system to measure performance.
31Amdahls Law
- Amdahls Law basically states that
- Execution time depends on a number of factors,
such as the speeds of various classes of
instructions. - If you improved the performance of one factor by
X times, then the overall improvements in
execution time will always be less than X. - If we were to improve the execution time of a
particular class of instructions, then the new
execution time is given by - New Ex Time Ex Time of unaffected classes (Ex
Time of affected class / speedup)
32Amdahls Law
- Suppose a program runs in 100 seconds on a
machine, and multiplies account for 80 seconds of
this time. What improvement in execution time
will we have if we improved (i) executions by 5
times, ii) improved the other instructions by 10
times? - i) New ex time unaffected time affected time
/ speedup - 20 80/5 36 seconds
- Improvement 100/36 2.77 times faster.
- ii) New ex time 80 20/10 82 seconds
- Improvement 100 / 82 1.22 times faster
- Moral Always improve the common case to get the
best increase in performance! - Here the common case is the multiply (80).
Improving multiplies by 5 times gives far better
gains than improving the other instructions (20)
by 10 times!
33Summary
- We looked at how instructions take several steps
to execute, and each step is synchronized with
the tick of a clock a clock cycle. - Execution time is the only reliable way to tell
which machine is faster. - Machine performance may also be measured using
instruction throughput - How many instructions can this machine execute in
1 second? - Amdahls law allows us to see how much
improvements we need to make to a class of
instructions in order to achieve a desired order
of improvement in performance.