InstructionLevel Parallelism and Its Dynamic Exploitation

About This Presentation

Title:

InstructionLevel Parallelism and Its Dynamic Exploitation

Description:

Not the critical property that must be preserved ... WB. 28. WAR & WAW may arise when dynamic scheduling. More Interesting Code Fragment ... – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 184

Provided by: web2CcN

Category:

more less

Transcript and Presenter's Notes

Title: InstructionLevel Parallelism and Its Dynamic Exploitation

1
Instruction-Level Parallelism andIts Dynamic
Exploitation
2
Outline

Instruction-Level Parallelism Concepts and
Challenges
Overcoming Data Hazards with Dynamic Scheduling
Dynamic Scheduling Examples and the Algorithm
Reducing Branch Penalties with Dynamic Hardware
Prediction
High-Performance Instruction Delivery
Taking Advantage of More ILP with Multiple Issue
Hardware-Based Speculation
Studies of the Limitations of ILP
Limitations on ILP for Reliable Processors

3
Instruction-Level Parallelism Concepts and
Challenges
4
Introduction

Instruction-Level Parallelism (ILP) potential
execution overlap among instructions
Instructions are executed in parallel
Pipeline supports a limited sense of ILP
This chapter introduces techniques to increase
the amount of parallelism exploited among
instructions
How to reduce the impact of data and control
hazards
How to increase the ability of the processor to
exploit parallelism
Pipelined CPIIdeal pipeline CPIStructural
stallsRAW stallsWAR stallsWAW stallsControl
stalls

5
Approaches To Exploiting ILP

Hardware approach focus of this chapter
Dynamic running time
Dominate desktop and server markets
Pentium III and IV Athlon MIPS R10000/12000
Sun UltraSPARC III PowerPC 603, G3, and G4
Alpha 21264
Software approach focus of next chapter
Static compiler time
Rely on compilers
Broader adoption in the embedded market
But include IA-64 and Intels Itanium

6
ILP Methods
A combo of HW and SW/Compiler methods
7
ILP within a Basic Block

Basic Block Instructions between branch
instructions
Instructions in a basic block are executed in
sequence
Real code is a bunch of basic blocks connected by
branch
Notice dynamic branch frequency between 15
and 25
Basic block size between 6 and 7 instructions
May depend on each other (data dependence)
Therefore, probably little in the way of
parallelism
To obtain substantial performance enhancement
ILP across multiple basic blocks
Easiest target is the loop
Exploit parallelism among iterations of a loop
(loop-level parallelism)

8
Loop Level Parallelism (LLP)

Consider adding two 1000 element arrays
There is no dependence between data values
produced in any iteration j and those needed in
jn for any j and n
Truly independent iterations
Independence means no stalls due to data hazards
Basic idea to convert LLP into ILP
Unroll the loop either statically by the compiler
(next chapter) or dynamically by the hardware
(this chapter)

x1 x1 y1x2 x2
y2x1000x1000y1000
for (i1 ilt1000, ii1) xi xi yi
9
Data Dependences and Hazards
10
Introduction

If two instructions are independent, then
They can execute (parallel) simultaneously in a
pipeline without stall
Assume no structural hazards
Their execution orders can be swapped
Dependent instructions must be executed in order,
or partially overlapped in pipeline
Why to check dependence?
Determine how much parallelism exists, and how
that parallelism can be exploited
Types of dependences -- Data, Name, Control
dependence

11
Data Dependence Analysis

i is data dependent on j if i uses a result
produced by j
OR i uses a result produced by k and k depends on
j (chain)
Dependence indicates a potential RAW hazard
Induce a hazard and stall? - depends on the
pipeline organization
The possibility limits the performance
Order in which instructions must be executed
Sets a bound on how much parallelism can be
exploited
Overcome data dependence
Maintain dependence but avoid a hazard
scheduling the code (HW,SW)
Eliminate a dependence by transforming the code
(by compiler)

12
Data Dependence Example

Loop L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, -8
BNE R1, R2, Loop

If two instructions are data dependent, they
cannot execute simultaneously or be completely
overlapped.
13
Data Dependence through Memory Location

Dependences that flow through memory locations
are more difficult to detect
Addresses may refer to the same location but look
different
100(R4) and 20(R6) may be identical
The effective address of a load or store may
change from one execution of the instruction to
another
Two execution of the same instruction L.D F0,
20(R4) may refer to different memory location
Because the value of R4 may change between two
executions

14
Name Dependence

Occurs when 2 instructions use the same register
name or memory location without data dependence
Let i precede j in program order
i is antidependent on j when j writes a register
that i reads
Indicates a potential WAR hazard
i is output dependent on j if they both write to
the same register
indicates a potential WAW hazard
Not true data dependences no value being
transmitted between instructions
Can execute simultaneously or be reordered if the
name used in the instructions is changed so the
instructions do not conflict

15
Name Dependence Example

L.D F0, 0(R1)
ADD.D F4,F0,F2
S.D F4, 0(R1)
L.D F0,-8(R1)
ADD.D F4,F0,F2

Output dependence
Anti-dependence
Register renaming
Renaming can be performedeither by compiler or
hardware
16
Register Renaming and WAW/WAR

DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0 (R1)
SUB.D F8, F10, F14
MUL.D F6, F10, F8

DIV.D F0, F2, F4
ADD.D S, F0, F8
S.D S, 0 (R1)
SUB.D T, F10, F14
MUL.D F6, F10, T

WAW ADD.D/MUL.D
WAR ADD.D/SUB.D, S.D/MUL.D
RAW DIV.D/ADD.D, ADD.D/S.D SUB.D/MUL.D

Renaming result
17
Control Dependence
if p1 s1A if p2 s2

Since branches are conditional
Some instructions will be executed and others
will not
Instructions before the branch dont matter
Only possibility is between a branch and
instructions which follow it
2 obvious constraints to maintain control
dependence
Instructions controlled by the branch cannot be
moved before the branch (since it would then be
uncontrolled)
An instruction not controlled by the branch
cannot be moved after the branch (since it would
then be controlled)
Note
Transitive control dependence is also a factor
In simple pipelines - order is preserved anyway
so no big deal

18
Control Dependence (Cont.)

Whats the big deal
No data dependence so move something before the
branch
Trash the result if the branch goes the wrong way
Note only works when result goes to a register
which becomes dead (result never used) if the
wrong way is taken
However 2 important side-effects affect
correctness issues
Exception behavior remains intact
Sometimes this is relaxed but it probably should
not be
Branches effectively set up conditional data flow
Data flow is definitely real so if we do the move
then we better make sure it does not change the
data flow
So it can be done but care must be taken
Enter HW and SW speculation conditional
instructions

19
Control Dependence (Cont.)

Not the critical property that must be preserved
May execute instructions that should not have
been executed, thereby violating the control
dependence ? as long as OK
Wrong guess in delayed branch (from
target/fall-through)
Maintain control and data dependences can prevent
raising new exceptions
DADDU R2, R3, R4
BEQZ R2, L1
LW R1, 0(R2)
L1
No data dependence prevents us from interchanging
BEQZ and LW it is only the control dependence

May raise memory protection exception if we
interchange BEQZ and LW
20
Control Dependence (Cont.)

By preserving the control dependence of the OR on
the branch, we prevent an illegal change to the
data flow
DADDU R1, R2, R3
BEQZ R4, L1
DSUBU R1, R5, R6
L1.
OR R7, R1, R8

21
Control Dependence (Cont.)

IF R4 were unused (dead) after skipnext and DSUBU
could not generate an exception, we could move
DSUBU before the branch, since the data flow
cannot be affected
If branch is taken, DSUBU will execute and will
be useless
DADDU R1, R2, R3
BEQZ R12, skipnext
DSUBU R4, R5, R6
DADDU R5, R4, R9
skipnext OR R7, R8, R9

22
Overcoming Data Hazards with Dynamic Scheduling
23
Introduction

Approaches used to avoid data hazard in Appendix
A and Chapter 4
Forwarding or bypassing let dependence not
result in hazards
Stall Stall the instruction that uses the
result and successive instructions
Compiler (Pipeline) scheduling static
scheduling
In-order instruction issue and execution
Instructions are issued in program order, and if
an instruction is stalled in the pipeline, no
later instructions can proceed
If there is a dependence between two closely
spaced instructions in the pipeline, this will
lead to a hazard and a stall will result

24
Dynamic Scheduling VS. Static Scheduling

Dynamic Scheduling Avoid stalling when
dependences are present
Static Scheduling Minimize stalls by separating
dependent instructions so that they will not lead
to hazards

25
Dynamic Scheduling Idea

Dynamic scheduling HW rearranges the
instruction execution to avoid stalling when
dependences, which could generate hazards, are
present
Advantages
Enable handling some dependences unknown at
compile time
Simplify the compiler
Code for one machine runs well on another
Approaches
Scoreboard (Appendix A)
Tomasulo Approach (focus of this part)
Assume multiple instructions can be in execution
at the same time (require multiple FUs, pipelined
Fus, or both)

26
Dynamic Scheduling

Dynamic instruction reordering
In-order issue
But allow out-of-order execution (and thus
out-of-order completion)
Consider
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F12, F8, F14
DIV.D has a long latency (20 pipeline stages)
ADD.D has a data dependence on F0, SUB.D does not
Stalling ADD.D will stall SUB.D too
So swap them - compiler might have done this but
so could HW
Problems raise new exceptions?
For now lets ignore precise exceptions (Section
3.7 and Appendix A)

Hazard?
27
Dynamic Scheduling (Cont.)

Key Idea allow instructions behind stall to
proceed
SUB.D can proceed even when ADD.D is stalled
Out-of-order execution divides ID stage
Issue decode instructions, check for structural
hazards
Read operands wait until no data hazards, then
read operands
All instructions pass through the issue stage in
order
But, instructions can be stalled or bypass each
other in the read-operand stage, and thus enter
execution out of order.

Issue
DM
IM
EX
IF
ID
MEM
WB
28
WAR WAW may arise when dynamic scheduling

More Interesting Code Fragment
DIV.D F0, F2, F4
ADD.D F6, F0, F8
SUB.D F8, F10, F14
MUL.D F6, F10, F8
Note following
ADD.D cant start until DIV.D completes
SUB.D does not need to wait but cant post result
to F8 until ADD.D reads F8 otherwise, yielding
WAR hazard
MUL.D does not need to wait but cant post result
to F6 until ADD.D write F6 otherwise, yielding
WAW hazard

Data dependence
Anti-dependence
Output-dependence
Both WAW and WAR hazards can be solved by
Scoreboard (Appendix A) and Tomasulo
29
Tomasulos Approach

The original idea is for IBM 360/91 overcome
Limited compiler scheduling (only 4
double-precision FP registers)
Reduce memory accesses and FP delays
Goal High Performance without special compilers
Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,
Key ideas
Track data dependences to allow execution as soon
as operands are available ? minimize RAW hazards
Rename registers to avoid WAR and WAW hazards

30
Key Idea

Pipelined or multiple function units (FU)
Each FU has multiple reservation stations (RS)
Issue to reservation stations was in-order
(in-order issue)
RS starts whenever they had collected source
operands from real registers (RR) - hence
out-of-order execution
Reservation stations contain virtual registers
(VR) that remove WAW and WAR induced stalls
RS fetches operands from RR and stores them into
VR
Since virtual registers can be more than real
registers, the technique can even eliminate
hazards arising from name dependences that could
not be eliminated by a compiler

31
Basic Structure of A Tomasulo-Based MIPS Processor
Virtual registers
32
Reservation Station Duties

Each RS holds an instruction that has been issued
and is awaiting execution at a FU, and either the
operand values or the RS names that will provide
the operand values
RS fetches operands from CDB when they appear
When all operands are present, enable the
associated functional unit to execute
Since values are not really written to registers
No WAW or WAR hazards are possible

33
Register Renaming in Tomasulos Approach

Register renaming is provided by reservation
stations (RS) and instruction issue logic
Each function unit has several reservation
stations
A RS fetches and buffers an operand as soon as it
is available
Eliminate the need to get the operand from a
register
Pending instructions designate the RS that will
provide their input
When successive writes to a register overlap in
execution, only the last one is actually used to
update the register
Avoid WAW

Avoid WAR
34
RS and Tomasulos Approach

Hazard detection and execution control are
distributed
Information held in RS at each functional unit
determine when an instruction can begin execution
at that unit
Results are passed directly to functional units
rather than through the registers
Essentially similar to bypass logic
Broadcast capability since they pass on CDB
(common data bus)

35
Instruction Steps

Issue (note in-order due to queue structure)
Get instruction from instruction Queue
Issue if there is an empty RS or available buffer
(loads, stores)
If the operands are in registers send them to the
reservation station
Stall otherwise due to the structural hazard
Execute (may be out of order)
When all operands are available then execute
If not, then monitor CDB to grab desired operand
when it is produced
Effectively deals with RAW hazards
Write Result (also may be out of order)
When result available write it to the CDB
From CDB it will go to a waiting RS and to the
registers and store buffer
Note renaming model prevents WAW and WAR hazards
as a side effect

36
Basic Structure of A Tomasulo-Based MIPS Processor
Virtual registers
37
Hazards Handling

Structural hazards checked at 2 points
At dispatch - a free RS of the appropriate type
must be available
When operands are ready - multiple RS may compete
for issue to the shared execution unit
Program order used as basis for the arbitration
RAW, WAR, WAW
To preserve exception behavior, instructions
should not be allowed to execute if a branch that
is earlier in program has not yet completed
Implemented by preventing any instruction from
leaving the issue step, if there is a pending
branch already in the pipeline

38
Virtual Registers

Tag field associated with data
Tag field is a virtual register ID
Corresponds to
Reservation station and load buffer names
Motivation due to the 360s register weakness
Had only 6 FP registers
The 9 renamed virtual registers were a
significant bonus

39
Tomasulo Structure

Each Reservation Station
Op - the operation
Qj, Qk - RS that will produce the operand
0?value is already available or no necessary
operand
Vj, Vk - the value of the operands
Only one of V or Q is valid for each operand
Busy - RS and its corresponding functional unit
are occupied
A information for memory address calculation for
a load or store
Immediate ? effective address
Register file and store buffers
Qi RS that produces the value to be stored in
this register
Load and store buffers each require a busy field

Note
max 1 valid Qj or Vj
same with Qk or Vk

40
Detailed Tomasulo Algorithm Control
Avoid RAW
Avoid RAW
Avoid RAW
The result of register Qiwill come from RS r
Avoid RAW
41
Detailed Tomasulo Algorithm Control (Cont.)
Calculate effectiveaddress
Write to register
Broadcast to RSneeding result
42
Tomasulo Example Cycle 0
LD is 1 CC, ADDD/SUBD is 2 CC, MULT is 10 CC, and
DIVD is 40 CC(Execution stage)
43
Tomasulo Example Cycle 1
Yes
44
Tomasulo Example Cycle 2
45
Tomasulo Example Cycle 3

Note registers names are removed (renamed) in
Reservation Stations MULT issued vs. scoreboard
Load1 completing what is waiting for Load1?

46
Tomasulo Example Cycle 4

Load2 completing what is waiting for it?

47
Tomasulo Example Cycle 5
48
Tomasulo Example Cycle 6
49
Tomasulo Example Cycle 7

Add1 completing what is waiting for it?

50
Tomasulo Example Cycle 8
51
Tomasulo Example Cycle 9
52
Tomasulo Example Cycle 10
53
Tomasulo Example Cycle 11
54
Tomasulo Example Cycle 12

Note all quick instructions complete already

55
Tomasulo Example Cycle 13
56
Tomasulo Example Cycle 14
57
Tomasulo Example Cycle 15

Mult1 completing what is waiting for it?

58
Tomasulo Example Cycle 16

Note Just waiting for divide

59
Tomasulo Example Cycle 55
60
Tomasulo Example Cycle 56

Mult 2 completing what is waiting for it?

61
Tomasulo Example Cycle 57

Again, in-order issue, out-of-order execution,
completion

62
Advantages of Tomasulo

Distribution of the hazard detection logic
Distributed RS and CDB
If multiple instructions are waiting on a single
result, and each already has its other operand,
then the instruction can be released
simultaneously by the broadcast on CDB
No waiting for the register bus in a centralized
register file
Elimination of stalls for WAW and WAR
Rename register using RS
Store operands into RS as soon as they are
available
For WAW-hazard, the last write will win
Issue stage RegisterStatrd.Qi ? r (the last
wins)

63
Tomasulo Drawbacks

Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus
Multiple CDBs ? more FU logic for parallel assoc
stores

64
Tomasulo Loop Example

Loop LD F0 0 R1
MULTD F4 F0 F2
SD F4 0 R1
SUBI R1 R1 8
BNEZ R1 Loop
Assume Multiply takes 4 clocks
Assume first load takes 8 clocks (cache miss?),
second load takes 4 clocks (hit)
To be clear, will show clocks for SUBI, BNEZ
Reality, integer instructions ahead

65
Loop Example Cycle 0
66
Loop Example Cycle 1
67
Loop Example Cycle 2
68
Loop Example Cycle 3

Note MULT1 has no registers names in RS

69
Loop Example Cycle 4
70
Loop Example Cycle 5
71
Loop Example Cycle 6

Note F0 never sees Load1 result

72
Loop Example Cycle 7

Note MULT2 has no registers names in RS

73
Loop Example Cycle 8
74
Loop Example Cycle 9

Load1 completing what is waiting for it?

75
Loop Example Cycle 10

Load2 completing what is waiting for it?

76
Loop Example Cycle 11
77
Loop Example Cycle 12
78
Loop Example Cycle 13
79
Loop Example Cycle 14

Mult1 completing what is waiting for it?

80
Loop Example Cycle 15

Mult2 completing what is waiting for it?

81
Loop Example Cycle 16
82
Loop Example Cycle 17
83
Loop Example Cycle 18
84
Loop Example Cycle 19
85
Loop Example Cycle 20
86
Loop Example Cycle 21
87
Tomasulo Summary

Reservations stations renaming to larger set of
registers buffering source operands
Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
For one CDB, only one operation can use it at a
single clock cycle
Not limited to basic blocks (integer units gets
ahead, beyond branches)
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are Pentium II PowerPC 604
MIPS R10000 HP-PA 8000 Alpha 21264

88
Reducing Branch Penalties with Dynamic Hardware
Prediction
89
Dynamic Control Hazard Avoidance

Consider Effects of Increasing the ILP
Control dependencies rapidly become the limiting
factor
They tend to not get optimized by the compiler
Higher branch frequencies result
Plus multiple issue (more than one
instructions/sec) ? more control instructions
per sec.
Control stall penalties will go up as machines go
faster
Amdahls Law in action - again
Branch Prediction helps if can be done for
reasonable cost
Static by compiler appendix A
Dynamic by HW this section

90
Dynamic Branch Prediction

Processor attempts to resolve the outcome of a
branch early, thus preventing control dependences
from causing stalls
BP_Performance f (accuracy, cost of
misprediction)
Branch History Table (BHT)
Lower bits of PC address index table of 1-bit
values
No precise address check just match the lower
bits
Says whether or not branch taken last time

91
BHT Prediction
Useful only for the target addressis known
before CC is decided
If two branch instructions withthe same lower
bits
92
Problem with the Simple BHT
clear benefit is that its cheap and
understandable

Aliasing
All branches with the same index (lower) bits
reference same BHT entry
Hence they mutually predict each other
No guarantee that a prediction is right. But it
may not matter anyway
Avoidance
Make the table bigger - OK since its only a
single bit-vector
This is a common cache improvement strategy as
well
Other cache strategies may also apply
Consider how this works for loops
Always mispredict twice for every loop
Once is unavoidable since the exit is always a
surprise
However previous exit will always cause a
mis-predict on the first try of every new loop
entry

93
N-bit Predictors
idea improve on the loop entry problem

Use an n-bit saturating counter
2-bit counter implies 4 states
Statistically 2 bits gets most of the advantage

94
BHT Accuracy
4K of BPB with 2-bit entries misprediction rates
on SPEC89

Mispredict because either
Wrong guess for that branch
Got branch history of wrong branch when index the
table

95
BHT Accuracy BHT Size

4096 entry table programs vary from 1
misprediction (nasa7, tomcatv) to 18 (eqntott),
with spice at 9 and gcc at 12
4096 about as good as infinite table (in Alpha
211164)

96
Improve Prediction Strategy By Correlating
Branches

Consider the worst case for the 2-bit predictor
if (aa2) then aa0
if (bb2) then bb0
if (aa ! bb) then whatever
single level predictors can never get this case
Correlating or 2-level predictors
Correlation what happened on the last branch
Note that the last correlator branch may not
always be the same
Predictor which way to go
4 possibilities which way the last one went
chooses the prediction
(Last-taken, last-not-taken) X (predict-taken,
predict-not-taken)

if the first 2 fail then the 3rd will always be
taken
97
The worst case for the 2-bit predictor

if (aa2)
aa0
if (bb2)
bb0
if (aa ! bb)

DSUBUI R3, R1, 2
BNEZ R3, L1
DADDD R1, R0, R0
L1 DSUBUI R3, R2, 2
BNEZ R2, R0, R0
L2 DSUBU R3, R1, R2
BEQZ R3, L3

if the first 2 untaken then the 3rd will always
be taken
98
Correlating Branches

Hypothesis recently executed branches are
correlated that is, behavior of recently
executed branches affects prediction of current
branch
Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table
In general, (m,n) predictor means record last m
branches to select between 2m history tables each
with n-bit counters
Old 2-bit BHT is then a (0,2) predictor

99
Example of Correlating Branch Predictors

if (d0)
d 1
if (d1)

BNEZ R1, L1 branch b1 (d!0)
DAAIU R1, R0, 1 d0, so d1
L1 DAAIU R3, R1, -1
BNEZ R3, L2 branch b2 (d!1)
L2

100
Example of Correlating Branch Predictors (Cont.)
101
Example of Correlating Branch Predictors (Cont.)
102
In general (m,n) BHT (prediction buffer)

p bits of buffer index 2p bit BHT
Use last m branches global branch history
Use n bit predictor
Total bits for the (m, n) BHT precitction buffer
2m banks of memory selected by the global branch
history (which is just a shift register) - e.g. a
column address
Use p bits of the branch address to select row
Get the n predictor bits in the entry to make the
decision

103
(2,2) Predictor Implementation
4 banks each with 32 2-bit predictor entries
p5m2n2
532
104
Accuracy of Different Schemes
105
Tournament Predictors

Adaptively combine local and global predictors
Multiple predictors
One based on global information Results of
recently executed m branches
One based on local information Results of past
executions of the current branch instruction
Selector to choose which predictors to use
2-bit saturating counter, incremented whenever
the predicted predictor is correct and the
other predictor is incorrect, and it is
decremented in the reverse situation
Advantage
Ability to select the right predictor for the
right branch
Alpha 21264 Branch Predictor (p. 207 p. 209)

106
State Transition Diagram for A Tournament
Predictor
0/0, 0/1, 1/1
0/0, 1/0, 1/1
Use Predictor 1
Use Predictor 2
0/1
1/0
0/1
1/0
0/1
Use Predictor 1
Use Predictor 2
1/0
0/0, 1/1
0/0, 1/1
107
Fraction of Predictions Coming from the Local
Predictor (SPEC89)
108
Misprediction Rate Comparison
109
Branch Target Buffer/Cache

To reduce the branch penalty to 0
Need to know what the address is by the end of IF
But the instruction is not even decoded yet
So use the instruction address rather than wait
for decode
If prediction works then penalty goes to 0!
BTB Idea -- Cache to store taken branches (no
need to store untaken)
Match tag is instruction address ? compare with
current PC
Data field is the predicted PC
May want to add predictor field
To avoid the mispredict twice on every loop
phenomenon
Adds complexity since we now have to track
untaken branches as well

110
Branch Target Buffer/Cache-- Illustration
111
Changes in DLX to incorporate BTB
112
Penalties Using this Approach for MIPS/DLX

Note
Predict_wrong 1 CC to update BTB 1 CC to
restart fetching
Not found and taken 2CC to update BTB
Note
For complex pipeline design, the penalties may be
higher

113
Branch Penalty CPI

Prediction accuracy is 90
Hit rate in the buffer is 90
Taken branch frequency is 60
Branch_penaltybuffer_hit_rateincorrect_predictio
n_rate2 (1-buffer_hit_rate)Taken_branch2
(0.9 0.1 2) (0.1 0.6 2) 0.18 0.12
0.3
Branch penalty for delayed branches is about 0.5

114
Return Address Predictor

Indirect jump jumps whose destination address
varies at run time
indirect procedure call, select or case,
procedure return
SPEC89 benchmarks 85 of indirect jumps are
procedure returns
Accuracy of BTB for procedure returns are low
if procedure is called from many places, and the
calls from one place are not clustered in time
Use a small buffer of return addresses operating
as a stack
Cache the most recent return addresses
Push a return address at a call, and pop one off
at a return
If the cache is sufficient large (max call depth)
? prefect

115
Dynamic Branch Prediction Summary

Branch History Table 2 bits for loop accuracy
Correlation Recently executed branches
correlated with next branch
Branch Target Buffer include branch address
prediction
Reduce penalty further by fetching instructions
from both the predicted and unpredicted direction
Require dual-ported memory, interleaved cache ?
HW cost
Caching addresses or instructions from multiple
path in BTB

116
3.6 Taking Advantages of More ILP with Multiple
Issue

Pipelined CPIIdeal pipeline CPIStructural
stallsRAM stallsWAR stallsWAW stallsControl
stalls

117
Getting CPI lt 1 IssuingMultiple
Instructions/Cycle

Superscalar
Issue varying numbers of instructions per clock
Constrained by hazard style issues
Scheduling
Static - by the compiler
Dynamic - hardware support for some form of
Tomasulo
VLIW (very long instruction word)
Issue a fixed number of instructions formatted
as
One large instruction or
A fixed instruction packet with the parallelism
among instructions explicitly indicated by
instruction
Also known as EPIC explicitly parallel
instruction computers
Scheduling mostly static

Int/Br
Int/Ld-St
FP-/-
FPmul/div
118
Five Approaches in use for Multiple-Issue
Processors
119
Statically Scheduled Superscalar Processors

HW might issue 0 to 8 instructions in a clock
cycle
Instructions issue in program order
Pipeline hazards are checked for at issue time
Among instructions being issued in a given clock
cycle
Among the issuing instructions and all those
still in execution
If data or structural hazards occur, only the
instruction preceding that one in the instruction
sequence will be issued (Dynamic issue)
Complex issue stage
Split and pipelined ? But result in higher
branch penalties
Instruction issue is likely to be one limitation
on the clock rate of superscalar processors

120
Superscalar 2-issue MIPS

Very similar to the HP 7100
Require fetching and decoding 64 bits of
instructions
Which instructions
1 integer load, store, branch, or integer ALU
operation
1 float FP operation
Why issue one integer and one FP operation?
Eliminate most hazard possibility ? simplify the
logic
Integer and FP register sets are different
Integer and FP FUs are different
Only difficulty when integer instructions are FP
load, store, move
Need an additional read/write port on the FP
registers
May create RAW hazard

121
Superscalar 2-issue MIPS (Cont.)

Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Instruction placement is not restricted in modern
processor
1 cycle load delay expands to 3 instructions in
SS
instruction in right half can not use it, nor
instructions in next slot
Must have pipeline FP FUs or multiple independent
FP FUs

122
Consider adding a scalars to a vector

for (i1000 i gt 0 ii-1) xi xi s

Loop L.D F0,0( R1 ) F0vector element
ADD.D F4,F0,F2 add scalar from F2 S.D F4,
0(R1), store result DADDUI R1,R1,-8 decreme
nt pointer 8B (DW) BNE R1, R2,Loop branch
R1!R2
Assume 8(R2) is the last element to operate on
123
Unscheduled Loop
124
Unrolled Loop that Minimizes Stalls for Scalar

1 Loop L.D F0,0( R1)
4 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 DADDUI R1,R1,-32
12 S.D F12, 16(R1)
13 BNE R1, R2, LOOP
14 S.D F16, 8(R1) 8-32-24

14 clock cycles, or 3.5 per iteration
125
Unrolled Loop for SuperScalar (5 times)
1 Loop L.D F0,0( R1) 2 L.D F6,-8(R1)
3 L.D F10,-16(R1) 4 ADD.D F4,F0,F2 5 L.D
F14,-24(R1) 6 ADD.D F8,F6,F2 7 L.D F18,
-32(R1) ...
126
Loop Unrolling in Superscalar
Unrolled 5 times to avoid delays

Integer instruction FP instruction Clock cycle
Loop L.D F0,0(R1) 1
L.D F6,-8(R1) 2
L.D F10,-16(R1) ADD.D F4,F0,F2 3
L.D F14,-24(R1) ADD.D F8,F6,F2 4
L.D F18,-32(R1) ADD.D F12,F10,F2 5
S.D F4, 0(R1) ADD.D F16,F14,F2 6
S.D F8, -8(R1) ADD.D F20,F18,F2 7
S.D F12, -16(R1) 8
S.D F16, -24(R1) 9
DADDUI R1,R1,-40 10
BNE R1, R2, LOOP 11
SD F20, -32(R1) 12

12 clocks, or 2.4 clocks per iteration
127
Seem Simple?

Registers
Each pipe has its own set
Due to separation of FP and GP registers
Also inherently separates data dependencies into
2 classes
Exception is LDD or LDF
EFA is an integer operation
Destination register however is a FPreg
FP pipe has longer latency
Exacerbated by operation latency differences
mult 6 cycles, divide 24 cycles for example
Result is that completion is out of order
Complicates hazard control within the FP
execution pipe
Pipeline FP ALU or use multiple FP ALUs

128
Problems So Far

Look at the opcodes
See if the pair is an appropriate issue pair
Some integer operations are a problem
FP register loads/stores - since other
instruction may be dependent
A stall will result - options?
Force FP loads, stores or moves to issue by
themselves
Safe but suboptimal since the other instruction
may still be independent
OR add more ports to the FP register file
Such as separate read and write ports
Still must stall the 2nd instruction if it is
dependent

129
Other Issues

Hazard detection
Similar to the normal pipeline model, but need
large set of bypass path (twice as many
instructions in the pipeline)
Load use delay
Assume 1 cycle ? now covers 3 instruction slots
Branch delay
Have branches to be issued by themselves?
The 1 instruction branch delay now holds 3
instructions as well
Instruction scheduling by compiler
Mandatory for issuing independent operations in
SS
Increasingly important as issue width goes up

130
Dynamic Scheduling In SuperScalar

Use Tomasulo Algorithm
Two arbitrary instructions per clock issue and
let RS sort it out
But still cant issue a dependent pair
Two examples pp. 221224
How to issue multiple arbitrary instructions per
clock?
Run the issue step in half a clock cycle (ex.
Pipelined)
Build the logic necessary to handle two
instructions at once, including any possible
dependences between the instructions
Modern SS processors that issue four or more
instructions per clock often include both
approaches

131
Dynamic Scheduling in Superscalar (Cont.)

Only FP loads might cause dependency between
integer and FP issue
Replace load reservation station with a load
queue
Operands must be read in the order they are
fetched
Load checks addresses in Store Queue to avoid RAW
violation
Store checks addresses in Load Queue to avoid
WAR, WAW
Called decoupled architecture

132
Example

Can issue two arbitrary operations per clock
One integer FU for ALU operation and
EA-calculation
A separate pipelined FP FU
One memory unit, 2CDB
no delayed branch with perfect branch prediction
Fetch and issue as if the branch predictions are
always correct
Latency between a source instruction and an
instruction consuming the result presence of
Write Result stage
1 CC for integer ALU operations
2 CC for loads
3 CC for FP add

133
Note

WR stages does not apply to either stores or
branches
For L.D and S.D, the execution cycle is EA
calculation
For branches, the execution cycle shows when the
branch condition can be evaluated and the
prediction checked
Any instruction following a branch cannot start
execution until after the branch condition has
been evaluated
If two instructions could use the same FU at the
same point (structural hazard), priority is given
to the older instruction

134
Consider adding a scalars to a vector

for (i1000 i gt 0 ii-1) xi xi s

Loop L.D F0,0( R1 ) F0vector element
ADD.D F4,F0,F2 add scalar from F2 S.D F4,
0(R1) store result DAADIU R1,R1,-8 decrement
pointer 8B (DW) BNE R1, R2, Loop branch
R1!R2
135
Execution Timing
136
Execution Timing (Cont.)
137
Example Result

Result
IPC issued 5/3 1.67 Instruction execution
rate 15/16 0.94
Only one load, store, and Integer ALU operation
can execute
Load of the next iteration performs its memory
address before the store of the current iteration
A single CDB is actually required
Integer operations become the bottleneck
Many integer operations, but only one integer ALU
One stall cycle each loop iteration due to a
branch hazard

138
Another Example Execution Timing
Separate integer FU for EA calculation and ALU
operations
139
Execution Timing (Cont.)
140
Note

Result
IPC issued 5/3 1.67 Instruction execution
rate 15/11 1.36
A second CDB is needed
This example has a higher instruction execution
rate but lower efficiency as measured by the
utilization of FU

141
Limitations on Multiple Issue

How much ILP can be found in the application
fundamental problems
Requires deep unrolling - hence big focus on
loops
Compiler complexity goes way up
Deep unrolling needs lots of registers
Increased HW cost
Increased ports for register files
Cost of scoreboarding (e.g. Tomasulo data
structure) and forwarding paths
Memory bandwidth requirement goes up
Most have gone with separate I and D ports
already
Newest approaches are to go for multiple D ports
as well - big time expense!! (PA- 8000)
Branch prediction by HW is an absolute must HW
Speculation (Sect. 3.7)

142
3.7 Hardware-Based Speculation
143
Overview

Overcome control dependence by speculating on the
outcome of branches and executing the program as
if our guesses were correct
Fetch, issue, and execute instructions
Need mechanisms to handle the situation when the
speculation is incorrect
Dynamic scheduling only fetch and issue such
instructions

144
Key Ideas

Dynamic branch prediction to choose which
instructions to execute
Speculation to allow the speculated blocks to
execution before the control dependences are
resolved
And undo the effects of an incorrectly speculated
sequence
Dynamic scheduling to deal with the scheduling of
different combinations of basic blocks (Tomasulo
style approach)

145
HW Speculation Approach

Issue ? execution ? write result ? commit
Commit is the point where the operation is no
longer speculative
Allow out of order execution
Require in-order commit
Prevent speculative instructions from performing
destructive state changes (e.g. memory write or
register write)
Collect pre-commit instructions in a reorder
buffer (ROB)
Holds completed but not committed instructions
Effectively contains a set of virtual registers
to store the result of speculative instructions
until they are no longer speculative
Similar to reservation station ? And becomes a
bypass source

146
The Speculative MIPS
Replace store buffer
147
The Speculative MIPS (Cont.)

Need HW buffer for results of uncommitted
instructions reorder buffer (ROB)
4 fields instruction type, destination field,
value field, ready field
ROB is a source of operands ? more registers like
RS
ROB supplies operands in the interval between
completion of instruction execution and
instruction commit
Use ROB number instead of RS to indicate the
source of operands when execution completes (but
not committed)
Once instruction commits, result is put into
register
As a result, its easy to undo speculated
instructions on mispredicted branches or on
exceptions

148
ROB Fields

Instruction type branch, store, register
operations
Destination field
Unused for branches
Memory address for stores
Register number for load and ALU operations
(register operations)
Value hold the value of the instruction result
until commit
Ready indicate if the instruction has completed
execution

149
Steps in Speculative Execution

Issue (or dispatch)
Get instruction from the instruction queue
In-order issue if available RS AND ROB slot
otherwise, stall
Send operands to RS if they are in register or
ROB
Update Tomasulo DS and ROB
The ROB no. allocated for the result is sent to
RS, so that the number can be used to tag the
result when it is placed on CDB
Execute
RS waits grabs results off the CDB if necessary
When all operands are there execution happens
Write Result
Result posted to ROB via the CDB
Waiting reservation stations can grab it as well

150
Steps in Speculative Execution (Cont.)

Commit (or graduate) instruction reaches the
ROB head
Normal commit when instruction reaches the ROB
head and its result is present in the buffer
Update the register and remove the instruction
from ROB
Store Update memory and remove the instruction
from ROB
Branch with incorrect prediction wrong
speculation
Flush ROB and the related FP OP queue (RS)
Restart at the correct successor of the branch
Remove the instruction from ROB
Branch with correct prediction finish the
branch
Remove the instruction from ROB

151
Example

The same example as Tomasulo without speculation.
Show the status tables when MUL.D is ready to go
to commit
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Modified status tables
Qj and Qk fields, and register status fields use
ROB (instead of RS)
Add Dest field to RS (ROB to put the operation
result)

152
Figure 3.30
153
Example Result

Tomasulo without speculation
SUB.D and ADD.D have completed (clock cycle 16,
slide 58)
Tomasulo with speculation
No instruction after the earliest uncompleted
instruction (MUL.D) is allowed to complete
In-order commit
Implication ROB with in-order instruction
commit provides precise exceptions
Precise exceptions exceptions are handled in
the instruction order

154
Loop Example

Loop L.D F0, 0(R1)
MUL.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1,R1, -8
BNE R1, R2, Loop
Assume we have issued all the instructions in the
loop twice
Assume L.D and MUL.D from the first iteration
have committed and all others have completed
execution

155
Figure 3.31
156
Loop Example Observation

Suppose the first BNE is not taken ? flush ROB
and begins fetch instructions from the other path

157
Other Issues

Performance is more sensitive to
branch-prediction
Impact of a mis-prediction will be higher
Prediction accuracy, mis-prediction detection,
and mis-prediction recovery increase in
importance
Precise exception
Handled by not recognizing the exception until it
is ready to commit
If a speculation instruction raises an exception,
the exception is recorded in ROB
Mis-prediction branch ? exception are flushed as
well
If the instruction reaches the ROB head ? take
the exception

158
Figure 3.32
159
(No Transcript)
160
Multiple Issue with Speculation

Process multiple instructions per clock,
assigning RS and ROB to the instructions
To maintain throughput of greater than one
instruction per cycle, must handle multiple
instruction commits per clock
Speculation helps significantly when a branch is
a key potential performance limitation
Speculation can be advantageous when there are
data-dependent branches, which otherwise would
limit performance
Depend on accurate branch prediction ? incorrect
speculation will typically harm performance

161
Example

Assume separate integer FUs for ALU operations,
effective address calculation, and branch
condition evaluation
Assume up to 2 instruction of any type can commit
per clock
Loop LD R2, 0(R1)
DADDIU R2, R2, 1
SD R2, 0(R1)
DADDIU R1, R1, 4
BNE R2, R3, LOOP

162
No Speculation
Figure 3.33 3.34
R2
R2
R2
163
Speculation
R2
R2
R2
164
Example Result

Without speculation
L.D following BNE cannot start execution earlier
? wait until branch outcome is determined
Completion rate is falling behind the issue rate
rapidly, stall when a few more iterations are
issued
With speculation
L.D following BNE can start execution early
because it is speculative

165
3.8 Studies of The Limitations of ILP
166
ILP Studies

Perfect Hardware model - in the ideal infinite
cost case
Rename as much as you need
Implies infinite virtual registers
Hence - complete WAW or WAR insensitivity
Branch prediction is perfect
This will never happen in reality of course
Jump prediction (even computed such as return)
are also perfect
Similarly unreal
Perfect memory disambiguation
Almost perfect is not too hard in practice
Can issue an unlimited of instructions at once
no restriction on types of instructions issued
? FUs
One-cycle latency

167
Lets Look at A Real Machine

Alpha 21264 one of the most advanced
superscalar processors announced to date
Issues up to four instructions per clock, and
initiates execution on up to six
At most 2 memory references, among other
restrictions
Support a large set of renaming registers (41
integer and 41 FP)
Allow up to 80 instructions in execution
Multicycle latencies
Tournament-style branch predictor

168
How to Measure

A set of programs were compiled and optimized
with the standard MIPS optimizing compilers
Execute and produce a trace of the instruction
and data references
Perfect branch prediction and perfect alias
analysis are easy to do
Every instruction in the trace is then scheduled
as early as possible, limited only by the data
dependence
Including moving across branches

169
What A Perfect Processor Must Do?

Look arbitrary far ahead to find a set of
instructions to issue, predicting all branches
perfectly
Rename all register uses to avoid WAW and WAR
hazards
Determine whether there are any dependences among
the instructions in the issue packet if so,
rename accordingly
Determine if any memory dependences exist among
the issuing instructions and hand them
appropriately
Provide enough replicated Fus to allow all the
ready instructions to issue

170
ILP at the Limit

How many instructions would issue on the perfect
machine every cycle?
gcc - 54.8
espresso - 62.6
li - 17.9
fpppp - 75.2
doduc - 118.7
tomcatv - 150.1
Limited only by the ILP inherent in the
benchmarks
Note
Benchmarks are small codes
More ILP tends to surface as the codes get bigger

Huge amounts of loop parallelismin the SPECfp
codes
171
Window Size

The set of instructions that is examined for
simultaneous execution is called the window
The window size will be determined by the cost of
determining whether n issuing instructions have
any register dependences among them
In theory, This c

Write a Comment

User Comments (0)