Title: InstructionLevel Parallelism and Its Dynamic Exploitation
1Instruction-Level Parallelism andIts Dynamic
Exploitation
2Outline
- Instruction-Level Parallelism Concepts and
Challenges - Overcoming Data Hazards with Dynamic Scheduling
- Dynamic Scheduling Examples and the Algorithm
- Reducing Branch Penalties with Dynamic Hardware
Prediction - High-Performance Instruction Delivery
- Taking Advantage of More ILP with Multiple Issue
- Hardware-Based Speculation
- Studies of the Limitations of ILP
- Limitations on ILP for Reliable Processors
3Instruction-Level Parallelism Concepts and
Challenges
4Introduction
- Instruction-Level Parallelism (ILP) potential
execution overlap among instructions - Instructions are executed in parallel
- Pipeline supports a limited sense of ILP
- This chapter introduces techniques to increase
the amount of parallelism exploited among
instructions - How to reduce the impact of data and control
hazards - How to increase the ability of the processor to
exploit parallelism - Pipelined CPIIdeal pipeline CPIStructural
stallsRAW stallsWAR stallsWAW stallsControl
stalls
5Approaches To Exploiting ILP
- Hardware approach focus of this chapter
- Dynamic running time
- Dominate desktop and server markets
- Pentium III and IV Athlon MIPS R10000/12000
Sun UltraSPARC III PowerPC 603, G3, and G4
Alpha 21264 - Software approach focus of next chapter
- Static compiler time
- Rely on compilers
- Broader adoption in the embedded market
- But include IA-64 and Intels Itanium
6ILP Methods
A combo of HW and SW/Compiler methods
7ILP within a Basic Block
- Basic Block Instructions between branch
instructions - Instructions in a basic block are executed in
sequence - Real code is a bunch of basic blocks connected by
branch - Notice dynamic branch frequency between 15
and 25 - Basic block size between 6 and 7 instructions
- May depend on each other (data dependence)
- Therefore, probably little in the way of
parallelism - To obtain substantial performance enhancement
ILP across multiple basic blocks - Easiest target is the loop
- Exploit parallelism among iterations of a loop
(loop-level parallelism)
8Loop Level Parallelism (LLP)
- Consider adding two 1000 element arrays
- There is no dependence between data values
produced in any iteration j and those needed in
jn for any j and n - Truly independent iterations
- Independence means no stalls due to data hazards
- Basic idea to convert LLP into ILP
- Unroll the loop either statically by the compiler
(next chapter) or dynamically by the hardware
(this chapter)
x1 x1 y1x2 x2
y2x1000x1000y1000
for (i1 ilt1000, ii1) xi xi yi
9Data Dependences and Hazards
10Introduction
- If two instructions are independent, then
- They can execute (parallel) simultaneously in a
pipeline without stall - Assume no structural hazards
- Their execution orders can be swapped
- Dependent instructions must be executed in order,
or partially overlapped in pipeline - Why to check dependence?
- Determine how much parallelism exists, and how
that parallelism can be exploited - Types of dependences -- Data, Name, Control
dependence
11Data Dependence Analysis
- i is data dependent on j if i uses a result
produced by j - OR i uses a result produced by k and k depends on
j (chain) - Dependence indicates a potential RAW hazard
- Induce a hazard and stall? - depends on the
pipeline organization - The possibility limits the performance
- Order in which instructions must be executed
- Sets a bound on how much parallelism can be
exploited - Overcome data dependence
- Maintain dependence but avoid a hazard
scheduling the code (HW,SW) - Eliminate a dependence by transforming the code
(by compiler)
12Data Dependence Example
- Loop L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1)
- DADDUI R1, R1, -8
- BNE R1, R2, Loop
If two instructions are data dependent, they
cannot execute simultaneously or be completely
overlapped.
13Data Dependence through Memory Location
- Dependences that flow through memory locations
are more difficult to detect - Addresses may refer to the same location but look
different - 100(R4) and 20(R6) may be identical
- The effective address of a load or store may
change from one execution of the instruction to
another - Two execution of the same instruction L.D F0,
20(R4) may refer to different memory location - Because the value of R4 may change between two
executions
14Name Dependence
- Occurs when 2 instructions use the same register
name or memory location without data dependence - Let i precede j in program order
- i is antidependent on j when j writes a register
that i reads - Indicates a potential WAR hazard
- i is output dependent on j if they both write to
the same register - indicates a potential WAW hazard
- Not true data dependences no value being
transmitted between instructions - Can execute simultaneously or be reordered if the
name used in the instructions is changed so the
instructions do not conflict
15Name Dependence Example
- L.D F0, 0(R1)
- ADD.D F4,F0,F2
- S.D F4, 0(R1)
- L.D F0,-8(R1)
- ADD.D F4,F0,F2
Output dependence
Anti-dependence
Register renaming
Renaming can be performedeither by compiler or
hardware
16Register Renaming and WAW/WAR
- DIV.D F0, F2, F4
- ADD.D F6, F0, F8
- S.D F6, 0 (R1)
- SUB.D F8, F10, F14
- MUL.D F6, F10, F8
- DIV.D F0, F2, F4
- ADD.D S, F0, F8
- S.D S, 0 (R1)
- SUB.D T, F10, F14
- MUL.D F6, F10, T
- WAW ADD.D/MUL.D
- WAR ADD.D/SUB.D, S.D/MUL.D
- RAW DIV.D/ADD.D, ADD.D/S.D SUB.D/MUL.D
Renaming result
17Control Dependence
if p1 s1A if p2 s2
- Since branches are conditional
- Some instructions will be executed and others
will not - Instructions before the branch dont matter
- Only possibility is between a branch and
instructions which follow it - 2 obvious constraints to maintain control
dependence - Instructions controlled by the branch cannot be
moved before the branch (since it would then be
uncontrolled) - An instruction not controlled by the branch
cannot be moved after the branch (since it would
then be controlled) - Note
- Transitive control dependence is also a factor
- In simple pipelines - order is preserved anyway
so no big deal
18Control Dependence (Cont.)
- Whats the big deal
- No data dependence so move something before the
branch - Trash the result if the branch goes the wrong way
- Note only works when result goes to a register
which becomes dead (result never used) if the
wrong way is taken - However 2 important side-effects affect
correctness issues - Exception behavior remains intact
- Sometimes this is relaxed but it probably should
not be - Branches effectively set up conditional data flow
- Data flow is definitely real so if we do the move
then we better make sure it does not change the
data flow - So it can be done but care must be taken
- Enter HW and SW speculation conditional
instructions
19Control Dependence (Cont.)
- Not the critical property that must be preserved
- May execute instructions that should not have
been executed, thereby violating the control
dependence ? as long as OK - Wrong guess in delayed branch (from
target/fall-through) - Maintain control and data dependences can prevent
raising new exceptions - DADDU R2, R3, R4
- BEQZ R2, L1
- LW R1, 0(R2)
- L1
- No data dependence prevents us from interchanging
BEQZ and LW it is only the control dependence
May raise memory protection exception if we
interchange BEQZ and LW
20Control Dependence (Cont.)
- By preserving the control dependence of the OR on
the branch, we prevent an illegal change to the
data flow - DADDU R1, R2, R3
- BEQZ R4, L1
- DSUBU R1, R5, R6
- L1.
- OR R7, R1, R8
21Control Dependence (Cont.)
- IF R4 were unused (dead) after skipnext and DSUBU
could not generate an exception, we could move
DSUBU before the branch, since the data flow
cannot be affected - If branch is taken, DSUBU will execute and will
be useless - DADDU R1, R2, R3
- BEQZ R12, skipnext
- DSUBU R4, R5, R6
- DADDU R5, R4, R9
- skipnext OR R7, R8, R9
22Overcoming Data Hazards with Dynamic Scheduling
23Introduction
- Approaches used to avoid data hazard in Appendix
A and Chapter 4 - Forwarding or bypassing let dependence not
result in hazards - Stall Stall the instruction that uses the
result and successive instructions - Compiler (Pipeline) scheduling static
scheduling - In-order instruction issue and execution
- Instructions are issued in program order, and if
an instruction is stalled in the pipeline, no
later instructions can proceed - If there is a dependence between two closely
spaced instructions in the pipeline, this will
lead to a hazard and a stall will result
24Dynamic Scheduling VS. Static Scheduling
- Dynamic Scheduling Avoid stalling when
dependences are present - Static Scheduling Minimize stalls by separating
dependent instructions so that they will not lead
to hazards
25Dynamic Scheduling Idea
- Dynamic scheduling HW rearranges the
instruction execution to avoid stalling when
dependences, which could generate hazards, are
present - Advantages
- Enable handling some dependences unknown at
compile time - Simplify the compiler
- Code for one machine runs well on another
- Approaches
- Scoreboard (Appendix A)
- Tomasulo Approach (focus of this part)
- Assume multiple instructions can be in execution
at the same time (require multiple FUs, pipelined
Fus, or both)
26Dynamic Scheduling
- Dynamic instruction reordering
- In-order issue
- But allow out-of-order execution (and thus
out-of-order completion) - Consider
- DIV.D F0, F2, F4
- ADD.D F10, F0, F8
- SUB.D F12, F8, F14
- DIV.D has a long latency (20 pipeline stages)
- ADD.D has a data dependence on F0, SUB.D does not
- Stalling ADD.D will stall SUB.D too
- So swap them - compiler might have done this but
so could HW - Problems raise new exceptions?
- For now lets ignore precise exceptions (Section
3.7 and Appendix A)
Hazard?
27Dynamic Scheduling (Cont.)
- Key Idea allow instructions behind stall to
proceed - SUB.D can proceed even when ADD.D is stalled
- Out-of-order execution divides ID stage
- Issue decode instructions, check for structural
hazards - Read operands wait until no data hazards, then
read operands - All instructions pass through the issue stage in
order - But, instructions can be stalled or bypass each
other in the read-operand stage, and thus enter
execution out of order.
Issue
DM
IM
EX
IF
ID
MEM
WB
28WAR WAW may arise when dynamic scheduling
- More Interesting Code Fragment
- DIV.D F0, F2, F4
- ADD.D F6, F0, F8
- SUB.D F8, F10, F14
- MUL.D F6, F10, F8
- Note following
- ADD.D cant start until DIV.D completes
- SUB.D does not need to wait but cant post result
to F8 until ADD.D reads F8 otherwise, yielding
WAR hazard - MUL.D does not need to wait but cant post result
to F6 until ADD.D write F6 otherwise, yielding
WAW hazard
Data dependence
Anti-dependence
Output-dependence
Both WAW and WAR hazards can be solved by
Scoreboard (Appendix A) and Tomasulo
29Tomasulos Approach
- The original idea is for IBM 360/91 overcome
- Limited compiler scheduling (only 4
double-precision FP registers) - Reduce memory accesses and FP delays
- Goal High Performance without special compilers
- Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604, - Key ideas
- Track data dependences to allow execution as soon
as operands are available ? minimize RAW hazards - Rename registers to avoid WAR and WAW hazards
30Key Idea
- Pipelined or multiple function units (FU)
- Each FU has multiple reservation stations (RS)
- Issue to reservation stations was in-order
(in-order issue) - RS starts whenever they had collected source
operands from real registers (RR) - hence
out-of-order execution - Reservation stations contain virtual registers
(VR) that remove WAW and WAR induced stalls - RS fetches operands from RR and stores them into
VR - Since virtual registers can be more than real
registers, the technique can even eliminate
hazards arising from name dependences that could
not be eliminated by a compiler
31Basic Structure of A Tomasulo-Based MIPS Processor
Virtual registers
32Reservation Station Duties
- Each RS holds an instruction that has been issued
and is awaiting execution at a FU, and either the
operand values or the RS names that will provide
the operand values - RS fetches operands from CDB when they appear
- When all operands are present, enable the
associated functional unit to execute - Since values are not really written to registers
- No WAW or WAR hazards are possible
33Register Renaming in Tomasulos Approach
- Register renaming is provided by reservation
stations (RS) and instruction issue logic - Each function unit has several reservation
stations - A RS fetches and buffers an operand as soon as it
is available - Eliminate the need to get the operand from a
register - Pending instructions designate the RS that will
provide their input - When successive writes to a register overlap in
execution, only the last one is actually used to
update the register - Avoid WAW
Avoid WAR
34RS and Tomasulos Approach
- Hazard detection and execution control are
distributed - Information held in RS at each functional unit
determine when an instruction can begin execution
at that unit - Results are passed directly to functional units
rather than through the registers - Essentially similar to bypass logic
- Broadcast capability since they pass on CDB
(common data bus)
35Instruction Steps
- Issue (note in-order due to queue structure)
- Get instruction from instruction Queue
- Issue if there is an empty RS or available buffer
(loads, stores) - If the operands are in registers send them to the
reservation station - Stall otherwise due to the structural hazard
- Execute (may be out of order)
- When all operands are available then execute
- If not, then monitor CDB to grab desired operand
when it is produced - Effectively deals with RAW hazards
- Write Result (also may be out of order)
- When result available write it to the CDB
- From CDB it will go to a waiting RS and to the
registers and store buffer - Note renaming model prevents WAW and WAR hazards
as a side effect
36Basic Structure of A Tomasulo-Based MIPS Processor
Virtual registers
37Hazards Handling
- Structural hazards checked at 2 points
- At dispatch - a free RS of the appropriate type
must be available - When operands are ready - multiple RS may compete
for issue to the shared execution unit - Program order used as basis for the arbitration
- RAW, WAR, WAW
- To preserve exception behavior, instructions
should not be allowed to execute if a branch that
is earlier in program has not yet completed - Implemented by preventing any instruction from
leaving the issue step, if there is a pending
branch already in the pipeline
38Virtual Registers
- Tag field associated with data
- Tag field is a virtual register ID
- Corresponds to
- Reservation station and load buffer names
- Motivation due to the 360s register weakness
- Had only 6 FP registers
- The 9 renamed virtual registers were a
significant bonus
39Tomasulo Structure
- Each Reservation Station
- Op - the operation
- Qj, Qk - RS that will produce the operand
- 0?value is already available or no necessary
operand - Vj, Vk - the value of the operands
- Only one of V or Q is valid for each operand
- Busy - RS and its corresponding functional unit
are occupied - A information for memory address calculation for
a load or store - Immediate ? effective address
- Register file and store buffers
- Qi RS that produces the value to be stored in
this register - Load and store buffers each require a busy field
- Note
- max 1 valid Qj or Vj
- same with Qk or Vk
40Detailed Tomasulo Algorithm Control
Avoid RAW
Avoid RAW
Avoid RAW
The result of register Qiwill come from RS r
Avoid RAW
41Detailed Tomasulo Algorithm Control (Cont.)
Calculate effectiveaddress
Write to register
Broadcast to RSneeding result
42Tomasulo Example Cycle 0
LD is 1 CC, ADDD/SUBD is 2 CC, MULT is 10 CC, and
DIVD is 40 CC(Execution stage)
43Tomasulo Example Cycle 1
Yes
44Tomasulo Example Cycle 2
45Tomasulo Example Cycle 3
- Note registers names are removed (renamed) in
Reservation Stations MULT issued vs. scoreboard - Load1 completing what is waiting for Load1?
46Tomasulo Example Cycle 4
- Load2 completing what is waiting for it?
47Tomasulo Example Cycle 5
48Tomasulo Example Cycle 6
49Tomasulo Example Cycle 7
- Add1 completing what is waiting for it?
50Tomasulo Example Cycle 8
51Tomasulo Example Cycle 9
52Tomasulo Example Cycle 10
53Tomasulo Example Cycle 11
54Tomasulo Example Cycle 12
- Note all quick instructions complete already
55Tomasulo Example Cycle 13
56Tomasulo Example Cycle 14
57Tomasulo Example Cycle 15
- Mult1 completing what is waiting for it?
58Tomasulo Example Cycle 16
- Note Just waiting for divide
59Tomasulo Example Cycle 55
60Tomasulo Example Cycle 56
- Mult 2 completing what is waiting for it?
61Tomasulo Example Cycle 57
- Again, in-order issue, out-of-order execution,
completion
62Advantages of Tomasulo
- Distribution of the hazard detection logic
- Distributed RS and CDB
- If multiple instructions are waiting on a single
result, and each already has its other operand,
then the instruction can be released
simultaneously by the broadcast on CDB - No waiting for the register bus in a centralized
register file - Elimination of stalls for WAW and WAR
- Rename register using RS
- Store operands into RS as soon as they are
available - For WAW-hazard, the last write will win
- Issue stage RegisterStatrd.Qi ? r (the last
wins)
63Tomasulo Drawbacks
- Complexity
- delays of 360/91, MIPS 10000, IBM 620?
- Many associative stores (CDB) at high speed
- Performance limited by Common Data Bus
- Multiple CDBs ? more FU logic for parallel assoc
stores
64Tomasulo Loop Example
- Loop LD F0 0 R1
- MULTD F4 F0 F2
- SD F4 0 R1
- SUBI R1 R1 8
- BNEZ R1 Loop
- Assume Multiply takes 4 clocks
- Assume first load takes 8 clocks (cache miss?),
second load takes 4 clocks (hit) - To be clear, will show clocks for SUBI, BNEZ
- Reality, integer instructions ahead
65Loop Example Cycle 0
66Loop Example Cycle 1
67Loop Example Cycle 2
68Loop Example Cycle 3
- Note MULT1 has no registers names in RS
69Loop Example Cycle 4
70Loop Example Cycle 5
71Loop Example Cycle 6
- Note F0 never sees Load1 result
72Loop Example Cycle 7
- Note MULT2 has no registers names in RS
73Loop Example Cycle 8
74Loop Example Cycle 9
- Load1 completing what is waiting for it?
75Loop Example Cycle 10
- Load2 completing what is waiting for it?
76Loop Example Cycle 11
77Loop Example Cycle 12
78Loop Example Cycle 13
79Loop Example Cycle 14
- Mult1 completing what is waiting for it?
80Loop Example Cycle 15
- Mult2 completing what is waiting for it?
81Loop Example Cycle 16
82Loop Example Cycle 17
83Loop Example Cycle 18
84Loop Example Cycle 19
85Loop Example Cycle 20
86Loop Example Cycle 21
87Tomasulo Summary
- Reservations stations renaming to larger set of
registers buffering source operands - Prevents registers as bottleneck
- Avoids WAR, WAW hazards of Scoreboard
- Allows loop unrolling in HW
- For one CDB, only one operation can use it at a
single clock cycle - Not limited to basic blocks (integer units gets
ahead, beyond branches) - Lasting Contributions
- Dynamic scheduling
- Register renaming
- Load/store disambiguation
- 360/91 descendants are Pentium II PowerPC 604
MIPS R10000 HP-PA 8000 Alpha 21264
88Reducing Branch Penalties with Dynamic Hardware
Prediction
89Dynamic Control Hazard Avoidance
- Consider Effects of Increasing the ILP
- Control dependencies rapidly become the limiting
factor - They tend to not get optimized by the compiler
- Higher branch frequencies result
- Plus multiple issue (more than one
instructions/sec) ? more control instructions
per sec. - Control stall penalties will go up as machines go
faster - Amdahls Law in action - again
- Branch Prediction helps if can be done for
reasonable cost - Static by compiler appendix A
- Dynamic by HW this section
90Dynamic Branch Prediction
- Processor attempts to resolve the outcome of a
branch early, thus preventing control dependences
from causing stalls - BP_Performance f (accuracy, cost of
misprediction) - Branch History Table (BHT)
- Lower bits of PC address index table of 1-bit
values - No precise address check just match the lower
bits - Says whether or not branch taken last time
91BHT Prediction
Useful only for the target addressis known
before CC is decided
If two branch instructions withthe same lower
bits
92Problem with the Simple BHT
clear benefit is that its cheap and
understandable
- Aliasing
- All branches with the same index (lower) bits
reference same BHT entry - Hence they mutually predict each other
- No guarantee that a prediction is right. But it
may not matter anyway - Avoidance
- Make the table bigger - OK since its only a
single bit-vector - This is a common cache improvement strategy as
well - Other cache strategies may also apply
- Consider how this works for loops
- Always mispredict twice for every loop
- Once is unavoidable since the exit is always a
surprise - However previous exit will always cause a
mis-predict on the first try of every new loop
entry
93N-bit Predictors
idea improve on the loop entry problem
- Use an n-bit saturating counter
- 2-bit counter implies 4 states
- Statistically 2 bits gets most of the advantage
94BHT Accuracy
4K of BPB with 2-bit entries misprediction rates
on SPEC89
- Mispredict because either
- Wrong guess for that branch
- Got branch history of wrong branch when index the
table
95BHT Accuracy BHT Size
- 4096 entry table programs vary from 1
misprediction (nasa7, tomcatv) to 18 (eqntott),
with spice at 9 and gcc at 12 - 4096 about as good as infinite table (in Alpha
211164)
96Improve Prediction Strategy By Correlating
Branches
- Consider the worst case for the 2-bit predictor
- if (aa2) then aa0
- if (bb2) then bb0
- if (aa ! bb) then whatever
- single level predictors can never get this case
- Correlating or 2-level predictors
- Correlation what happened on the last branch
- Note that the last correlator branch may not
always be the same - Predictor which way to go
- 4 possibilities which way the last one went
chooses the prediction - (Last-taken, last-not-taken) X (predict-taken,
predict-not-taken)
if the first 2 fail then the 3rd will always be
taken
97The worst case for the 2-bit predictor
- if (aa2)
- aa0
- if (bb2)
- bb0
- if (aa ! bb)
- DSUBUI R3, R1, 2
- BNEZ R3, L1
- DADDD R1, R0, R0
- L1 DSUBUI R3, R2, 2
- BNEZ R2, R0, R0
- L2 DSUBU R3, R1, R2
- BEQZ R3, L3
if the first 2 untaken then the 3rd will always
be taken
98Correlating Branches
- Hypothesis recently executed branches are
correlated that is, behavior of recently
executed branches affects prediction of current
branch - Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table - In general, (m,n) predictor means record last m
branches to select between 2m history tables each
with n-bit counters - Old 2-bit BHT is then a (0,2) predictor
99Example of Correlating Branch Predictors
- BNEZ R1, L1 branch b1 (d!0)
- DAAIU R1, R0, 1 d0, so d1
- L1 DAAIU R3, R1, -1
- BNEZ R3, L2 branch b2 (d!1)
-
- L2
100Example of Correlating Branch Predictors (Cont.)
101Example of Correlating Branch Predictors (Cont.)
102In general (m,n) BHT (prediction buffer)
- p bits of buffer index 2p bit BHT
- Use last m branches global branch history
- Use n bit predictor
- Total bits for the (m, n) BHT precitction buffer
- 2m banks of memory selected by the global branch
history (which is just a shift register) - e.g. a
column address - Use p bits of the branch address to select row
- Get the n predictor bits in the entry to make the
decision
103(2,2) Predictor Implementation
4 banks each with 32 2-bit predictor entries
p5m2n2
532
104Accuracy of Different Schemes
105Tournament Predictors
- Adaptively combine local and global predictors
- Multiple predictors
- One based on global information Results of
recently executed m branches - One based on local information Results of past
executions of the current branch instruction - Selector to choose which predictors to use
- 2-bit saturating counter, incremented whenever
the predicted predictor is correct and the
other predictor is incorrect, and it is
decremented in the reverse situation - Advantage
- Ability to select the right predictor for the
right branch - Alpha 21264 Branch Predictor (p. 207 p. 209)
106State Transition Diagram for A Tournament
Predictor
0/0, 0/1, 1/1
0/0, 1/0, 1/1
Use Predictor 1
Use Predictor 2
0/1
1/0
0/1
1/0
0/1
Use Predictor 1
Use Predictor 2
1/0
0/0, 1/1
0/0, 1/1
107Fraction of Predictions Coming from the Local
Predictor (SPEC89)
108Misprediction Rate Comparison
109Branch Target Buffer/Cache
- To reduce the branch penalty to 0
- Need to know what the address is by the end of IF
- But the instruction is not even decoded yet
- So use the instruction address rather than wait
for decode - If prediction works then penalty goes to 0!
- BTB Idea -- Cache to store taken branches (no
need to store untaken) - Match tag is instruction address ? compare with
current PC - Data field is the predicted PC
- May want to add predictor field
- To avoid the mispredict twice on every loop
phenomenon - Adds complexity since we now have to track
untaken branches as well
110Branch Target Buffer/Cache-- Illustration
111Changes in DLX to incorporate BTB
112Penalties Using this Approach for MIPS/DLX
- Note
- Predict_wrong 1 CC to update BTB 1 CC to
restart fetching - Not found and taken 2CC to update BTB
- Note
- For complex pipeline design, the penalties may be
higher
113Branch Penalty CPI
- Prediction accuracy is 90
- Hit rate in the buffer is 90
- Taken branch frequency is 60
- Branch_penaltybuffer_hit_rateincorrect_predictio
n_rate2 (1-buffer_hit_rate)Taken_branch2
(0.9 0.1 2) (0.1 0.6 2) 0.18 0.12
0.3 - Branch penalty for delayed branches is about 0.5
114Return Address Predictor
- Indirect jump jumps whose destination address
varies at run time - indirect procedure call, select or case,
procedure return - SPEC89 benchmarks 85 of indirect jumps are
procedure returns - Accuracy of BTB for procedure returns are low
- if procedure is called from many places, and the
calls from one place are not clustered in time - Use a small buffer of return addresses operating
as a stack - Cache the most recent return addresses
- Push a return address at a call, and pop one off
at a return - If the cache is sufficient large (max call depth)
? prefect
115Dynamic Branch Prediction Summary
- Branch History Table 2 bits for loop accuracy
- Correlation Recently executed branches
correlated with next branch - Branch Target Buffer include branch address
prediction - Reduce penalty further by fetching instructions
from both the predicted and unpredicted direction - Require dual-ported memory, interleaved cache ?
HW cost - Caching addresses or instructions from multiple
path in BTB
1163.6 Taking Advantages of More ILP with Multiple
Issue
- Pipelined CPIIdeal pipeline CPIStructural
stallsRAM stallsWAR stallsWAW stallsControl
stalls
117Getting CPI lt 1 IssuingMultiple
Instructions/Cycle
- Superscalar
- Issue varying numbers of instructions per clock
- Constrained by hazard style issues
- Scheduling
- Static - by the compiler
- Dynamic - hardware support for some form of
Tomasulo - VLIW (very long instruction word)
- Issue a fixed number of instructions formatted
as - One large instruction or
- A fixed instruction packet with the parallelism
among instructions explicitly indicated by
instruction - Also known as EPIC explicitly parallel
instruction computers - Scheduling mostly static
Int/Br
Int/Ld-St
FP-/-
FPmul/div
118Five Approaches in use for Multiple-Issue
Processors
119Statically Scheduled Superscalar Processors
- HW might issue 0 to 8 instructions in a clock
cycle - Instructions issue in program order
- Pipeline hazards are checked for at issue time
- Among instructions being issued in a given clock
cycle - Among the issuing instructions and all those
still in execution - If data or structural hazards occur, only the
instruction preceding that one in the instruction
sequence will be issued (Dynamic issue) - Complex issue stage
- Split and pipelined ? But result in higher
branch penalties - Instruction issue is likely to be one limitation
on the clock rate of superscalar processors
120Superscalar 2-issue MIPS
- Very similar to the HP 7100
- Require fetching and decoding 64 bits of
instructions - Which instructions
- 1 integer load, store, branch, or integer ALU
operation - 1 float FP operation
- Why issue one integer and one FP operation?
- Eliminate most hazard possibility ? simplify the
logic - Integer and FP register sets are different
- Integer and FP FUs are different
- Only difficulty when integer instructions are FP
load, store, move - Need an additional read/write port on the FP
registers - May create RAW hazard
121Superscalar 2-issue MIPS (Cont.)
- Type Pipe Stages
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Instruction placement is not restricted in modern
processor - 1 cycle load delay expands to 3 instructions in
SS - instruction in right half can not use it, nor
instructions in next slot - Must have pipeline FP FUs or multiple independent
FP FUs
122Consider adding a scalars to a vector
- for (i1000 i gt 0 ii-1) xi xi s
Loop L.D F0,0( R1 ) F0vector element
ADD.D F4,F0,F2 add scalar from F2 S.D F4,
0(R1), store result DADDUI R1,R1,-8 decreme
nt pointer 8B (DW) BNE R1, R2,Loop branch
R1!R2
Assume 8(R2) is the last element to operate on
123Unscheduled Loop
124Unrolled Loop that Minimizes Stalls for Scalar
- 1 Loop L.D F0,0( R1)
- 4 L.D F6,-8(R1)
- 3 L.D F10,-16(R1)
- 4 L.D F14,-24(R1)
- 5 ADD.D F4,F0,F2
- 6 ADD.D F8,F6,F2
- 7 ADD.D F12,F10,F2
- 8 ADD.D F16,F14,F2
- 9 S.D 0(R1),F4
- 10 S.D -8(R1),F8
- 11 DADDUI R1,R1,-32
- 12 S.D F12, 16(R1)
- 13 BNE R1, R2, LOOP
- 14 S.D F16, 8(R1) 8-32-24
14 clock cycles, or 3.5 per iteration
125Unrolled Loop for SuperScalar (5 times)
1 Loop L.D F0,0( R1) 2 L.D F6,-8(R1)
3 L.D F10,-16(R1) 4 ADD.D F4,F0,F2 5 L.D
F14,-24(R1) 6 ADD.D F8,F6,F2 7 L.D F18,
-32(R1) ...
126Loop Unrolling in Superscalar
Unrolled 5 times to avoid delays
- Integer instruction FP instruction Clock cycle
- Loop L.D F0,0(R1) 1
- L.D F6,-8(R1) 2
- L.D F10,-16(R1) ADD.D F4,F0,F2 3
- L.D F14,-24(R1) ADD.D F8,F6,F2 4
- L.D F18,-32(R1) ADD.D F12,F10,F2 5
- S.D F4, 0(R1) ADD.D F16,F14,F2 6
- S.D F8, -8(R1) ADD.D F20,F18,F2 7
- S.D F12, -16(R1) 8
- S.D F16, -24(R1) 9
- DADDUI R1,R1,-40 10
- BNE R1, R2, LOOP 11
- SD F20, -32(R1) 12
12 clocks, or 2.4 clocks per iteration
127Seem Simple?
- Registers
- Each pipe has its own set
- Due to separation of FP and GP registers
- Also inherently separates data dependencies into
2 classes - Exception is LDD or LDF
- EFA is an integer operation
- Destination register however is a FPreg
- FP pipe has longer latency
- Exacerbated by operation latency differences
- mult 6 cycles, divide 24 cycles for example
- Result is that completion is out of order
- Complicates hazard control within the FP
execution pipe - Pipeline FP ALU or use multiple FP ALUs
128Problems So Far
- Look at the opcodes
- See if the pair is an appropriate issue pair
- Some integer operations are a problem
- FP register loads/stores - since other
instruction may be dependent - A stall will result - options?
- Force FP loads, stores or moves to issue by
themselves - Safe but suboptimal since the other instruction
may still be independent - OR add more ports to the FP register file
- Such as separate read and write ports
- Still must stall the 2nd instruction if it is
dependent
129Other Issues
- Hazard detection
- Similar to the normal pipeline model, but need
large set of bypass path (twice as many
instructions in the pipeline) - Load use delay
- Assume 1 cycle ? now covers 3 instruction slots
- Branch delay
- Have branches to be issued by themselves?
- The 1 instruction branch delay now holds 3
instructions as well - Instruction scheduling by compiler
- Mandatory for issuing independent operations in
SS - Increasingly important as issue width goes up
130Dynamic Scheduling In SuperScalar
- Use Tomasulo Algorithm
- Two arbitrary instructions per clock issue and
let RS sort it out - But still cant issue a dependent pair
- Two examples pp. 221224
- How to issue multiple arbitrary instructions per
clock? - Run the issue step in half a clock cycle (ex.
Pipelined) - Build the logic necessary to handle two
instructions at once, including any possible
dependences between the instructions - Modern SS processors that issue four or more
instructions per clock often include both
approaches
131Dynamic Scheduling in Superscalar (Cont.)
- Only FP loads might cause dependency between
integer and FP issue - Replace load reservation station with a load
queue - Operands must be read in the order they are
fetched - Load checks addresses in Store Queue to avoid RAW
violation - Store checks addresses in Load Queue to avoid
WAR, WAW - Called decoupled architecture
132Example
- Can issue two arbitrary operations per clock
- One integer FU for ALU operation and
EA-calculation - A separate pipelined FP FU
- One memory unit, 2CDB
- no delayed branch with perfect branch prediction
- Fetch and issue as if the branch predictions are
always correct - Latency between a source instruction and an
instruction consuming the result presence of
Write Result stage - 1 CC for integer ALU operations
- 2 CC for loads
- 3 CC for FP add
133Note
- WR stages does not apply to either stores or
branches - For L.D and S.D, the execution cycle is EA
calculation - For branches, the execution cycle shows when the
branch condition can be evaluated and the
prediction checked - Any instruction following a branch cannot start
execution until after the branch condition has
been evaluated - If two instructions could use the same FU at the
same point (structural hazard), priority is given
to the older instruction
134Consider adding a scalars to a vector
- for (i1000 i gt 0 ii-1) xi xi s
Loop L.D F0,0( R1 ) F0vector element
ADD.D F4,F0,F2 add scalar from F2 S.D F4,
0(R1) store result DAADIU R1,R1,-8 decrement
pointer 8B (DW) BNE R1, R2, Loop branch
R1!R2
135Execution Timing
136Execution Timing (Cont.)
137Example Result
- Result
- IPC issued 5/3 1.67 Instruction execution
rate 15/16 0.94 - Only one load, store, and Integer ALU operation
can execute - Load of the next iteration performs its memory
address before the store of the current iteration - A single CDB is actually required
- Integer operations become the bottleneck
- Many integer operations, but only one integer ALU
- One stall cycle each loop iteration due to a
branch hazard
138Another Example Execution Timing
Separate integer FU for EA calculation and ALU
operations
139Execution Timing (Cont.)
140Note
- Result
- IPC issued 5/3 1.67 Instruction execution
rate 15/11 1.36 - A second CDB is needed
- This example has a higher instruction execution
rate but lower efficiency as measured by the
utilization of FU
141Limitations on Multiple Issue
- How much ILP can be found in the application
fundamental problems - Requires deep unrolling - hence big focus on
loops - Compiler complexity goes way up
- Deep unrolling needs lots of registers
- Increased HW cost
- Increased ports for register files
- Cost of scoreboarding (e.g. Tomasulo data
structure) and forwarding paths - Memory bandwidth requirement goes up
- Most have gone with separate I and D ports
already - Newest approaches are to go for multiple D ports
as well - big time expense!! (PA- 8000) - Branch prediction by HW is an absolute must HW
Speculation (Sect. 3.7)
1423.7 Hardware-Based Speculation
143Overview
- Overcome control dependence by speculating on the
outcome of branches and executing the program as
if our guesses were correct - Fetch, issue, and execute instructions
- Need mechanisms to handle the situation when the
speculation is incorrect - Dynamic scheduling only fetch and issue such
instructions
144Key Ideas
- Dynamic branch prediction to choose which
instructions to execute - Speculation to allow the speculated blocks to
execution before the control dependences are
resolved - And undo the effects of an incorrectly speculated
sequence - Dynamic scheduling to deal with the scheduling of
different combinations of basic blocks (Tomasulo
style approach)
145HW Speculation Approach
- Issue ? execution ? write result ? commit
- Commit is the point where the operation is no
longer speculative - Allow out of order execution
- Require in-order commit
- Prevent speculative instructions from performing
destructive state changes (e.g. memory write or
register write) - Collect pre-commit instructions in a reorder
buffer (ROB) - Holds completed but not committed instructions
- Effectively contains a set of virtual registers
to store the result of speculative instructions
until they are no longer speculative - Similar to reservation station ? And becomes a
bypass source
146The Speculative MIPS
Replace store buffer
147The Speculative MIPS (Cont.)
- Need HW buffer for results of uncommitted
instructions reorder buffer (ROB) - 4 fields instruction type, destination field,
value field, ready field - ROB is a source of operands ? more registers like
RS - ROB supplies operands in the interval between
completion of instruction execution and
instruction commit - Use ROB number instead of RS to indicate the
source of operands when execution completes (but
not committed) - Once instruction commits, result is put into
register - As a result, its easy to undo speculated
instructions on mispredicted branches or on
exceptions
148ROB Fields
- Instruction type branch, store, register
operations - Destination field
- Unused for branches
- Memory address for stores
- Register number for load and ALU operations
(register operations) - Value hold the value of the instruction result
until commit - Ready indicate if the instruction has completed
execution
149Steps in Speculative Execution
- Issue (or dispatch)
- Get instruction from the instruction queue
- In-order issue if available RS AND ROB slot
otherwise, stall - Send operands to RS if they are in register or
ROB - Update Tomasulo DS and ROB
- The ROB no. allocated for the result is sent to
RS, so that the number can be used to tag the
result when it is placed on CDB - Execute
- RS waits grabs results off the CDB if necessary
- When all operands are there execution happens
- Write Result
- Result posted to ROB via the CDB
- Waiting reservation stations can grab it as well
150Steps in Speculative Execution (Cont.)
- Commit (or graduate) instruction reaches the
ROB head - Normal commit when instruction reaches the ROB
head and its result is present in the buffer - Update the register and remove the instruction
from ROB - Store Update memory and remove the instruction
from ROB - Branch with incorrect prediction wrong
speculation - Flush ROB and the related FP OP queue (RS)
- Restart at the correct successor of the branch
- Remove the instruction from ROB
- Branch with correct prediction finish the
branch - Remove the instruction from ROB
151Example
- The same example as Tomasulo without speculation.
Show the status tables when MUL.D is ready to go
to commit - L.D F6, 34(R2)
- L.D F2, 45(R3)
- MUL.D F0, F2, F4
- SUB.D F8, F6, F2
- DIV.D F10, F0, F6
- ADD.D F6, F8, F2
- Modified status tables
- Qj and Qk fields, and register status fields use
ROB (instead of RS) - Add Dest field to RS (ROB to put the operation
result)
152Figure 3.30
153Example Result
- Tomasulo without speculation
- SUB.D and ADD.D have completed (clock cycle 16,
slide 58) - Tomasulo with speculation
- No instruction after the earliest uncompleted
instruction (MUL.D) is allowed to complete - In-order commit
- Implication ROB with in-order instruction
commit provides precise exceptions - Precise exceptions exceptions are handled in
the instruction order
154Loop Example
- Loop L.D F0, 0(R1)
- MUL.D F4, F0, F2
- S.D F4, 0(R1)
- DADDUI R1,R1, -8
- BNE R1, R2, Loop
- Assume we have issued all the instructions in the
loop twice - Assume L.D and MUL.D from the first iteration
have committed and all others have completed
execution
155Figure 3.31
156Loop Example Observation
- Suppose the first BNE is not taken ? flush ROB
and begins fetch instructions from the other path
157Other Issues
- Performance is more sensitive to
branch-prediction - Impact of a mis-prediction will be higher
- Prediction accuracy, mis-prediction detection,
and mis-prediction recovery increase in
importance - Precise exception
- Handled by not recognizing the exception until it
is ready to commit - If a speculation instruction raises an exception,
the exception is recorded in ROB - Mis-prediction branch ? exception are flushed as
well - If the instruction reaches the ROB head ? take
the exception
158Figure 3.32
159(No Transcript)
160Multiple Issue with Speculation
- Process multiple instructions per clock,
assigning RS and ROB to the instructions - To maintain throughput of greater than one
instruction per cycle, must handle multiple
instruction commits per clock - Speculation helps significantly when a branch is
a key potential performance limitation - Speculation can be advantageous when there are
data-dependent branches, which otherwise would
limit performance - Depend on accurate branch prediction ? incorrect
speculation will typically harm performance
161Example
- Assume separate integer FUs for ALU operations,
effective address calculation, and branch
condition evaluation - Assume up to 2 instruction of any type can commit
per clock - Loop LD R2, 0(R1)
- DADDIU R2, R2, 1
- SD R2, 0(R1)
- DADDIU R1, R1, 4
- BNE R2, R3, LOOP
162No Speculation
Figure 3.33 3.34
R2
R2
R2
163Speculation
R2
R2
R2
164Example Result
- Without speculation
- L.D following BNE cannot start execution earlier
? wait until branch outcome is determined - Completion rate is falling behind the issue rate
rapidly, stall when a few more iterations are
issued - With speculation
- L.D following BNE can start execution early
because it is speculative
1653.8 Studies of The Limitations of ILP
166ILP Studies
- Perfect Hardware model - in the ideal infinite
cost case - Rename as much as you need
- Implies infinite virtual registers
- Hence - complete WAW or WAR insensitivity
- Branch prediction is perfect
- This will never happen in reality of course
- Jump prediction (even computed such as return)
are also perfect - Similarly unreal
- Perfect memory disambiguation
- Almost perfect is not too hard in practice
- Can issue an unlimited of instructions at once
no restriction on types of instructions issued
? FUs - One-cycle latency
167Lets Look at A Real Machine
- Alpha 21264 one of the most advanced
superscalar processors announced to date - Issues up to four instructions per clock, and
initiates execution on up to six - At most 2 memory references, among other
restrictions - Support a large set of renaming registers (41
integer and 41 FP) - Allow up to 80 instructions in execution
- Multicycle latencies
- Tournament-style branch predictor
168How to Measure
- A set of programs were compiled and optimized
with the standard MIPS optimizing compilers - Execute and produce a trace of the instruction
and data references - Perfect branch prediction and perfect alias
analysis are easy to do - Every instruction in the trace is then scheduled
as early as possible, limited only by the data
dependence - Including moving across branches
169What A Perfect Processor Must Do?
- Look arbitrary far ahead to find a set of
instructions to issue, predicting all branches
perfectly - Rename all register uses to avoid WAW and WAR
hazards - Determine whether there are any dependences among
the instructions in the issue packet if so,
rename accordingly - Determine if any memory dependences exist among
the issuing instructions and hand them
appropriately - Provide enough replicated Fus to allow all the
ready instructions to issue
170ILP at the Limit
- How many instructions would issue on the perfect
machine every cycle? - gcc - 54.8
- espresso - 62.6
- li - 17.9
- fpppp - 75.2
- doduc - 118.7
- tomcatv - 150.1
- Limited only by the ILP inherent in the
benchmarks - Note
- Benchmarks are small codes
- More ILP tends to surface as the codes get bigger
Huge amounts of loop parallelismin the SPECfp
codes
171Window Size
- The set of instructions that is examined for
simultaneous execution is called the window - The window size will be determined by the cost of
determining whether n issuing instructions have
any register dependences among them - In theory, This c