Title: Branch Prediction
1Branch Prediction
2Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 2, 1, here
add ...
sub ...
These instructions shouldnt be executed!
lw ...
IM
Reg
DM
here lw ...
Finally, the right instruction
3Objectives
- Ways to deal with control hazards
- Stalling for branch hazards
- Reducing Branch delays
- Eliminating Branch Stalls
- Predicting Branches
- Static
- Dynamic
4Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg
sw ...
5Stalling for Branch Hazards
- Seems wasteful, particularly when the branch
isnt taken. - Makes all branches cost 4 cycles.
6Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg
sw ...
7Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg
sw ...
8Eliminating the Branch Stall
- Theres no rule that says we have to see the
effect of the branch immediately. Why not wait
an extra instruction before branching?
9Eliminating the Branch Stall
- Theres no rule that says we have to see the
effect of the branch immediately. Why not wait
an extra instruction before branching? - The original SPARC and MIPS processors each used
a single branch delay slot to eliminate
single-cycle stalls after branches. - The instruction after a conditional branch is
always executed in those machines, regardless of
whether the branch is taken or not!
10Branch Delay Slot
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
there or ...
IM
Reg
DM
add ...
IM
Reg
sw ...
Branch delay slot instruction (next instruction
after a branch) is executed even if the branch is
taken.
11Filling the branch delay slot
- add 5, 3, 7
- sub 6, 1, 4
- and 7, 8, 2
- beq 6, 7, there
- nop / branch delay slot /
- add 9, 1, 2
- sub 2, 9, 5
- ...
- there
- mult 2, 10, 11
12Filling the branch delay slot
- The branch delay slot is only useful if you can
find something to put there. - If you cant find anything, you must put a nop to
insure correctness.
13Problems with branch delay slots
- Exposes pipeline structure to programmer
(compatibility) - Ability to find instructions for delay slots
drops off rapidly
14Assume Branch Not Taken
- works pretty well when youre right
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg
sw ...
15Assume Branch Not Taken
- same performance as stalling when youre wrong
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
and 12, 2, 5
IM
Reg
or ...
IM
add ...
IM
Reg
there sub 12, 4, 2
16Assume Branch Not Taken
- Performance depends on percentage of time you
guess right. - Flushing an instruction means to prevent it from
changing any permanent state (registers, memory,
PC). - sounds a lot like a bubble...
- But notice that we need to be able to insert
those bubbles later in the pipeline
17Problems With Predict-Taken
- Where do we get the address that the branch will
jump to? - Many branch instructions perform a computation to
generate the branch address - Cache issues when predict incorrectly
- Starting down the taken path often fetches a new
line of instructions into the cache, increasing
pressure - Predicting that all branches are not taken
addresses these problems - Can use compiler transformations to rearrange
loops so that the conditional branch is at the
top of the loop and make not-taken branches more
common
18Some static strategies
- Assume backwards branch is always taken, forward
branch never is - backwards negative displacement field
- loops (which branch backwards) are usually
executed multiple times. - if-then-else often takes the then (no branch)
clause. - Compiler makes educated guess
- sets predict taken/not taken bit in instruction
19Dynamic Branch Prediction
20Why Dynamic Prediction
- Static prediction isnt good enough
- Predicting all branches as either taken not taken
doesnt give very good prediction rates - Compilers arent that good at predicting whether
a given branch will be taken or not taken - The best you can ever hope to do with static
prediction is to figure out whether a given
branch is taken or not taken more often - Works well for some branches (loops)
- Doesnt work well for data-dependent branches
(conditionals) - Idea Maybe we can do better if we have dynamic
information about what the program is doing at a
given time
21Branch Prediction
Branch history table
program counter
1
0000 0001 0010 0010 0011 0100 0101 ...
for (i0ilt10i) ... ...
1
0
1
1
0
... ... add i, i, 1 beq i, 10, loop
This 1 bit means, the last time the
program counter ended with 0100 and a beq
instruction was seen, the branch was taken.
Hardware guesses it will be taken again.
22Branch Prediction
- Similar in structure to a direct-mapped cache
- Bits of address select exactly one entry that can
hold a prediction for a given branch - No tag field Want a prediction for every branch,
so use the value in the entry without checking
the address - Small, fast
- At one bit/entry, can afford a lot of entries
- Lookup process is very simple, can do in parallel
with fetching each instruction - If instruction isnt a branch, just dont use the
prediction.
23Cons of the 1-bit Branch Prediction
- Aliasing between branches
- Branches that map onto the same entry cause
conflicts, like conflict misses in caches - Using only one bit of state mispredicts many
branches - Example Every time a loop is executed, will
mispredict at least twice - for (i 0 i lt 100 i )
- j j i
24Two-bit predictors give better loop prediction
for (i0ilt10i) ... ...
... ... add i, i, 1 beq i, 10, loop
this state means, the last two branches at
this location were taken.
This one means, the last two branches at
this location were not taken.
25Branch History Table
- has limited size
- 2 bits by N (e.g. 4K)
- uses low bits of branch address to choose entry
- what happens when table too small?
PHT
branch address
00
262-bit prediction accuracy
Is this good enough?
27Can We Do Better?
- Can we get more information dynamically than just
the history of this branch?
28Can We Do Better?
- Can we get more information dynamically than just
the history of this branch? - We can look at patterns (2-level predictor) for a
particular branch. - last eight branches 00100100, then it is a good
guess that the next one is 1 (taken)
29Can We Do Better?
- Can we get more information dynamically than just
the history of this branch? - We can look at patterns (2-level predictor) for a
particular branch. - last eight branches 00100100, then it is a good
guess that the next one is 1 (taken)
BHT
address
00
000000
00
111111
001001
000000
11
30Can We Do Better?
- Correlating Branch Predictors also look at other
branches for clues - if (i 0)
- ...
- if (i gt 7)
- ...
- Typically use two indexes
- Global history register --gt history of last m
branches (e.g., 0100011) - branch address
31Correlating Branch Predictors
- The global history register is a shift register
that records the last n branches (of any address)
encountered by the processor.
ghr
00
01
2-bit predictors
00
11
32Two-level correlating branch predictors
- Can use both the PC address and the GHR
ghr
00
01
2-bit predictors
PC
00
combining function
11
33Performance of 2-level Correlating Branch
Prediction
34Can we do even better?
- Combining branch predictors use multiple schemes
and a voter to decide which one typically does
better for that branch.
P1
P2
use P2
PC
35Limitations of Branch Prediction
- Not enough to predict whether or not branch is
taken - Need to know target of branch if taken
- Predicting return branches from subroutines is
hard - Can use separate buffer to record call history
36Branch Target Buffers (BTB)
- predict the location of branches in the
instruction stream - predict the destination of branches
37BTB Operation
- use PC (all bits) for lookup
- match implies this is a branch
- if match and predict bits gt taken, set PC to
predicted PC - if branch predict wrong, must recover (same as
branch hazards weve already seen) - if decode indicates branch when no BTB match, two
choices - look up prediction now and act on it
- just predict not taken
- when branch resolved, update BTB (at least
prediction bits, maybe more)
38BTB Performance
- Two things that can go wrong
- didnt predict the branch (misfetch)
- mispredicted a branch (mispredict)
- Suppose BTB hit rate of 85 and predict accuracy
of 90, misfetch penalty of 2 cycles and
mispredict penalty of 10 cycles, what is average
branch penalty? - Can use both BTB and branch predictor
- have no prediction bits in BTB (why is that a
good idea?) - presence of PC in BTB indicates a lookup in
branch predictor to predict whether the branch
will go to destination address in BTB.
39What about indirect jumps/returns?
- Branch predictor does really well with
conditional jumps - BTB does really well with unconditional jumps
(jump, jal, etc.) - Indirect jumps often jump to different
destinations, even from the same instruction.
Indirect jumps most often used for return
instructions. - Return easily handled by a stack.
- jal -gt push PC4
- return -gt predict jump to address on top of
stack, pop stack
40Branch Prediction
- Latest branch predictors significantly more
sophisticated, using more advanced correlating
techniques, larger structures, and soon possibly
using AI techniques.
41Branch Prediction
- Latest branch predictors significantly more
sophisticated, using more advanced correlating
techniqes, larger structures, and soon possibly
using AI techniques. - Presupposes what two pieces of information are
available at fetch time? -
-
- Branch Target Buffer supplies this information.
42Control Hazards -- Key Points
- Control (or branch) hazards arise because we must
fetch the next instruction before we know if we
are branching or where we are branching. - Control hazards are detected in hardware.
- We can reduce the impact of control hazards
through - early detection of branch address and condition
- branch prediction
- branch delay slots
43Modern Processors