Branch Prediction - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Branch Prediction

Description:

Starting down the taken path often fetches a new line of instructions into the ... Lookup process is very simple, can do in parallel with fetching each instruction ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 44

Provided by: Sapn

Category:

more less

Transcript and Presenter's Notes

Title: Branch Prediction

1
Branch Prediction

Subbaiah Venkata

2
Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 2, 1, here
add ...
sub ...
These instructions shouldnt be executed!
lw ...
IM
Reg
DM
here lw ...
Finally, the right instruction
3
Objectives

Ways to deal with control hazards
Stalling for branch hazards
Reducing Branch delays
Eliminating Branch Stalls
Predicting Branches
Static
Dynamic

4
Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
5
Stalling for Branch Hazards

Seems wasteful, particularly when the branch
isnt taken.
Makes all branches cost 4 cycles.

6
Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
7
Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
8
Eliminating the Branch Stall

Theres no rule that says we have to see the
effect of the branch immediately. Why not wait
an extra instruction before branching?

9
Eliminating the Branch Stall

Theres no rule that says we have to see the
effect of the branch immediately. Why not wait
an extra instruction before branching?
The original SPARC and MIPS processors each used
a single branch delay slot to eliminate
single-cycle stalls after branches.
The instruction after a conditional branch is
always executed in those machines, regardless of
whether the branch is taken or not!

10
Branch Delay Slot
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
there or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
Branch delay slot instruction (next instruction
after a branch) is executed even if the branch is
taken.
11
Filling the branch delay slot

add 5, 3, 7
sub 6, 1, 4
and 7, 8, 2
beq 6, 7, there
nop / branch delay slot /
add 9, 1, 2
sub 2, 9, 5
...
there
mult 2, 10, 11

12
Filling the branch delay slot

The branch delay slot is only useful if you can
find something to put there.
If you cant find anything, you must put a nop to
insure correctness.

13
Problems with branch delay slots

Exposes pipeline structure to programmer
(compatibility)
Ability to find instructions for delay slots
drops off rapidly

14
Assume Branch Not Taken

works pretty well when youre right

CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
15
Assume Branch Not Taken

same performance as stalling when youre wrong

CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
and 12, 2, 5
IM
Reg
or ...
IM
add ...
IM
Reg

there sub 12, 4, 2
16
Assume Branch Not Taken

Performance depends on percentage of time you
guess right.
Flushing an instruction means to prevent it from
changing any permanent state (registers, memory,
PC).
sounds a lot like a bubble...
But notice that we need to be able to insert
those bubbles later in the pipeline

17
Problems With Predict-Taken

Where do we get the address that the branch will
jump to?
Many branch instructions perform a computation to
generate the branch address
Cache issues when predict incorrectly
Starting down the taken path often fetches a new
line of instructions into the cache, increasing
pressure
Predicting that all branches are not taken
addresses these problems
Can use compiler transformations to rearrange
loops so that the conditional branch is at the
top of the loop and make not-taken branches more
common

18
Some static strategies

Assume backwards branch is always taken, forward
branch never is
backwards negative displacement field
loops (which branch backwards) are usually
executed multiple times.
if-then-else often takes the then (no branch)
clause.
Compiler makes educated guess
sets predict taken/not taken bit in instruction

19
Dynamic Branch Prediction
20
Why Dynamic Prediction

Static prediction isnt good enough
Predicting all branches as either taken not taken
doesnt give very good prediction rates
Compilers arent that good at predicting whether
a given branch will be taken or not taken
The best you can ever hope to do with static
prediction is to figure out whether a given
branch is taken or not taken more often
Works well for some branches (loops)
Doesnt work well for data-dependent branches
(conditionals)
Idea Maybe we can do better if we have dynamic
information about what the program is doing at a
given time

21
Branch Prediction
Branch history table
program counter
1
0000 0001 0010 0010 0011 0100 0101 ...
for (i0ilt10i) ... ...
1
0
1
1
0
... ... add i, i, 1 beq i, 10, loop
This 1 bit means, the last time the
program counter ended with 0100 and a beq
instruction was seen, the branch was taken.
Hardware guesses it will be taken again.
22
Branch Prediction

Similar in structure to a direct-mapped cache
Bits of address select exactly one entry that can
hold a prediction for a given branch
No tag field Want a prediction for every branch,
so use the value in the entry without checking
the address
Small, fast
At one bit/entry, can afford a lot of entries
Lookup process is very simple, can do in parallel
with fetching each instruction
If instruction isnt a branch, just dont use the
prediction.

23
Cons of the 1-bit Branch Prediction

Aliasing between branches
Branches that map onto the same entry cause
conflicts, like conflict misses in caches
Using only one bit of state mispredicts many
branches
Example Every time a loop is executed, will
mispredict at least twice
for (i 0 i lt 100 i )
j j i

24
Two-bit predictors give better loop prediction
for (i0ilt10i) ... ...
... ... add i, i, 1 beq i, 10, loop
this state means, the last two branches at
this location were taken.
This one means, the last two branches at
this location were not taken.
25
Branch History Table

has limited size
2 bits by N (e.g. 4K)
uses low bits of branch address to choose entry
what happens when table too small?

PHT
branch address
00
26
2-bit prediction accuracy
Is this good enough?
27
Can We Do Better?

Can we get more information dynamically than just
the history of this branch?

28
Can We Do Better?

Can we get more information dynamically than just
the history of this branch?
We can look at patterns (2-level predictor) for a
particular branch.
last eight branches 00100100, then it is a good
guess that the next one is 1 (taken)

29
Can We Do Better?

Can we get more information dynamically than just
the history of this branch?
We can look at patterns (2-level predictor) for a
particular branch.
last eight branches 00100100, then it is a good
guess that the next one is 1 (taken)

BHT
address
00
000000
00
111111
001001
000000
11
30
Can We Do Better?

Correlating Branch Predictors also look at other
branches for clues
if (i 0)
...
if (i gt 7)
...
Typically use two indexes
Global history register --gt history of last m
branches (e.g., 0100011)
branch address

31
Correlating Branch Predictors

The global history register is a shift register
that records the last n branches (of any address)
encountered by the processor.

ghr
00
01
2-bit predictors
00
11
32
Two-level correlating branch predictors

Can use both the PC address and the GHR

ghr
00
01
2-bit predictors
PC
00
combining function
11
33
Performance of 2-level Correlating Branch
Prediction
34
Can we do even better?

Combining branch predictors use multiple schemes
and a voter to decide which one typically does
better for that branch.

P1
P2
use P2
PC
35
Limitations of Branch Prediction

Not enough to predict whether or not branch is
taken
Need to know target of branch if taken
Predicting return branches from subroutines is
hard
Can use separate buffer to record call history

36
Branch Target Buffers (BTB)

predict the location of branches in the
instruction stream
predict the destination of branches

37
BTB Operation

use PC (all bits) for lookup
match implies this is a branch
if match and predict bits gt taken, set PC to
predicted PC
if branch predict wrong, must recover (same as
branch hazards weve already seen)
if decode indicates branch when no BTB match, two
choices
look up prediction now and act on it
just predict not taken
when branch resolved, update BTB (at least
prediction bits, maybe more)

38
BTB Performance

Two things that can go wrong
didnt predict the branch (misfetch)
mispredicted a branch (mispredict)
Suppose BTB hit rate of 85 and predict accuracy
of 90, misfetch penalty of 2 cycles and
mispredict penalty of 10 cycles, what is average
branch penalty?
Can use both BTB and branch predictor
have no prediction bits in BTB (why is that a
good idea?)
presence of PC in BTB indicates a lookup in
branch predictor to predict whether the branch
will go to destination address in BTB.

39
What about indirect jumps/returns?

Branch predictor does really well with
conditional jumps
BTB does really well with unconditional jumps
(jump, jal, etc.)
Indirect jumps often jump to different
destinations, even from the same instruction.
Indirect jumps most often used for return
instructions.
Return easily handled by a stack.
jal -gt push PC4
return -gt predict jump to address on top of
stack, pop stack

40
Branch Prediction

Latest branch predictors significantly more
sophisticated, using more advanced correlating
techniques, larger structures, and soon possibly
using AI techniques.

41
Branch Prediction

Latest branch predictors significantly more
sophisticated, using more advanced correlating
techniqes, larger structures, and soon possibly
using AI techniques.
Presupposes what two pieces of information are
available at fetch time?
Branch Target Buffer supplies this information.

42
Control Hazards -- Key Points

Control (or branch) hazards arise because we must
fetch the next instruction before we know if we
are branching or where we are branching.
Control hazards are detected in hardware.
We can reduce the impact of control hazards
through
early detection of branch address and condition
branch prediction
branch delay slots

43
Modern Processors