Branch Prediction - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Branch Prediction

Description:

Starting down the taken path often fetches a new line of instructions into the ... Lookup process is very simple, can do in parallel with fetching each instruction ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 44
Provided by: Sapn
Category:

less

Transcript and Presenter's Notes

Title: Branch Prediction


1
Branch Prediction
  • Subbaiah Venkata

2
Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 2, 1, here
add ...
sub ...
These instructions shouldnt be executed!
lw ...
IM
Reg
DM
here lw ...
Finally, the right instruction
3
Objectives
  • Ways to deal with control hazards
  • Stalling for branch hazards
  • Reducing Branch delays
  • Eliminating Branch Stalls
  • Predicting Branches
  • Static
  • Dynamic

4
Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
5
Stalling for Branch Hazards
  • Seems wasteful, particularly when the branch
    isnt taken.
  • Makes all branches cost 4 cycles.

6
Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
7
Stalling for Branch Hazards
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
8
Eliminating the Branch Stall
  • Theres no rule that says we have to see the
    effect of the branch immediately. Why not wait
    an extra instruction before branching?

9
Eliminating the Branch Stall
  • Theres no rule that says we have to see the
    effect of the branch immediately. Why not wait
    an extra instruction before branching?
  • The original SPARC and MIPS processors each used
    a single branch delay slot to eliminate
    single-cycle stalls after branches.
  • The instruction after a conditional branch is
    always executed in those machines, regardless of
    whether the branch is taken or not!

10
Branch Delay Slot
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
there or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
Branch delay slot instruction (next instruction
after a branch) is executed even if the branch is
taken.
11
Filling the branch delay slot
  • add 5, 3, 7
  • sub 6, 1, 4
  • and 7, 8, 2
  • beq 6, 7, there
  • nop / branch delay slot /
  • add 9, 1, 2
  • sub 2, 9, 5
  • ...
  • there
  • mult 2, 10, 11

12
Filling the branch delay slot
  • The branch delay slot is only useful if you can
    find something to put there.
  • If you cant find anything, you must put a nop to
    insure correctness.

13
Problems with branch delay slots
  • Exposes pipeline structure to programmer
    (compatibility)
  • Ability to find instructions for delay slots
    drops off rapidly

14
Assume Branch Not Taken
  • works pretty well when youre right

CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
DM
Reg
and 12, 2, 5
IM
Reg
DM
Reg
or ...
IM
Reg
DM
add ...
IM
Reg

sw ...
15
Assume Branch Not Taken
  • same performance as stalling when youre wrong

CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
beq 4, 0, there
IM
Reg
and 12, 2, 5
IM
Reg
or ...
IM
add ...
IM
Reg

there sub 12, 4, 2
16
Assume Branch Not Taken
  • Performance depends on percentage of time you
    guess right.
  • Flushing an instruction means to prevent it from
    changing any permanent state (registers, memory,
    PC).
  • sounds a lot like a bubble...
  • But notice that we need to be able to insert
    those bubbles later in the pipeline

17
Problems With Predict-Taken
  • Where do we get the address that the branch will
    jump to?
  • Many branch instructions perform a computation to
    generate the branch address
  • Cache issues when predict incorrectly
  • Starting down the taken path often fetches a new
    line of instructions into the cache, increasing
    pressure
  • Predicting that all branches are not taken
    addresses these problems
  • Can use compiler transformations to rearrange
    loops so that the conditional branch is at the
    top of the loop and make not-taken branches more
    common

18
Some static strategies
  • Assume backwards branch is always taken, forward
    branch never is
  • backwards negative displacement field
  • loops (which branch backwards) are usually
    executed multiple times.
  • if-then-else often takes the then (no branch)
    clause.
  • Compiler makes educated guess
  • sets predict taken/not taken bit in instruction

19
Dynamic Branch Prediction
20
Why Dynamic Prediction
  • Static prediction isnt good enough
  • Predicting all branches as either taken not taken
    doesnt give very good prediction rates
  • Compilers arent that good at predicting whether
    a given branch will be taken or not taken
  • The best you can ever hope to do with static
    prediction is to figure out whether a given
    branch is taken or not taken more often
  • Works well for some branches (loops)
  • Doesnt work well for data-dependent branches
    (conditionals)
  • Idea Maybe we can do better if we have dynamic
    information about what the program is doing at a
    given time

21
Branch Prediction
Branch history table
program counter
1
0000 0001 0010 0010 0011 0100 0101 ...
for (i0ilt10i) ... ...
1
0
1
1
0
... ... add i, i, 1 beq i, 10, loop
This 1 bit means, the last time the
program counter ended with 0100 and a beq
instruction was seen, the branch was taken.
Hardware guesses it will be taken again.
22
Branch Prediction
  • Similar in structure to a direct-mapped cache
  • Bits of address select exactly one entry that can
    hold a prediction for a given branch
  • No tag field Want a prediction for every branch,
    so use the value in the entry without checking
    the address
  • Small, fast
  • At one bit/entry, can afford a lot of entries
  • Lookup process is very simple, can do in parallel
    with fetching each instruction
  • If instruction isnt a branch, just dont use the
    prediction.

23
Cons of the 1-bit Branch Prediction
  • Aliasing between branches
  • Branches that map onto the same entry cause
    conflicts, like conflict misses in caches
  • Using only one bit of state mispredicts many
    branches
  • Example Every time a loop is executed, will
    mispredict at least twice
  • for (i 0 i lt 100 i )
  • j j i

24
Two-bit predictors give better loop prediction
for (i0ilt10i) ... ...
... ... add i, i, 1 beq i, 10, loop
this state means, the last two branches at
this location were taken.
This one means, the last two branches at
this location were not taken.
25
Branch History Table
  • has limited size
  • 2 bits by N (e.g. 4K)
  • uses low bits of branch address to choose entry
  • what happens when table too small?

PHT
branch address
00
26
2-bit prediction accuracy
Is this good enough?
27
Can We Do Better?
  • Can we get more information dynamically than just
    the history of this branch?

28
Can We Do Better?
  • Can we get more information dynamically than just
    the history of this branch?
  • We can look at patterns (2-level predictor) for a
    particular branch.
  • last eight branches 00100100, then it is a good
    guess that the next one is 1 (taken)

29
Can We Do Better?
  • Can we get more information dynamically than just
    the history of this branch?
  • We can look at patterns (2-level predictor) for a
    particular branch.
  • last eight branches 00100100, then it is a good
    guess that the next one is 1 (taken)

BHT
address
00
000000
00
111111
001001
000000
11
30
Can We Do Better?
  • Correlating Branch Predictors also look at other
    branches for clues
  • if (i 0)
  • ...
  • if (i gt 7)
  • ...
  • Typically use two indexes
  • Global history register --gt history of last m
    branches (e.g., 0100011)
  • branch address

31
Correlating Branch Predictors
  • The global history register is a shift register
    that records the last n branches (of any address)
    encountered by the processor.

ghr
00
01
2-bit predictors
00
11
32
Two-level correlating branch predictors
  • Can use both the PC address and the GHR

ghr
00
01
2-bit predictors
PC
00
combining function
11
33
Performance of 2-level Correlating Branch
Prediction
34
Can we do even better?
  • Combining branch predictors use multiple schemes
    and a voter to decide which one typically does
    better for that branch.

P1
P2
use P2
PC
35
Limitations of Branch Prediction
  • Not enough to predict whether or not branch is
    taken
  • Need to know target of branch if taken
  • Predicting return branches from subroutines is
    hard
  • Can use separate buffer to record call history

36
Branch Target Buffers (BTB)
  • predict the location of branches in the
    instruction stream
  • predict the destination of branches

37
BTB Operation
  • use PC (all bits) for lookup
  • match implies this is a branch
  • if match and predict bits gt taken, set PC to
    predicted PC
  • if branch predict wrong, must recover (same as
    branch hazards weve already seen)
  • if decode indicates branch when no BTB match, two
    choices
  • look up prediction now and act on it
  • just predict not taken
  • when branch resolved, update BTB (at least
    prediction bits, maybe more)

38
BTB Performance
  • Two things that can go wrong
  • didnt predict the branch (misfetch)
  • mispredicted a branch (mispredict)
  • Suppose BTB hit rate of 85 and predict accuracy
    of 90, misfetch penalty of 2 cycles and
    mispredict penalty of 10 cycles, what is average
    branch penalty?
  • Can use both BTB and branch predictor
  • have no prediction bits in BTB (why is that a
    good idea?)
  • presence of PC in BTB indicates a lookup in
    branch predictor to predict whether the branch
    will go to destination address in BTB.

39
What about indirect jumps/returns?
  • Branch predictor does really well with
    conditional jumps
  • BTB does really well with unconditional jumps
    (jump, jal, etc.)
  • Indirect jumps often jump to different
    destinations, even from the same instruction.
    Indirect jumps most often used for return
    instructions.
  • Return easily handled by a stack.
  • jal -gt push PC4
  • return -gt predict jump to address on top of
    stack, pop stack

40
Branch Prediction
  • Latest branch predictors significantly more
    sophisticated, using more advanced correlating
    techniques, larger structures, and soon possibly
    using AI techniques.

41
Branch Prediction
  • Latest branch predictors significantly more
    sophisticated, using more advanced correlating
    techniqes, larger structures, and soon possibly
    using AI techniques.
  • Presupposes what two pieces of information are
    available at fetch time?
  • Branch Target Buffer supplies this information.

42
Control Hazards -- Key Points
  • Control (or branch) hazards arise because we must
    fetch the next instruction before we know if we
    are branching or where we are branching.
  • Control hazards are detected in hardware.
  • We can reduce the impact of control hazards
    through
  • early detection of branch address and condition
  • branch prediction
  • branch delay slots

43
Modern Processors
  • Happy Thanksgiving
Write a Comment
User Comments (0)
About PowerShow.com