InstructionLevel Parallelism and Its Dynamic Exploitation - PowerPoint PPT Presentation

1 / 183
About This Presentation
Title:

InstructionLevel Parallelism and Its Dynamic Exploitation

Description:

Reducing Branch Penalties with Dynamic Hardware Prediction ... But include IA-64 and Intel's Itanium. 6. ILP Methods. A combo of HW and SW/Compiler methods ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 184
Provided by: ccNct
Category:

less

Transcript and Presenter's Notes

Title: InstructionLevel Parallelism and Its Dynamic Exploitation


1
Instruction-Level Parallelism andIts Dynamic
Exploitation
2
Outline
  • Instruction-Level Parallelism Concepts and
    Challenges
  • Overcoming Data Hazards with Dynamic Scheduling
  • Dynamic Scheduling Examples and the Algorithm
  • Reducing Branch Penalties with Dynamic Hardware
    Prediction
  • High-Performance Instruction Delivery
  • Taking Advantage of More ILP with Multiple Issue
  • Hardware-Based Speculation
  • Studies of the Limitations of ILP
  • Limitations on ILP for Reliable Processors

3
Instruction-Level Parallelism Concepts and
Challenges
4
Introduction
  • Instruction-Level Parallelism (ILP) potential
    execution overlap among instructions
  • Instructions are executed in parallel
  • Pipeline supports a limited sense of ILP
  • This chapter introduces techniques to increase
    the amount of parallelism exploited among
    instructions
  • How to reduce the impact of data and control
    hazards
  • How to increase the ability of the processor to
    exploit parallelism
  • Pipelined CPIIdeal pipeline CPIStructural
    stallsRAW stallsWAR stallsWAW stallsControl
    stalls

5
Approaches To Exploiting ILP
  • Hardware approach focus of this chapter
  • Dynamic running time
  • Dominate desktop and server markets
  • Pentium III and IV Athlon MIPS R10000/12000
    Sun UltraSPARC III PowerPC 603, G3, and G4
    Alpha 21264
  • Software approach focus of next chapter
  • Static compiler time
  • Rely on compilers
  • Broader adoption in the embedded market
  • But include IA-64 and Intels Itanium

6
ILP Methods
A combo of HW and SW/Compiler methods
7
ILP within a Basic Block
  • Basic Block Instructions between branch
    instructions
  • Instructions in a basic block are executed in
    sequence
  • Real code is a bunch of basic blocks connected by
    branch
  • Notice dynamic branch frequency between 15
    and 25
  • Basic block size between 6 and 7 instructions
  • May depend on each other (data dependence)
  • Therefore, probably little in the way of
    parallelism
  • To obtain substantial performance enhancement
    ILP across multiple basic blocks
  • Easiest target is the loop
  • Exploit parallelism among iterations of a loop
    (loop-level parallelism)

8
Loop Level Parallelism (LLP)
  • Consider adding two 1000 element arrays
  • There is no dependence between data values
    produced in any iteration j and those needed in
    jn for any j and n
  • Truly independent iterations
  • Independence means no stalls due to data hazards
  • Basic idea to convert LLP into ILP
  • Unroll the loop either statically by the compiler
    (next chapter) or dynamically by the hardware
    (this chapter)

x1 x1 y1x2 x2
y2x1000x1000y1000
for (i1 ilt1000, ii1) xi xi yi
9
Data Dependences and Hazards
10
Introduction
  • If two instructions are independent, then
  • They can execute (parallel) simultaneously in a
    pipeline without stall
  • Assume no structural hazards
  • Their execution orders can be swapped
  • Dependent instructions must be executed in order,
    or partially overlapped in pipeline
  • Why to check dependence?
  • Determine how much parallelism exists, and how
    that parallelism can be exploited
  • Types of dependences -- Data, Name, Control
    dependence

11
Data Dependence Analysis
  • i is data dependent on j if i uses a result
    produced by j
  • OR i uses a result produced by k and k depends on
    j (chain)
  • Dependence indicates a potential RAW hazard
  • Induce a hazard and stall? - depends on the
    pipeline organization
  • The possibility limits the performance
  • Order in which instructions must be executed
  • Sets a bound on how much parallelism can be
    exploited
  • Overcome data dependence
  • Maintain dependence but avoid a hazard
    scheduling the code (HW,SW)
  • Eliminate a dependence by transforming the code
    (by compiler)

12
Data Dependence Example
  • Loop L.D F0, 0(R1)
  • ADD.D F4, F0, F2
  • S.D F4, 0(R1)
  • DADDUI R1, R1, -8
  • BNE R1, R2, Loop

If two instructions are data dependent, they
cannot execute simultaneously or be completely
overlapped.
13
Data Dependence through Memory Location
  • Dependences that flow through memory locations
    are more difficult to detect
  • Addresses may refer to the same location but look
    different
  • 100(R4) and 20(R6) may be identical
  • The effective address of a load or store may
    change from one execution of the instruction to
    another
  • Two execution of the same instruction L.D F0,
    20(R4) may refer to different memory location
  • Because the value of R4 may change between two
    executions

14
Name Dependence
  • Occurs when 2 instructions use the same register
    name or memory location without data dependence
  • Let i precede j in program order
  • i is antidependent on j when j writes a register
    that i reads
  • Indicates a potential WAR hazard
  • i is output dependent on j if they both write to
    the same register
  • indicates a potential WAW hazard
  • Not true data dependences no value being
    transmitted between instructions
  • Can execute simultaneously or be reordered if the
    name used in the instructions is changed so the
    instructions do not conflict

15
Name Dependence Example
  • L.D F0, 0(R1)
  • ADD.D F4,F0,F2
  • S.D F4, 0(R1)
  • L.D F0,-8(R1)
  • ADD.D F4,F0,F2

Output dependence
Anti-dependence
Register renaming
Renaming can be performedeither by compiler or
hardware
16
Register Renaming and WAW/WAR
  • DIV.D F0, F2, F4
  • ADD.D F6, F0, F8
  • S.D F6, 0 (R1)
  • SUB.D F8, F10, F14
  • MUL.D F6, F10, F8
  • DIV.D F0, F2, F4
  • ADD.D S, F0, F8
  • S.D S, 0 (R1)
  • SUB.D T, F10, F14
  • MUL.D F6, F10, T
  • WAW ADD.D/MUL.D
  • WAR ADD.D/SUB.D, S.D/MUL.D
  • RAW DIV.D/ADD.D, ADD.D/S.D SUB.D/MUL.D

Renaming result
17
Control Dependence
if p1 s1A if p2 s2
  • Since branches are conditional
  • Some instructions will be executed and others
    will not
  • Instructions before the branch dont matter
  • Only possibility is between a branch and
    instructions which follow it
  • 2 obvious constraints to maintain control
    dependence
  • Instructions controlled by the branch cannot be
    moved before the branch (since it would then be
    uncontrolled)
  • An instruction not controlled by the branch
    cannot be moved after the branch (since it would
    then be controlled)
  • Note
  • Transitive control dependence is also a factor
  • In simple pipelines - order is preserved anyway
    so no big deal

18
Control Dependence (Cont.)
  • Whats the big deal
  • No data dependence so move something before the
    branch
  • Trash the result if the branch goes the wrong way
  • Note only works when result goes to a register
    which becomes dead (result never used) if the
    wrong way is taken
  • However 2 important side-effects affect
    correctness issues
  • Exception behavior remains intact
  • Sometimes this is relaxed but it probably should
    not be
  • Branches effectively set up conditional data flow
  • Data flow is definitely real so if we do the move
    then we better make sure it does not change the
    data flow
  • So it can be done but care must be taken
  • Enter HW and SW speculation conditional
    instructions

19
Control Dependence (Cont.)
  • Not the critical property that must be preserved
  • May execute instructions that should not have
    been executed, thereby violating the control
    dependence ? as long as OK
  • Wrong guess in delayed branch (from
    target/fall-through)
  • Maintain control and data dependences can prevent
    raising new exceptions
  • DADDU R2, R3, R4
  • BEQZ R2, L1
  • LW R1, 0(R2)
  • L1
  • No data dependence prevents us from interchanging
    BEQZ and LW it is only the control dependence

May raise memory protection exception if we
interchange BEQZ and LW
20
Control Dependence (Cont.)
  • By preserving the control dependence of the OR on
    the branch, we prevent an illegal change to the
    data flow
  • DADDU R1, R2, R3
  • BEQZ R4, L1
  • DSUBU R1, R5, R6
  • L1.
  • OR R7, R1, R8

21
Control Dependence (Cont.)
  • IF R4 were unused (dead) after skipnext and DSUBU
    could not generate an exception, we could move
    DSUBU before the branch, since the data flow
    cannot be affected
  • If branch is taken, DSUBU will execute and will
    be useless
  • DADDU R1, R2, R3
  • BEQZ R12, skipnext
  • DSUBU R4, R5, R6
  • DADDU R5, R4, R9
  • skipnext OR R7, R8, R9

22
Overcoming Data Hazards with Dynamic Scheduling
23
Introduction
  • Approaches used to avoid data hazard in Appendix
    A and Chapter 4
  • Forwarding or bypassing let dependence not
    result in hazards
  • Stall Stall the instruction that uses the
    result and successive instructions
  • Compiler (Pipeline) scheduling static
    scheduling
  • In-order instruction issue and execution
  • Instructions are issued in program order, and if
    an instruction is stalled in the pipeline, no
    later instructions can proceed
  • If there is a dependence between two closely
    spaced instructions in the pipeline, this will
    lead to a hazard and a stall will result

24
Dynamic Scheduling VS. Static Scheduling
  • Dynamic Scheduling Avoid stalling when
    dependences are present
  • Static Scheduling Minimize stalls by separating
    dependent instructions so that they will not lead
    to hazards

25
Dynamic Scheduling Idea
  • Dynamic scheduling HW rearranges the
    instruction execution to avoid stalling when
    dependences, which could generate hazards, are
    present
  • Advantages
  • Enable handling some dependences unknown at
    compile time
  • Simplify the compiler
  • Code for one machine runs well on another
  • Approaches
  • Scoreboard (Appendix A)
  • Tomasulo Approach (focus of this part)
  • Assume multiple instructions can be in execution
    at the same time (require multiple FUs, pipelined
    Fus, or both)

26
Dynamic Scheduling
  • Dynamic instruction reordering
  • In-order issue
  • But allow out-of-order execution (and thus
    out-of-order completion)
  • Consider
  • DIV.D F0, F2, F4
  • ADD.D F10, F0, F8
  • SUB.D F12, F8, F14
  • DIV.D has a long latency (20 pipeline stages)
  • ADD.D has a data dependence on F0, SUB.D does not
  • Stalling ADD.D will stall SUB.D too
  • So swap them - compiler might have done this but
    so could HW
  • Problems raise new exceptions?
  • For now lets ignore precise exceptions (Section
    3.7 and Appendix A)

Hazard?
27
Dynamic Scheduling (Cont.)
  • Key Idea allow instructions behind stall to
    proceed
  • SUB.D can proceed even when ADD.D is stalled
  • Out-of-order execution divides ID stage
  • Issue decode instructions, check for structural
    hazards
  • Read operands wait until no data hazards, then
    read operands
  • All instructions pass through the issue stage in
    order
  • But, instructions can be stalled or bypass each
    other in the read-operand stage, and thus enter
    execution out of order.

Issue
DM
IM
EX
IF
ID
MEM
WB
28
WAR WAW may arise when dynamic scheduling
  • More Interesting Code Fragment
  • DIV.D F0, F2, F4
  • ADD.D F6, F0, F8
  • SUB.D F8, F10, F14
  • MUL.D F6, F10, F8
  • Note following
  • ADD.D cant start until DIV.D completes
  • SUB.D does not need to wait but cant post result
    to F8 until ADD.D reads F8 otherwise, yielding
    WAR hazard
  • MUL.D does not need to wait but cant post result
    to F6 until ADD.D write F6 otherwise, yielding
    WAW hazard

Data dependence
Anti-dependence
Output-dependence
Both WAW and WAR hazards can be solved by
Scoreboard (Appendix A) and Tomasulo
29
Tomasulos Approach
  • The original idea is for IBM 360/91 overcome
  • Limited compiler scheduling (only 4
    double-precision FP registers)
  • Reduce memory accesses and FP delays
  • Goal High Performance without special compilers
  • Why Study? lead to Alpha 21264, HP 8000, MIPS
    10000, Pentium II, PowerPC 604,
  • Key ideas
  • Track data dependences to allow execution as soon
    as operands are available ? minimize RAW hazards
  • Rename registers to avoid WAR and WAW hazards

30
Key Idea
  • Pipelined or multiple function units (FU)
  • Each FU has multiple reservation stations (RS)
  • Issue to reservation stations was in-order
    (in-order issue)
  • RS starts whenever they had collected source
    operands from real registers (RR) - hence
    out-of-order execution
  • Reservation stations contain virtual registers
    (VR) that remove WAW and WAR induced stalls
  • RS fetches operands from RR and stores them into
    VR
  • Since virtual registers can be more than real
    registers, the technique can even eliminate
    hazards arising from name dependences that could
    not be eliminated by a compiler

31
Basic Structure of A Tomasulo-Based MIPS Processor
Virtual registers
32
Reservation Station Duties
  • Each RS holds an instruction that has been issued
    and is awaiting execution at a FU, and either the
    operand values or the RS names that will provide
    the operand values
  • RS fetches operands from CDB when they appear
  • When all operands are present, enable the
    associated functional unit to execute
  • Since values are not really written to registers
  • No WAW or WAR hazards are possible

33
Register Renaming in Tomasulos Approach
  • Register renaming is provided by reservation
    stations (RS) and instruction issue logic
  • Each function unit has several reservation
    stations
  • A RS fetches and buffers an operand as soon as it
    is available
  • Eliminate the need to get the operand from a
    register
  • Pending instructions designate the RS that will
    provide their input
  • When successive writes to a register overlap in
    execution, only the last one is actually used to
    update the register
  • Avoid WAW

Avoid WAR
34
RS and Tomasulos Approach
  • Hazard detection and execution control are
    distributed
  • Information held in RS at each functional unit
    determine when an instruction can begin execution
    at that unit
  • Results are passed directly to functional units
    rather than through the registers
  • Essentially similar to bypass logic
  • Broadcast capability since they pass on CDB
    (common data bus)

35
Instruction Steps
  • Issue (note in-order due to queue structure)
  • Get instruction from instruction Queue
  • Issue if there is an empty RS or available buffer
    (loads, stores)
  • If the operands are in registers send them to the
    reservation station
  • Stall otherwise due to the structural hazard
  • Execute (may be out of order)
  • When all operands are available then execute
  • If not, then monitor CDB to grab desired operand
    when it is produced
  • Effectively deals with RAW hazards
  • Write Result (also may be out of order)
  • When result available write it to the CDB
  • From CDB it will go to a waiting RS and to the
    registers and store buffer
  • Note renaming model prevents WAW and WAR hazards
    as a side effect

36
Basic Structure of A Tomasulo-Based MIPS Processor
Virtual registers
37
Hazards Handling
  • Structural hazards checked at 2 points
  • At dispatch - a free RS of the appropriate type
    must be available
  • When operands are ready - multiple RS may compete
    for issue to the shared execution unit
  • Program order used as basis for the arbitration
  • RAW, WAR, WAW
  • To preserve exception behavior, instructions
    should not be allowed to execute if a branch that
    is earlier in program has not yet completed
  • Implemented by preventing any instruction from
    leaving the issue step, if there is a pending
    branch already in the pipeline

38
Virtual Registers
  • Tag field associated with data
  • Tag field is a virtual register ID
  • Corresponds to
  • Reservation station and load buffer names
  • Motivation due to the 360s register weakness
  • Had only 6 FP registers
  • The 9 renamed virtual registers were a
    significant bonus

39
Tomasulo Structure
  • Each Reservation Station
  • Op - the operation
  • Qj, Qk - RS that will produce the operand
  • 0?value is already available or no necessary
    operand
  • Vj, Vk - the value of the operands
  • Only one of V or Q is valid for each operand
  • Busy - RS and its corresponding functional unit
    are occupied
  • A information for memory address calculation for
    a load or store
  • Immediate ? effective address
  • Register file and store buffers
  • Qi RS that produces the value to be stored in
    this register
  • Load and store buffers each require a busy field
  • Note
  • max 1 valid Qj or Vj
  • same with Qk or Vk

40
Detailed Tomasulo Algorithm Control
Avoid RAW
Avoid RAW
Avoid RAW
The result of register Qiwill come from RS r
Avoid RAW
41
Detailed Tomasulo Algorithm Control (Cont.)
Calculate effectiveaddress
Write to register
Broadcast to RSneeding result
42
Tomasulo Example Cycle 0
LD is 1 CC, ADDD/SUBD is 2 CC, MULT is 10 CC, and
DIVD is 40 CC(Execution stage)
43
Tomasulo Example Cycle 1
Yes
44
Tomasulo Example Cycle 2
45
Tomasulo Example Cycle 3
  • Note registers names are removed (renamed) in
    Reservation Stations MULT issued vs. scoreboard
  • Load1 completing what is waiting for Load1?

46
Tomasulo Example Cycle 4
  • Load2 completing what is waiting for it?

47
Tomasulo Example Cycle 5
48
Tomasulo Example Cycle 6
49
Tomasulo Example Cycle 7
  • Add1 completing what is waiting for it?

50
Tomasulo Example Cycle 8
51
Tomasulo Example Cycle 9
52
Tomasulo Example Cycle 10
53
Tomasulo Example Cycle 11
54
Tomasulo Example Cycle 12
  • Note all quick instructions complete already

55
Tomasulo Example Cycle 13
56
Tomasulo Example Cycle 14
57
Tomasulo Example Cycle 15
  • Mult1 completing what is waiting for it?

58
Tomasulo Example Cycle 16
  • Note Just waiting for divide

59
Tomasulo Example Cycle 55
60
Tomasulo Example Cycle 56
  • Mult 2 completing what is waiting for it?

61
Tomasulo Example Cycle 57
  • Again, in-order issue, out-of-order execution,
    completion

62
Advantages of Tomasulo
  • Distribution of the hazard detection logic
  • Distributed RS and CDB
  • If multiple instructions are waiting on a single
    result, and each already has its other operand,
    then the instruction can be released
    simultaneously by the broadcast on CDB
  • No waiting for the register bus in a centralized
    register file
  • Elimination of stalls for WAW and WAR
  • Rename register using RS
  • Store operands into RS as soon as they are
    available
  • For WAW-hazard, the last write will win
  • Issue stage RegisterStatrd.Qi ? r (the last
    wins)

63
Tomasulo Drawbacks
  • Complexity
  • delays of 360/91, MIPS 10000, IBM 620?
  • Many associative stores (CDB) at high speed
  • Performance limited by Common Data Bus
  • Multiple CDBs ? more FU logic for parallel assoc
    stores

64
Tomasulo Loop Example
  • Loop LD F0 0 R1
  • MULTD F4 F0 F2
  • SD F4 0 R1
  • SUBI R1 R1 8
  • BNEZ R1 Loop
  • Assume Multiply takes 4 clocks
  • Assume first load takes 8 clocks (cache miss?),
    second load takes 4 clocks (hit)
  • To be clear, will show clocks for SUBI, BNEZ
  • Reality, integer instructions ahead

65
Loop Example Cycle 0
66
Loop Example Cycle 1
67
Loop Example Cycle 2
68
Loop Example Cycle 3
  • Note MULT1 has no registers names in RS

69
Loop Example Cycle 4
70
Loop Example Cycle 5
71
Loop Example Cycle 6
  • Note F0 never sees Load1 result

72
Loop Example Cycle 7
  • Note MULT2 has no registers names in RS

73
Loop Example Cycle 8
74
Loop Example Cycle 9
  • Load1 completing what is waiting for it?

75
Loop Example Cycle 10
  • Load2 completing what is waiting for it?

76
Loop Example Cycle 11
77
Loop Example Cycle 12
78
Loop Example Cycle 13
79
Loop Example Cycle 14
  • Mult1 completing what is waiting for it?

80
Loop Example Cycle 15
  • Mult2 completing what is waiting for it?

81
Loop Example Cycle 16
82
Loop Example Cycle 17
83
Loop Example Cycle 18
84
Loop Example Cycle 19
85
Loop Example Cycle 20
86
Loop Example Cycle 21
87
Tomasulo Summary
  • Reservations stations renaming to larger set of
    registers buffering source operands
  • Prevents registers as bottleneck
  • Avoids WAR, WAW hazards of Scoreboard
  • Allows loop unrolling in HW
  • For one CDB, only one operation can use it at a
    single clock cycle
  • Not limited to basic blocks (integer units gets
    ahead, beyond branches)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are Pentium II PowerPC 604
    MIPS R10000 HP-PA 8000 Alpha 21264

88
Reducing Branch Penalties with Dynamic Hardware
Prediction
89
Dynamic Control Hazard Avoidance
  • Consider Effects of Increasing the ILP
  • Control dependencies rapidly become the limiting
    factor
  • They tend to not get optimized by the compiler
  • Higher branch frequencies result
  • Plus multiple issue (more than one
    instructions/sec) ? more control instructions
    per sec.
  • Control stall penalties will go up as machines go
    faster
  • Amdahls Law in action - again
  • Branch Prediction helps if can be done for
    reasonable cost
  • Static by compiler appendix A
  • Dynamic by HW this section

90
Dynamic Branch Prediction
  • Processor attempts to resolve the outcome of a
    branch early, thus preventing control dependences
    from causing stalls
  • BP_Performance f (accuracy, cost of
    misprediction)
  • Branch History Table (BHT)
  • Lower bits of PC address index table of 1-bit
    values
  • No precise address check just match the lower
    bits
  • Says whether or not branch taken last time

91
BHT Prediction
Useful only for the target addressis known
before CC is decided
If two branch instructions withthe same lower
bits
92
Problem with the Simple BHT
clear benefit is that its cheap and
understandable
  • Aliasing
  • All branches with the same index (lower) bits
    reference same BHT entry
  • Hence they mutually predict each other
  • No guarantee that a prediction is right. But it
    may not matter anyway
  • Avoidance
  • Make the table bigger - OK since its only a
    single bit-vector
  • This is a common cache improvement strategy as
    well
  • Other cache strategies may also apply
  • Consider how this works for loops
  • Always mispredict twice for every loop
  • Once is unavoidable since the exit is always a
    surprise
  • However previous exit will always cause a
    mis-predict on the first try of every new loop
    entry

93
N-bit Predictors
idea improve on the loop entry problem
  • Use an n-bit saturating counter
  • 2-bit counter implies 4 states
  • Statistically 2 bits gets most of the advantage

94
BHT Accuracy
4K of BPB with 2-bit entries misprediction rates
on SPEC89
  • Mispredict because either
  • Wrong guess for that branch
  • Got branch history of wrong branch when index the
    table

95
BHT Accuracy BHT Size
  • 4096 entry table programs vary from 1
    misprediction (nasa7, tomcatv) to 18 (eqntott),
    with spice at 9 and gcc at 12
  • 4096 about as good as infinite table (in Alpha
    211164)

96
Improve Prediction Strategy By Correlating
Branches
  • Consider the worst case for the 2-bit predictor
  • if (aa2) then aa0
  • if (bb2) then bb0
  • if (aa ! bb) then whatever
  • single level predictors can never get this case
  • Correlating or 2-level predictors
  • Correlation what happened on the last branch
  • Note that the last correlator branch may not
    always be the same
  • Predictor which way to go
  • 4 possibilities which way the last one went
    chooses the prediction
  • (Last-taken, last-not-taken) X (predict-taken,
    predict-not-taken)

if the first 2 fail then the 3rd will always be
taken
97
The worst case for the 2-bit predictor
  • if (aa2)
  • aa0
  • if (bb2)
  • bb0
  • if (aa ! bb)
  • DSUBUI R3, R1, 2
  • BNEZ R3, L1
  • DADDD R1, R0, R0
  • L1 DSUBUI R3, R2, 2
  • BNEZ R2, R0, R0
  • L2 DSUBU R3, R1, R2
  • BEQZ R3, L3

if the first 2 untaken then the 3rd will always
be taken
98
Correlating Branches
  • Hypothesis recently executed branches are
    correlated that is, behavior of recently
    executed branches affects prediction of current
    branch
  • Idea record m most recently executed branches as
    taken or not taken, and use that pattern to
    select the proper branch history table
  • In general, (m,n) predictor means record last m
    branches to select between 2m history tables each
    with n-bit counters
  • Old 2-bit BHT is then a (0,2) predictor

99
Example of Correlating Branch Predictors
  • if (d0)
  • d 1
  • if (d1)
  • BNEZ R1, L1 branch b1 (d!0)
  • DAAIU R1, R0, 1 d0, so d1
  • L1 DAAIU R3, R1, -1
  • BNEZ R3, L2 branch b2 (d!1)
  • L2

100
Example of Correlating Branch Predictors (Cont.)
101
Example of Correlating Branch Predictors (Cont.)
102
In general (m,n) BHT (prediction buffer)
  • p bits of buffer index 2p bit BHT
  • Use last m branches global branch history
  • Use n bit predictor
  • Total bits for the (m, n) BHT precitction buffer
  • 2m banks of memory selected by the global branch
    history (which is just a shift register) - e.g. a
    column address
  • Use p bits of the branch address to select row
  • Get the n predictor bits in the entry to make the
    decision

103
(2,2) Predictor Implementation
4 banks each with 32 2-bit predictor entries
p5m2n2
532
104
Accuracy of Different Schemes
105
Tournament Predictors
  • Adaptively combine local and global predictors
  • Multiple predictors
  • One based on global information Results of
    recently executed m branches
  • One based on local information Results of past
    executions of the current branch instruction
  • Selector to choose which predictors to use
  • 2-bit saturating counter, incremented whenever
    the predicted predictor is correct and the
    other predictor is incorrect, and it is
    decremented in the reverse situation
  • Advantage
  • Ability to select the right predictor for the
    right branch
  • Alpha 21264 Branch Predictor (p. 207 p. 209)

106
State Transition Diagram for A Tournament
Predictor
0/0, 0/1, 1/1
0/0, 1/0, 1/1
Use Predictor 1
Use Predictor 2
0/1
1/0
0/1
1/0
0/1
Use Predictor 1
Use Predictor 2
1/0
0/0, 1/1
0/0, 1/1
107
Fraction of Predictions Coming from the Local
Predictor (SPEC89)
108
Misprediction Rate Comparison
109
Branch Target Buffer/Cache
  • To reduce the branch penalty to 0
  • Need to know what the address is by the end of IF
  • But the instruction is not even decoded yet
  • So use the instruction address rather than wait
    for decode
  • If prediction works then penalty goes to 0!
  • BTB Idea -- Cache to store taken branches (no
    need to store untaken)
  • Match tag is instruction address ? compare with
    current PC
  • Data field is the predicted PC
  • May want to add predictor field
  • To avoid the mispredict twice on every loop
    phenomenon
  • Adds complexity since we now have to track
    untaken branches as well

110
Branch Target Buffer/Cache-- Illustration
111
Changes in DLX to incorporate BTB
112
Penalties Using this Approach for MIPS/DLX
  • Note
  • Predict_wrong 1 CC to update BTB 1 CC to
    restart fetching
  • Not found and taken 2CC to update BTB
  • Note
  • For complex pipeline design, the penalties may be
    higher

113
Branch Penalty CPI
  • Prediction accuracy is 90
  • Hit rate in the buffer is 90
  • Taken branch frequency is 60
  • Branch_penaltybuffer_hit_rateincorrect_predictio
    n_rate2 (1-buffer_hit_rate)Taken_branch2
    (0.9 0.1 2) (0.1 0.6 2) 0.18 0.12
    0.3
  • Branch penalty for delayed branches is about 0.5

114
Return Address Predictor
  • Indirect jump jumps whose destination address
    varies at run time
  • indirect procedure call, select or case,
    procedure return
  • SPEC89 benchmarks 85 of indirect jumps are
    procedure returns
  • Accuracy of BTB for procedure returns are low
  • if procedure is called from many places, and the
    calls from one place are not clustered in time
  • Use a small buffer of return addresses operating
    as a stack
  • Cache the most recent return addresses
  • Push a return address at a call, and pop one off
    at a return
  • If the cache is sufficient large (max call depth)
    ? prefect

115
Dynamic Branch Prediction Summary
  • Branch History Table 2 bits for loop accuracy
  • Correlation Recently executed branches
    correlated with next branch
  • Branch Target Buffer include branch address
    prediction
  • Reduce penalty further by fetching instructions
    from both the predicted and unpredicted direction
  • Require dual-ported memory, interleaved cache ?
    HW cost
  • Caching addresses or instructions from multiple
    path in BTB

116
3.6 Taking Advantages of More ILP with Multiple
Issue
  • Pipelined CPIIdeal pipeline CPIStructural
    stallsRAM stallsWAR stallsWAW stallsControl
    stalls

117
Getting CPI lt 1 IssuingMultiple
Instructions/Cycle
  • Superscalar
  • Issue varying numbers of instructions per clock
  • Constrained by hazard style issues
  • Scheduling
  • Static - by the compiler
  • Dynamic - hardware support for some form of
    Tomasulo
  • VLIW (very long instruction word)
  • Issue a fixed number of instructions formatted
    as
  • One large instruction or
  • A fixed instruction packet with the parallelism
    among instructions explicitly indicated by
    instruction
  • Also known as EPIC explicitly parallel
    instruction computers
  • Scheduling mostly static

Int/Br
Int/Ld-St
FP-/-
FPmul/div
118
Five Approaches in use for Multiple-Issue
Processors
119
Statically Scheduled Superscalar Processors
  • HW might issue 0 to 8 instructions in a clock
    cycle
  • Instructions issue in program order
  • Pipeline hazards are checked for at issue time
  • Among instructions being issued in a given clock
    cycle
  • Among the issuing instructions and all those
    still in execution
  • If data or structural hazards occur, only the
    instruction preceding that one in the instruction
    sequence will be issued (Dynamic issue)
  • Complex issue stage
  • Split and pipelined ? But result in higher
    branch penalties
  • Instruction issue is likely to be one limitation
    on the clock rate of superscalar processors

120
Superscalar 2-issue MIPS
  • Very similar to the HP 7100
  • Require fetching and decoding 64 bits of
    instructions
  • Which instructions
  • 1 integer load, store, branch, or integer ALU
    operation
  • 1 float FP operation
  • Why issue one integer and one FP operation?
  • Eliminate most hazard possibility ? simplify the
    logic
  • Integer and FP register sets are different
  • Integer and FP FUs are different
  • Only difficulty when integer instructions are FP
    load, store, move
  • Need an additional read/write port on the FP
    registers
  • May create RAW hazard

121
Superscalar 2-issue MIPS (Cont.)
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Instruction placement is not restricted in modern
    processor
  • 1 cycle load delay expands to 3 instructions in
    SS
  • instruction in right half can not use it, nor
    instructions in next slot
  • Must have pipeline FP FUs or multiple independent
    FP FUs

122
Consider adding a scalars to a vector
  • for (i1000 i gt 0 ii-1) xi xi s

Loop L.D F0,0( R1 ) F0vector element
ADD.D F4,F0,F2 add scalar from F2 S.D F4,
0(R1), store result DADDUI R1,R1,-8 decreme
nt pointer 8B (DW) BNE R1, R2,Loop branch
R1!R2
Assume 8(R2) is the last element to operate on
123
Unscheduled Loop
124
Unrolled Loop that Minimizes Stalls for Scalar
  • 1 Loop L.D F0,0( R1)
  • 4 L.D F6,-8(R1)
  • 3 L.D F10,-16(R1)
  • 4 L.D F14,-24(R1)
  • 5 ADD.D F4,F0,F2
  • 6 ADD.D F8,F6,F2
  • 7 ADD.D F12,F10,F2
  • 8 ADD.D F16,F14,F2
  • 9 S.D 0(R1),F4
  • 10 S.D -8(R1),F8
  • 11 DADDUI R1,R1,-32
  • 12 S.D F12, 16(R1)
  • 13 BNE R1, R2, LOOP
  • 14 S.D F16, 8(R1) 8-32-24

14 clock cycles, or 3.5 per iteration
125
Unrolled Loop for SuperScalar (5 times)
1 Loop L.D F0,0( R1) 2 L.D F6,-8(R1)
3 L.D F10,-16(R1) 4 ADD.D F4,F0,F2 5 L.D
F14,-24(R1) 6 ADD.D F8,F6,F2 7 L.D F18,
-32(R1) ...
126
Loop Unrolling in Superscalar
Unrolled 5 times to avoid delays
  • Integer instruction FP instruction Clock cycle
  • Loop L.D F0,0(R1) 1
  • L.D F6,-8(R1) 2
  • L.D F10,-16(R1) ADD.D F4,F0,F2 3
  • L.D F14,-24(R1) ADD.D F8,F6,F2 4
  • L.D F18,-32(R1) ADD.D F12,F10,F2 5
  • S.D F4, 0(R1) ADD.D F16,F14,F2 6
  • S.D F8, -8(R1) ADD.D F20,F18,F2 7
  • S.D F12, -16(R1) 8
  • S.D F16, -24(R1) 9
  • DADDUI R1,R1,-40 10
  • BNE R1, R2, LOOP 11
  • SD F20, -32(R1) 12

12 clocks, or 2.4 clocks per iteration
127
Seem Simple?
  • Registers
  • Each pipe has its own set
  • Due to separation of FP and GP registers
  • Also inherently separates data dependencies into
    2 classes
  • Exception is LDD or LDF
  • EFA is an integer operation
  • Destination register however is a FPreg
  • FP pipe has longer latency
  • Exacerbated by operation latency differences
  • mult 6 cycles, divide 24 cycles for example
  • Result is that completion is out of order
  • Complicates hazard control within the FP
    execution pipe
  • Pipeline FP ALU or use multiple FP ALUs

128
Problems So Far
  • Look at the opcodes
  • See if the pair is an appropriate issue pair
  • Some integer operations are a problem
  • FP register loads/stores - since other
    instruction may be dependent
  • A stall will result - options?
  • Force FP loads, stores or moves to issue by
    themselves
  • Safe but suboptimal since the other instruction
    may still be independent
  • OR add more ports to the FP register file
  • Such as separate read and write ports
  • Still must stall the 2nd instruction if it is
    dependent

129
Other Issues
  • Hazard detection
  • Similar to the normal pipeline model, but need
    large set of bypass path (twice as many
    instructions in the pipeline)
  • Load use delay
  • Assume 1 cycle ? now covers 3 instruction slots
  • Branch delay
  • Have branches to be issued by themselves?
  • The 1 instruction branch delay now holds 3
    instructions as well
  • Instruction scheduling by compiler
  • Mandatory for issuing independent operations in
    SS
  • Increasingly important as issue width goes up

130
Dynamic Scheduling In SuperScalar
  • Use Tomasulo Algorithm
  • Two arbitrary instructions per clock issue and
    let RS sort it out
  • But still cant issue a dependent pair
  • Two examples pp. 221224
  • How to issue multiple arbitrary instructions per
    clock?
  • Run the issue step in half a clock cycle (ex.
    Pipelined)
  • Build the logic necessary to handle two
    instructions at once, including any possible
    dependences between the instructions
  • Modern SS processors that issue four or more
    instructions per clock often include both
    approaches

131
Dynamic Scheduling in Superscalar (Cont.)
  • Only FP loads might cause dependency between
    integer and FP issue
  • Replace load reservation station with a load
    queue
  • Operands must be read in the order they are
    fetched
  • Load checks addresses in Store Queue to avoid RAW
    violation
  • Store checks addresses in Load Queue to avoid
    WAR, WAW
  • Called decoupled architecture

132
Example
  • Can issue two arbitrary operations per clock
  • One integer FU for ALU operation and
    EA-calculation
  • A separate pipelined FP FU
  • One memory unit, 2CDB
  • no delayed branch with perfect branch prediction
  • Fetch and issue as if the branch predictions are
    always correct
  • Latency between a source instruction and an
    instruction consuming the result presence of
    Write Result stage
  • 1 CC for integer ALU operations
  • 2 CC for loads
  • 3 CC for FP add

133
Note
  • WR stages does not apply to either stores or
    branches
  • For L.D and S.D, the execution cycle is EA
    calculation
  • For branches, the execution cycle shows when the
    branch condition can be evaluated and the
    prediction checked
  • Any instruction following a branch cannot start
    execution until after the branch condition has
    been evaluated
  • If two instructions could use the same FU at the
    same point (structural hazard), priority is given
    to the older instruction

134
Consider adding a scalars to a vector
  • for (i1000 i gt 0 ii-1) xi xi s

Loop L.D F0,0( R1 ) F0vector element
ADD.D F4,F0,F2 add scalar from F2 S.D F4,
0(R1) store result DAADIU R1,R1,-8 decrement
pointer 8B (DW) BNE R1, R2, Loop branch
R1!R2
135
Execution Timing
136
Execution Timing (Cont.)
137
Example Result
  • Result
  • IPC issued 5/3 1.67 Instruction execution
    rate 15/16 0.94
  • Only one load, store, and Integer ALU operation
    can execute
  • Load of the next iteration performs its memory
    address before the store of the current iteration
  • A single CDB is actually required
  • Integer operations become the bottleneck
  • Many integer operations, but only one integer ALU
  • One stall cycle each loop iteration due to a
    branch hazard

138
Another Example Execution Timing
Separate integer FU for EA calculation and ALU
operations
139
Execution Timing (Cont.)
140
Note
  • Result
  • IPC issued 5/3 1.67 Instruction execution
    rate 15/11 1.36
  • A second CDB is needed
  • This example has a higher instruction execution
    rate but lower efficiency as measured by the
    utilization of FU

141
Limitations on Multiple Issue
  • How much ILP can be found in the application
    fundamental problems
  • Requires deep unrolling - hence big focus on
    loops
  • Compiler complexity goes way up
  • Deep unrolling needs lots of registers
  • Increased HW cost
  • Increased ports for register files
  • Cost of scoreboarding (e.g. Tomasulo data
    structure) and forwarding paths
  • Memory bandwidth requirement goes up
  • Most have gone with separate I and D ports
    already
  • Newest approaches are to go for multiple D ports
    as well - big time expense!! (PA- 8000)
  • Branch prediction by HW is an absolute must HW
    Speculation (Sect. 3.7)

142
3.7 Hardware-Based Speculation
143
Overview
  • Overcome control dependence by speculating on the
    outcome of branches and executing the program as
    if our guesses were correct
  • Fetch, issue, and execute instructions
  • Need mechanisms to handle the situation when the
    speculation is incorrect
  • Dynamic scheduling only fetch and issue such
    instructions

144
Key Ideas
  • Dynamic branch prediction to choose which
    instructions to execute
  • Speculation to allow the speculated blocks to
    execution before the control dependences are
    resolved
  • And undo the effects of an incorrectly speculated
    sequence
  • Dynamic scheduling to deal with the scheduling of
    different combinations of basic blocks (Tomasulo
    style approach)

145
HW Speculation Approach
  • Issue ? execution ? write result ? commit
  • Commit is the point where the operation is no
    longer speculative
  • Allow out of order execution
  • Require in-order commit
  • Prevent speculative instructions from performing
    destructive state changes (e.g. memory write or
    register write)
  • Collect pre-commit instructions in a reorder
    buffer (ROB)
  • Holds completed but not committed instructions
  • Effectively contains a set of virtual registers
    to store the result of speculative instructions
    until they are no longer speculative
  • Similar to reservation station ? And becomes a
    bypass source

146
The Speculative MIPS
Replace store buffer
147
The Speculative MIPS (Cont.)
  • Need HW buffer for results of uncommitted
    instructions reorder buffer (ROB)
  • 4 fields instruction type, destination field,
    value field, ready field
  • ROB is a source of operands ? more registers like
    RS
  • ROB supplies operands in the interval between
    completion of instruction execution and
    instruction commit
  • Use ROB number instead of RS to indicate the
    source of operands when execution completes (but
    not committed)
  • Once instruction commits, result is put into
    register
  • As a result, its easy to undo speculated
    instructions on mispredicted branches or on
    exceptions

148
ROB Fields
  • Instruction type branch, store, register
    operations
  • Destination field
  • Unused for branches
  • Memory address for stores
  • Register number for load and ALU operations
    (register operations)
  • Value hold the value of the instruction result
    until commit
  • Ready indicate if the instruction has completed
    execution

149
Steps in Speculative Execution
  • Issue (or dispatch)
  • Get instruction from the instruction queue
  • In-order issue if available RS AND ROB slot
    otherwise, stall
  • Send operands to RS if they are in register or
    ROB
  • Update Tomasulo DS and ROB
  • The ROB no. allocated for the result is sent to
    RS, so that the number can be used to tag the
    result when it is placed on CDB
  • Execute
  • RS waits grabs results off the CDB if necessary
  • When all operands are there execution happens
  • Write Result
  • Result posted to ROB via the CDB
  • Waiting reservation stations can grab it as well

150
Steps in Speculative Execution (Cont.)
  • Commit (or graduate) instruction reaches the
    ROB head
  • Normal commit when instruction reaches the ROB
    head and its result is present in the buffer
  • Update the register and remove the instruction
    from ROB
  • Store Update memory and remove the instruction
    from ROB
  • Branch with incorrect prediction wrong
    speculation
  • Flush ROB and the related FP OP queue (RS)
  • Restart at the correct successor of the branch
  • Remove the instruction from ROB
  • Branch with correct prediction finish the
    branch
  • Remove the instruction from ROB

151
Example
  • The same example as Tomasulo without speculation.
    Show the status tables when MUL.D is ready to go
    to commit
  • L.D F6, 34(R2)
  • L.D F2, 45(R3)
  • MUL.D F0, F2, F4
  • SUB.D F8, F6, F2
  • DIV.D F10, F0, F6
  • ADD.D F6, F8, F2
  • Modified status tables
  • Qj and Qk fields, and register status fields use
    ROB (instead of RS)
  • Add Dest field to RS (ROB to put the operation
    result)

152
Figure 3.30
153
Example Result
  • Tomasulo without speculation
  • SUB.D and ADD.D have completed (clock cycle 16,
    slide 58)
  • Tomasulo with speculation
  • No instruction after the earliest uncompleted
    instruction (MUL.D) is allowed to complete
  • In-order commit
  • Implication ROB with in-order instruction
    commit provides precise exceptions
  • Precise exceptions exceptions are handled in
    the instruction order

154
Loop Example
  • Loop L.D F0, 0(R1)
  • MUL.D F4, F0, F2
  • S.D F4, 0(R1)
  • DADDUI R1,R1, -8
  • BNE R1, R2, Loop
  • Assume we have issued all the instructions in the
    loop twice
  • Assume L.D and MUL.D from the first iteration
    have committed and all others have completed
    execution

155
Figure 3.31
156
Loop Example Observation
  • Suppose the first BNE is not taken ? flush ROB
    and begins fetch instructions from the other path

157
Other Issues
  • Performance is more sensitive to
    branch-prediction
  • Impact of a mis-prediction will be higher
  • Prediction accuracy, mis-prediction detection,
    and mis-prediction recovery increase in
    importance
  • Precise exception
  • Handled by not recognizing the exception until it
    is ready to commit
  • If a speculation instruction raises an exception,
    the exception is recorded in ROB
  • Mis-prediction branch ? exception are flushed as
    well
  • If the instruction reaches the ROB head ? take
    the exception

158
Figure 3.32
159
(No Transcript)
160
Multiple Issue with Speculation
  • Process multiple instructions per clock,
    assigning RS and ROB to the instructions
  • To maintain throughput of greater than one
    instruction per cycle, must handle multiple
    instruction commits per clock
  • Speculation helps significantly when a branch is
    a key potential performance limitation
  • Speculation can be advantageous when there are
    data-dependent branches, which otherwise would
    limit performance
  • Depend on accurate branch prediction ? incorrect
    speculation will typically harm performance

161
Example
  • Assume separate integer FUs for ALU operations,
    effective address calculation, and branch
    condition evaluation
  • Assume up to 2 instruction of any type can commit
    per clock
  • Loop LD R2, 0(R1)
  • DADDIU R2, R2, 1
  • SD R2, 0(R1)
  • DADDIU R1, R1, 4
  • BNE R2, R3, LOOP

162
No Speculation
Figure 3.33 3.34
R2
R2
R2
163
Speculation
R2
R2
R2
164
Example Result
  • Without speculation
  • L.D following BNE cannot start execution earlier
    ? wait until branch outcome is determined
  • Completion rate is falling behind the issue rate
    rapidly, stall when a few more iterations are
    issued
  • With speculation
  • L.D following BNE can start execution early
    because it is speculative

165
3.8 Studies of The Limitations of ILP
166
ILP Studies
  • Perfect Hardware model - in the ideal infinite
    cost case
  • Rename as much as you need
  • Implies infinite virtual registers
  • Hence - complete WAW or WAR insensitivity
  • Branch prediction is perfect
  • This will never happen in reality of course
  • Jump prediction (even computed such as return)
    are also perfect
  • Similarly unreal
  • Perfect memory disambiguation
  • Almost perfect is not too hard in practice
  • Can issue an unlimited of instructions at once
    no restriction on types of instructions issued
    ? FUs
  • One-cycle latency

167
Lets Look at A Real Machine
  • Alpha 21264 one of the most advanced
    superscalar processors announced to date
  • Issues up to four instructions per clock, and
    initiates execution on up to six
  • At most 2 memory references, among other
    restrictions
  • Support a large set of renaming registers (41
    integer and 41 FP)
  • Allow up to 80 instructions in execution
  • Multicycle latencies
  • Tournament-style branch predictor

168
How to Measure
  • A set of programs were compiled and optimized
    with the standard MIPS optimizing compilers
  • Execute and produce a trace of the instruction
    and data references
  • Perfect branch prediction and perfect alias
    analysis are easy to do
  • Every instruction in the trace is then scheduled
    as early as possible, limited only by the data
    dependence
  • Including moving across branches

169
What A Perfect Processor Must Do?
  • Look arbitrary far ahead to find a set of
    instructions to issue, predicting all branches
    perfectly
  • Rename all register uses to avoid WAW and WAR
    hazards
  • Determine whether there are any dependences among
    the instructions in the issue packet if so,
    rename accordingly
  • Determine if any memory dependences exist among
    the issuing instructions and hand them
    appropriately
  • Provide enough replicated Fus to allow all the
    ready instructions to issue

170
ILP at the Limit
  • How many instructions would issue on the perfect
    machine every cycle?
  • gcc - 54.8
  • espresso - 62.6
  • li - 17.9
  • fpppp - 75.2
  • doduc - 118.7
  • tomcatv - 150.1
  • Limited only by the ILP inherent in the
    benchmarks
  • Note
  • Benchmarks are small codes
  • More ILP tends to surface as the codes get bigger

Huge amounts of loop parallelismin the SPECfp
codes
171
Window Size
  • The set of instructions that is examined for
    simultaneous execution is called the window
  • The window size will be determined by the cost of
    determining whether n issuing instructions have
    any register dependences among them
  • In theory, This c
Write a Comment
User Comments (0)
About PowerShow.com