CPE 631: ILP, Dynamic Exploitation - PowerPoint PPT Presentation

About This Presentation
Title:

CPE 631: ILP, Dynamic Exploitation

Description:

Techniques that increase amount of parallelism. exploited among instructions ... The Orginal'register renaming' 12. LaCASA. Definition: Control Dependencies ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 115
Provided by: Alek155
Learn more at: http://www.ece.uah.edu
Category:

less

Transcript and Presenter's Notes

Title: CPE 631: ILP, Dynamic Exploitation


1
CPE 631 ILP, Dynamic Exploitation
  • Electrical and Computer EngineeringUniversity of
    Alabama in Huntsville
  • Aleksandar Milenkovic
  • milenka_at_ece.uah.edu
  • http//www.ece.uah.edu/milenka

2
Outline
  • Instruction Level Parallelism (ILP)
  • Recap Data Dependencies
  • Extended MIPS Pipeline and Hazards
  • Dynamic scheduling with a scoreboard

3
ILP Concepts and Challenges
  • ILP (Instruction Level Parallelism) overlap
    execution of unrelated instructions
  • Techniques that increase amount of parallelism
    exploited among instructions
  • reduce impact of data and control hazards
  • increase processor ability to exploit parallelism
  • Pipeline CPI Ideal pipeline CPI Structural
    stalls RAW stalls WAR stalls WAW stalls
    Control stalls
  • Reducing each of the terms of the right-hand side
    minimize CPI and thus increase instruction
    throughput

4
Two approaches to exploit parallelism
  • Dynamic techniques
  • largely depend on hardware to locate the
    parallelism
  • Static techniques
  • relay on software

5
Techniques to exploit parallelism
Technique (Section in the textbook) Reduces
Forwarding and bypassing (Section A.2) Data hazard (DH) stalls
Delayed branches (A.2) Control hazard stalls
Basic dynamic scheduling (A.8) DH stalls (RAW)
Dynamic scheduling with register renaming (3.2) WAR and WAW stalls
Dynamic branch prediction (3.4) CH stalls
Issuing multiple instruction per cycle (3.6) Ideal CPI
Speculation (3.7) Data and control stalls
Dynamic memory disambiguation (3.2, 3.7) RAW stalls w. memory
Loop Unrolling (4.1) CH stalls
Basic compiler pipeline scheduling (A.2, 4.1) DH stalls
Compiler dependence analysis (4.4) Ideal CPI, DH stalls
Software pipelining and trace scheduling (4.3) Ideal CPI and DH stalls
Compiler speculation (4.4) Ideal CPI, and D/CH stalls
6
Where to look for ILP?
  • Amount of parallelism available within a basic
    block
  • BB straight line code sequence of instructions
    with no branches in except to the entry, and no
    branches out except at the exit
  • Example Gcc (Gnu C Compiler) 17 control
    transfer
  • 5 or 6 instructions 1 branch
  • Dependencies gt amount of parallelism in a basic
    block is likely to be much less than 5gt look
    beyond single block to get more instruction
    level parallelism
  • Simplest and most common way to increase amount
    of parallelism among instruction is to exploit
    parallelism among iterations of a loop gt Loop
    Level Parallelism
  • Vector Processing see Appendix G

for(i1 ilt1000 i) xixi s
7
Definition Data Dependencies
  • Data dependence instruction j is data dependent
    on instruction i if either of the following holds
  • Instruction i produces a result used by
    instruction j, or
  • Instruction j is data dependent on instruction k,
    and instruction k is data dependent on
    instruction i
  • If dependent, cannot execute in parallel
  • Try to schedule to avoid hazards
  • Easy to determine for registers (fixed names)
  • Hard for memory (memory disambiguation)
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?

8
Examples of Data Dependencies
Loop LD.D F0, 0(R1) F0 array
element ADD.D F4, F0, F2 add scalar in
F2 SD.D 0(R1), F4 store result
and DADUI R1,R1,-8 decrement
pointer BNE R1, R2, Loop branch if R1!R2
9
Definition Name Dependencies
  • Two instructions use same name (register or
    memory location) but dont exchange data
  • Antidependence (WAR if a hazard for
    HW)Instruction j writes a register or memory
    location that instruction i reads from and
    instruction i is executed first
  • Output dependence (WAW if a hazard for
    HW)Instruction i and instruction j write the
    same register or memory location ordering
    between instructions must be preserved. If
    dependent, cant execute in parallel
  • Renaming to remove data dependencies
  • Again Name Dependencies are Hard for Memory
    Accesses
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?

10
Where are the name dependencies?
1 Loop L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D 0(R1),
F4 drop DSUBUI BNEZ 4 L.D F0,-8(R1) 5 ADD.D F4
,F0,F2 6 S.D -8(R1),F4 drop DSUBUI
BNEZ 7 L.D F0,-16(R1) 8 ADD.D F4,F0,F2 9 S.D -16(R
1),F4 drop DSUBUI BNEZ 10 L.D F0,-24(R1) 11 AD
D.D F4,F0,F2 12 S.D -24(R1),F4 13 SUBUI R1,R1,32
alter to 48 14 BNEZ R1,LOOP 15 NOP How can
remove them?
11
Where are the name dependencies?
1 Loop L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D 0(R1),
F4 drop DSUBUI BNEZ 4 L.D F6,-8(R1) 5 ADD.D F8
,F6,F2 6 S.D -8(R1),F8 drop DSUBUI
BNEZ 7 L.D F10,-16(R1) 8 ADD.D F12,F10,F2 9 S.D -1
6(R1),F12 drop DSUBUI BNEZ 10 L.D F14,-24(R1)
11 ADD.D F16,F14,F2 12 S.D -24(R1),F16 13 DSUBUI R
1,R1,32 alter to 48 14 BNEZ R1,LOOP 15 NOP
The Orginalregister renaming
12
Definition Control Dependencies
  • Example if p1 S1 if p2 S2S1 is control
    dependent on p1 and S2 is control dependent on
    p2 but not on p1
  • Two constraints on control dependences
  • An instruction that is control dep. on a branch
    cannot be moved before the branch, so that its
    execution is no longer controlled by the branch
  • An instruction that is not control dep. on a
    branch cannot be moved to after the branch so
    that its execution is controlled by the branch

DADDU R5, R6, R7 ADD R1, R2, R3 BEQZ R4,
L SUB R1, R5, R6 L OR R7, R1, R8
13
Dynamically Scheduled Pipelines
14
Overcoming Data Hazards with Dynamic Scheduling
  • Why in HW at run time?
  • Works when cant know real dependence at compile
    time
  • Simpler compiler
  • Code for one machine runs well on another
  • Example
  • Key idea Allow instructions behind stall to
    proceed

SUB.D cannot execute because the dependence of
ADD.D on DIV.D causes the pipeline to stall yet
SUBD is not data dependent on anything!
DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F12
15
Overcoming Data Hazards with Dynamic Scheduling
(contd)
  • Enables out-of-order execution gt out-of-order
    completion
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards
  • 2. Read operandswait until no data hazards,
    then read operands
  • Scoreboarding technique for allowing
    instructions to execute out of order when there
    are sufficient resources and no data dependencies
    (CDC 6600, 1963)

16
Scoreboarding Implications
  • Out-of-order completion gt WAR, WAW hazards?
  • Solutions for WAR
  • Queue both the operation and copies of its
    operands
  • Read registers only during Read Operands stage
  • For WAW, must detect hazard stall until other
    completes
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units
  • Scoreboard keeps track of dependencies, state or
    operations
  • Scoreboard replaces ID, EX, WB with 4 stages

DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F10,F8,F12
DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F8,F8,F12
17
Four Stages of Scoreboard Control
  • ID1 Issue decode instructions check for
    structural hazards
  • ID2 Read operands wait until no data hazards,
    then read operands
  • EX Execute operate on operands when the
    result is ready, it notifies the scoreboard that
    it has completed execution
  • WB Write results finish execution the
    scoreboard checks for WAR hazards. If none, it
    writes results. If WAR, then it stalls the
    instruction

DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F8,F8,F12
Scoreboarding stalls the the SUBD in its write
result stage until ADDD reads its operands
18
Four Stages of Scoreboard Control
  • 1. Issuedecode instructions check for
    structural hazards (ID1)
  • If a functional unit for the instruction is free
    and no other active instruction has the same
    destination register (WAW), the scoreboard issues
    the instruction to the functional unit and
    updates its internal data structure. If a
    structural or WAW hazard exists, then the
    instruction issue stalls, and no further
    instructions will issue until these hazards are
    cleared.
  • 2. Read operandswait until no data hazards, then
    read operands (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it,
    or if the register containing the operand is
    being written by a currently active functional
    unit. When the source operands are available, the
    scoreboard tells the functional unit to proceed
    to read the operands from the registers and begin
    execution. The scoreboard resolves RAW hazards
    dynamically in this step, and instructions may be
    sent into execution out of order.

19
Four Stages of Scoreboard Control
  • 3. Executionoperate on operands (EX)
  • The functional unit begins execution upon
    receiving operands. When the result is ready, it
    notifies the scoreboard that it has completed
    execution.
  • 4. Write resultfinish execution (WB)
  • Once the scoreboard is aware that the functional
    unit has completed execution, the scoreboard
    checks for WAR hazards. If none, it writes
    results. If WAR, then it stalls the instruction.
  • Example
  • CDC 6600 scoreboard would stall SUBD until ADD.D
    reads operands

DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F8,F8,F14
20
Three Parts of the Scoreboard
  • 1. Instruction statuswhich of 4 steps the
    instruction is in (Capacity window size)
  • 2. Functional unit statusIndicates the state of
    the functional unit (FU). 9 fields for each
    functional unit
  • BusyIndicates whether the unit is busy or not
  • OpOperation to perform in the unit (e.g., or
    )
  • FiDestination register
  • Fj, FkSource-register numbers
  • Qj, QkFunctional units producing source
    registers Fj, Fk
  • Rj, RkFlags indicating when Fj, Fk are ready
  • 3. Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions will
    write that register

21
MIPS with a Scoreboard
Registers
FP Mult
FP Mult
FP Div
FP Div
FP Div
Add1 Add2 Add3
Control/Status
Control/Status
22
Detailed Scoreboard Pipeline Control
23
Scoreboard Example
24
Scoreboard Example Cycle 1
Issue 1st L.D!
25
Scoreboard Example Cycle 2
Structural hazard!No further instructions will
issue!
Issue 2nd L.D?
26
Scoreboard Example Cycle 3
Issue MUL.D?
27
Scoreboard Example Cycle 4
Check for WAR hazards! If none, write result!
28
Scoreboard Example Cycle 5
Issue 2nd L.D!
29
Scoreboard Example Cycle 6
Issue MUL.D!
30
Scoreboard Example Cycle 7
Issue SUB.D!
31
Scoreboard Example Cycle 8
Issue DIV.D!
32
Scoreboard Example Cycle 9
Read operands for MUL.D and SUB.D!Assume we can
feed Mult1 and Add units in the same clock
cycle. Issue ADD.D? Structural Hazard (unit is
busy)!
33
Scoreboard Example Cycle 11
Last cycle of SUB.D execution.
34
Scoreboard Example Cycle 12
Check WAR on F8. Write F8.
35
Scoreboard Example Cycle 13
Issue ADD.D!
36
Scoreboard Example Cycle 14
Read operands for ADD.D!
37
Scoreboard Example Cycle 15
38
Scoreboard Example Cycle 16
39
Scoreboard Example Cycle 17
Why cannot write F6?
40
Scoreboard Example Cycle 19
41
Scoreboard Example Cycle 20
42
Scoreboard Example Cycle 21
43
Scoreboard Example Cycle 22
Write F6?
44
Scoreboard Example Cycle 61
45
Scoreboard Example Cycle 62
46
Scoreboard Results
  • For the CDC 6600
  • 70 improvement for Fortran
  • 150 improvement for hand coded assembly language
  • cost was similar to one of the functional units
  • surprisingly low
  • bulk of cost was in the extra busses
  • Still this was in ancient time
  • no caches no main semiconductor memory
  • no software pipelining
  • compilers?
  • So, why is it coming back
  • performance via ILP

47
Scoreboard Limitations
  • Amount of parallelism among instructions
  • can we find independent instructions to execute
  • Number of scoreboard entries
  • how far ahead the pipeline can look for
    independent instructions (we assume a window does
    not extend beyond a branch)
  • Number and types of functional units
  • avoid structural hazards
  • Presence of antidependences and output
    dependences
  • WAR and WAW stalls become more important

48
Things to Remember
  • Pipeline CPI Ideal pipeline CPI Structural
    stalls RAW stalls WAR stalls WAW stalls
    Control stalls
  • Data dependencies
  • Dynamic scheduling to minimise stalls
  • Dynamic scheduling with a scoreboard

49
Scoreboard Limitations
  • Amount of parallelism among instructions
  • can we find independent instructions to execute
  • Number of scoreboard entries
  • how far ahead the pipeline can look for
    independent instructions (we assume a window does
    not extend beyond a branch)
  • Number and types of functional units
  • avoid structural hazards
  • Presence of antidependences and output
    dependences
  • WAR and WAW stalls become more important

50
Tomasulos Algorithm
  • Used in IBM 360/91 FPU (before caches)
  • Goal high FP performance without special
    compilers
  • Conditions
  • Small number of floating point registers (4 in
    360) prevented interesting compiler scheduling of
    operations
  • Long memory accesses and long FP delays
  • This led Tomasulo to try to figure out how to get
    more effective registers renaming in hardware!
  • Why Study 1966 Computer?
  • The descendants of this have flourished!
  • Alpha 21264, HP 8000, MIPS 10000, Pentium III,
    PowerPC 604,

51
Tomasulos Algorithm (contd)
  • Control buffers distributed with Function Units
    (FU)
  • FU buffers called reservation stations gt
    buffer the operands of instructions waiting to
    issue
  • Registers in instructions replaced by values or
    pointers to reservation stations (RS) gt register
    renaming
  • avoids WAR, WAW hazards
  • More reservation stations than registers, so can
    do optimizations compilers cant
  • Results to FU from RS, not through registers,
    over Common Data Bus that broadcasts results to
    all FUs
  • Load and Stores treated as FUs with RSs as well
  • Integer instructions can go past branches,
    allowing FP ops beyond basic block in FP queue

52
Tomasulo-based FPU for MIPS
From Instruction Unit
FP Registers
FP Op Queue
From Mem
Load Buffers
Load1 Load2 Load3 Load4 Load5 Load6
Store Buffers
Store1 Store2 Store3
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Common Data Bus (CDB)
53
Reservation Station Components
  • Op Operation to perform in the unit (e.g., or
    )
  • Vj, Vk Value of Source operands
  • Store buffers has V field, result to be stored
  • Qj, Qk Reservation stations producing source
    registers (value to be written)
  • Note Qj/Qk0 gt source operand is already
    available in Vj /Vk
  • Store buffers only have Qi for RS producing
    result
  • Busy Indicates reservation station or FU is
    busy
  • Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions that
    will write that register.

54
Three Stages of Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station free (no structural
    hazard), control issues instr sends operands
    (renames registers)
  • 2. Executeoperate on operands (EX)
  • When both operands ready then executeif not
    ready, watch Common Data Bus for result
  • 3. Write resultfinish execution (WB)
  • Write it on Common Data Bus to all awaiting
    units mark reservation station available
  • Normal data bus data destination (go to bus)
  • Common data bus data source (come from bus)
  • 64 bits of data 4 bits of Functional Unit
    source address
  • Write if matches expected Functional Unit
    (produces result)
  • Does the broadcast
  • Example speed 2 clocks for Fl .pt. ,- 10 for
    40 clks for /

55
Tomasulo Example
56
Tomasulo Example Cycle 1
57
Tomasulo Example Cycle 2
Note Can have multiple loads outstanding
58
Tomasulo Example Cycle 3
  • Note registers names are removed (renamed) in
    Reservation Stations MULT issued
  • Load1 completing what is waiting for Load1?

59
Tomasulo Example Cycle 4
  • Load2 completing what is waiting for Load2?

60
Tomasulo Example Cycle 5
  • Timer starts down for Add1, Mult1

61
Tomasulo Example Cycle 6
  • Issue ADDD here despite name dependency on F6?

62
Tomasulo Example Cycle 7
  • Add1 (SUBD) completing what is waiting for it?

63
Tomasulo Example Cycle 8
64
Tomasulo Example Cycle 9
65
Tomasulo Example Cycle 10
  • Add2 (ADDD) completing what is waiting for it?

66
Tomasulo Example Cycle 11
  • Write result of ADDD here?
  • All quick instructions complete in this cycle!

67
Tomasulo Example Cycle 12
68
Tomasulo Example Cycle 13
69
Tomasulo Example Cycle 14
70
Tomasulo Example Cycle 15
  • Mult1 (MULTD) completing what is waiting for it?

71
Tomasulo Example Cycle 16
  • Just waiting for Mult2 (DIVD) to complete

72
Tomasulo Example Cycle 55
73
Tomasulo Example Cycle 56
  • Mult2 (DIVD) is completing what is waiting for
    it?

74
Tomasulo Example Cycle 57
  • Once again In-order issue, out-of-order
    execution and out-of-order completion.

75
Tomasulo Drawbacks
  • Complexity
  • delays of 360/91, MIPS 10000, Alpha 21264, IBM
    PPC 620 in CAAQA 2/e, but not in silicon!
  • Many associative stores (CDB) at high speed
  • Performance limited by Common Data Bus
  • Each CDB must go to multiple functional units ?
    high capacitance, high wiring density
  • Number of functional units that can complete per
    cycle limited to one!
  • Multiple CDBs ? more FU logic for parallel assoc
    stores
  • Non-precise interrupts!
  • We will address this later

76
Tomasulo Loop Example
Loop LD F0 0(R1) MULTD F4 F0 F2 SD F4 0 R1 SUB
I R1 R1 8 BNEZ R1 Loop
  • This time assume Multiply takes 4 clocks
  • Assume 1st load takes 8 clocks (L1 cache miss),
    2nd load takes 1 clock (hit)
  • To be clear, will show clocks for SUBI, BNEZ
  • Reality integer instructions ahead of Fl. Pt.
    Instructions
  • Show 2 iterations

77
Loop Example
78
Loop Example Cycle 1
79
Loop Example Cycle 2
80
Loop Example Cycle 3
  • Implicit renaming sets up data flow graph

81
Loop Example Cycle 4
82
Loop Example Cycle 5
83
Loop Example Cycle 6
84
Loop Example Cycle 7
85
Loop Example Cycle 8
86
Loop Example Cycle 9
87
Loop Example Cycle 10
88
Loop Example Cycle 11
89
Loop Example Cycle 12
90
Loop Example Cycle 13
91
Loop Example Cycle 14
92
Loop Example Cycle 15
93
Loop Example Cycle 16
94
Loop Example Cycle 17
95
Loop Example Cycle 18
96
Loop Example Cycle 19
97
Loop Example Cycle 20
  • Once again In-order issue, out-of-order
    execution and out-of-order completion.

98
Why can Tomasulo overlap iterations of loops?
  • Register renaming
  • Multiple iterations use different physical
    destinations for registers (dynamic loop
    unrolling)
  • Reservation stations
  • Permit instruction issue to advance past integer
    control flow operations
  • Also buffer old values of registers - totally
    avoiding the WAR stall that we saw in the
    scoreboard
  • Other perspective Tomasulo building data flow
    dependency graph on the fly

99
Tomasulos scheme offers 2 major advantages
  • (1) the distribution of the hazard detection
    logic
  • distributed reservation stations and the CDB
  • If multiple instructions waiting on single
    result, each instruction has other operand,
    then instructions can be released simultaneously
    by broadcast on CDB
  • If a centralized register file were used, the
    units would have to read their results from the
    registers when register buses are available.
  • (2) the elimination of stalls for WAW and WAR
    hazards

100
Multiple Issue
  • Allow multiple instructions to issue in a single
    clock cycle (CPI lt 1)
  • Two flavors
  • Superscalar
  • Issue varying number of instruction per clock
  • Can be statically (compiler tech.) or dynamically
    (Tomasulo) scheduled
  • VLIW (Very Long Instruction Word)
  • Issue a fixed number of instructions formatted as
    a single long instruction or as a fixed
    instruction packet

101
Multiple Issue with Dynamic Scheduling
From Instruction Unit
FP Registers
FP Op Queue
From Mem
Load Buffers
Load1 Load2 Load3 Load4 Load5 Load6
Store Buffers
Store1 Store2 Store3
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Issue 2 instructions per clock cycle
102
Multiple Issue with Dynamic Scheduling An Example
Loop L.D F0, 0(R1) ADD.D F4,F0,F2 S.D 0(R1)
, F4 DADDIU R1,R1,-8 BNE R1,R2,Loop
Assumptions 2-issue processor can issue any
pair of instructions if reservation stations are
available Resources ALU (int effective
address),a separate pipelined FP for each
operation type,branch prediction hardware, 1
CDB 2 cc for loads, 3 cc for FP Add Branches
single issue, branch prediction is perfect
103
Execution in Dual-issue Tomasulo Pipeline
Iter. Inst. Issue Exe. (begins) Mem.Access Write at CDB Com.
1 LD.D F0,0(R1) 1 2 3 4 first issue
1 ADD.D F4,F0,F2 1 5 8 Wait for LD.D
1 S.D 0(R1), F4 2 3 9 Wait for ADD.D
1 DADDIU R1,R1,-8 2 4 5 Wait for ALU
1 BNE R1,R2,Loop 3 6 Wait for DAIDU
2 LD.D F0,0(R1) 4 7 8 9 Wait for BNE
2 ADD.D F4,F0,F2 4 10 13 Wait for LD.D
2 S.D 0(R1), F4 5 8 14 Wait for ADD.D
2 DADDIU R1,R1,-8 5 9 10 Wait for ALU
2 BNE R1,R2,Loop 6 11 Wait for DAIDU
3 LD.D F0,0(R1) 7 12 13 14 Wait for BNE
3 ADD.D F4,F0,F2 7 15 18 Wait for LD.D
3 S.D 0(R1), F4 8 13 19 Wait for ADD.D
3 DADDIU R1,R1,-8 8 14 15 Wait for ALU
3 BNE R1,R2,Loop 9 16 Wait for DAIDU
104
Multiple Issue with Dynamic Scheduling Resource
Usage
Clock Int ALU FP ALU Data Cache CDB
2 1/L.D
3 1/S.D 1/L.D
4 1/DADDIU 1/L.D
5 1/ADD.D 1/DADDIU
6
7 2/L.D
8 2/S.D 2/L.D 1/ADD.D
9 2/ DADDIU 1/S.D 2/L.D
10 2/ADD.D 2/DADDIU
11
12 3/L.D
13 3/S.D 3/L.D 2/ADD.D
14 3/ DADDIU 2/S.D 3/L.D
15 3/ADD.D 3/DADDIU
16
17
18 3/ADD.D
19 3/S.D
105
Multiple Issue with Dynamic Scheduling
  • DADDIU waits for ALU used by S.D
  • Add one ALU dedicated to effective address
    calculation
  • Use 2 CDBs
  • Draw table for the dual-issue version of
    Tomasulos pipeline

106
Multiple Issue with Dynamic Scheduling
Iter. Inst. Issue Exe. (begins) Mem.Access Write at CDB Com.
1 LD.D F0,0(R1) 1 2 3 4 first issue
1 ADD.D F4,F0,F2 1 5 8 Wait for LD.D
1 S.D 0(R1), F4 2 3 9 Wait for ADD.D
1 DADDIU R1,R1,-8 2 3 4 Executes earlier
1 BNE R1,R2,Loop 3 5 Wait for DAIDU
2 LD.D F0,0(R1) 4 6 7 8 Wait for BNE
2 ADD.D F4,F0,F2 4 9 12 Wait for LD.D
2 S.D 0(R1), F4 5 7 13 Wait for ADD.D
2 DADDIU R1,R1,-8 5 6 7 Executes earlier
2 BNE R1,R2,Loop 6 8
3 LD.D F0,0(R1) 7 9 10 11 Wait for BNE
3 ADD.D F4,F0,F2 7 12 15
3 S.D 0(R1), F4 8 10 16
3 DADDIU R1,R1,-8 8 9 10
3 BNE R1,R2,Loop 9 11
107
Multiple Issue with Dynamic Scheduling Resource
Usage
Clock Int ALU Adr. Adder FP ALU Data Cache CDB1 CDB2
2 1/L.D
3 1/DADDIU 1/S.D 1/L.D
4 1/L.D 1/DADDIU
5 1/ADD.D
6 2/ DADDIU 2/L.D
7 2/S.D 2/L.D 2/DADDIU
8 1/ADD.D 2/L.D
9 3/ DADDIU 3/L.D 2/ADD.D 1/S.D
10 3/S.D 3/L.D 3/DADDIU
11 3/L.D
12 3/ADD.D 2/ADD.D
13 2/S.D
14
15 3/ADD.D
16 3/S.D
108
What about Precise Interrupts?
  • Tomasulo hadIn-order issue, out-of-order
    execution, and out-of-order completion
  • Need to fix the out-of-order completion aspect
    so that we can find precise breakpoint in
    instruction stream

109
Hardware-based Speculation
  • With wide issue processors control dependences
    become a burden, even with sophisticated branch
    predictors
  • Speculation speculate on the outcome of branches
    and execute the program as if our guesses were
    correct gt need a mechanism to handle situations
    when the speculations were incorrect

110
Relationship between precise interrupts and
speculation
  • Speculation is a form of guessing
  • Important for branch prediction
  • Need to take our best shot at predicting branch
    direction
  • If we speculate and are wrong, need to back up
    and restart execution to point at which we
    predicted incorrectly
  • This is exactly same as precise exceptions!
  • Technique for both precise interrupts/exceptions
    and speculation in-order completion or commit

111
HW support for precise interrupts
  • Need HW buffer for results of uncommitted
    instructions reorder buffer (ROB)
  • 4 fields instr. type, destination, value, ready
  • Use reorder buffer number instead of reservation
    station when execution completes
  • Supplies operands between execution complete
    commit
  • (Reorder buffer can be operand source gt more
    registers like RS)
  • Instructions commit
  • Once instruction commits, result is put into
    register
  • As a result, easy to undo speculated
    instructions on mispredicted branches or
    exceptions

Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
112
Four Steps of Speculative Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station and reorder buffer slot
    free, issue instr send operands reorder
    buffer no. for destination (this stage sometimes
    called dispatch)
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch CDB for result when both in
    reservation station, execute checks RAW
    (sometimes called issue)
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting FUs
    reorder buffer mark reservation station
    available.
  • 4. Commitupdate register with reorder result
  • When instr. at head of reorder buffer result
    present, update register with result (or store to
    memory) and remove instr from reorder buffer.
    Mispredicted branch flushes reorder buffer
    (sometimes called graduation)

113
What are the hardware complexities with reorder
buffer (ROB)?
  • How do you find the latest version of a register?
  • (As specified by Smith paper) need associative
    comparison network
  • Could use future file or just use the register
    result status buffer to track which specific
    reorder buffer has received the value
  • Need as many ports on ROB as register file

114
Summary
  • Reservations stations implicit register renaming
    to larger set of registers buffering source
    operands
  • Prevents registers as bottleneck
  • Avoids WAR, WAW hazards of Scoreboard
  • Allows loop unrolling in HW
  • Not limited to basic blocks (integer units gets
    ahead, beyond branches)
  • Today, helps cache misses as well
  • Dont stall for L1 Data cache miss (insufficient
    ILP for L2 miss?)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are Pentium III PowerPC 604
    MIPS R10000 HP-PA 8000 Alpha 21264
Write a Comment
User Comments (0)
About PowerShow.com