Instruction Level Parallelism - PowerPoint PPT Presentation

1 / 149
About This Presentation
Title:

Instruction Level Parallelism

Description:

Instruction Level Parallelism. Dr. Chinni. 2. Instruction ... Summary of ... Add Ra, Rb, Rc //stall. sw a, Ra. lw Re, e. lw Rf, f. sub Rd, Re, Rf ... – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 150
Provided by: venkat3
Category:

less

Transcript and Presenter's Notes

Title: Instruction Level Parallelism


1
Instruction Level Parallelism
2
Instruction Level Parallelism
  • Concepts and Challenges
  • Dynamic Scheduling
  • Dynamic Hardware Prediction
  • Multiple Issue
  • Compiler Support
  • Hardware Support
  • Studies of ILP

3
Summary of Pipelining Basics
  • Hazards limit performance by preventing
    instructions from executing during their
    designated clock cycles
  • Structural Hazards need more HW resources
  • Data Hazards need forwarding, compiler
    scheduling
  • Control Hazards early evaluation PC, delayed
    branch, prediction
  • Increasing length of pipe increases impact of
    hazards
  • Pipelining helps instruction bandwidth, not
    latency
  • Interrupts, Instruction Set, FP makes pipelining
    harder
  • Compilers reduce cost of data and control hazards
  • Stall Increases CPI and decreases performance

4
What Is an ILP?
  • Principle Many instructions in the code do not
    depend on each other
  • Result Possible to execute them in parallel
  • ILP Potential overlap among instructions (so
    they can be evaluated in parallel)
  • Issues
  • Building compilers to analyze the code
  • Building special/smarter hardware to handle the
    code
  • ILP Increase the amount of parallelism
    exploited among instructions
  • Seeks Good Results out of Pipelining

5
Scheduling
  • Scheduling re-arranging instructions to maximize
    performance
  • Requires knowledge about structure of processor
  • Static Scheduling done by compiler
  • Review Provides good analogies for hardware
    scheduling
  • Embedded market and IA-64 architecture and
    Intels Itanium
  • Have already seen an example of this
  • Scheduling to eliminate MEM/ALU Bubbles
  • Another example
  • for (i1000 igt0 i--) xi xi s
  • Dynamic Scheduling done by hardware
  • Dominates Server and Desktop markets (Pentium
    III, IV MIPS R10000/12000, UltraSPARC III,
    PowerPC 603 etc)

6
Pipeline Scheduling Previous Lecture Example
Compiler schedules (move) instructions to reduce
stall Ex code sequence a b c, d e f
Before scheduling lw Rb, b lw
Rc, c Add Ra, Rb, Rc //stall sw a, Ra
lw Re, e lw Rf, f sub Rd,
Re, Rf //stall sw d, Rd
After scheduling lw Rb, b lw Rc, c lw Re, e
Add Ra, Rb, Rc lw Rf, f sw a, Ra sub Rd, Re,
Rf sw d, Rd
Schedule
7
Basic Pipeline Scheduling
  • To avoid pipeline stall
  • A dependant instruction must be separated from
    the source instruction by a distance in clock
    cycles equal to the pipeline latency
  • Compilers ability depends on
  • Amount of ILP available in the program
  • Latencies of the functional units in the pipeline
  • Pipeline CPI Ideal pipeline CPI Structured
    stalls Data hazards stalls Control stalls

8
Pipeline Scheduling Loop Unrolling
  • Basic Block
  • Set of instructions between entry points and
    between branches. A basic block has only one
    entry and one exit
  • Typically 4 to 7 instructions
  • Amount of overlap ltlt 4 to 7 instructions
  • Obtain substantial performance enhancements
    Exploit ILP across multiple basic blocks
  • Loop Level Parallelism
  • Parallelism that exists within a loop Limited
    opportunity
  • Parallelism can cross loop iterations!
  • Techniques to convert loop-level parallelism to
    instructional-level parallelism
  • Loop Unrolling Compiler or the hardwares
    ability to exploit the parallelism inherent in
    the loop
  • Vector instructions Operate on a sequence of
    data items

9
Assumptions
  • Five-stage integer pipeline
  • Branches have delay of one clock cycle
  • ID stage Comparisons done, decisions made and PC
    loaded
  • No structural hazards
  • Functional units are fully pipelined or
    replicated (as many times as the pipeline depth)
  • FP Latencies

Integer load latency 1 Integer ALU operation
latency 0
10
Simple Loop Assembler Equivalent
  • for (i1000 igt0 i--) xi xi s
  • Loop LD F0, 0(R1) F0array element
  • ADDD F4, F0, F2 add scalar in F2
  • SD F4 , 0(R1) store result
  • SUBI R1, R1, 8 decrement pointer 8bytes (DW)
  • BNE R1, R2, Loop branch R1!R2
  • xi s are double/floating point type
  • R1 initially address of an array element with the
    highest address
  • F2 contains the scalar value s
  • Register R2 is pre-computed so that 8(R2) is the
    last element to operate on

11
Where are the stalls?
  • Unscheduled
  • Loop LD F0, 0(R1)
  • stall
  • ADDD F4, F0, F2
  • stall
  • stall
  • SD F4, 0(R1)
  • SUBI R1, R1, 8
  • stall
  • BNE R1, R2, Loop
  • stall
  • 10 clock cycles
  • Can we minimize?
  • Scheduled
  • Loop LD F0, 0(R1)
  • SUBI R1, R1, 8
  • ADDD F4, F0, F2
  • stall
  • BNE R1, R2, Loop
  • SD F4, 8(R1)
  • 6 clock cycles
  • 3 cycles actual work 3 cycles overhead
  • Can we minimize further?

Integer load latency 1 Integer ALU operation
latency 0
12
Where are the stalls?
Slide 12 Note 2 stall is required as the
latency requirement between FP ALU OP and Store
double is 2 cycles for this architecture as
specified the table at the bottom of the slide.
The ADDD instruction and SD instruction should
have two cycles latency between them.
13
Loop Unrolling
Four copies of loop
Four iteration code
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • Loop LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4, 0(R1)
  • LD F6, -8(R1)
  • ADDD F8, F6, F2
  • SD F8, -8(R1)
  • LD F10, -16(R1)
  • ADDD F12, F10, F2
  • SD F12, -16(R1)
  • LD F14, -24(R1)
  • ADDD F16, F14, F2
  • SD F16, -24(R1)
  • SUBI R1, R1, 32
  • BNE R1, R2, Loop

Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
14
Loop Unroll Schedule
  • Loop LD F0, 0(R1)
  • stall
  • ADDD F4, F0, F2
  • stall
  • stall
  • SD F4, 0(R1)
  • LD F6, -8(R1)
  • stall
  • ADDD F8, F6, F2
  • stall
  • stall
  • SD F8, -8(R1)
  • LD F10, -16(R1)
  • stall
  • ADDD F12, F10, F2
  • stall
  • stall
  • SD F12, -16(R1)
  • LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SUBI R1,
R1, 32 NOTE 3 SD F12, 16(R1) BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
Note 3 To enable the latency requirements between
SUBI and BNE instructions (we need one cycle
latency as explained in note on slide 12, I moved
one SD instruction to in-between these
instructions)
28 clock cycles or 7 per iteration Can we
minimize further?
15
Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
16
Limits to Gains of Loop Unrolling
  • Decreasing benefit
  • A decrease in the amount of overhead amortized
    with each unroll
  • Example just considered
  • Unrolled loop 4 times, no stall cycles, in 14
    cycles 2 were loop overhead
  • If unrolled 8 times, the overhead is reduced from
    ½ cycle per iteration to 1/4
  • Code size limitations
  • Memory is premium
  • Larger size causes cache hit rate changes
  • Shortfall in registers (Register pressure)
    Increasing ILP leads to increase in number of
    live values May not be possible to allocate all
    the live values to registers
  • Compiler limitations Significant increase in
    complexity

17
What if upper bound of the loop is unknown?
  • Suppose
  • Upper bound of the loop is n
  • Unroll the loop to make k copies of the body
  • Solution Generate pair of consecutive loops
  • First loop body same as original loop, execute
    (n mod k) times
  • Second loop unrolled body (k copies of
    original), iterate (n/k) times
  • For large values of n, most of the execution time
    is spent in the unrolled loop body

18
Summary Tricks of High Performance Processors
  • Out-of-order scheduling To tolerate RAW hazard
    latency
  • Determine that the loads and stores can be
    exchanged as loads and stores from different
    iterations are independent
  • This requires analyzing the memory addresses and
    finding that they do not refer to the same
    address
  • Find that it was ok to move the SD after the SUBI
    and BNE, and adjust the SD offset
  • Loop unrolling Increase scheduling scope for
    more latency tolerance
  • Find that loop unrolling is useful by finding
    that loop iterations are independent, except for
    the loop maintenance code
  • Eliminate extra tests and branches and adjust the
    loop maintenance code
  • Register renaming Remove WAR/WAW violations due
    to scheduling
  • Use different registers to avoid unnecessary
    constraints that would be forced by using same
    registers for different computations
  • Summary Schedule the code preserving any
    dependences needed

19
Compiler Perspective
  • Compiler concerned about dependencies in program.
  • Tries to schedule code to avoid hazards
    property of pipeline organization
  • Looks for Data dependencies
  • Instruction i produces a result used by
    instruction j, or
  • Instruction j is data dependent on instruction k,
    and instruction k is data dependent on
    instruction i (chain dependence)
  • If dependent, cant execute in parallel (or be
    completely overlapped)
  • Easy to determine for registers (fixed names)
  • Hard for memory
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?

20
Data Dependence
  • Data dependence
  • Indicates the possibility of a hazard
  • Determines the order in which results must be
    calculated
  • Sets upper bound on how much parallelism can be
    exploited
  • But, actual hazard length of any stall is
    determined by pipeline
  • Dependence avoidance
  • Maintain the dependence but avoid hazard
    Scheduling
  • Eliminate dependence by transforming the code

21
Data Dependencies
  • 1 Loop LD F0, 0(R1)
  • 2 ADDD F4, F0, F2
  • 3 SUBI R1, R1, 8
  • 4 BNE R1, R2, Loop delayed branch
  • 5 SD F4, 8(R1) altered when move past SUBI

22
Name Dependencies
  • Two instructions use same name (register or
    memory location) but dont exchange data
  • Anti-dependence (WAR if a hazard for HW)
  • Instruction j writes a register or memory
    location that instruction i reads from and
    instruction i is executed first
  • Output dependence (WAW if a hazard for HW)
  • Instruction i and instruction j write the same
    register or memory location ordering between
    instructions must be preserved
  • How to remove name dependencies?
  • They are not true dependencies

23
Register Renaming
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F0, -8(R1) 5 ADDD F4, F0,
F2 6 SD F4, -8(R1) 7 LD F0, -16(R1)
8 ADDD F4, F0, F2 9 SD F4, -16(R1) 10 LD F0,
-24(R1) 11 ADDD F4,F0,F2 12 SD F4, -24(R1)
13 SUBI R1, R1, 32 14 BNE R1, R2, LOOP
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F6,-8(R1) 5 ADDD F8, F6,
F2 6 SD F8, -8(R1) 7 LD F10,-16(R1)
8 ADDD F12, F10, F2 9 SD F12, -16(R1)
10 LD F14, -24(R1) 11 ADDD F16, F14,F2
12 SD F16, -24(R1) 13 SUBI R1, R1, 32
14 BNE R1, R2, LOOP
No data is passed in F0, but cant reuse F0 in
cycle 4.
  • Name Dependencies are Hard for Memory Accesses
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?
  • Our example required compiler to know that if R1
    doesnt change then 0(R1) ? -8(R1) ?
    -16(R1) ? -24(R1)There were no dependencies
    between some loads and stores so they could be
    moved around

24
Control Dependencies
  • Example
  • if p1 S1
  • if p2 S2
  • S1 is control dependent on p1 S2 is control
    dependent on p2 but not on p1
  • Two constraints
  • An instruction that is control dependent on a
    branch cannot be moved before the branch so
    that its execution is no longer controlled by the
    branch.
  • An instruction that is not control dependent on a
    branch cannot be moved to after the branch so
    that its execution is controlled by the branch.
  • Control dependencies relaxed to get parallelism
  • Get same effect if preserve order of exceptions
    (Ex address in register checked by branch before
    use) and data flow (Ex value in register depends
    on branch) (Speculation, Delayed branching etc).

25
Control Dependencies
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BE R1, R2, exit
  • LD F0, 0(R1) if executed before branch, may
    create exception
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BE R1, R2, Exit
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BE R1, R2, Exit
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8

26
When Safe to Unroll Loop?
  • Example-1 Where are the data dependences?
  • (A, B, C are distinct and non-overlapping arrays)
  • for (i1 ilt100 i i1) Ai1 Ai
    Ci / S1 / Bi1 Bi Ai1 /
    S2 /
  • S2 uses the value, Ai1, computed by S1 in the
    same iteration
  • S1 uses a value computed by S1 in an earlier
    iteration, since iteration i computes Ai1,
    which is read in iteration i1. The same is true
    of S2 for Bi and Bi1.
  • Second one is a loop-carried dependence between
    iterations
  • Iterations are dependent and cant be executed in
    parallel
  • Note the case for our prior example each
    iteration was distinct
  • (Possible loop-carried dependence that does not
    prevent parallelism)

27
When Safe to Unroll Loop?
  • Example-2 Where are the data dependences?
  • (A, B, C are distinct and non-overlapping arrays)
  • for (i1 ilt100 i i1) Ai1 Ai
    Bi / S1 / Bi1 Ci Di /
    S2 /
  • No dependence from S1 to S2. If there were, then
    there would be a cycle in the dependencies and
    the loop would not be parallel. Since this other
    dependence is absent, interchanging the two
    statements will not affect the execution of S2.
  • On the first iteration of the loop, statement S1
    depends on the value of B1 computed prior to
    initiating the loop.
  • New code No loop dependence
  • A1 A1 B1
  • for (i1 ilt100 i i1) Bi1 Ci
    Di
  • Ai2 Ai1 Bi1 //check it out on
    computer/ use your logic
  • B101 C100 D100

28
Tricks Can Be Done in Hardware..
  • Why build complicated hardware if we can do this
    in software?
  • Performance portabiity
  • Software assumes pipeline structure
  • Dont want to recompile for new machines
  • More information available to hardware
  • Data addresses, branch directions, cache misses
    statically unknown but compiler can look at
    more instructions
  • More resources available to hardware
  • May not have enough architectural registers to
    resolve WAR/WAW
  • Easier to use speculative execution in hardware
  • Easier to recover from mis-speculation
  • Solution do combination of both

29
Dynamic Scheduling
  • Dynamic Scheduling Hardware rearranges the order
    of instruction execution to reduce stalls
  • Disadvantages
  • Hardware much more complex
  • Key idea
  • Instructions execution in parallel (use available
    all executing units)
  • Allow instructions behind stall to proceed
  • Example
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F12,F8,F14
  • Out-of-order execution gt out-of-order completion

30
Overview
  • In-order pipeline
  • 5 interlocked stages IF, ID, EX, MEM, WB
  • Structural hazard maximum of 1 instruction per
    stage
  • Unless stage is replicated (FP integer EX) or
    idle (WB for stores)
  • Out-of-order pipeline
  • How does one instruction pass another without
    killingit?
  • Remember only one instruction per-stage
    per-cycle
  • Must buffer instructions

IF
ID
EX
MEM
WB
31
Instruction Buffer
  • Trick instruction buffer (many names for this
    buffer)
  • Accumulate decoded instructions in buffer
  • Buffer sends instructions down rest of pipe
    out-of-order

instruction buffer
ID1
ID2
EX
MEM
WB
IF
32
Scoreboard
State/Steps
instruction buffer
IS
RO
EX
WB
IF
ID
  • Confusion in community about which is which stage

Structure
Data Bus
EX
Registers
EX
EX
Control/Status
Scoreboard
33
Dynamic Scheduling Scoreboard
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards
  • 2. Read Operandswait until no data hazards, then
    read operands
  • Scoreboards allow instruction to execute whenever
    1 2 hold, not waiting for prior instructions.
  • A scoreboard is a data structure that provides
    the information necessary for all pieces of the
    processor to work together.
  • Centralized control scheme
  • No bypassing
  • No elimination of WAR/WAW hazards
  • We will use In order issue, out of order
    execution, out of order commit ( also called
    completion)
  • First used in CDC6600. Our example modified here
    for DLX.
  • CDC had 4 FP units, 5 memory reference units, 7
    integer units.
  • DLX has 2 FP multiply, 1 FP adder, 1 FP divider,
    1 integer.

34
Scoreboard Implications
  • Out-of-order completion gt WAR, WAW hazards?
  • Solutions for WAR
  • Queue both the operation and copies of its
    operands
  • Read registers only during Read Operands stage
  • Solution for WAW Structural Hazards
  • Must detect hazard stall until the hazards are
    cleared
  • Need to have multiple instructions in execution
    phase
  • Multiple execution units or pipelined execution
    units
  • Scoreboard keeps track of dependencies, state or
    operations
  • Scoreboard replaces ID, EX, WB with 4 stages

35
Stages of Scoreboard Control
  • Issue decode instructions check for structural
    hazards (ID1)
  • If a functional unit for the instruction is free
    and no other active instruction has the same
    destination register (WAW), the scoreboard issues
    the instruction to the functional unit and
    updates its internal data structure.
  • If a structural or WAW hazard exists, then the
    instruction issue stalls, and no further
    instructions will issue until these hazards are
    cleared.

36
Stages of Scoreboard Control
  • Read Operands wait until no data hazards, then
    read operands from registers (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it,
    or if the register containing the operand is
    being written by a currently active functional
    unit.
  • When the source operands are available, the
    scoreboard tells the functional unit to proceed
    to read the operands from the registers and begin
    execution.
  • The scoreboard resolves RAW hazards dynamically
    in this step, and instructions may be sent into
    execution out of order.

37
Stages of Scoreboard Control
  • Execution operate on operands (EX)
  • The functional unit begins execution upon
    receiving operands. When the result is ready, it
    notifies the scoreboard that it has completed
    execution.
  • Write result finish execution (WB)
  • Once the scoreboard is aware that the
    functional unit has completed execution, the
    scoreboard checks for WAR hazards. If none, it
    writes results. If WAR, then it stalls the
    instruction.
  • Example
  • DIVD F0, F2, F4
  • ADDD F10, F0, F8
  • SUBD F8, F8, F14
  • Scoreboard would stall SUBD until ADDD reads
    operands

38
Scoreboard Data Structures
  • Instruction status
  • Which of 4 steps the instruction is in
  • Functional unit status
  • Busy Whether the unit is busy or not
  • Op Operation to perform in the unit (e.g., or
    )
  • Fi Destination register
  • Fj, Fk Source-register numbers
  • Qj, Qk Functional units producing source
    registers Fj, Fk
  • Rj, Rk ready bits for Fj, Fk
  • Register result status
  • Indicates which functional unit (if any) will
    write each register.
  • Blank when no pending instructions will write
    that register

39
Detailed Scoreboard Pipeline Control
Instruction status
Bookkeeping
Wait until
Issue
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Read operands
Rj? No Rk? No
Rj and Rk
Execution complete
Functional unit done
Write result
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
40
Scoreboard Example
LD F6, 34(R2) LD F2, 45(R3) MULT F0, F2,
F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6,
F8, F2 What are the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 DIVD 40 A
DDD, SUBD 2
41
Scoreboard Example
42
Scoreboard Example Cycle 1
Issue LD 1
Shows in which cycle the operation occurred.
43
Scoreboard Example Cycle 2
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
44
Scoreboard Example Cycle 3
45
Scoreboard Example Cycle 4
46
Scoreboard Example Cycle 5
Issue LD 2 since integer unit is now free
47
Scoreboard Example Cycle 6
Issue MULT
48
Scoreboard Example Cycle 7
MULT cant read its operands (F2) because LD 2
hasnt finished
49
Scoreboard Example Cycle 8a
DIVD issues. MULT and SUBD both waiting for F2
50
Scoreboard Example Cycle 8b
LD 2 writes F2
51
Scoreboard Example Cycle 9
Now MULT and SUBD can both read F2 How can both
instructions do this at the same time??
52
Scoreboard Example Cycle 11
ADDD cant start because Add unit is busy
53
Scoreboard Example Cycle 12
SUBD finishes. DIVD waiting for F0
54
Scoreboard Example Cycle 13
ADDD issues
55
Scoreboard Example Cycle 14
56
Scoreboard Example Cycle 15
57
Scoreboard Example Cycle 16
58
Scoreboard Example Cycle 17
ADDD cant write because of DIVD RAW!
59
Scoreboard Example Cycle 18
Nothing Happens!!
60
Scoreboard Example Cycle 19
MULT completes execution
61
Scoreboard Example Cycle 20
MULT writes
62
Scoreboard Example Cycle 21
DIVD loads operands
63
Scoreboard Example Cycle 22
Now ADDD can write since WAR removed
64
Scoreboard Example Cycle 61
DIVD completes execution
65
Scoreboard Example Cycle 62
DONE!!
66
Scoreboard
  • Operands for an instruction are read only when
    both operands are available in the register file
  • Scoreboard does not take advantage of forwarding
  • Instructions write to register file as soon as
    they are complete execution (assuming no WAR
    hazards) and do not wait for write slot
  • Reduced pipeline latency benefits of forwarding
  • One additional cycle of latency as write result
    and read operand stages cannot overlap
  • Bus structure
  • Limited number of buses to register file
    represent structural hazards

67
Scoreboard
  • Limitations
  • No forwarding (RAW dependence handled through
    registers)
  • In-order issue for WAW/structural hazards limit
    scheduling flexibility
  • WAR stalls limit dynamic loop unrolling (no
    register unrolling)
  • Performance
  • 1.7X for FORTRAN programs
  • 2.5X for hand-coded assembly
  • Hardware
  • Scoreboard is cheap
  • Busses are not

68
DS Method 2 Tomasulos Algorithm
  • Developed for IBM 360/91 3 years after CDC 6600
    (1966)
  • Goal High Performance without special compilers
  • Differences between IBM 360 CDC 6600 ISA
  • IBM has only 2 register specifiers per
    instruction vs. 3 in CDC 6600
  • IBM has 4 FP registers vs. 8 in CDC 6600
  • IBM has long memory access delays, long FP delays
  • Why Study? lead to Alpha 21264, HP 8000, MIPS
    10000, Pentium II, PowerPC 604,

69
Tomasulos Algorithm
  • Avoid RAW Hazards
  • Execute an instruction only when its operands are
    available
  • Has a scheme to track when operands are available
  • Avoid WAR and WAW Hazards
  • Support Register renaming (even across branches)
  • Renames all destination registers Out-of-order
    write does not affect any instructions that
    depend on an earlier value of an operand
  • DIVD F0, F2, F4 DIVD F0, F2, F4
  • ADDD F6, F0, F8 ADDD S, F0, F8 //S T temp Reg
  • SD F6, 0(R1) SD S, 0(R1)
  • SUBD F8, F10, F14 SUBD T, F10, F14
  • MULD F6, F10, F8 MULD F6, F10, T
  • Supports the overlapped execution of multiple
    iterations of a loop

WAR
WAW
70
Tomasulo Algorithm vs. Scoreboard
  • Control buffers distributed with Function Units
    (FU) vs. centralized in scoreboard (with
    bypassing)
  • FU buffers are called reservation stations have
    pending operands
  • Registers in instructions replaced by values or
    pointers to reservation stations(RS) called
    register renaming
  • avoids WAR, WAW hazards
  • More reservation stations than registers, so can
    do optimizations compilers cant
  • Results to FU from RS, not through registers,
    over Common Data Bus that broadcasts results to
    all FUs
  • Load and Stores treated as FUs with reservation
    stations as well
  • Integer instructions can go past branches,
    allowing FP ops beyond basic block in FP queue

71
MIPS FP Unit Using Tomasulos Algorithm
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
72
Three Stages of Tomasulos Algorithm
  • Issueget instruction from FP Op Queue
  • If reservation station free (no structural
    hazard), issue instruction operand values (if
    they are in the registers).
  • If reservation station is busy, instruction
    stalls
  • If operands are not in the registers rename
    registers (eliminate WAR, WAW hazards) and keep
    track of functional units producing operands
  • Executionoperate on operands (EX)
  • If both operands ready then execute
  • if not ready, watch Common Data Bus for result
    (Avoid RAW hazard)
  • Preserve exception behavior No instruction
    executes unless all preceding branches have
    completed
  • Write resultfinish execution (WB)
  • Write on Common Data Bus to all units mark
    reservation station available
  • Normal data bus data destination (go
    to bus)
  • Common data bus data source (come from
    bus) Broadcasts

Each stage can take different number of clock
cycles
73
Reservation Station Components
  • Op
  • Operation to perform in the unit (e.g., or )
  • Vj, Vk
  • Value of Source operands
  • Store buffers have V field with result to be
    stored
  • Qj, Qk
  • Reservation stations producing source operand
    (Qj,Qk0 gt ready)
  • Busy
  • Indicates reservation station or FU is busy
  • QiRegister result status
  • Indicates which functional unit (if exists) will
    write to the register.
  • 0 when no pending instructions to write to this
    register.

74
Tomasulos Data Structures
75
Tomasulos Example Cycle 0
  • Do it yourself excersize

76
Register Renaming
  • Register renaming
  • Change register names to eliminate WAR/WAW
    hazards
  • Hardware renaming most beautiful thing in
    architecture
  • Key think of architectural registers as names,
    not locations
  • Can have more locations than names
  • Dynamically map names to locations
  • Map table hardware structure holds current
    mappings
  • Writes allocate new location, note in map table
  • Reads find location of most recent write by
    looking at map table
  • Must de-allocate locations appropriately (slight
    detail)

77
Tomasulo Register Renaming
  • Locations register file, reservation station
    (RS)
  • Values can (and do) exist in both!
  • Value copies used to eliminate WAR hazards
  • Called value-based or copy-based renaming
  • Not pointer based renaming
  • Locations referred to internally by tags (4-bit
    specifiers)
  • Map table translates names to tags
  • After translation, names are discarded
  • CDB broadcasts values with tags attached
  • So RS knows what its looking at

CDB Common Data Bus
78
Tomasulo Register Renaming
  • Creating operation maps destination register
  • On dispatch, register renamed to tag of allocated
    RS
  • Register table entry RS number
  • On completion, register written
  • Regiter table entry0
  • Subsequent operation looks up sources in register
    table
  • Entry0 -gt register has already been written
  • Copy register value to RS
  • Eliminates WAR hazards (private valid copy of
    register in RS)
  • Entry!0 -gtregister value not ready, some RS will
    provide
  • Copy entry (RS tag) to RS, monitor CDB for that
    tag

CDB Common Data Bus
79
Tomasulos Algorithm A Loop Based Example
  • If we predict that branches are taken
  • Reservation stations allow multiple executions of
    the loop to proceed at once
  • Advantage without changing code
  • Loop unrolled dynamically renaming at
    reservation systems acts as additional registers
  • Load Store
  • Can be done in any order if they access different
    addresses
  • If access same address
  • interchange leads to WAR/RAW Interchange two
    stores leads WAW
  • Detect Hazards
  • Compute effective data memory address and check
    for address conflict with memory address
    associated with earlier memory operation
  • Wait on a match
  • Need to keep relative order for stores and loads
    Loads reordered freely

80
Comparison Tomasulo vs. Scoreboard
Distributed hazard detection
81
Review Tomasulo
  • Prevents Register as bottleneck
  • Avoids WAR, WAW hazards of Scoreboard
  • Allows loop unrolling in HW
  • Not limited to basic blocks (provides branch
    prediction)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are PowerPC 604, 620 MIPS
    R10000 HP-PA 8000 Intel Pentium Pro

82
Dynamic Hardware Prediction
  • Dynamic Branch Prediction is the ability of the
    hardware to make an educated guess about which
    way a branch will go - will the branch be taken
    or not.
  • The hardware can look for clues based on the
    instructions, or it can use past history - we
    will discuss both of these directions.

83
Dynamic Branch Prediction
  • Performance (accuracy, cost of misprediction)
  • Branch History Table (BHT) or Branch Prediction
    Buffer
  • Lower bits of PC address used as index of 1-bit
    values
  • Says whether or not branch taken last time
  • Problem in a loop, 1-bit BHT will cause two
    mis-predictions
  • End of loop case, when it exits instead of
    looping as before
  • First time through loop on next time through
    code, when it predicts exit instead of looping
  • Typical loops related branches are not taken only
    the last iteration
  • That is twice the rate at which typical branches
    are not taken
  • Prediction may be from another branch with same
    low order address bits

84
Branch Prediction Buffers
  • 2-bit scheme where change prediction only if get
    misprediction twice

10
11
01
00
Does not help the five-stage classic pipeline as
it finds branch direction and next PC by ID stage
(assuming no hazard in accessing the register)
85
Branch History Table (BHT) Accuracy
  • Mispredict because either
  • Wrong guess for that branch
  • Got branch history of wrong branch when indexing
    the table
  • 4096 entry table
  • programs vary from 1 misprediction (nasa7,
    tomcatv) to 18 (eqntott), with spice at 9 and
    gcc at 12
  • Misprediction rate for integer benchmarks (gcc,
    espress, eqntott etc) is substantially higher
    (average 11) than that for the FP programs
    (nasa7, matrix300, tomcatv etc., average 4)
  • 4096 entries (2 bits per entry) as good as
    infinite table
  • But 4096 is a lot of HW

86
Correlating Branch Predictors
  • Branch predictors that use the behavior of other
    branches to make prediction
  • Also called two-level predictors
  • Idea taken/not taken of recently executed
    branches is related to behavior of next branch
    (as well as the history of that branch behavior)
  • Then behavior of recent branches selects between,
    say, four predictions of next branch, updating
    just that prediction

87
Accuracy of Different Schemes
4096 Entries 2-bits per entry Unlimited Entries
2-bits per entry 1024 Entries2 bits of history,
2 bits per entry
18
Frequency of Mispredictions
0
88
Branch Target Buffer (BTB)
  • Use address of branch as index to get prediction
    AND branch address (if taken)
  • Note must check for branch match now, since
    cant use wrong branch address
  • Done at IF stage better than branch computation
    at ID stage in 5-stage process
  • Penalty
  • 2 clock cycles
  • (1 to update buffer
  • 1 to fetch new)
  • Return instruction addresses predicted with stack

Predicted PC
Branch Prediction Taken or not Taken
89
Example
  • What is the total branch penalty for a BTB with
  • Prediction accuracy of 90
  • Hit rate in the buffer of 90
  • 60 of the branches are taken

Instructions Prediction Actual Penalty in
Buffer Branch Cycles Yes Taken Taken 0 Yes
Taken Not taken 2 No Taken 2 No Not
taken 0
Penalty Predicted taken, but not taken (2
cycles) Branch taken but not found in the
buffer (2 cycles) Branch Penalty Percent buffer
hit rate X Percent incorrect predictions X 2
( 1 - percent buffer hit rate) X Taken
branches X 2 Branch Penalty ( 90 X 10 X 2)
(10 X 60 X 2) 0.30 clock cycles
90
Multiple Issue
  • Multiple Issue is the ability of the processor to
    start more than one instruction in a given cycle.
  • Superscalar processors
  • Very Long Instruction Word (VLIW) processors

91
1990s Superscalar Processors
  • Bottleneck CPI gt 1
  • Limit on scalar performance (single instruction
    issue)
  • Hazards
  • Superpipelining? Diminishing returns (hazards
    overhead)
  • How can we make the CPI 0.5?
  • Multiple instructions in every pipeline stage
    (super-scalar)
  • 1 2 3 4 5 6 7
  • Inst0 IF ID EX MEM WB
  • Inst0 IF ID EX MEM WB
  • Inst0 IF ID EX MEM WB
  • Inst0 IF ID EX MEM WB
  • Inst0 IF ID EX MEM WB
  • Inst0 IF ID EX MEM WB

92
Superscalar Processors
  • Pioneer IBM (America gt RIOS, RS/6000, Power-1)
  • Superscalar instruction combinations
  • 1 ALU or memory or branch 1 FP (RS/6000)
  • Any 1 1 ALU (Pentium)
  • Any 1 ALU or FP 1 ALU 1 load 1 store 1
    branch (Pentium II)
  • Impact of superscalar
  • More opportunity for hazards (why?)
  • More performance loss due to hazards (why?)

93
Superscalar Processors
  • Issues varying number of instructions per clock
  • Scheduling Static (by the compiler) or
    dynamic(by the hardware)
  • Superscalar has a varying number of
    instructions/cycle (1 to 8), scheduled by
    compiler or by HW (Tomasulo).
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

94
Elements of Advanced Superscalars
  • High performance instruction fetching
  • Good dynamic branch and jump prediction
  • Multiple instructions per cycle, multiple
    branches per cycle?
  • Scheduling and hazard elimination
  • Dynamic scheduling
  • Not necessarily Alpha 21064 Pentium were
    statically scheduled
  • Register renaming to eliminate WAR and WAW
  • Parallel functional units, paths/buses/multiple
    register ports
  • High performance memory systems
  • Speculative execution
  • Precise interrupts

95
SS DS Speculation
  • Superscalar Dynamic scheduling Speculation
  • Three great tastes that taste great together
  • CPI gt 1?
  • Overcome with superscalar
  • Superscalar increases hazards
  • Overcome with dynamic scheduling
  • RAW dependences still a problem?
  • Overcome with a large window
  • Branches a problem for filling large window?
  • Overcome with speculation

96
3GTtTGT II
  • Static ILP
  • VLIW (very long instruction word)
  • To get IPC gt1
  • Static scheduling (pipeline scheduling)
  • To overcome data hazards
  • Static scheduling/software speculation (loop
    unrolling)
  • More instructions for scheduling flexibility,
    overcome control hazards
  • Case for VLIW compiler complexity doesnt impact
    clock!

97
VLIW
  • VLIW Very long instruction word
  • In-order pipe, but each instruction is N
    instructions (VLIW)
  • Typically slotted (I.e., 1st must be ALU, 2nd
    must be load,etc., )
  • VLIW travels down pipe as a unit
  • Compiler packs independent instructions into VLIW
  • Processor does not have logic to interlock
    instructions within a VLIW
  • Pure VLIW
  • Fixed instruction latencies, processor cant
    interlock between VLIWs

IF
ID
ALU
WB
ALU
WB
Ad
WB
MEM
FP
WB
FP
98
Very Long Instruction Word
  • VLIW - issues a fixed number of instructions
    formatted either as one very large instruction or
    as a fixed packet of smaller instructions
  • Fixed number of instructions (4-16) scheduled by
    the compiler put operators into wide templates
  • Started with microcode (horizontal microcode)
  • Joint HP/Intel agreement in 1999/2000
  • Intel Architecture-64 (IA-64) 64-bit address
    /Itanium
  • Explicitly Parallel Instruction Computer (EPIC)
  • Transmeta translates X86 to VLIW
  • Many embedded controllers (TI, Motorola) are VLIW

99
Superscalar Vs. VLIW
  • Religious debate, similar to RISC vs. CISC
  • Wisconsin Michigan (Super scalar) Vs. Illinois
    (VLIW)
  • Q. Who can schedule code better, hardware or
    software?

100
Hardware Scheduling
  • High branch prediction accuracy
  • Dynamic information on latencies (cache misses)
  • Dynamic information on memory dependences
  • Easy to speculate ( recover from
    mis-speculation)
  • Works for generic, non-loop, irregular code
  • Ex databases, desktop applications, compilers
  • -ves
  • Limited reorder buffer size limits lookahead
  • High cost/complexity
  • Slow clock

101
Software Scheduling
  • Large scheduling scope (full program), large
    lookahead
  • Can handle very long latencies
  • Simple hardware with fast clock
  • Only works well for regular codes (scientific,
    FORTRAN)
  • -ves
  • Low branch prediction accuracy
  • Can improve by profiling
  • No information on latencies like cache misses
  • Can improve by profiling
  • Pain to speculate and recover from
    mis-speculation
  • Can improve with hardware support

102
Profiling
  • Information from previous program run
  • Must use different input!
  • Softwares answer to everything
  • Works OK, but only OK
  • Popular research topic
  • Gaining importance

103
Pure VLIW What Does VLIW Mean?
  • All latencies fixed
  • All instructions in VLIW issue at once
  • No hardware interlocks at all
  • Compiler responsible for scheduling entire
    pipeline
  • Includes stall cycles
  • Possible if you know structure of pipeline and
    latencies exactly

104
Problems with Pure VLIW
  • Latencies are not fixed (e.g., caches)
  • Option I dont use caches (forget it)
  • Option II stall whole pipeline on a miss? (need
    interlocks)
  • Option III stall instructions waiting for
    memory? (need out-of-order)
  • Different implementations
  • Different pipe depths, different latencies
  • New pipeline may produce wrong results (code
    stalls in wrong place)
  • Recompile for new implementations?
  • Code compatibility is very important, made Intel
    what it is

105
Tainted VLIW
  • EPIC (IA64, Itanium)
  • Less rigid than VLIW (Not really VLIW at all)
  • Architecture variable width instruction words
  • Implemented as bundles with dependence bits
  • Makes code compatible with different width
    machines
  • Implementation interlocks
  • Makes code compatible with different pipelines
  • Enables stalls on cache misses
  • Actually enables out-of-order, too
  • Explicitly parallel RISC with support for
    software speculation

106
Key Static Scheduling
  • VLIW relies on the fact that software can
    schedule code well
  • Three techniques
  • Loop unrolling (we have seen this one already)
  • Problems
  • Code growth
  • Poor scheduling along seams of unrolled copies
  • Doesnt handle carried dependences
    (inter-iteration dependences or recurrents)
  • Software pipelining (symbolic loop unrolling)
  • Trace scheduling

107
VLIW
  • 3 Instructions in 128 bit groups field
    determines if instructions dependent or
    independent
  • Smaller code size than old VLIW, larger than
    x86/RISC
  • Groups can be linked to show independence gt 3
    instr
  • 64 integer registers 64 floating point
    registers
  • Not separate files per functional unit as in old
    VLIW
  • Hardware checks dependencies (interlocks gt
    binary compatibility over time)
  • Predicated execution (select 1 out of 64 1-bit
    flags) gt 40 fewer mispredictions?
  • IA-64 name of instruction set architecture EPIC
    is type
  • Merced is name of first implementation
    (1999/2000?) Itanium?

108
Superscalar Version of DLX
  • can handle 2 instructions/cycle
  • Floating Point
  • Anything Else
  • Fetch 64-bits/clock cycle Int on left, FP on
    right
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports for FP registers to do FP load FP
    op in a pair
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay can cause delay to 3
    instructions in Superscalar
  • instruction in right half cant use it, nor
    instructions in next slot

109
Unrolled Loop Minimizes Stalls for Scalar
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD F4,
0(R1) 10 SD F8, -8(R1) 11 SD F12,
-16(R1) 12 SUBI R1,R1,32 13 BNE R1, R2,
LOOP 14 SD F16, 8(R1) 14 clock cycles, or 3.5
per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2 Cycles
110
Loop Unrolling in Superscalar
  • Integer instruction FP instruction Clock cycle
  • Loop LD F0, 0(R1) 1
  • LD F6, -8(R1) 2
  • LD F10, -16(R1) ADDD F4, F0, F2 3
  • LD F14, -24(R1) ADDD F8, F6, F2 4
  • LD F18, -32(R1) ADDD F12, F10, F2 5
  • SD F4, 0(R1) ADDD F16, F14, F2 6
  • SD F8, -8(R1) ADDD F20, F18, F2 7
  • SD F12, -16(R1) 8
  • SD F16, -24(R1) 9
  • SUBI R1,R1,40 10
  • BNE R1, R2, LOOP 11
  • SD 8(R1), F20 12
  • Unrolled 5 times to avoid delays (1 due to SS)
  • 12 clocks, or 2.4 clocks per iteration

Static Scheduling
111
Dynamic Scheduling in Superscalar
  • Code compiler for scalar version will run poorly
    on Superscalar
  • May want code to vary depending on Superscalar
    Architecture
  • Simple approach Separate Tomasulo Control for
    separate reservation stations for Integer FU/Reg
    and for FP FU/Reg

112
Dynamic Scheduling in Superscalar
  • How to do instruction issue with two instructions
    and keep in-order instruction issue for Tomasulo?
  • Issue 2X Clock Rate, so that issue remains in
    order
  • Only FP loads might cause dependency between
    integer and FP issue
  • Replace load reservation station with a load
    queue operands must be read in the order they
    are fetched
  • Load checks addresses in Store Queue to avoid RAW
    violation
  • Store checks addresses in Load Queue to avoid
    WAR, WAW

113
Performance of Dynamic Superscalar
  • Iteration Instructions Issues Executes Writes
    result
  • no.
    clock-cycle number
  • 1 LD F0, 0(R1) 1 2 4
  • 1 ADDD F4, F0, F2 1 5 8
  • 1 SD F4, 0(R1) 2 9
  • 1 SUBI R1, R1, 8 3 4 5
  • 1 BNEZ R1, LOOP 4 5
  • 2 LD F0, 0(R1) 5 6 8
  • 2 ADDD F4, F0, F2 5 9 12
  • 2 SD F4, 0(R1) 6 13
  • 2 SUBI R1, R1, 8 7 8 9
  • 2 BNE R1, R2, LOOP 8 9
  • 4 clocks per iteration
  • Branches, Decrements still take 1 clock cycle

114
Loop Unrolling in VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0,0(R1) LD F6,-8(R1) 1
  • LD F10,-16(R1) LD F14,-24(R1) 2
  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2 3
  • LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
  • ADDD F20,F18,F2 ADDD F24,F22,F2 5
  • SD F4, 0(R1) SD F8, -8(R1) ADDD F28,F26,F2 6
  • SD F12, -16(R1) SD F16, -24(R1) 7
  • SD F20, -32(R1) SD F24, -40(R1) SUBI
    R1,R1,48 8
  • SD F28, -0(R1) BNE R1, R2, LOOP 9
  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks per
    iteration
  • Need more registers to effectively use VLIW

115
Limits to Multi-Issue Machines
  • Inherent limitations of ILP
  • 1 branch in 5 instructions gt how to keep a 5-way
    VLIW busy?
  • Latencies of units gt many operations must be
    scheduled
  • Need about Pipeline Depth x No. Functional Units
    of independent operations to keep machines busy.
  • Difficulties in building HW
  • Duplicate Functional Units to get parallel
    execution
  • Increase ports to Register File (VLIW example
    needs 6 read and 3 write for Int. Reg. 6 read
    and 4 write for Reg.)
  • Increase ports to memory
  • Decoding SS and impact on clock rate, pipeline
    depth

SS Super scalar
116
Limits to Multi-Issue Machines
  • Limitations specific to either SS or VLIW
    implementation
  • Decode issue in SS
  • VLIW code size unroll loops wasted fields in
    VLIW
  • VLIW lock step gt 1 hazard all instructions
    stall
  • VLIW binary compatibility

117
Multiple Issue Challenges
  • While Integer/FP split is simple for the HW, get
    CPI of 0.5 only for programs with
  • Exactly 50 FP operations
  • No hazards
  • If more instructions issue at same time, greater
    difficulty of decode and issue
  • Even 2-scalar gt examine 2 opcodes, 6 register
    specifiers, decide if 1 or 2 instructions can
    issue
  • VLIW tradeoff instruction space for simple
    decoding
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word are independent
    gt execute in parallel
  • E.g., 2 integer operations, 2 FP ops, 2 Memory
    refs, 1 branch
  • 16 to 24 bits per field gt 716 or 112 bits to
    724 or 168 bits wide
  • Need compiling technique that schedules across
    several branches

118
Compiler Support For ILP
  • How can compilers be smart?
  • Produce good scheduling of code.
  • Determine which loops might contain parallelism.
  • Eliminate name dependencies.
  • Compilers must be REALLY smart
  • Figure out aliases
  • Pointers in C are a real problem
  • Techniques lead to
  • Symbolic Loop Unrolling
  • Critical Path Scheduling

119
Symbolic Loop Unrolling
  • Observation
  • if iterations from loops are independent, then
    can get ILP by taking instructions from different
    iterations
  • Software pipelining
  • reorganizes loops so that each iteration is made
    from instructions chosen from different
    iterations of the original loop (Tomasulo in SW)

120
Software Pipelining
  • Software pipelining (symbolic loop unrolling)
  • Really is pipelining in software
  • One physical iteration
  • Contains instructions from multiple original
    iterations
  • Each instruction in different stage
  • Need prologue and epilogue to start flush
    pipeline

121
Symbolic Loop Unrolling SW Pipelining Example
  • Before Unrolled 3 times
  • 1 LD F0,0(R1)
  • 2 ADDD F4,F0,F2
  • 3 SD F4,0(R1)
  • 4 LD F6,-8(R1)
  • 5 ADDD F8,F6,F2
  • 6 SD F8,-8(R1)
  • 7 LD F10,-16(R1)
  • 8 ADDD F12,F10,F2
  • 9 SD F12,-16(R1)
  • 10 SUBI R1,R1,24
  • 11 BNE R1, R2, LOOP

After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD F4,0(R1) Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNE R1, R2, LOOP SD F4,0(R1) ADDD F4,F0,F2 SD
F4,-8(R1)
Note Within physical iteration, instructions are
unrelated Perfrect for VLIW!!
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
122
Symbolic Loop Unrolling
  • Less code space
  • Overhead paid only once vs. each iteration
    in loop unrolling

Software Pipelining
Loop Unrolling
100 iterations 25 loops with 4 unrolled
iterations each
123
Software Pipelining
  • Doesnt increase code size (much)
  • Good scheduling at iteration seams
  • Can bary degree of piplining to tolerate longer
    latencies
  • software superpipelining
  • One physical iteration instructions from logical
    iterations, I, I2, I4
  • -ves
  • Hard to do conditionals within loops
  • Tricky register allocation sometimes
  • Not everything is loops

124
Trace Scheduling
  • Trace scheduling
  • For general non-loop situations
  • Basic idea
  • Find common paths in program
  • Realign basic blocks to form straight-line trace
  • Basic block single-entry, single-exit
    instruction sequence
  • Trace (aka superblock, hyperblock) fused basic
Write a Comment
User Comments (0)
About PowerShow.com