Title: Instruction Level Parallelism
1Instruction Level Parallelism
2Instruction Level Parallelism
- Concepts and Challenges
- Dynamic Scheduling
- Dynamic Hardware Prediction
- Multiple Issue
- Compiler Support
- Hardware Support
- Studies of ILP
3Summary of Pipelining Basics
- Hazards limit performance by preventing
instructions from executing during their
designated clock cycles - Structural Hazards need more HW resources
- Data Hazards need forwarding, compiler
scheduling - Control Hazards early evaluation PC, delayed
branch, prediction - Increasing length of pipe increases impact of
hazards - Pipelining helps instruction bandwidth, not
latency - Interrupts, Instruction Set, FP makes pipelining
harder - Compilers reduce cost of data and control hazards
- Stall Increases CPI and decreases performance
4What Is an ILP?
- Principle Many instructions in the code do not
depend on each other - Result Possible to execute them in parallel
- ILP Potential overlap among instructions (so
they can be evaluated in parallel) - Issues
- Building compilers to analyze the code
- Building special/smarter hardware to handle the
code - ILP Increase the amount of parallelism
exploited among instructions - Seeks Good Results out of Pipelining
5Scheduling
- Scheduling re-arranging instructions to maximize
performance - Requires knowledge about structure of processor
- Static Scheduling done by compiler
- Review Provides good analogies for hardware
scheduling - Embedded market and IA-64 architecture and
Intels Itanium - Have already seen an example of this
- Scheduling to eliminate MEM/ALU Bubbles
- Another example
- for (i1000 igt0 i--) xi xi s
- Dynamic Scheduling done by hardware
- Dominates Server and Desktop markets (Pentium
III, IV MIPS R10000/12000, UltraSPARC III,
PowerPC 603 etc)
6Pipeline Scheduling Previous Lecture Example
Compiler schedules (move) instructions to reduce
stall Ex code sequence a b c, d e f
Before scheduling lw Rb, b lw
Rc, c Add Ra, Rb, Rc //stall sw a, Ra
lw Re, e lw Rf, f sub Rd,
Re, Rf //stall sw d, Rd
After scheduling lw Rb, b lw Rc, c lw Re, e
Add Ra, Rb, Rc lw Rf, f sw a, Ra sub Rd, Re,
Rf sw d, Rd
Schedule
7Basic Pipeline Scheduling
- To avoid pipeline stall
- A dependant instruction must be separated from
the source instruction by a distance in clock
cycles equal to the pipeline latency - Compilers ability depends on
- Amount of ILP available in the program
- Latencies of the functional units in the pipeline
- Pipeline CPI Ideal pipeline CPI Structured
stalls Data hazards stalls Control stalls
8Pipeline Scheduling Loop Unrolling
- Basic Block
- Set of instructions between entry points and
between branches. A basic block has only one
entry and one exit - Typically 4 to 7 instructions
- Amount of overlap ltlt 4 to 7 instructions
- Obtain substantial performance enhancements
Exploit ILP across multiple basic blocks - Loop Level Parallelism
- Parallelism that exists within a loop Limited
opportunity - Parallelism can cross loop iterations!
- Techniques to convert loop-level parallelism to
instructional-level parallelism - Loop Unrolling Compiler or the hardwares
ability to exploit the parallelism inherent in
the loop - Vector instructions Operate on a sequence of
data items
9Assumptions
- Five-stage integer pipeline
- Branches have delay of one clock cycle
- ID stage Comparisons done, decisions made and PC
loaded - No structural hazards
- Functional units are fully pipelined or
replicated (as many times as the pipeline depth) - FP Latencies
Integer load latency 1 Integer ALU operation
latency 0
10Simple Loop Assembler Equivalent
- for (i1000 igt0 i--) xi xi s
-
- Loop LD F0, 0(R1) F0array element
- ADDD F4, F0, F2 add scalar in F2
- SD F4 , 0(R1) store result
- SUBI R1, R1, 8 decrement pointer 8bytes (DW)
- BNE R1, R2, Loop branch R1!R2
-
- xi s are double/floating point type
- R1 initially address of an array element with the
highest address - F2 contains the scalar value s
- Register R2 is pre-computed so that 8(R2) is the
last element to operate on
11Where are the stalls?
- Unscheduled
- Loop LD F0, 0(R1)
- stall
- ADDD F4, F0, F2
- stall
- stall
- SD F4, 0(R1)
- SUBI R1, R1, 8
- stall
- BNE R1, R2, Loop
- stall
- 10 clock cycles
- Can we minimize?
- Scheduled
- Loop LD F0, 0(R1)
- SUBI R1, R1, 8
- ADDD F4, F0, F2
- stall
- BNE R1, R2, Loop
- SD F4, 8(R1)
-
- 6 clock cycles
- 3 cycles actual work 3 cycles overhead
- Can we minimize further?
-
Integer load latency 1 Integer ALU operation
latency 0
12Where are the stalls?
Slide 12 Note 2 stall is required as the
latency requirement between FP ALU OP and Store
double is 2 cycles for this architecture as
specified the table at the bottom of the slide.
The ADDD instruction and SD instruction should
have two cycles latency between them.
13Loop Unrolling
Four copies of loop
Four iteration code
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- Loop LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4, 0(R1)
- LD F6, -8(R1)
- ADDD F8, F6, F2
- SD F8, -8(R1)
- LD F10, -16(R1)
- ADDD F12, F10, F2
- SD F12, -16(R1)
- LD F14, -24(R1)
- ADDD F16, F14, F2
- SD F16, -24(R1)
- SUBI R1, R1, 32
- BNE R1, R2, Loop
-
Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
14Loop Unroll Schedule
- Loop LD F0, 0(R1)
- stall
- ADDD F4, F0, F2
- stall
- stall
- SD F4, 0(R1)
- LD F6, -8(R1)
- stall
- ADDD F8, F6, F2
- stall
- stall
- SD F8, -8(R1)
- LD F10, -16(R1)
- stall
- ADDD F12, F10, F2
- stall
- stall
- SD F12, -16(R1)
- LD F14, -24(R1)
Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SUBI R1,
R1, 32 NOTE 3 SD F12, 16(R1) BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
Note 3 To enable the latency requirements between
SUBI and BNE instructions (we need one cycle
latency as explained in note on slide 12, I moved
one SD instruction to in-between these
instructions)
28 clock cycles or 7 per iteration Can we
minimize further?
15Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
16Limits to Gains of Loop Unrolling
- Decreasing benefit
- A decrease in the amount of overhead amortized
with each unroll - Example just considered
- Unrolled loop 4 times, no stall cycles, in 14
cycles 2 were loop overhead - If unrolled 8 times, the overhead is reduced from
½ cycle per iteration to 1/4 - Code size limitations
- Memory is premium
- Larger size causes cache hit rate changes
- Shortfall in registers (Register pressure)
Increasing ILP leads to increase in number of
live values May not be possible to allocate all
the live values to registers - Compiler limitations Significant increase in
complexity
17What if upper bound of the loop is unknown?
- Suppose
- Upper bound of the loop is n
- Unroll the loop to make k copies of the body
- Solution Generate pair of consecutive loops
- First loop body same as original loop, execute
(n mod k) times - Second loop unrolled body (k copies of
original), iterate (n/k) times - For large values of n, most of the execution time
is spent in the unrolled loop body
18Summary Tricks of High Performance Processors
- Out-of-order scheduling To tolerate RAW hazard
latency - Determine that the loads and stores can be
exchanged as loads and stores from different
iterations are independent - This requires analyzing the memory addresses and
finding that they do not refer to the same
address - Find that it was ok to move the SD after the SUBI
and BNE, and adjust the SD offset - Loop unrolling Increase scheduling scope for
more latency tolerance - Find that loop unrolling is useful by finding
that loop iterations are independent, except for
the loop maintenance code - Eliminate extra tests and branches and adjust the
loop maintenance code - Register renaming Remove WAR/WAW violations due
to scheduling - Use different registers to avoid unnecessary
constraints that would be forced by using same
registers for different computations - Summary Schedule the code preserving any
dependences needed
19Compiler Perspective
- Compiler concerned about dependencies in program.
- Tries to schedule code to avoid hazards
property of pipeline organization - Looks for Data dependencies
- Instruction i produces a result used by
instruction j, or - Instruction j is data dependent on instruction k,
and instruction k is data dependent on
instruction i (chain dependence) - If dependent, cant execute in parallel (or be
completely overlapped) - Easy to determine for registers (fixed names)
- Hard for memory
- Does 100(R4) 20(R6)?
- From different loop iterations, does 20(R6)
20(R6)?
20Data Dependence
- Data dependence
- Indicates the possibility of a hazard
- Determines the order in which results must be
calculated - Sets upper bound on how much parallelism can be
exploited - But, actual hazard length of any stall is
determined by pipeline - Dependence avoidance
- Maintain the dependence but avoid hazard
Scheduling - Eliminate dependence by transforming the code
21Data Dependencies
-
- 1 Loop LD F0, 0(R1)
- 2 ADDD F4, F0, F2
- 3 SUBI R1, R1, 8
- 4 BNE R1, R2, Loop delayed branch
- 5 SD F4, 8(R1) altered when move past SUBI
22Name Dependencies
- Two instructions use same name (register or
memory location) but dont exchange data - Anti-dependence (WAR if a hazard for HW)
- Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first - Output dependence (WAW if a hazard for HW)
- Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved - How to remove name dependencies?
- They are not true dependencies
23Register Renaming
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F0, -8(R1) 5 ADDD F4, F0,
F2 6 SD F4, -8(R1) 7 LD F0, -16(R1)
8 ADDD F4, F0, F2 9 SD F4, -16(R1) 10 LD F0,
-24(R1) 11 ADDD F4,F0,F2 12 SD F4, -24(R1)
13 SUBI R1, R1, 32 14 BNE R1, R2, LOOP
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F6,-8(R1) 5 ADDD F8, F6,
F2 6 SD F8, -8(R1) 7 LD F10,-16(R1)
8 ADDD F12, F10, F2 9 SD F12, -16(R1)
10 LD F14, -24(R1) 11 ADDD F16, F14,F2
12 SD F16, -24(R1) 13 SUBI R1, R1, 32
14 BNE R1, R2, LOOP
No data is passed in F0, but cant reuse F0 in
cycle 4.
- Name Dependencies are Hard for Memory Accesses
- Does 100(R4) 20(R6)?
- From different loop iterations, does 20(R6)
20(R6)? - Our example required compiler to know that if R1
doesnt change then 0(R1) ? -8(R1) ?
-16(R1) ? -24(R1)There were no dependencies
between some loads and stores so they could be
moved around
24Control Dependencies
- Example
- if p1 S1
- if p2 S2
- S1 is control dependent on p1 S2 is control
dependent on p2 but not on p1 - Two constraints
- An instruction that is control dependent on a
branch cannot be moved before the branch so
that its execution is no longer controlled by the
branch. - An instruction that is not control dependent on a
branch cannot be moved to after the branch so
that its execution is controlled by the branch.
- Control dependencies relaxed to get parallelism
- Get same effect if preserve order of exceptions
(Ex address in register checked by branch before
use) and data flow (Ex value in register depends
on branch) (Speculation, Delayed branching etc).
25Control Dependencies
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BE R1, R2, exit
- LD F0, 0(R1) if executed before branch, may
create exception - ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BE R1, R2, Exit
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BE R1, R2, Exit
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
26When Safe to Unroll Loop?
- Example-1 Where are the data dependences?
- (A, B, C are distinct and non-overlapping arrays)
-
- for (i1 ilt100 i i1) Ai1 Ai
Ci / S1 / Bi1 Bi Ai1 /
S2 / - S2 uses the value, Ai1, computed by S1 in the
same iteration - S1 uses a value computed by S1 in an earlier
iteration, since iteration i computes Ai1,
which is read in iteration i1. The same is true
of S2 for Bi and Bi1. - Second one is a loop-carried dependence between
iterations - Iterations are dependent and cant be executed in
parallel - Note the case for our prior example each
iteration was distinct - (Possible loop-carried dependence that does not
prevent parallelism)
27When Safe to Unroll Loop?
- Example-2 Where are the data dependences?
- (A, B, C are distinct and non-overlapping arrays)
-
- for (i1 ilt100 i i1) Ai1 Ai
Bi / S1 / Bi1 Ci Di /
S2 / - No dependence from S1 to S2. If there were, then
there would be a cycle in the dependencies and
the loop would not be parallel. Since this other
dependence is absent, interchanging the two
statements will not affect the execution of S2. - On the first iteration of the loop, statement S1
depends on the value of B1 computed prior to
initiating the loop. - New code No loop dependence
- A1 A1 B1
- for (i1 ilt100 i i1) Bi1 Ci
Di - Ai2 Ai1 Bi1 //check it out on
computer/ use your logic -
- B101 C100 D100
28Tricks Can Be Done in Hardware..
- Why build complicated hardware if we can do this
in software? - Performance portabiity
- Software assumes pipeline structure
- Dont want to recompile for new machines
- More information available to hardware
- Data addresses, branch directions, cache misses
statically unknown but compiler can look at
more instructions - More resources available to hardware
- May not have enough architectural registers to
resolve WAR/WAW - Easier to use speculative execution in hardware
- Easier to recover from mis-speculation
- Solution do combination of both
29Dynamic Scheduling
- Dynamic Scheduling Hardware rearranges the order
of instruction execution to reduce stalls - Disadvantages
- Hardware much more complex
- Key idea
- Instructions execution in parallel (use available
all executing units) - Allow instructions behind stall to proceed
- Example
- DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F12,F8,F14
- Out-of-order execution gt out-of-order completion
30Overview
- In-order pipeline
- 5 interlocked stages IF, ID, EX, MEM, WB
- Structural hazard maximum of 1 instruction per
stage - Unless stage is replicated (FP integer EX) or
idle (WB for stores) - Out-of-order pipeline
- How does one instruction pass another without
killingit? - Remember only one instruction per-stage
per-cycle - Must buffer instructions
IF
ID
EX
MEM
WB
31Instruction Buffer
- Trick instruction buffer (many names for this
buffer) - Accumulate decoded instructions in buffer
- Buffer sends instructions down rest of pipe
out-of-order
instruction buffer
ID1
ID2
EX
MEM
WB
IF
32Scoreboard
State/Steps
instruction buffer
IS
RO
EX
WB
IF
ID
- Confusion in community about which is which stage
Structure
Data Bus
EX
Registers
EX
EX
Control/Status
Scoreboard
33Dynamic Scheduling Scoreboard
- Out-of-order execution divides ID stage
- 1. Issuedecode instructions, check for
structural hazards - 2. Read Operandswait until no data hazards, then
read operands - Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions. - A scoreboard is a data structure that provides
the information necessary for all pieces of the
processor to work together. - Centralized control scheme
- No bypassing
- No elimination of WAR/WAW hazards
- We will use In order issue, out of order
execution, out of order commit ( also called
completion) - First used in CDC6600. Our example modified here
for DLX. - CDC had 4 FP units, 5 memory reference units, 7
integer units. - DLX has 2 FP multiply, 1 FP adder, 1 FP divider,
1 integer.
34Scoreboard Implications
- Out-of-order completion gt WAR, WAW hazards?
- Solutions for WAR
- Queue both the operation and copies of its
operands - Read registers only during Read Operands stage
- Solution for WAW Structural Hazards
- Must detect hazard stall until the hazards are
cleared - Need to have multiple instructions in execution
phase - Multiple execution units or pipelined execution
units - Scoreboard keeps track of dependencies, state or
operations - Scoreboard replaces ID, EX, WB with 4 stages
35Stages of Scoreboard Control
- Issue decode instructions check for structural
hazards (ID1) - If a functional unit for the instruction is free
and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure. - If a structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared.
36Stages of Scoreboard Control
- Read Operands wait until no data hazards, then
read operands from registers (ID2) - A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit. - When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution. - The scoreboard resolves RAW hazards dynamically
in this step, and instructions may be sent into
execution out of order.
37Stages of Scoreboard Control
- Execution operate on operands (EX)
- The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution. - Write result finish execution (WB)
- Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction. - Example
- DIVD F0, F2, F4
- ADDD F10, F0, F8
- SUBD F8, F8, F14
- Scoreboard would stall SUBD until ADDD reads
operands
38Scoreboard Data Structures
- Instruction status
- Which of 4 steps the instruction is in
- Functional unit status
- Busy Whether the unit is busy or not
- Op Operation to perform in the unit (e.g., or
) - Fi Destination register
- Fj, Fk Source-register numbers
- Qj, Qk Functional units producing source
registers Fj, Fk - Rj, Rk ready bits for Fj, Fk
- Register result status
- Indicates which functional unit (if any) will
write each register. - Blank when no pending instructions will write
that register
39Detailed Scoreboard Pipeline Control
Instruction status
Bookkeeping
Wait until
Issue
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Read operands
Rj? No Rk? No
Rj and Rk
Execution complete
Functional unit done
Write result
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
40Scoreboard Example
LD F6, 34(R2) LD F2, 45(R3) MULT F0, F2,
F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6,
F8, F2 What are the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 DIVD 40 A
DDD, SUBD 2
41Scoreboard Example
42Scoreboard Example Cycle 1
Issue LD 1
Shows in which cycle the operation occurred.
43Scoreboard Example Cycle 2
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
44Scoreboard Example Cycle 3
45Scoreboard Example Cycle 4
46Scoreboard Example Cycle 5
Issue LD 2 since integer unit is now free
47Scoreboard Example Cycle 6
Issue MULT
48Scoreboard Example Cycle 7
MULT cant read its operands (F2) because LD 2
hasnt finished
49Scoreboard Example Cycle 8a
DIVD issues. MULT and SUBD both waiting for F2
50Scoreboard Example Cycle 8b
LD 2 writes F2
51Scoreboard Example Cycle 9
Now MULT and SUBD can both read F2 How can both
instructions do this at the same time??
52Scoreboard Example Cycle 11
ADDD cant start because Add unit is busy
53Scoreboard Example Cycle 12
SUBD finishes. DIVD waiting for F0
54Scoreboard Example Cycle 13
ADDD issues
55Scoreboard Example Cycle 14
56Scoreboard Example Cycle 15
57Scoreboard Example Cycle 16
58Scoreboard Example Cycle 17
ADDD cant write because of DIVD RAW!
59Scoreboard Example Cycle 18
Nothing Happens!!
60Scoreboard Example Cycle 19
MULT completes execution
61Scoreboard Example Cycle 20
MULT writes
62Scoreboard Example Cycle 21
DIVD loads operands
63Scoreboard Example Cycle 22
Now ADDD can write since WAR removed
64Scoreboard Example Cycle 61
DIVD completes execution
65Scoreboard Example Cycle 62
DONE!!
66Scoreboard
- Operands for an instruction are read only when
both operands are available in the register file - Scoreboard does not take advantage of forwarding
- Instructions write to register file as soon as
they are complete execution (assuming no WAR
hazards) and do not wait for write slot - Reduced pipeline latency benefits of forwarding
- One additional cycle of latency as write result
and read operand stages cannot overlap - Bus structure
- Limited number of buses to register file
represent structural hazards
67Scoreboard
- Limitations
- No forwarding (RAW dependence handled through
registers) - In-order issue for WAW/structural hazards limit
scheduling flexibility - WAR stalls limit dynamic loop unrolling (no
register unrolling) - Performance
- 1.7X for FORTRAN programs
- 2.5X for hand-coded assembly
- Hardware
- Scoreboard is cheap
- Busses are not
68DS Method 2 Tomasulos Algorithm
- Developed for IBM 360/91 3 years after CDC 6600
(1966) - Goal High Performance without special compilers
- Differences between IBM 360 CDC 6600 ISA
- IBM has only 2 register specifiers per
instruction vs. 3 in CDC 6600 - IBM has 4 FP registers vs. 8 in CDC 6600
- IBM has long memory access delays, long FP delays
- Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,
69Tomasulos Algorithm
- Avoid RAW Hazards
- Execute an instruction only when its operands are
available - Has a scheme to track when operands are available
- Avoid WAR and WAW Hazards
- Support Register renaming (even across branches)
- Renames all destination registers Out-of-order
write does not affect any instructions that
depend on an earlier value of an operand - DIVD F0, F2, F4 DIVD F0, F2, F4
- ADDD F6, F0, F8 ADDD S, F0, F8 //S T temp Reg
- SD F6, 0(R1) SD S, 0(R1)
- SUBD F8, F10, F14 SUBD T, F10, F14
- MULD F6, F10, F8 MULD F6, F10, T
- Supports the overlapped execution of multiple
iterations of a loop
WAR
WAW
70Tomasulo Algorithm vs. Scoreboard
- Control buffers distributed with Function Units
(FU) vs. centralized in scoreboard (with
bypassing) - FU buffers are called reservation stations have
pending operands - Registers in instructions replaced by values or
pointers to reservation stations(RS) called
register renaming - avoids WAR, WAW hazards
- More reservation stations than registers, so can
do optimizations compilers cant - Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs - Load and Stores treated as FUs with reservation
stations as well - Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue
71MIPS FP Unit Using Tomasulos Algorithm
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
72Three Stages of Tomasulos Algorithm
- Issueget instruction from FP Op Queue
- If reservation station free (no structural
hazard), issue instruction operand values (if
they are in the registers). - If reservation station is busy, instruction
stalls - If operands are not in the registers rename
registers (eliminate WAR, WAW hazards) and keep
track of functional units producing operands - Executionoperate on operands (EX)
- If both operands ready then execute
- if not ready, watch Common Data Bus for result
(Avoid RAW hazard) - Preserve exception behavior No instruction
executes unless all preceding branches have
completed - Write resultfinish execution (WB)
- Write on Common Data Bus to all units mark
reservation station available - Normal data bus data destination (go
to bus) - Common data bus data source (come from
bus) Broadcasts
Each stage can take different number of clock
cycles
73Reservation Station Components
- Op
- Operation to perform in the unit (e.g., or )
- Vj, Vk
- Value of Source operands
- Store buffers have V field with result to be
stored - Qj, Qk
- Reservation stations producing source operand
(Qj,Qk0 gt ready) - Busy
- Indicates reservation station or FU is busy
-
- QiRegister result status
- Indicates which functional unit (if exists) will
write to the register. - 0 when no pending instructions to write to this
register.
74Tomasulos Data Structures
75Tomasulos Example Cycle 0
76Register Renaming
- Register renaming
- Change register names to eliminate WAR/WAW
hazards - Hardware renaming most beautiful thing in
architecture - Key think of architectural registers as names,
not locations - Can have more locations than names
- Dynamically map names to locations
- Map table hardware structure holds current
mappings - Writes allocate new location, note in map table
- Reads find location of most recent write by
looking at map table - Must de-allocate locations appropriately (slight
detail)
77Tomasulo Register Renaming
- Locations register file, reservation station
(RS) - Values can (and do) exist in both!
- Value copies used to eliminate WAR hazards
- Called value-based or copy-based renaming
- Not pointer based renaming
- Locations referred to internally by tags (4-bit
specifiers) - Map table translates names to tags
- After translation, names are discarded
- CDB broadcasts values with tags attached
- So RS knows what its looking at
CDB Common Data Bus
78Tomasulo Register Renaming
- Creating operation maps destination register
- On dispatch, register renamed to tag of allocated
RS - Register table entry RS number
- On completion, register written
- Regiter table entry0
- Subsequent operation looks up sources in register
table - Entry0 -gt register has already been written
- Copy register value to RS
- Eliminates WAR hazards (private valid copy of
register in RS) - Entry!0 -gtregister value not ready, some RS will
provide - Copy entry (RS tag) to RS, monitor CDB for that
tag
CDB Common Data Bus
79Tomasulos Algorithm A Loop Based Example
- If we predict that branches are taken
- Reservation stations allow multiple executions of
the loop to proceed at once - Advantage without changing code
- Loop unrolled dynamically renaming at
reservation systems acts as additional registers - Load Store
- Can be done in any order if they access different
addresses - If access same address
- interchange leads to WAR/RAW Interchange two
stores leads WAW - Detect Hazards
- Compute effective data memory address and check
for address conflict with memory address
associated with earlier memory operation - Wait on a match
- Need to keep relative order for stores and loads
Loads reordered freely
80Comparison Tomasulo vs. Scoreboard
Distributed hazard detection
81Review Tomasulo
- Prevents Register as bottleneck
- Avoids WAR, WAW hazards of Scoreboard
- Allows loop unrolling in HW
- Not limited to basic blocks (provides branch
prediction) - Lasting Contributions
- Dynamic scheduling
- Register renaming
- Load/store disambiguation
- 360/91 descendants are PowerPC 604, 620 MIPS
R10000 HP-PA 8000 Intel Pentium Pro
82Dynamic Hardware Prediction
- Dynamic Branch Prediction is the ability of the
hardware to make an educated guess about which
way a branch will go - will the branch be taken
or not. - The hardware can look for clues based on the
instructions, or it can use past history - we
will discuss both of these directions.
83Dynamic Branch Prediction
- Performance (accuracy, cost of misprediction)
- Branch History Table (BHT) or Branch Prediction
Buffer - Lower bits of PC address used as index of 1-bit
values - Says whether or not branch taken last time
- Problem in a loop, 1-bit BHT will cause two
mis-predictions - End of loop case, when it exits instead of
looping as before - First time through loop on next time through
code, when it predicts exit instead of looping - Typical loops related branches are not taken only
the last iteration - That is twice the rate at which typical branches
are not taken - Prediction may be from another branch with same
low order address bits
84Branch Prediction Buffers
- 2-bit scheme where change prediction only if get
misprediction twice
10
11
01
00
Does not help the five-stage classic pipeline as
it finds branch direction and next PC by ID stage
(assuming no hazard in accessing the register)
85Branch History Table (BHT) Accuracy
- Mispredict because either
- Wrong guess for that branch
- Got branch history of wrong branch when indexing
the table - 4096 entry table
- programs vary from 1 misprediction (nasa7,
tomcatv) to 18 (eqntott), with spice at 9 and
gcc at 12 - Misprediction rate for integer benchmarks (gcc,
espress, eqntott etc) is substantially higher
(average 11) than that for the FP programs
(nasa7, matrix300, tomcatv etc., average 4) - 4096 entries (2 bits per entry) as good as
infinite table - But 4096 is a lot of HW
86Correlating Branch Predictors
- Branch predictors that use the behavior of other
branches to make prediction - Also called two-level predictors
- Idea taken/not taken of recently executed
branches is related to behavior of next branch
(as well as the history of that branch behavior) - Then behavior of recent branches selects between,
say, four predictions of next branch, updating
just that prediction
87Accuracy of Different Schemes
4096 Entries 2-bits per entry Unlimited Entries
2-bits per entry 1024 Entries2 bits of history,
2 bits per entry
18
Frequency of Mispredictions
0
88Branch Target Buffer (BTB)
- Use address of branch as index to get prediction
AND branch address (if taken) - Note must check for branch match now, since
cant use wrong branch address - Done at IF stage better than branch computation
at ID stage in 5-stage process - Penalty
- 2 clock cycles
- (1 to update buffer
- 1 to fetch new)
- Return instruction addresses predicted with stack
Predicted PC
Branch Prediction Taken or not Taken
89Example
- What is the total branch penalty for a BTB with
- Prediction accuracy of 90
- Hit rate in the buffer of 90
- 60 of the branches are taken
Instructions Prediction Actual Penalty in
Buffer Branch Cycles Yes Taken Taken 0 Yes
Taken Not taken 2 No Taken 2 No Not
taken 0
Penalty Predicted taken, but not taken (2
cycles) Branch taken but not found in the
buffer (2 cycles) Branch Penalty Percent buffer
hit rate X Percent incorrect predictions X 2
( 1 - percent buffer hit rate) X Taken
branches X 2 Branch Penalty ( 90 X 10 X 2)
(10 X 60 X 2) 0.30 clock cycles
90Multiple Issue
- Multiple Issue is the ability of the processor to
start more than one instruction in a given cycle. - Superscalar processors
- Very Long Instruction Word (VLIW) processors
911990s Superscalar Processors
- Bottleneck CPI gt 1
- Limit on scalar performance (single instruction
issue) - Hazards
- Superpipelining? Diminishing returns (hazards
overhead) - How can we make the CPI 0.5?
- Multiple instructions in every pipeline stage
(super-scalar) - 1 2 3 4 5 6 7
- Inst0 IF ID EX MEM WB
- Inst0 IF ID EX MEM WB
- Inst0 IF ID EX MEM WB
- Inst0 IF ID EX MEM WB
- Inst0 IF ID EX MEM WB
- Inst0 IF ID EX MEM WB
92Superscalar Processors
- Pioneer IBM (America gt RIOS, RS/6000, Power-1)
- Superscalar instruction combinations
- 1 ALU or memory or branch 1 FP (RS/6000)
- Any 1 1 ALU (Pentium)
- Any 1 ALU or FP 1 ALU 1 load 1 store 1
branch (Pentium II) - Impact of superscalar
- More opportunity for hazards (why?)
- More performance loss due to hazards (why?)
93Superscalar Processors
- Issues varying number of instructions per clock
- Scheduling Static (by the compiler) or
dynamic(by the hardware) - Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled by
compiler or by HW (Tomasulo). - IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
94Elements of Advanced Superscalars
- High performance instruction fetching
- Good dynamic branch and jump prediction
- Multiple instructions per cycle, multiple
branches per cycle? - Scheduling and hazard elimination
- Dynamic scheduling
- Not necessarily Alpha 21064 Pentium were
statically scheduled - Register renaming to eliminate WAR and WAW
- Parallel functional units, paths/buses/multiple
register ports - High performance memory systems
- Speculative execution
- Precise interrupts
95SS DS Speculation
- Superscalar Dynamic scheduling Speculation
- Three great tastes that taste great together
- CPI gt 1?
- Overcome with superscalar
- Superscalar increases hazards
- Overcome with dynamic scheduling
- RAW dependences still a problem?
- Overcome with a large window
- Branches a problem for filling large window?
- Overcome with speculation
963GTtTGT II
- Static ILP
- VLIW (very long instruction word)
- To get IPC gt1
- Static scheduling (pipeline scheduling)
- To overcome data hazards
- Static scheduling/software speculation (loop
unrolling) - More instructions for scheduling flexibility,
overcome control hazards - Case for VLIW compiler complexity doesnt impact
clock!
97VLIW
- VLIW Very long instruction word
- In-order pipe, but each instruction is N
instructions (VLIW) - Typically slotted (I.e., 1st must be ALU, 2nd
must be load,etc., ) - VLIW travels down pipe as a unit
- Compiler packs independent instructions into VLIW
- Processor does not have logic to interlock
instructions within a VLIW - Pure VLIW
- Fixed instruction latencies, processor cant
interlock between VLIWs
IF
ID
ALU
WB
ALU
WB
Ad
WB
MEM
FP
WB
FP
98Very Long Instruction Word
- VLIW - issues a fixed number of instructions
formatted either as one very large instruction or
as a fixed packet of smaller instructions - Fixed number of instructions (4-16) scheduled by
the compiler put operators into wide templates - Started with microcode (horizontal microcode)
- Joint HP/Intel agreement in 1999/2000
- Intel Architecture-64 (IA-64) 64-bit address
/Itanium - Explicitly Parallel Instruction Computer (EPIC)
- Transmeta translates X86 to VLIW
- Many embedded controllers (TI, Motorola) are VLIW
99Superscalar Vs. VLIW
- Religious debate, similar to RISC vs. CISC
- Wisconsin Michigan (Super scalar) Vs. Illinois
(VLIW) - Q. Who can schedule code better, hardware or
software?
100Hardware Scheduling
- High branch prediction accuracy
- Dynamic information on latencies (cache misses)
- Dynamic information on memory dependences
- Easy to speculate ( recover from
mis-speculation) - Works for generic, non-loop, irregular code
- Ex databases, desktop applications, compilers
- -ves
- Limited reorder buffer size limits lookahead
- High cost/complexity
- Slow clock
101Software Scheduling
- Large scheduling scope (full program), large
lookahead - Can handle very long latencies
- Simple hardware with fast clock
- Only works well for regular codes (scientific,
FORTRAN) - -ves
- Low branch prediction accuracy
- Can improve by profiling
- No information on latencies like cache misses
- Can improve by profiling
- Pain to speculate and recover from
mis-speculation - Can improve with hardware support
102Profiling
- Information from previous program run
- Must use different input!
- Softwares answer to everything
- Works OK, but only OK
- Popular research topic
- Gaining importance
103Pure VLIW What Does VLIW Mean?
- All latencies fixed
- All instructions in VLIW issue at once
- No hardware interlocks at all
- Compiler responsible for scheduling entire
pipeline - Includes stall cycles
- Possible if you know structure of pipeline and
latencies exactly
104Problems with Pure VLIW
- Latencies are not fixed (e.g., caches)
- Option I dont use caches (forget it)
- Option II stall whole pipeline on a miss? (need
interlocks) - Option III stall instructions waiting for
memory? (need out-of-order) - Different implementations
- Different pipe depths, different latencies
- New pipeline may produce wrong results (code
stalls in wrong place) - Recompile for new implementations?
- Code compatibility is very important, made Intel
what it is
105Tainted VLIW
- EPIC (IA64, Itanium)
- Less rigid than VLIW (Not really VLIW at all)
- Architecture variable width instruction words
- Implemented as bundles with dependence bits
- Makes code compatible with different width
machines - Implementation interlocks
- Makes code compatible with different pipelines
- Enables stalls on cache misses
- Actually enables out-of-order, too
- Explicitly parallel RISC with support for
software speculation
106Key Static Scheduling
- VLIW relies on the fact that software can
schedule code well - Three techniques
- Loop unrolling (we have seen this one already)
- Problems
- Code growth
- Poor scheduling along seams of unrolled copies
- Doesnt handle carried dependences
(inter-iteration dependences or recurrents) - Software pipelining (symbolic loop unrolling)
- Trace scheduling
107VLIW
- 3 Instructions in 128 bit groups field
determines if instructions dependent or
independent - Smaller code size than old VLIW, larger than
x86/RISC - Groups can be linked to show independence gt 3
instr - 64 integer registers 64 floating point
registers - Not separate files per functional unit as in old
VLIW - Hardware checks dependencies (interlocks gt
binary compatibility over time) - Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mispredictions? - IA-64 name of instruction set architecture EPIC
is type - Merced is name of first implementation
(1999/2000?) Itanium?
108Superscalar Version of DLX
- can handle 2 instructions/cycle
- Floating Point
- Anything Else
- Fetch 64-bits/clock cycle Int on left, FP on
right - Can only issue 2nd instruction if 1st
instruction issues - More ports for FP registers to do FP load FP
op in a pair - Type Pipe Stages
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
-
- 1 cycle load delay can cause delay to 3
instructions in Superscalar - instruction in right half cant use it, nor
instructions in next slot
109Unrolled Loop Minimizes Stalls for Scalar
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD F4,
0(R1) 10 SD F8, -8(R1) 11 SD F12,
-16(R1) 12 SUBI R1,R1,32 13 BNE R1, R2,
LOOP 14 SD F16, 8(R1) 14 clock cycles, or 3.5
per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2 Cycles
110Loop Unrolling in Superscalar
- Integer instruction FP instruction Clock cycle
- Loop LD F0, 0(R1) 1
- LD F6, -8(R1) 2
- LD F10, -16(R1) ADDD F4, F0, F2 3
- LD F14, -24(R1) ADDD F8, F6, F2 4
- LD F18, -32(R1) ADDD F12, F10, F2 5
- SD F4, 0(R1) ADDD F16, F14, F2 6
- SD F8, -8(R1) ADDD F20, F18, F2 7
- SD F12, -16(R1) 8
- SD F16, -24(R1) 9
- SUBI R1,R1,40 10
- BNE R1, R2, LOOP 11
- SD 8(R1), F20 12
- Unrolled 5 times to avoid delays (1 due to SS)
- 12 clocks, or 2.4 clocks per iteration
Static Scheduling
111Dynamic Scheduling in Superscalar
- Code compiler for scalar version will run poorly
on Superscalar - May want code to vary depending on Superscalar
Architecture - Simple approach Separate Tomasulo Control for
separate reservation stations for Integer FU/Reg
and for FP FU/Reg
112Dynamic Scheduling in Superscalar
- How to do instruction issue with two instructions
and keep in-order instruction issue for Tomasulo? - Issue 2X Clock Rate, so that issue remains in
order - Only FP loads might cause dependency between
integer and FP issue - Replace load reservation station with a load
queue operands must be read in the order they
are fetched - Load checks addresses in Store Queue to avoid RAW
violation - Store checks addresses in Load Queue to avoid
WAR, WAW
113Performance of Dynamic Superscalar
- Iteration Instructions Issues Executes Writes
result - no.
clock-cycle number - 1 LD F0, 0(R1) 1 2 4
- 1 ADDD F4, F0, F2 1 5 8
- 1 SD F4, 0(R1) 2 9
- 1 SUBI R1, R1, 8 3 4 5
- 1 BNEZ R1, LOOP 4 5
- 2 LD F0, 0(R1) 5 6 8
- 2 ADDD F4, F0, F2 5 9 12
- 2 SD F4, 0(R1) 6 13
- 2 SUBI R1, R1, 8 7 8 9
- 2 BNE R1, R2, LOOP 8 9
- 4 clocks per iteration
- Branches, Decrements still take 1 clock cycle
114Loop Unrolling in VLIW
- Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch - LD F0,0(R1) LD F6,-8(R1) 1
- LD F10,-16(R1) LD F14,-24(R1) 2
- LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3 - LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
- ADDD F20,F18,F2 ADDD F24,F22,F2 5
- SD F4, 0(R1) SD F8, -8(R1) ADDD F28,F26,F2 6
- SD F12, -16(R1) SD F16, -24(R1) 7
- SD F20, -32(R1) SD F24, -40(R1) SUBI
R1,R1,48 8 - SD F28, -0(R1) BNE R1, R2, LOOP 9
- Unrolled 7 times to avoid delays
- 7 results in 9 clocks, or 1.3 clocks per
iteration - Need more registers to effectively use VLIW
115Limits to Multi-Issue Machines
- Inherent limitations of ILP
- 1 branch in 5 instructions gt how to keep a 5-way
VLIW busy? - Latencies of units gt many operations must be
scheduled - Need about Pipeline Depth x No. Functional Units
of independent operations to keep machines busy. - Difficulties in building HW
- Duplicate Functional Units to get parallel
execution - Increase ports to Register File (VLIW example
needs 6 read and 3 write for Int. Reg. 6 read
and 4 write for Reg.) - Increase ports to memory
- Decoding SS and impact on clock rate, pipeline
depth
SS Super scalar
116Limits to Multi-Issue Machines
- Limitations specific to either SS or VLIW
implementation - Decode issue in SS
- VLIW code size unroll loops wasted fields in
VLIW - VLIW lock step gt 1 hazard all instructions
stall - VLIW binary compatibility
117Multiple Issue Challenges
- While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with - Exactly 50 FP operations
- No hazards
- If more instructions issue at same time, greater
difficulty of decode and issue - Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue - VLIW tradeoff instruction space for simple
decoding - The long instruction word has room for many
operations - By definition, all the operations the compiler
puts in the long instruction word are independent
gt execute in parallel - E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch - 16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide - Need compiling technique that schedules across
several branches
118Compiler Support For ILP
- How can compilers be smart?
- Produce good scheduling of code.
- Determine which loops might contain parallelism.
- Eliminate name dependencies.
- Compilers must be REALLY smart
- Figure out aliases
- Pointers in C are a real problem
- Techniques lead to
- Symbolic Loop Unrolling
- Critical Path Scheduling
119Symbolic Loop Unrolling
- Observation
- if iterations from loops are independent, then
can get ILP by taking instructions from different
iterations - Software pipelining
- reorganizes loops so that each iteration is made
from instructions chosen from different
iterations of the original loop (Tomasulo in SW)
120Software Pipelining
- Software pipelining (symbolic loop unrolling)
- Really is pipelining in software
- One physical iteration
- Contains instructions from multiple original
iterations - Each instruction in different stage
- Need prologue and epilogue to start flush
pipeline
121Symbolic Loop Unrolling SW Pipelining Example
- Before Unrolled 3 times
- 1 LD F0,0(R1)
- 2 ADDD F4,F0,F2
- 3 SD F4,0(R1)
- 4 LD F6,-8(R1)
- 5 ADDD F8,F6,F2
- 6 SD F8,-8(R1)
- 7 LD F10,-16(R1)
- 8 ADDD F12,F10,F2
- 9 SD F12,-16(R1)
- 10 SUBI R1,R1,24
- 11 BNE R1, R2, LOOP
After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD F4,0(R1) Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNE R1, R2, LOOP SD F4,0(R1) ADDD F4,F0,F2 SD
F4,-8(R1)
Note Within physical iteration, instructions are
unrelated Perfrect for VLIW!!
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
122Symbolic Loop Unrolling
- Less code space
- Overhead paid only once vs. each iteration
in loop unrolling
Software Pipelining
Loop Unrolling
100 iterations 25 loops with 4 unrolled
iterations each
123Software Pipelining
- Doesnt increase code size (much)
- Good scheduling at iteration seams
- Can bary degree of piplining to tolerate longer
latencies - software superpipelining
- One physical iteration instructions from logical
iterations, I, I2, I4 - -ves
- Hard to do conditionals within loops
- Tricky register allocation sometimes
- Not everything is loops
124Trace Scheduling
- Trace scheduling
- For general non-loop situations
- Basic idea
- Find common paths in program
- Realign basic blocks to form straight-line trace
- Basic block single-entry, single-exit
instruction sequence - Trace (aka superblock, hyperblock) fused basic