Instruction Level Parallelism

About This Presentation

Title:

Instruction Level Parallelism

Description:

Instruction Level Parallelism. Dr. Chinni. 2. Instruction ... Summary of ... Add Ra, Rb, Rc //stall. sw a, Ra. lw Re, e. lw Rf, f. sub Rd, Re, Rf ... – PowerPoint PPT presentation

Number of Views:247

Avg rating:3.0/5.0

Slides: 150

Provided by: venkat3

Category:

more less

Transcript and Presenter's Notes

Title: Instruction Level Parallelism

1
Instruction Level Parallelism
2
Instruction Level Parallelism

Concepts and Challenges
Dynamic Scheduling
Dynamic Hardware Prediction
Multiple Issue
Compiler Support
Hardware Support
Studies of ILP

3
Summary of Pipelining Basics

Hazards limit performance by preventing
instructions from executing during their
designated clock cycles
Structural Hazards need more HW resources
Data Hazards need forwarding, compiler
scheduling
Control Hazards early evaluation PC, delayed
branch, prediction
Increasing length of pipe increases impact of
hazards
Pipelining helps instruction bandwidth, not
latency
Interrupts, Instruction Set, FP makes pipelining
harder
Compilers reduce cost of data and control hazards
Stall Increases CPI and decreases performance

4
What Is an ILP?

Principle Many instructions in the code do not
depend on each other
Result Possible to execute them in parallel
ILP Potential overlap among instructions (so
they can be evaluated in parallel)
Issues
Building compilers to analyze the code
Building special/smarter hardware to handle the
code
ILP Increase the amount of parallelism
exploited among instructions
Seeks Good Results out of Pipelining

5
Scheduling

Scheduling re-arranging instructions to maximize
performance
Requires knowledge about structure of processor
Static Scheduling done by compiler
Review Provides good analogies for hardware
scheduling
Embedded market and IA-64 architecture and
Intels Itanium
Have already seen an example of this
Scheduling to eliminate MEM/ALU Bubbles
Another example
for (i1000 igt0 i--) xi xi s
Dynamic Scheduling done by hardware
Dominates Server and Desktop markets (Pentium
III, IV MIPS R10000/12000, UltraSPARC III,
PowerPC 603 etc)

6
Pipeline Scheduling Previous Lecture Example
Compiler schedules (move) instructions to reduce
stall Ex code sequence a b c, d e f
Before scheduling lw Rb, b lw
Rc, c Add Ra, Rb, Rc //stall sw a, Ra
lw Re, e lw Rf, f sub Rd,
Re, Rf //stall sw d, Rd
After scheduling lw Rb, b lw Rc, c lw Re, e
Add Ra, Rb, Rc lw Rf, f sw a, Ra sub Rd, Re,
Rf sw d, Rd
Schedule
7
Basic Pipeline Scheduling

To avoid pipeline stall
A dependant instruction must be separated from
the source instruction by a distance in clock
cycles equal to the pipeline latency
Compilers ability depends on
Amount of ILP available in the program
Latencies of the functional units in the pipeline
Pipeline CPI Ideal pipeline CPI Structured
stalls Data hazards stalls Control stalls

8
Pipeline Scheduling Loop Unrolling

Basic Block
Set of instructions between entry points and
between branches. A basic block has only one
entry and one exit
Typically 4 to 7 instructions
Amount of overlap ltlt 4 to 7 instructions
Obtain substantial performance enhancements
Exploit ILP across multiple basic blocks
Loop Level Parallelism
Parallelism that exists within a loop Limited
opportunity
Parallelism can cross loop iterations!
Techniques to convert loop-level parallelism to
instructional-level parallelism
Loop Unrolling Compiler or the hardwares
ability to exploit the parallelism inherent in
the loop
Vector instructions Operate on a sequence of
data items

9
Assumptions

Five-stage integer pipeline
Branches have delay of one clock cycle
ID stage Comparisons done, decisions made and PC
loaded
No structural hazards
Functional units are fully pipelined or
replicated (as many times as the pipeline depth)
FP Latencies

Integer load latency 1 Integer ALU operation
latency 0
10
Simple Loop Assembler Equivalent

for (i1000 igt0 i--) xi xi s
Loop LD F0, 0(R1) F0array element
ADDD F4, F0, F2 add scalar in F2
SD F4 , 0(R1) store result
SUBI R1, R1, 8 decrement pointer 8bytes (DW)
BNE R1, R2, Loop branch R1!R2

xi s are double/floating point type
R1 initially address of an array element with the
highest address
F2 contains the scalar value s
Register R2 is pre-computed so that 8(R2) is the
last element to operate on

11
Where are the stalls?

Unscheduled
Loop LD F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD F4, 0(R1)
SUBI R1, R1, 8
stall
BNE R1, R2, Loop
stall
10 clock cycles
Can we minimize?

Scheduled
Loop LD F0, 0(R1)
SUBI R1, R1, 8
ADDD F4, F0, F2
stall
BNE R1, R2, Loop
SD F4, 8(R1)
6 clock cycles
3 cycles actual work 3 cycles overhead
Can we minimize further?

Integer load latency 1 Integer ALU operation
latency 0
12
Where are the stalls?
Slide 12 Note 2 stall is required as the
latency requirement between FP ALU OP and Store
double is 2 cycles for this architecture as
specified the table at the bottom of the slide.
The ADDD instruction and SD instruction should
have two cycles latency between them.
13
Loop Unrolling
Four copies of loop
Four iteration code

LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8

Loop LD F0, 0(R1)
ADDD F4, F0, F2
SD F4, 0(R1)
LD F6, -8(R1)
ADDD F8, F6, F2
SD F8, -8(R1)
LD F10, -16(R1)
ADDD F12, F10, F2
SD F12, -16(R1)
LD F14, -24(R1)
ADDD F16, F14, F2
SD F16, -24(R1)
SUBI R1, R1, 32
BNE R1, R2, Loop

Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
14
Loop Unroll Schedule

Loop LD F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD F4, 0(R1)
LD F6, -8(R1)
stall
ADDD F8, F6, F2
stall
stall
SD F8, -8(R1)
LD F10, -16(R1)
stall
ADDD F12, F10, F2
stall
stall
SD F12, -16(R1)
LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SUBI R1,
R1, 32 NOTE 3 SD F12, 16(R1) BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
Note 3 To enable the latency requirements between
SUBI and BNE instructions (we need one cycle
latency as explained in note on slide 12, I moved
one SD instruction to in-between these
instructions)
28 clock cycles or 7 per iteration Can we
minimize further?
15
Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
16
Limits to Gains of Loop Unrolling

Decreasing benefit
A decrease in the amount of overhead amortized
with each unroll
Example just considered
Unrolled loop 4 times, no stall cycles, in 14
cycles 2 were loop overhead
If unrolled 8 times, the overhead is reduced from
½ cycle per iteration to 1/4
Code size limitations
Memory is premium
Larger size causes cache hit rate changes
Shortfall in registers (Register pressure)
Increasing ILP leads to increase in number of
live values May not be possible to allocate all
the live values to registers
Compiler limitations Significant increase in
complexity

17
What if upper bound of the loop is unknown?

Suppose
Upper bound of the loop is n
Unroll the loop to make k copies of the body
Solution Generate pair of consecutive loops
First loop body same as original loop, execute
(n mod k) times
Second loop unrolled body (k copies of
original), iterate (n/k) times
For large values of n, most of the execution time
is spent in the unrolled loop body

18
Summary Tricks of High Performance Processors

Out-of-order scheduling To tolerate RAW hazard
latency
Determine that the loads and stores can be
exchanged as loads and stores from different
iterations are independent
This requires analyzing the memory addresses and
finding that they do not refer to the same
address
Find that it was ok to move the SD after the SUBI
and BNE, and adjust the SD offset
Loop unrolling Increase scheduling scope for
more latency tolerance
Find that loop unrolling is useful by finding
that loop iterations are independent, except for
the loop maintenance code
Eliminate extra tests and branches and adjust the
loop maintenance code
Register renaming Remove WAR/WAW violations due
to scheduling
Use different registers to avoid unnecessary
constraints that would be forced by using same
registers for different computations
Summary Schedule the code preserving any
dependences needed

19
Compiler Perspective

Compiler concerned about dependencies in program.
Tries to schedule code to avoid hazards
property of pipeline organization
Looks for Data dependencies
Instruction i produces a result used by
instruction j, or
Instruction j is data dependent on instruction k,
and instruction k is data dependent on
instruction i (chain dependence)
If dependent, cant execute in parallel (or be
completely overlapped)
Easy to determine for registers (fixed names)
Hard for memory
Does 100(R4) 20(R6)?
From different loop iterations, does 20(R6)
20(R6)?

20
Data Dependence

Data dependence
Indicates the possibility of a hazard
Determines the order in which results must be
calculated
Sets upper bound on how much parallelism can be
exploited
But, actual hazard length of any stall is
determined by pipeline
Dependence avoidance
Maintain the dependence but avoid hazard
Scheduling
Eliminate dependence by transforming the code

21
Data Dependencies

1 Loop LD F0, 0(R1)
2 ADDD F4, F0, F2
3 SUBI R1, R1, 8
4 BNE R1, R2, Loop delayed branch
5 SD F4, 8(R1) altered when move past SUBI

22
Name Dependencies

Two instructions use same name (register or
memory location) but dont exchange data
Anti-dependence (WAR if a hazard for HW)
Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved
How to remove name dependencies?
They are not true dependencies

23
Register Renaming
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F0, -8(R1) 5 ADDD F4, F0,
F2 6 SD F4, -8(R1) 7 LD F0, -16(R1)
8 ADDD F4, F0, F2 9 SD F4, -16(R1) 10 LD F0,
-24(R1) 11 ADDD F4,F0,F2 12 SD F4, -24(R1)
13 SUBI R1, R1, 32 14 BNE R1, R2, LOOP
1 Loop LD F0, 0(R1) 2 ADDD F4, F0, F2
3 SD F4, 0(R1) 4 LD F6,-8(R1) 5 ADDD F8, F6,
F2 6 SD F8, -8(R1) 7 LD F10,-16(R1)
8 ADDD F12, F10, F2 9 SD F12, -16(R1)
10 LD F14, -24(R1) 11 ADDD F16, F14,F2
12 SD F16, -24(R1) 13 SUBI R1, R1, 32
14 BNE R1, R2, LOOP
No data is passed in F0, but cant reuse F0 in
cycle 4.

Name Dependencies are Hard for Memory Accesses
Does 100(R4) 20(R6)?
From different loop iterations, does 20(R6)
20(R6)?
Our example required compiler to know that if R1
doesnt change then 0(R1) ? -8(R1) ?
-16(R1) ? -24(R1)There were no dependencies
between some loads and stores so they could be
moved around

24
Control Dependencies

Example
if p1 S1
if p2 S2
S1 is control dependent on p1 S2 is control
dependent on p2 but not on p1
Two constraints
An instruction that is control dependent on a
branch cannot be moved before the branch so
that its execution is no longer controlled by the
branch.
An instruction that is not control dependent on a
branch cannot be moved to after the branch so
that its execution is controlled by the branch.
Control dependencies relaxed to get parallelism
Get same effect if preserve order of exceptions
(Ex address in register checked by branch before
use) and data flow (Ex value in register depends
on branch) (Speculation, Delayed branching etc).

25
Control Dependencies

LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BE R1, R2, exit
LD F0, 0(R1) if executed before branch, may
create exception
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BE R1, R2, Exit
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BE R1, R2, Exit
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8

26
When Safe to Unroll Loop?

Example-1 Where are the data dependences?
(A, B, C are distinct and non-overlapping arrays)
for (i1 ilt100 i i1) Ai1 Ai
Ci / S1 / Bi1 Bi Ai1 /
S2 /
S2 uses the value, Ai1, computed by S1 in the
same iteration
S1 uses a value computed by S1 in an earlier
iteration, since iteration i computes Ai1,
which is read in iteration i1. The same is true
of S2 for Bi and Bi1.
Second one is a loop-carried dependence between
iterations
Iterations are dependent and cant be executed in
parallel
Note the case for our prior example each
iteration was distinct
(Possible loop-carried dependence that does not
prevent parallelism)

27
When Safe to Unroll Loop?

Example-2 Where are the data dependences?
(A, B, C are distinct and non-overlapping arrays)
for (i1 ilt100 i i1) Ai1 Ai
Bi / S1 / Bi1 Ci Di /
S2 /
No dependence from S1 to S2. If there were, then
there would be a cycle in the dependencies and
the loop would not be parallel. Since this other
dependence is absent, interchanging the two
statements will not affect the execution of S2.
On the first iteration of the loop, statement S1
depends on the value of B1 computed prior to
initiating the loop.
New code No loop dependence
A1 A1 B1
for (i1 ilt100 i i1) Bi1 Ci
Di
Ai2 Ai1 Bi1 //check it out on
computer/ use your logic
B101 C100 D100

28
Tricks Can Be Done in Hardware..

Why build complicated hardware if we can do this
in software?
Performance portabiity
Software assumes pipeline structure
Dont want to recompile for new machines
More information available to hardware
Data addresses, branch directions, cache misses
statically unknown but compiler can look at
more instructions
More resources available to hardware
May not have enough architectural registers to
resolve WAR/WAW
Easier to use speculative execution in hardware
Easier to recover from mis-speculation
Solution do combination of both

29
Dynamic Scheduling

Dynamic Scheduling Hardware rearranges the order
of instruction execution to reduce stalls
Disadvantages
Hardware much more complex
Key idea
Instructions execution in parallel (use available
all executing units)
Allow instructions behind stall to proceed
Example
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
Out-of-order execution gt out-of-order completion

30
Overview

In-order pipeline
5 interlocked stages IF, ID, EX, MEM, WB
Structural hazard maximum of 1 instruction per
stage
Unless stage is replicated (FP integer EX) or
idle (WB for stores)
Out-of-order pipeline
How does one instruction pass another without
killingit?
Remember only one instruction per-stage
per-cycle
Must buffer instructions

IF
ID
EX
MEM
WB
31
Instruction Buffer

Trick instruction buffer (many names for this
buffer)
Accumulate decoded instructions in buffer
Buffer sends instructions down rest of pipe
out-of-order

instruction buffer
ID1
ID2
EX
MEM
WB
IF
32
Scoreboard
State/Steps
instruction buffer
IS
RO
EX
WB
IF
ID

Confusion in community about which is which stage

Structure
Data Bus
EX
Registers
EX
EX
Control/Status
Scoreboard
33
Dynamic Scheduling Scoreboard

Out-of-order execution divides ID stage
1. Issuedecode instructions, check for
structural hazards
2. Read Operandswait until no data hazards, then
read operands
Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions.
A scoreboard is a data structure that provides
the information necessary for all pieces of the
processor to work together.
Centralized control scheme
No bypassing
No elimination of WAR/WAW hazards
We will use In order issue, out of order
execution, out of order commit ( also called
completion)
First used in CDC6600. Our example modified here
for DLX.
CDC had 4 FP units, 5 memory reference units, 7
integer units.
DLX has 2 FP multiply, 1 FP adder, 1 FP divider,
1 integer.

34
Scoreboard Implications

Out-of-order completion gt WAR, WAW hazards?
Solutions for WAR
Queue both the operation and copies of its
operands
Read registers only during Read Operands stage
Solution for WAW Structural Hazards
Must detect hazard stall until the hazards are
cleared
Need to have multiple instructions in execution
phase
Multiple execution units or pipelined execution
units
Scoreboard keeps track of dependencies, state or
operations
Scoreboard replaces ID, EX, WB with 4 stages

35
Stages of Scoreboard Control

Issue decode instructions check for structural
hazards (ID1)
If a functional unit for the instruction is free
and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared.

36
Stages of Scoreboard Control

Read Operands wait until no data hazards, then
read operands from registers (ID2)
A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit.
When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution.
The scoreboard resolves RAW hazards dynamically
in this step, and instructions may be sent into
execution out of order.

37
Stages of Scoreboard Control

Execution operate on operands (EX)
The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution.
Write result finish execution (WB)
Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction.
Example
DIVD F0, F2, F4
ADDD F10, F0, F8
SUBD F8, F8, F14
Scoreboard would stall SUBD until ADDD reads
operands

38
Scoreboard Data Structures

Instruction status
Which of 4 steps the instruction is in
Functional unit status
Busy Whether the unit is busy or not
Op Operation to perform in the unit (e.g., or
)
Fi Destination register
Fj, Fk Source-register numbers
Qj, Qk Functional units producing source
registers Fj, Fk
Rj, Rk ready bits for Fj, Fk
Register result status
Indicates which functional unit (if any) will
write each register.
Blank when no pending instructions will write
that register

39
Detailed Scoreboard Pipeline Control
Instruction status
Bookkeeping
Wait until
Issue
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Read operands
Rj? No Rk? No
Rj and Rk
Execution complete
Functional unit done
Write result
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
40
Scoreboard Example
LD F6, 34(R2) LD F2, 45(R3) MULT F0, F2,
F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6,
F8, F2 What are the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 DIVD 40 A
DDD, SUBD 2
41
Scoreboard Example
42
Scoreboard Example Cycle 1
Issue LD 1
Shows in which cycle the operation occurred.
43
Scoreboard Example Cycle 2
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
44
Scoreboard Example Cycle 3
45
Scoreboard Example Cycle 4
46
Scoreboard Example Cycle 5
Issue LD 2 since integer unit is now free
47
Scoreboard Example Cycle 6
Issue MULT
48
Scoreboard Example Cycle 7
MULT cant read its operands (F2) because LD 2
hasnt finished
49
Scoreboard Example Cycle 8a
DIVD issues. MULT and SUBD both waiting for F2
50
Scoreboard Example Cycle 8b
LD 2 writes F2
51
Scoreboard Example Cycle 9
Now MULT and SUBD can both read F2 How can both
instructions do this at the same time??
52
Scoreboard Example Cycle 11
ADDD cant start because Add unit is busy
53
Scoreboard Example Cycle 12
SUBD finishes. DIVD waiting for F0
54
Scoreboard Example Cycle 13
ADDD issues
55
Scoreboard Example Cycle 14
56
Scoreboard Example Cycle 15
57
Scoreboard Example Cycle 16
58
Scoreboard Example Cycle 17
ADDD cant write because of DIVD RAW!
59
Scoreboard Example Cycle 18
Nothing Happens!!
60
Scoreboard Example Cycle 19
MULT completes execution
61
Scoreboard Example Cycle 20
MULT writes
62
Scoreboard Example Cycle 21
DIVD loads operands
63
Scoreboard Example Cycle 22
Now ADDD can write since WAR removed
64
Scoreboard Example Cycle 61
DIVD completes execution
65
Scoreboard Example Cycle 62
DONE!!
66
Scoreboard

Operands for an instruction are read only when
both operands are available in the register file
Scoreboard does not take advantage of forwarding
Instructions write to register file as soon as
they are complete execution (assuming no WAR
hazards) and do not wait for write slot
Reduced pipeline latency benefits of forwarding
One additional cycle of latency as write result
and read operand stages cannot overlap
Bus structure
Limited number of buses to register file
represent structural hazards

67
Scoreboard

Limitations
No forwarding (RAW dependence handled through
registers)
In-order issue for WAW/structural hazards limit
scheduling flexibility
WAR stalls limit dynamic loop unrolling (no
register unrolling)
Performance
1.7X for FORTRAN programs
2.5X for hand-coded assembly
Hardware
Scoreboard is cheap
Busses are not

68
DS Method 2 Tomasulos Algorithm

Developed for IBM 360/91 3 years after CDC 6600
(1966)
Goal High Performance without special compilers
Differences between IBM 360 CDC 6600 ISA
IBM has only 2 register specifiers per
instruction vs. 3 in CDC 6600
IBM has 4 FP registers vs. 8 in CDC 6600
IBM has long memory access delays, long FP delays
Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,

69
Tomasulos Algorithm

Avoid RAW Hazards
Execute an instruction only when its operands are
available
Has a scheme to track when operands are available
Avoid WAR and WAW Hazards
Support Register renaming (even across branches)
Renames all destination registers Out-of-order
write does not affect any instructions that
depend on an earlier value of an operand
DIVD F0, F2, F4 DIVD F0, F2, F4
ADDD F6, F0, F8 ADDD S, F0, F8 //S T temp Reg
SD F6, 0(R1) SD S, 0(R1)
SUBD F8, F10, F14 SUBD T, F10, F14
MULD F6, F10, F8 MULD F6, F10, T
Supports the overlapped execution of multiple
iterations of a loop

WAR
WAW
70
Tomasulo Algorithm vs. Scoreboard

Control buffers distributed with Function Units
(FU) vs. centralized in scoreboard (with
bypassing)
FU buffers are called reservation stations have
pending operands
Registers in instructions replaced by values or
pointers to reservation stations(RS) called
register renaming
avoids WAR, WAW hazards
More reservation stations than registers, so can
do optimizations compilers cant
Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs
Load and Stores treated as FUs with reservation
stations as well
Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue

71
MIPS FP Unit Using Tomasulos Algorithm
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
72
Three Stages of Tomasulos Algorithm

Issueget instruction from FP Op Queue
If reservation station free (no structural
hazard), issue instruction operand values (if
they are in the registers).
If reservation station is busy, instruction
stalls
If operands are not in the registers rename
registers (eliminate WAR, WAW hazards) and keep
track of functional units producing operands
Executionoperate on operands (EX)
If both operands ready then execute
if not ready, watch Common Data Bus for result
(Avoid RAW hazard)
Preserve exception behavior No instruction
executes unless all preceding branches have
completed
Write resultfinish execution (WB)
Write on Common Data Bus to all units mark
reservation station available
Normal data bus data destination (go
to bus)
Common data bus data source (come from
bus) Broadcasts

Each stage can take different number of clock
cycles
73
Reservation Station Components

Op
Operation to perform in the unit (e.g., or )
Vj, Vk
Value of Source operands
Store buffers have V field with result to be
stored
Qj, Qk
Reservation stations producing source operand
(Qj,Qk0 gt ready)
Busy
Indicates reservation station or FU is busy
QiRegister result status
Indicates which functional unit (if exists) will
write to the register.
0 when no pending instructions to write to this
register.

74
Tomasulos Data Structures
75
Tomasulos Example Cycle 0

Do it yourself excersize

76
Register Renaming

Register renaming
Change register names to eliminate WAR/WAW
hazards
Hardware renaming most beautiful thing in
architecture
Key think of architectural registers as names,
not locations
Can have more locations than names
Dynamically map names to locations
Map table hardware structure holds current
mappings
Writes allocate new location, note in map table
Reads find location of most recent write by
looking at map table
Must de-allocate locations appropriately (slight
detail)

77
Tomasulo Register Renaming

Locations register file, reservation station
(RS)
Values can (and do) exist in both!
Value copies used to eliminate WAR hazards
Called value-based or copy-based renaming
Not pointer based renaming
Locations referred to internally by tags (4-bit
specifiers)
Map table translates names to tags
After translation, names are discarded
CDB broadcasts values with tags attached
So RS knows what its looking at

CDB Common Data Bus
78
Tomasulo Register Renaming

Creating operation maps destination register
On dispatch, register renamed to tag of allocated
RS
Register table entry RS number
On completion, register written
Regiter table entry0
Subsequent operation looks up sources in register
table
Entry0 -gt register has already been written
Copy register value to RS
Eliminates WAR hazards (private valid copy of
register in RS)
Entry!0 -gtregister value not ready, some RS will
provide
Copy entry (RS tag) to RS, monitor CDB for that
tag

CDB Common Data Bus
79
Tomasulos Algorithm A Loop Based Example

If we predict that branches are taken
Reservation stations allow multiple executions of
the loop to proceed at once
Advantage without changing code
Loop unrolled dynamically renaming at
reservation systems acts as additional registers
Load Store
Can be done in any order if they access different
addresses
If access same address
interchange leads to WAR/RAW Interchange two
stores leads WAW
Detect Hazards
Compute effective data memory address and check
for address conflict with memory address
associated with earlier memory operation
Wait on a match
Need to keep relative order for stores and loads
Loads reordered freely

80
Comparison Tomasulo vs. Scoreboard
Distributed hazard detection
81
Review Tomasulo

Prevents Register as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks (provides branch
prediction)
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are PowerPC 604, 620 MIPS
R10000 HP-PA 8000 Intel Pentium Pro

82
Dynamic Hardware Prediction

Dynamic Branch Prediction is the ability of the
hardware to make an educated guess about which
way a branch will go - will the branch be taken
or not.
The hardware can look for clues based on the
instructions, or it can use past history - we
will discuss both of these directions.

83
Dynamic Branch Prediction

Performance (accuracy, cost of misprediction)
Branch History Table (BHT) or Branch Prediction
Buffer
Lower bits of PC address used as index of 1-bit
values
Says whether or not branch taken last time
Problem in a loop, 1-bit BHT will cause two
mis-predictions
End of loop case, when it exits instead of
looping as before
First time through loop on next time through
code, when it predicts exit instead of looping
Typical loops related branches are not taken only
the last iteration
That is twice the rate at which typical branches
are not taken
Prediction may be from another branch with same
low order address bits

84
Branch Prediction Buffers

2-bit scheme where change prediction only if get
misprediction twice

10
11
01
00
Does not help the five-stage classic pipeline as
it finds branch direction and next PC by ID stage
(assuming no hazard in accessing the register)
85
Branch History Table (BHT) Accuracy

Mispredict because either
Wrong guess for that branch
Got branch history of wrong branch when indexing
the table
4096 entry table
programs vary from 1 misprediction (nasa7,
tomcatv) to 18 (eqntott), with spice at 9 and
gcc at 12
Misprediction rate for integer benchmarks (gcc,
espress, eqntott etc) is substantially higher
(average 11) than that for the FP programs
(nasa7, matrix300, tomcatv etc., average 4)
4096 entries (2 bits per entry) as good as
infinite table
But 4096 is a lot of HW

86
Correlating Branch Predictors

Branch predictors that use the behavior of other
branches to make prediction
Also called two-level predictors
Idea taken/not taken of recently executed
branches is related to behavior of next branch
(as well as the history of that branch behavior)
Then behavior of recent branches selects between,
say, four predictions of next branch, updating
just that prediction

87
Accuracy of Different Schemes
4096 Entries 2-bits per entry Unlimited Entries
2-bits per entry 1024 Entries2 bits of history,
2 bits per entry
18
Frequency of Mispredictions
0
88
Branch Target Buffer (BTB)

Use address of branch as index to get prediction
AND branch address (if taken)
Note must check for branch match now, since
cant use wrong branch address
Done at IF stage better than branch computation
at ID stage in 5-stage process
Penalty
2 clock cycles
(1 to update buffer
1 to fetch new)
Return instruction addresses predicted with stack

Predicted PC
Branch Prediction Taken or not Taken
89
Example

What is the total branch penalty for a BTB with
Prediction accuracy of 90
Hit rate in the buffer of 90
60 of the branches are taken

Instructions Prediction Actual Penalty in
Buffer Branch Cycles Yes Taken Taken 0 Yes
Taken Not taken 2 No Taken 2 No Not
taken 0
Penalty Predicted taken, but not taken (2
cycles) Branch taken but not found in the
buffer (2 cycles) Branch Penalty Percent buffer
hit rate X Percent incorrect predictions X 2
( 1 - percent buffer hit rate) X Taken
branches X 2 Branch Penalty ( 90 X 10 X 2)
(10 X 60 X 2) 0.30 clock cycles
90
Multiple Issue

Multiple Issue is the ability of the processor to
start more than one instruction in a given cycle.
Superscalar processors
Very Long Instruction Word (VLIW) processors

91
1990s Superscalar Processors

Bottleneck CPI gt 1
Limit on scalar performance (single instruction
issue)
Hazards
Superpipelining? Diminishing returns (hazards
overhead)
How can we make the CPI 0.5?
Multiple instructions in every pipeline stage
(super-scalar)
1 2 3 4 5 6 7
Inst0 IF ID EX MEM WB
Inst0 IF ID EX MEM WB
Inst0 IF ID EX MEM WB
Inst0 IF ID EX MEM WB
Inst0 IF ID EX MEM WB
Inst0 IF ID EX MEM WB

92
Superscalar Processors

Pioneer IBM (America gt RIOS, RS/6000, Power-1)
Superscalar instruction combinations
1 ALU or memory or branch 1 FP (RS/6000)
Any 1 1 ALU (Pentium)
Any 1 ALU or FP 1 ALU 1 load 1 store 1
branch (Pentium II)
Impact of superscalar
More opportunity for hazards (why?)
More performance loss due to hazards (why?)

93
Superscalar Processors

Issues varying number of instructions per clock
Scheduling Static (by the compiler) or
dynamic(by the hardware)
Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled by
compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

94
Elements of Advanced Superscalars

High performance instruction fetching
Good dynamic branch and jump prediction
Multiple instructions per cycle, multiple
branches per cycle?
Scheduling and hazard elimination
Dynamic scheduling
Not necessarily Alpha 21064 Pentium were
statically scheduled
Register renaming to eliminate WAR and WAW
Parallel functional units, paths/buses/multiple
register ports
High performance memory systems
Speculative execution
Precise interrupts

95
SS DS Speculation

Superscalar Dynamic scheduling Speculation
Three great tastes that taste great together
CPI gt 1?
Overcome with superscalar
Superscalar increases hazards
Overcome with dynamic scheduling
RAW dependences still a problem?
Overcome with a large window
Branches a problem for filling large window?
Overcome with speculation

96
3GTtTGT II

Static ILP
VLIW (very long instruction word)
To get IPC gt1
Static scheduling (pipeline scheduling)
To overcome data hazards
Static scheduling/software speculation (loop
unrolling)
More instructions for scheduling flexibility,
overcome control hazards
Case for VLIW compiler complexity doesnt impact
clock!

97
VLIW

VLIW Very long instruction word
In-order pipe, but each instruction is N
instructions (VLIW)
Typically slotted (I.e., 1st must be ALU, 2nd
must be load,etc., )
VLIW travels down pipe as a unit
Compiler packs independent instructions into VLIW
Processor does not have logic to interlock
instructions within a VLIW
Pure VLIW
Fixed instruction latencies, processor cant
interlock between VLIWs

IF
ID
ALU
WB
ALU
WB
Ad
WB
MEM
FP
WB
FP
98
Very Long Instruction Word

VLIW - issues a fixed number of instructions
formatted either as one very large instruction or
as a fixed packet of smaller instructions
Fixed number of instructions (4-16) scheduled by
the compiler put operators into wide templates
Started with microcode (horizontal microcode)
Joint HP/Intel agreement in 1999/2000
Intel Architecture-64 (IA-64) 64-bit address
/Itanium
Explicitly Parallel Instruction Computer (EPIC)
Transmeta translates X86 to VLIW
Many embedded controllers (TI, Motorola) are VLIW

99
Superscalar Vs. VLIW

Religious debate, similar to RISC vs. CISC
Wisconsin Michigan (Super scalar) Vs. Illinois
(VLIW)
Q. Who can schedule code better, hardware or
software?

100
Hardware Scheduling

High branch prediction accuracy
Dynamic information on latencies (cache misses)
Dynamic information on memory dependences
Easy to speculate ( recover from
mis-speculation)
Works for generic, non-loop, irregular code
Ex databases, desktop applications, compilers
-ves
Limited reorder buffer size limits lookahead
High cost/complexity
Slow clock

101
Software Scheduling

Large scheduling scope (full program), large
lookahead
Can handle very long latencies
Simple hardware with fast clock
Only works well for regular codes (scientific,
FORTRAN)
-ves
Low branch prediction accuracy
Can improve by profiling
No information on latencies like cache misses
Can improve by profiling
Pain to speculate and recover from
mis-speculation
Can improve with hardware support

102
Profiling

Information from previous program run
Must use different input!
Softwares answer to everything
Works OK, but only OK
Popular research topic
Gaining importance

103
Pure VLIW What Does VLIW Mean?

All latencies fixed
All instructions in VLIW issue at once
No hardware interlocks at all
Compiler responsible for scheduling entire
pipeline
Includes stall cycles
Possible if you know structure of pipeline and
latencies exactly

104
Problems with Pure VLIW

Latencies are not fixed (e.g., caches)
Option I dont use caches (forget it)
Option II stall whole pipeline on a miss? (need
interlocks)
Option III stall instructions waiting for
memory? (need out-of-order)
Different implementations
Different pipe depths, different latencies
New pipeline may produce wrong results (code
stalls in wrong place)
Recompile for new implementations?
Code compatibility is very important, made Intel
what it is

105
Tainted VLIW

EPIC (IA64, Itanium)
Less rigid than VLIW (Not really VLIW at all)
Architecture variable width instruction words
Implemented as bundles with dependence bits
Makes code compatible with different width
machines
Implementation interlocks
Makes code compatible with different pipelines
Enables stalls on cache misses
Actually enables out-of-order, too
Explicitly parallel RISC with support for
software speculation

106
Key Static Scheduling

VLIW relies on the fact that software can
schedule code well
Three techniques
Loop unrolling (we have seen this one already)
Problems
Code growth
Poor scheduling along seams of unrolled copies
Doesnt handle carried dependences
(inter-iteration dependences or recurrents)
Software pipelining (symbolic loop unrolling)
Trace scheduling

107
VLIW

3 Instructions in 128 bit groups field
determines if instructions dependent or
independent
Smaller code size than old VLIW, larger than
x86/RISC
Groups can be linked to show independence gt 3
instr
64 integer registers 64 floating point
registers
Not separate files per functional unit as in old
VLIW
Hardware checks dependencies (interlocks gt
binary compatibility over time)
Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mispredictions?
IA-64 name of instruction set architecture EPIC
is type
Merced is name of first implementation
(1999/2000?) Itanium?

108
Superscalar Version of DLX

can handle 2 instructions/cycle
Floating Point
Anything Else

Fetch 64-bits/clock cycle Int on left, FP on
right
Can only issue 2nd instruction if 1st
instruction issues
More ports for FP registers to do FP load FP
op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay can cause delay to 3
instructions in Superscalar
instruction in right half cant use it, nor
instructions in next slot

109
Unrolled Loop Minimizes Stalls for Scalar
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD F4,
0(R1) 10 SD F8, -8(R1) 11 SD F12,
-16(R1) 12 SUBI R1,R1,32 13 BNE R1, R2,
LOOP 14 SD F16, 8(R1) 14 clock cycles, or 3.5
per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2 Cycles
110
Loop Unrolling in Superscalar

Integer instruction FP instruction Clock cycle
Loop LD F0, 0(R1) 1
LD F6, -8(R1) 2
LD F10, -16(R1) ADDD F4, F0, F2 3
LD F14, -24(R1) ADDD F8, F6, F2 4
LD F18, -32(R1) ADDD F12, F10, F2 5
SD F4, 0(R1) ADDD F16, F14, F2 6
SD F8, -8(R1) ADDD F20, F18, F2 7
SD F12, -16(R1) 8
SD F16, -24(R1) 9
SUBI R1,R1,40 10
BNE R1, R2, LOOP 11
SD 8(R1), F20 12
Unrolled 5 times to avoid delays (1 due to SS)
12 clocks, or 2.4 clocks per iteration

Static Scheduling
111
Dynamic Scheduling in Superscalar

Code compiler for scalar version will run poorly
on Superscalar
May want code to vary depending on Superscalar
Architecture
Simple approach Separate Tomasulo Control for
separate reservation stations for Integer FU/Reg
and for FP FU/Reg

112
Dynamic Scheduling in Superscalar

How to do instruction issue with two instructions
and keep in-order instruction issue for Tomasulo?
Issue 2X Clock Rate, so that issue remains in
order
Only FP loads might cause dependency between
integer and FP issue
Replace load reservation station with a load
queue operands must be read in the order they
are fetched
Load checks addresses in Store Queue to avoid RAW
violation
Store checks addresses in Load Queue to avoid
WAR, WAW

113
Performance of Dynamic Superscalar

Iteration Instructions Issues Executes Writes
result
no.
clock-cycle number
1 LD F0, 0(R1) 1 2 4
1 ADDD F4, F0, F2 1 5 8
1 SD F4, 0(R1) 2 9
1 SUBI R1, R1, 8 3 4 5
1 BNEZ R1, LOOP 4 5
2 LD F0, 0(R1) 5 6 8
2 ADDD F4, F0, F2 5 9 12
2 SD F4, 0(R1) 6 13
2 SUBI R1, R1, 8 7 8 9
2 BNE R1, R2, LOOP 8 9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle

114
Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD F4, 0(R1) SD F8, -8(R1) ADDD F28,F26,F2 6
SD F12, -16(R1) SD F16, -24(R1) 7
SD F20, -32(R1) SD F24, -40(R1) SUBI
R1,R1,48 8
SD F28, -0(R1) BNE R1, R2, LOOP 9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per
iteration
Need more registers to effectively use VLIW

115
Limits to Multi-Issue Machines

Inherent limitations of ILP
1 branch in 5 instructions gt how to keep a 5-way
VLIW busy?
Latencies of units gt many operations must be
scheduled
Need about Pipeline Depth x No. Functional Units
of independent operations to keep machines busy.
Difficulties in building HW
Duplicate Functional Units to get parallel
execution
Increase ports to Register File (VLIW example
needs 6 read and 3 write for Int. Reg. 6 read
and 4 write for Reg.)
Increase ports to memory
Decoding SS and impact on clock rate, pipeline
depth

SS Super scalar
116
Limits to Multi-Issue Machines

Limitations specific to either SS or VLIW
implementation
Decode issue in SS
VLIW code size unroll loops wasted fields in
VLIW
VLIW lock step gt 1 hazard all instructions
stall
VLIW binary compatibility

117
Multiple Issue Challenges

While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with
Exactly 50 FP operations
No hazards
If more instructions issue at same time, greater
difficulty of decode and issue
Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue
VLIW tradeoff instruction space for simple
decoding
The long instruction word has room for many
operations
By definition, all the operations the compiler
puts in the long instruction word are independent
gt execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch
16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide
Need compiling technique that schedules across
several branches

118
Compiler Support For ILP

How can compilers be smart?
Produce good scheduling of code.
Determine which loops might contain parallelism.
Eliminate name dependencies.
Compilers must be REALLY smart
Figure out aliases
Pointers in C are a real problem
Techniques lead to
Symbolic Loop Unrolling
Critical Path Scheduling

119
Symbolic Loop Unrolling

Observation
if iterations from loops are independent, then
can get ILP by taking instructions from different
iterations
Software pipelining
reorganizes loops so that each iteration is made
from instructions chosen from different
iterations of the original loop (Tomasulo in SW)

120
Software Pipelining

Software pipelining (symbolic loop unrolling)
Really is pipelining in software
One physical iteration
Contains instructions from multiple original
iterations
Each instruction in different stage
Need prologue and epilogue to start flush
pipeline

121
Symbolic Loop Unrolling SW Pipelining Example

Before Unrolled 3 times
1 LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD F4,0(R1)
4 LD F6,-8(R1)
5 ADDD F8,F6,F2
6 SD F8,-8(R1)
7 LD F10,-16(R1)
8 ADDD F12,F10,F2
9 SD F12,-16(R1)
10 SUBI R1,R1,24
11 BNE R1, R2, LOOP

After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD F4,0(R1) Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNE R1, R2, LOOP SD F4,0(R1) ADDD F4,F0,F2 SD
F4,-8(R1)
Note Within physical iteration, instructions are
unrelated Perfrect for VLIW!!
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
122
Symbolic Loop Unrolling