CSE 7381

About This Presentation

Title:

CSE 7381

Description:

Multimedia instructions being added to many processors ... mechanisms to restore a precise exception state before resuming execution ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 50

Provided by: koc52

Category:

more less

Transcript and Presenter's Notes

Title: CSE 7381

1
Lecture 7More ILP with Multiple Issue and
Speculation

Prof. Fatih Koçan
CSE 7/5381 Computer Architecture
Fall 2002

2
Getting CPI lt 1 Issuing Multiple
Instructions/Cycle

Vector Processing Explicit coding of independent
loops as operations on large vectors of numbers
Multimedia instructions being added to many
processors
Superscalar varying no. instructions/cycle (1 to
8), scheduled by compiler or by HW (Tomasulo)
IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
III/4
(Very) Long Instruction Words (V)LIW fixed
number of instructions (4-16) scheduled by the
compiler put ops into wide templates
Intel Architecture-64 (IA-64) 64-bit address
Renamed Explicitly Parallel Instruction
Computer (EPIC)
Anticipated success of multiple instructions lead
to Instructions Per Clock cycle (IPC) vs. CPI

3
Getting CPI lt 1 Superscalar Processors

Issue varying number of instructions per clock
(18)
Statically scheduled by compiler
In-order execution
Dynamically scheduled by hardware
Use techniques based on Tomasulos Algorithm
Out-of-order execution

4
VLIW Very Long Instruction Word

Issue fixed number of instructions
Two Formats
One large instruction
A fixed instruction packet
The parallelism among instruction is explicitly
indicated by the instruction
EPIC explicitly parallel instruction computers
EPIC VLIW processors statically scheduled by
the compiler

5
Multi-issue Approaches
6
Statically Scheduled Superscalar Processors

HW might issue (0?8) instructions/cycle
In-order issue
Arbitrary K-issue
any combination of K instructions in any order
Non arbitrary K-issue
e.g. K/2 integer, K/2 float instructions
All pipeline hazards are checked for at the issue
stage
Check for the hazards
Among instructions in Is, and among Is and IE

7
The Process of Instruction Issue

K-issue, dynamically scheduled superscalar
processor

Issue Packet 0? I ? K
IPreF
IF
IS1
EX
IS2

IPreF Prefetches instructions for superscalar
IF Conceptually, IF examines each instruction in
the Issue Packet for hazards in program order
IS1 Decides how many instruction from the packet
can be issued simultaneously
IS2 Examines the selected instructions in IS1
with already
issued instructions for hazards

8
ISSUE Stage

Complex, determines the pipeline cycle time
ISSUE stage is pipelined to issue instructions
every cycle
Many statically scheduled and all dynamically
scheduled superscalars have pipelined issue stage
Higher branch penalties
Increase issue rate ? further pipeline IS stage
(not easy!)
Limitation on clock rate of superscalars

9
A Statically Scheduled Superscalar MIPS

Issue 2 instructions/cycle 1 FP 1 Anything
1 Anything LD, LDD, SD, SDD, BR, Int ALU, FP
Move
Fetch 64 bits/cycle
Can only issue 2nd instruction if 1st instruction
issues ? in-order issue
HP 7100, Desktops
Arbitrary Dual issue
Any combination of two instructions
Embedded processors

10
Statically Scheduled Superscalar MIPS

Superscalar MIPS 2 instructions,
Fetch 64-bits/clock cycle ltINT, FPgt or
ltFP,INTgt
More ports for FP registers to do FP load FP
op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX
WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX
WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX
WB
1 cycle load delay expands to 3 instructions in
SS
instruction in right half cant use it, nor
instructions in next slot

11
Different Issue Combinations

Type Pipe Stages
FP instruction IF ID EX EX EX WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX
WB
Int. instruction IF ID EX MEM WB

Type Pipe Stages FP instruction IF ID EX EX
EX WB Int. instruction IF ID EX M
EM WB Int. instruction IF ID EX MEM WB FP
instruction IF ID EX EX EX WB FP
instruction IF ID EX EX EX
WB Int. instruction IF ID EX MEM WB
12
Issue Process of 2-Issue MIPS

Fetch two instructions from the Prefetch unit or
from the cache
Determine how many instructions can be issued 0,
1 or 2
Issue them to correct functional units

13
Fetching Two Instructions from I-Cache

Easy two fetch I1 I2
How about I2 I3 ?
Most processors issue only I2
Use a prefetch unit

14
2-Issue MIPS Hazard Checking

Potential Issue Packets
, INT, FP, INT, FP, FP,INT
Most hazard possibilities are eliminated within
an Issue Packet
FP load/store/move, FP FP register port
contention
RAW hazard
WAR, WAW hazards across issue packet boundaries

15
Additional Hardware for Superscalars

Enhanced hazard detection
Minimized hardware support to execute integer and
floating point ins. in parallel
Different set of FP registers
Different set of Int registers
One additional FP read/write port
A larger set of bypass paths

16
Maintaining Precise Exception
Issue packet

Let the FP pipeline drain
DIV.D causes an exception after SUB.D exception
No precise exception at the HW level
Why? ADD.D destroys its one of operands
Approaches
Ignore the problem and settle for imprecise
exceptions
Buffer the results of an operation until all the
operations that were issued earlier are complete
Let Trap-handling routine to create a precise
sequence for the exception
Allow the instruction issue to continue only if
all the instructions before this instruction will
complete w/o causing an exception

17
1. Settle for Imprecise Exceptions

Virtual memory and the IEEE FP-standard ? require
precise exception
Two modes of execution
Imprecise mode (fast)
Precise mode
a mode switch or by insertion of explicit
FP-exception test instructions
The amount of overlap and reordering is
significantly restricted
DEC Alpha 21064 21164, IBM Power I II, MIPS
R8000

18
2. Buffering the Results of an Operation

The difference in running times is large
The number of results to buffer becomes large
The results from the queue must be bypassed to
all issuing and executing instructions
Large number of comparators and a very large
multiplexor

19
2. Buffering Variations

History File (CYBER 180/990, VAX)
Keeps track of the original values of registers
Upon an exception, unroll back and load the
original values from the file
Future File
Keeps the newer value of register
Update the main register file from the future
file after all earlier instructions complete
On an exception main reg file is intact!

20
3. Trap-handling routine to create a precise
sequence of exceptions

Know what operations in the pipeline and their
PCs
The software finishes any instructions that
precede the latest instruction completed
I1 -- long , causes an exception
I2 In-1 not completed
In completed
SW simulate I1 In ? major difficulty
HW restart at In1

21
4. All Instructions Before the Issuing Complete
w/o Exception

Stall the CPU to maintain precise exceptions
FP-functional units must determine if an
exception is possible early in EX stage
In the first 3 clock cycles in MIPS pipeline
MIPS R2000/3000/4000, Intel Pentium

22
Precise Exception Handling in SS MIPS

Int. op finishes before FP op
Integer instruction completes before FP op
exception detection
Imprecise exception
Solutions
Detecting FP exceptions early
Using software mechanisms to restore a precise
exception state before resuming execution
Delaying instruction completion until an
exception is impossible

Issue packet
The speculation approach uses 3
23
Load Branch Stalls in SS MIPS

Load result is not available
on the same cycle
on the next cycle
Branch delay for taken branch
2 instructions if the branch is the first in the
packet
3 instructions if the branch is the second in the
packet

Total 3 instructions
24
Multiple Issue Challenges

While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with
Exactly 50 FP operations AND No hazards
If more instructions issue at same time, greater
difficulty of decode and issue
Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue (N-issue O(N2-N) comparisons)
Register file need 2x reads and 1x writes/cycle
Rename logic must be able to rename same
register multiple times in one cycle! For
instance, consider 4-way issue
add r1, r2, r3 add p11, p4, p7 sub r4, r1,
r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
4(p22) add r5, r1, r2 add p12, p23, p4
Imagine doing this transformation in a single
cycle!
Result buses Need to complete multiple
instructions/cycle
So, need multiple buses with associated matching
logic at every reservation station.
Or, need multiple forwarding paths

25
Multiple Instruction Issue with Dynamic Scheduling

Issue an instruction in half of a cycle
Two instruction is processed in one cycle
Build necessary logic to handle two instruction
at once
Any possible dependences between two instructions
Both approaches are used at the same
Pipeline widen issue logic
Integrate dynamic branch prediction into a
dynamically scheduled pipeline

26
A two-issue dynamic scheduled processor

Issue any pair of instructions if reservation
station is available
Extend Tomasulos scheme to deal with both
integer and FP functional units and registers
Issue write result take 1 cycle each
There are a dynamic branch prediction hardware,
a branch condition evaluation unit, 1 int. ALU,
pipelined FP units
LOOP L.D F0, 0(R1) F0array element
ADD.D F4, F0, F2 add scalar in F2
S.D F4, 0(R1) store result
DADDIU R1, R1, -8 decrement pointer
8 bytes (per DW)
BNE R1, R2, LOOP branch R1 ! R2

27
Latencies

Producer Consumer Cycles
ALU op ALU op 1
Load FP op 2
Load ALU op 2
FP Add FP Add 3
Branch prediction is perfect
Two CDBs
No delayed branch

28
First 3 iterations
IPC 5/31.67 Execution rate15/160.94
29
Resource Usage
Do we need second CDB?
30
2-issue w/ additional resources

extra adder for effective address calculation

IPC 5/31.67 Execution rate15/121.25
31
Resource Usage
Lower efficiency as measured by the utilization
of the functional unit
32
What limits the performance of 2-issue
dynamically scheduled pipeline?

Imbalance between the functional unit structure
of the pipeline and the example loop
Impossibly to fully use the FP units
Need fewer dependent integer operations/loop
Very high loop overhead (2/5)
Try to reduce this overhead next chapter
The control hazard, could not start next L.D
before we know the outcome of the branch next

33
Hardware-based Speculation

Every cycle execute a branch
Prediction is not sufficient to have high amount
of ILP
Overcome control dependence by speculating on the
outcome of branches
Execute the program as if our guesses were
correct
Dynamic scheduling Fetch, Issue (No execute)
Speculation Fetch, Issue, Execute
Incorrect speculation ? Undo

34
Hardware-based Speculation Key Ideas

Dynamic branch prediction to choose which
instructions to execute
Speculation to allow the execution of
instructions before the control dependence is
resolved
Dynamic scheduling to deal with the scheduling of
different combinations of basic blocks
PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel
Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon

35
Speculative Tomasulos Algo

Separate bypassing of speculative and
non-speculative results
Undo possible
Instruction is no longer speculative, then
updates register or memory
Instruction Commit Stage
Key idea out-of-order execution, in-order-commit
Use Reorder buffer (ROB) for in-order commit

36
Reorder Buffer (ROB)

Holds the results of instructions that executed
but not committed
Passes results among instructions that may be
speculated
Like Store buffer in Tomasulo

ROB is the source
Execution completes
Commits
37
Reorder Buffer Structure
38
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
F0
LD F0,10(R2)
N
Registers
To Memory
Dest
from Memory
Dest
Dest
Reservation Stations
FP adders
FP multipliers
Prof. John Kubiatowiczs slide
39
Four-steps in Speculative Tomasulo

Issue (dispatch)
Get an instruction from the queue issue it if
there is empty Reservation Station and ROB slot.
Otherwise stalls. Send operands to the Res.Stat.
if operands are available in the ROB or the
registers. Send the ROB to reservation station.
Later, RS puts result and tag on CDB.
Execute (issue)
Wait for the not ready operands by watching CDB,
i.e. checks for structural hazards
Loads take 2 steps check if in the head of Load
buffer, and reads from the mem.
Stores effective address calculation.
Write Result
Put the result with ROB tag on CDB all waiting
reservation stations and ROB read from CDB
STORE write available value to a ROB slot, not
available watch for CDB to update value field of
ROB slot
Commit (completion, graduation)
BRANCH w/ incorrect prediction Branch w/
incorrect prediction reaches the head of the ROB
flush ROB, start execution at the correct
successor of branch
STORE Store reaches the head of the ROB and
result is available ? normal commit write to a
memory
Any other instruction instruction reaches the
head of the ROB and result is available ? normal
commit write to a register

40
Speculative Example

L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2

Latencies Add 2 Mult 10 Divide 40
41
Speculative Example
Reorder buffer
F0 F1 F2 F3 F4 F5 F6
F7 F8 F9 F10 Reorder 3
6 4 5
42
Speculative Example
Reservation Stations
43
Speculative Loop Example
LOOP L.D F0, 0(R1)
F0array element ADD.D F4, F0, F2
add scalar in F2 S.D F4, 0(R1) store
result DADDIU R1, R1, -8 decrement
pointer 8 bytes (per DW) BNE R1,
R2, LOOP branch R1 ! R2
44
Speculative Loop Example
Reorder buffer
FP register status
F0 F1 F2 F3 F4 F5 F6
F7 F8 F9 F10 Reorder 6
7
45
Speculative Dynamic Scheduling Summary

Record speculative exception in the ROB
Check for exception when instruction is ready to
commit
Complicated control over non-speculative Tomasulo
Stores updates memory when reaches
Write Results stage in Tomasulo
the head of the ROB in speculative Tomasulo
Store waits in Write Results stage for source
operand
Move value from Stores reservation station to
Stores ROB
In reality, the sourcing instruction directly
puts into Stores ROB by searching waiting
stores in the ROB
WAW and WAR memory hazards are eliminated
Actual memory update occurs in order
RAW memory hazards
The computation of an effective address of a load
w.r.t. all earlier stores is ordered
Load cannot initiate reading from memory (step 2)
if any active ROB entry occupied by a store has a
Destination field that matches the value of the
Address field of the load

46
Load/Store RAW Hazard

Question Given a load that follows a store in
program order, are the two related?
(Alternatively is there a RAW hazard between the
store and the load)? Eg st 0(R2),R5
ld R6,0(R3)
Can we go ahead and start the load early?
Store address could be delayed for a long time by
some calculation that leads to R2 (divide?).
We might want to issue/begin execution of both
operations in same cycle.
Answer is that we are not allowed to start load
until we know that address 0(R2) ? 0(R3)

47
Hardware Support for Memory Disambiguation

Need buffer to keep track of all outstanding
stores to memory, in program order.
Keep track of address (when becomes available)
and value (when becomes available)
FIFO ordering will retire stores from this
buffer in program order
When issuing a load, record current head of store
queue (know which stores are ahead of you).
When have address for load, check store queue
If any store prior to load is waiting for its
address, stall load.
If load address matches earlier store address
(associative lookup), then we have a
memory-induced RAW hazard
store value available ? return value
store value not available ? return ROB number of
source
Otherwise, send out request to memory
Actual stores commit in order, so no worry about
WAR/WAW hazards through memory.

48
Multiple Issue w/ Speculation

Assign multiple reservation stations and reorder
buffers to the instructions
Challenges
Instruction issue monitoring the CDBs for
instruction completion
Handle multiple instruction commits/cycle

49
Non-speculative vs. Speculative

Loop LD R2, 0(R1)
DADDIU R2, R2, 1
SD R2, 0(R1)
DADDIU R1, R1, 4
BNE R2, R3, Loop
Separate units for effective address calculation,
for ALU operations, for branch condition
evaluation
Up to 2 instructions of any time can commit per
clock
The branch is a key performance limitation

50
Design Considerations for Speculative Machines

Register renaming vs. Reorder buffers
A large set of registers (architectural vs.
physical registers)
How much to speculate
Handle only low-cost exceptional events in
speculative mode
1st cache miss vs. 2nd level miss
Speculating through Multiple Branches
Very high branch frequency, significant
clustering of branches, long delays in FUs