Title: CSE 7381
1Lecture 7More ILP with Multiple Issue and
Speculation
- Prof. Fatih Koçan
- CSE 7/5381 Computer Architecture
- Fall 2002
2Getting CPI lt 1 Issuing Multiple
Instructions/Cycle
- Vector Processing Explicit coding of independent
loops as operations on large vectors of numbers - Multimedia instructions being added to many
processors - Superscalar varying no. instructions/cycle (1 to
8), scheduled by compiler or by HW (Tomasulo) - IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
III/4 - (Very) Long Instruction Words (V)LIW fixed
number of instructions (4-16) scheduled by the
compiler put ops into wide templates - Intel Architecture-64 (IA-64) 64-bit address
- Renamed Explicitly Parallel Instruction
Computer (EPIC) - Anticipated success of multiple instructions lead
to Instructions Per Clock cycle (IPC) vs. CPI
3Getting CPI lt 1 Superscalar Processors
- Issue varying number of instructions per clock
(18) - Statically scheduled by compiler
- In-order execution
- Dynamically scheduled by hardware
- Use techniques based on Tomasulos Algorithm
- Out-of-order execution
4VLIW Very Long Instruction Word
- Issue fixed number of instructions
- Two Formats
- One large instruction
- A fixed instruction packet
- The parallelism among instruction is explicitly
indicated by the instruction - EPIC explicitly parallel instruction computers
- EPIC VLIW processors statically scheduled by
the compiler
5Multi-issue Approaches
6Statically Scheduled Superscalar Processors
- HW might issue (0?8) instructions/cycle
- In-order issue
- Arbitrary K-issue
- any combination of K instructions in any order
- Non arbitrary K-issue
- e.g. K/2 integer, K/2 float instructions
- All pipeline hazards are checked for at the issue
stage - Check for the hazards
- Among instructions in Is, and among Is and IE
7The Process of Instruction Issue
- K-issue, dynamically scheduled superscalar
processor
Issue Packet 0? I ? K
IPreF
IF
IS1
EX
IS2
- IPreF Prefetches instructions for superscalar
- IF Conceptually, IF examines each instruction in
the Issue Packet for hazards in program order - IS1 Decides how many instruction from the packet
can be issued simultaneously - IS2 Examines the selected instructions in IS1
with already - issued instructions for hazards
8ISSUE Stage
- Complex, determines the pipeline cycle time
- ISSUE stage is pipelined to issue instructions
every cycle - Many statically scheduled and all dynamically
scheduled superscalars have pipelined issue stage - Higher branch penalties
- Increase issue rate ? further pipeline IS stage
- (not easy!)
- Limitation on clock rate of superscalars
9A Statically Scheduled Superscalar MIPS
- Issue 2 instructions/cycle 1 FP 1 Anything
- 1 Anything LD, LDD, SD, SDD, BR, Int ALU, FP
Move - Fetch 64 bits/cycle
- Can only issue 2nd instruction if 1st instruction
issues ? in-order issue - HP 7100, Desktops
- Arbitrary Dual issue
- Any combination of two instructions
- Embedded processors
10Statically Scheduled Superscalar MIPS
- Superscalar MIPS 2 instructions,
- Fetch 64-bits/clock cycle ltINT, FPgt or
ltFP,INTgt - More ports for FP registers to do FP load FP
op in a pair - Type Pipe Stages
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX EX EX
WB - Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX EX EX
WB - Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX EX EX
WB - 1 cycle load delay expands to 3 instructions in
SS - instruction in right half cant use it, nor
instructions in next slot
11Different Issue Combinations
- Type Pipe Stages
- FP instruction IF ID EX EX EX WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX EX EX WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX EX EX
WB - Int. instruction IF ID EX MEM WB
Type Pipe Stages FP instruction IF ID EX EX
EX WB Int. instruction IF ID EX M
EM WB Int. instruction IF ID EX MEM WB FP
instruction IF ID EX EX EX WB FP
instruction IF ID EX EX EX
WB Int. instruction IF ID EX MEM WB
12Issue Process of 2-Issue MIPS
- Fetch two instructions from the Prefetch unit or
from the cache - Determine how many instructions can be issued 0,
1 or 2 - Issue them to correct functional units
13Fetching Two Instructions from I-Cache
- Easy two fetch I1 I2
- How about I2 I3 ?
- Most processors issue only I2
- Use a prefetch unit
142-Issue MIPS Hazard Checking
- Potential Issue Packets
- , INT, FP, INT, FP, FP,INT
- Most hazard possibilities are eliminated within
an Issue Packet - FP load/store/move, FP FP register port
contention - RAW hazard
- WAR, WAW hazards across issue packet boundaries
15Additional Hardware for Superscalars
- Enhanced hazard detection
- Minimized hardware support to execute integer and
floating point ins. in parallel - Different set of FP registers
- Different set of Int registers
- One additional FP read/write port
- A larger set of bypass paths
16Maintaining Precise Exception
Issue packet
- Let the FP pipeline drain
- DIV.D causes an exception after SUB.D exception
- No precise exception at the HW level
- Why? ADD.D destroys its one of operands
- Approaches
- Ignore the problem and settle for imprecise
exceptions - Buffer the results of an operation until all the
operations that were issued earlier are complete - Let Trap-handling routine to create a precise
sequence for the exception - Allow the instruction issue to continue only if
all the instructions before this instruction will
complete w/o causing an exception
171. Settle for Imprecise Exceptions
- Virtual memory and the IEEE FP-standard ? require
precise exception - Two modes of execution
- Imprecise mode (fast)
- Precise mode
- a mode switch or by insertion of explicit
FP-exception test instructions - The amount of overlap and reordering is
significantly restricted - DEC Alpha 21064 21164, IBM Power I II, MIPS
R8000
182. Buffering the Results of an Operation
- The difference in running times is large
- The number of results to buffer becomes large
- The results from the queue must be bypassed to
all issuing and executing instructions - Large number of comparators and a very large
multiplexor
192. Buffering Variations
- History File (CYBER 180/990, VAX)
- Keeps track of the original values of registers
- Upon an exception, unroll back and load the
original values from the file - Future File
- Keeps the newer value of register
- Update the main register file from the future
file after all earlier instructions complete - On an exception main reg file is intact!
203. Trap-handling routine to create a precise
sequence of exceptions
- Know what operations in the pipeline and their
PCs - The software finishes any instructions that
precede the latest instruction completed - I1 -- long , causes an exception
- I2 In-1 not completed
- In completed
- SW simulate I1 In ? major difficulty
- HW restart at In1
214. All Instructions Before the Issuing Complete
w/o Exception
- Stall the CPU to maintain precise exceptions
- FP-functional units must determine if an
exception is possible early in EX stage - In the first 3 clock cycles in MIPS pipeline
- MIPS R2000/3000/4000, Intel Pentium
22Precise Exception Handling in SS MIPS
- Int. op finishes before FP op
- Integer instruction completes before FP op
exception detection - Imprecise exception
- Solutions
- Detecting FP exceptions early
- Using software mechanisms to restore a precise
exception state before resuming execution - Delaying instruction completion until an
exception is impossible
Issue packet
The speculation approach uses 3
23Load Branch Stalls in SS MIPS
- Load result is not available
- on the same cycle
- on the next cycle
- Branch delay for taken branch
- 2 instructions if the branch is the first in the
packet - 3 instructions if the branch is the second in the
packet
Total 3 instructions
24Multiple Issue Challenges
- While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with - Exactly 50 FP operations AND No hazards
- If more instructions issue at same time, greater
difficulty of decode and issue - Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue (N-issue O(N2-N) comparisons) - Register file need 2x reads and 1x writes/cycle
- Rename logic must be able to rename same
register multiple times in one cycle! For
instance, consider 4-way issue - add r1, r2, r3 add p11, p4, p7 sub r4, r1,
r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
4(p22) add r5, r1, r2 add p12, p23, p4 - Imagine doing this transformation in a single
cycle! - Result buses Need to complete multiple
instructions/cycle - So, need multiple buses with associated matching
logic at every reservation station. - Or, need multiple forwarding paths
25Multiple Instruction Issue with Dynamic Scheduling
- Issue an instruction in half of a cycle
- Two instruction is processed in one cycle
- Build necessary logic to handle two instruction
at once - Any possible dependences between two instructions
- Both approaches are used at the same
- Pipeline widen issue logic
- Integrate dynamic branch prediction into a
dynamically scheduled pipeline
26A two-issue dynamic scheduled processor
- Issue any pair of instructions if reservation
station is available - Extend Tomasulos scheme to deal with both
integer and FP functional units and registers - Issue write result take 1 cycle each
- There are a dynamic branch prediction hardware,
a branch condition evaluation unit, 1 int. ALU,
pipelined FP units - LOOP L.D F0, 0(R1) F0array element
- ADD.D F4, F0, F2 add scalar in F2
- S.D F4, 0(R1) store result
- DADDIU R1, R1, -8 decrement pointer
- 8 bytes (per DW)
- BNE R1, R2, LOOP branch R1 ! R2
27Latencies
- Producer Consumer Cycles
- ALU op ALU op 1
- Load FP op 2
- Load ALU op 2
- FP Add FP Add 3
- Branch prediction is perfect
- Two CDBs
- No delayed branch
28First 3 iterations
IPC 5/31.67 Execution rate15/160.94
29Resource Usage
Do we need second CDB?
302-issue w/ additional resources
- extra adder for effective address calculation
IPC 5/31.67 Execution rate15/121.25
31Resource Usage
Lower efficiency as measured by the utilization
of the functional unit
32What limits the performance of 2-issue
dynamically scheduled pipeline?
- Imbalance between the functional unit structure
of the pipeline and the example loop - Impossibly to fully use the FP units
- Need fewer dependent integer operations/loop
- Very high loop overhead (2/5)
- Try to reduce this overhead next chapter
- The control hazard, could not start next L.D
before we know the outcome of the branch next
33Hardware-based Speculation
- Every cycle execute a branch
- Prediction is not sufficient to have high amount
of ILP - Overcome control dependence by speculating on the
outcome of branches - Execute the program as if our guesses were
correct - Dynamic scheduling Fetch, Issue (No execute)
- Speculation Fetch, Issue, Execute
- Incorrect speculation ? Undo
34Hardware-based Speculation Key Ideas
- Dynamic branch prediction to choose which
instructions to execute - Speculation to allow the execution of
instructions before the control dependence is
resolved - Dynamic scheduling to deal with the scheduling of
different combinations of basic blocks - PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel
Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon
35Speculative Tomasulos Algo
- Separate bypassing of speculative and
non-speculative results - Undo possible
- Instruction is no longer speculative, then
updates register or memory - Instruction Commit Stage
- Key idea out-of-order execution, in-order-commit
- Use Reorder buffer (ROB) for in-order commit
36Reorder Buffer (ROB)
- Holds the results of instructions that executed
but not committed - Passes results among instructions that may be
speculated - Like Store buffer in Tomasulo
ROB is the source
Execution completes
Commits
37Reorder Buffer Structure
38Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
F0
LD F0,10(R2)
N
Registers
To Memory
Dest
from Memory
Dest
Dest
Reservation Stations
FP adders
FP multipliers
Prof. John Kubiatowiczs slide
39Four-steps in Speculative Tomasulo
- Issue (dispatch)
- Get an instruction from the queue issue it if
there is empty Reservation Station and ROB slot.
Otherwise stalls. Send operands to the Res.Stat.
if operands are available in the ROB or the
registers. Send the ROB to reservation station.
Later, RS puts result and tag on CDB. - Execute (issue)
- Wait for the not ready operands by watching CDB,
i.e. checks for structural hazards - Loads take 2 steps check if in the head of Load
buffer, and reads from the mem. - Stores effective address calculation.
- Write Result
- Put the result with ROB tag on CDB all waiting
reservation stations and ROB read from CDB - STORE write available value to a ROB slot, not
available watch for CDB to update value field of
ROB slot - Commit (completion, graduation)
- BRANCH w/ incorrect prediction Branch w/
incorrect prediction reaches the head of the ROB
flush ROB, start execution at the correct
successor of branch - STORE Store reaches the head of the ROB and
result is available ? normal commit write to a
memory - Any other instruction instruction reaches the
head of the ROB and result is available ? normal
commit write to a register
40Speculative Example
- L.D F6, 34(R2)
- L.D F2, 45(R3)
- MUL.D F0, F2, F4
- SUB.D F8, F6, F2
- DIV.D F10, F0, F6
- ADD.D F6, F8, F2
Latencies Add 2 Mult 10 Divide 40
41Speculative Example
Reorder buffer
F0 F1 F2 F3 F4 F5 F6
F7 F8 F9 F10 Reorder 3
6 4 5
42Speculative Example
Reservation Stations
43Speculative Loop Example
LOOP L.D F0, 0(R1)
F0array element ADD.D F4, F0, F2
add scalar in F2 S.D F4, 0(R1) store
result DADDIU R1, R1, -8 decrement
pointer 8 bytes (per DW) BNE R1,
R2, LOOP branch R1 ! R2
44Speculative Loop Example
Reorder buffer
FP register status
F0 F1 F2 F3 F4 F5 F6
F7 F8 F9 F10 Reorder 6
7
45Speculative Dynamic Scheduling Summary
- Record speculative exception in the ROB
- Check for exception when instruction is ready to
commit - Complicated control over non-speculative Tomasulo
- Stores updates memory when reaches
- Write Results stage in Tomasulo
- the head of the ROB in speculative Tomasulo
- Store waits in Write Results stage for source
operand - Move value from Stores reservation station to
Stores ROB - In reality, the sourcing instruction directly
puts into Stores ROB by searching waiting
stores in the ROB - WAW and WAR memory hazards are eliminated
- Actual memory update occurs in order
- RAW memory hazards
- The computation of an effective address of a load
w.r.t. all earlier stores is ordered - Load cannot initiate reading from memory (step 2)
if any active ROB entry occupied by a store has a
Destination field that matches the value of the
Address field of the load
46Load/Store RAW Hazard
- Question Given a load that follows a store in
program order, are the two related? - (Alternatively is there a RAW hazard between the
store and the load)? Eg st 0(R2),R5
ld R6,0(R3) - Can we go ahead and start the load early?
- Store address could be delayed for a long time by
some calculation that leads to R2 (divide?). - We might want to issue/begin execution of both
operations in same cycle. - Answer is that we are not allowed to start load
until we know that address 0(R2) ? 0(R3)
47Hardware Support for Memory Disambiguation
- Need buffer to keep track of all outstanding
stores to memory, in program order. - Keep track of address (when becomes available)
and value (when becomes available) - FIFO ordering will retire stores from this
buffer in program order - When issuing a load, record current head of store
queue (know which stores are ahead of you). - When have address for load, check store queue
- If any store prior to load is waiting for its
address, stall load. - If load address matches earlier store address
(associative lookup), then we have a
memory-induced RAW hazard - store value available ? return value
- store value not available ? return ROB number of
source - Otherwise, send out request to memory
- Actual stores commit in order, so no worry about
WAR/WAW hazards through memory.
48Multiple Issue w/ Speculation
- Assign multiple reservation stations and reorder
buffers to the instructions - Challenges
- Instruction issue monitoring the CDBs for
instruction completion - Handle multiple instruction commits/cycle
49Non-speculative vs. Speculative
- Loop LD R2, 0(R1)
- DADDIU R2, R2, 1
- SD R2, 0(R1)
- DADDIU R1, R1, 4
- BNE R2, R3, Loop
- Separate units for effective address calculation,
for ALU operations, for branch condition
evaluation - Up to 2 instructions of any time can commit per
clock - The branch is a key performance limitation
50Design Considerations for Speculative Machines
- Register renaming vs. Reorder buffers
- A large set of registers (architectural vs.
physical registers) - How much to speculate
- Handle only low-cost exceptional events in
speculative mode - 1st cache miss vs. 2nd level miss
- Speculating through Multiple Branches
- Very high branch frequency, significant
clustering of branches, long delays in FUs