Title: Instruction-Level Parallelism
1Instruction-Level Parallelism
- Review of Pipelining (the laundry analogy)
2Instruction-Level Parallelism
- Review of Pipelining (Appendix A)
3Instruction-Level Parallelism
- Review of Pipelining (Appendix A)
- MIPS pipeline five stages
- IF instruction fetch
- ID instruction decoding and operands fetch
- EX execution using ALU, including effective
address and target address computing - MEM accessing memory for L S instructions
- WB write result back to (destination) register
4 5Instruction-Level Parallelism
- The naïve MIPS pipeline -- implementation
6Instruction-Level Parallelism
- A series of datapaths shifted in time
7Instruction-Level Parallelism
- A pipeline showing the pipeline registers between
stages
8The major hurdles of pipelining pipeline hazards
- Structural Hazards resource conflicts, such as
bus, register file ports, memory ports, etc.
9The major hurdles of pipelining pipeline hazards
- Data Hazards data dependency (producer-consumer
relationship, or read after write). Some can be
resolved by forwarding
10The major hurdles of pipelining pipeline hazards
- Data Hazards data hazards detection in MIPS
pipeline
11The major hurdles of pipelining pipeline hazards
- Data Hazards the logic for forwarding of data in
MIPS pipeline
12The major hurdles of pipelining pipeline hazards
- Data Hazards the forwarding of data in MIPS
pipeline
13The major hurdles of pipelining pipeline hazards
- Data Hazards Some cannot be resolved by
forwarding, thus requiring stalls
14The major hurdles of pipelining pipeline hazards
- Data Hazards Avoid non-forwardable data hazards
through compiler scheduling
15The major hurdles of pipelining pipeline hazards
- Branch (Control) Hazards can cause greater
performance loss (e.g., a 3-cycle loss in the
naïve MIPS pipeline)
16The major hurdles of pipelining pipeline hazards
- Branch (Control) Hazards improved MIPS pipelined
with one-cycle loss
17The major hurdles of pipelining pipeline hazards
- Reducing branch penalties
- Freeze or Flussh
- Predict-not-taken or Predict-taken
- Delayed Branch
- Branch instruction
- Sequential successor
- Branch target if taken
- Canceling/nullifying
- Branch if prediction
- incorrect
18The major hurdles of pipelining pipeline hazards
- Scheduling the branch delay slot
19Performance of Pipelining
- Example 1
- Consider an unpipelined machine A and a pipelined
machine B where CCTA 10ns, CPI(A)ALU CPI(A)Br
4, CPI(A)l/s 5, CCTB 11ns. Assuming an
instruction mix of 40 for ALU, 20 for branches,
and 40 for l/s, what is the speedup of B over A
under ideal conditions?
20Performance of Pipelining
- Impacts of pipeline hazards
21Performance of Pipelining
- Performance of branch schemes
- Overall costs of a variety of branch schemes
with the MIPS pipeline
22Performance of Pipelining
- Example 2 For a deeper pipeline such as that in
a MIPS R4000, it takes three pipeline stages
before the target-address is known and an
additional stage before the condition is
evaluated. This leads to the branch penalties for
the three simplest branch schemes listed below -
- Find the effective addition to the CPI arising
from branches for this pipeline, assuming that
unconditional, untaken conditional, and taken
conditional branches account for 4, 6, and 10,
respectively. - Answer
-
23What Makes Pipelining Hard to Implement?
- Exceptional conditions (e.g., interrupts, etc)
often change the order of instruction execution
24What Makes Pipelining Hard to Implement?
- Actions needed for different types of exceptional
conditions
25What Makes Pipelining Hard to Implement?
- Stopping and Restarting Execution Two Challenges
26What Makes Pipelining Hard to Implement?
- Stopping and Restarting Execution Two Challenges
(contd)
27What Makes Pipelining Hard to Implement?
- Precise Exception Handling in MIPS
Pipeline Stage Problem exceptions occurring
IF Page fault on instruction fetch misaligned memory access memory protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch misaligned memory access memory-protection violation
WB None
28What Makes Pipelining Hard to Implement?
- Precise Exception Handling in MIPS
29Extending MIPS Pipeline to Handle Multicycle
Operations
- Handle floating point operations single cycle
(CPI1) ? very long CCT or highly complex logic
circuit - Multiple cycle ?long latency with EX cycle
repeated many times and/or with multiple PF
function units - The MIPS pipeline with three additional
unpipelined, floating point units
30Extending MIPS Pipeline to Handle Multicycle
Operations
- Pipelining FP functional units
- Latency number of intervening cycles between the
producer and the consumer of an operand -- 0 for
ALU and 1 for LW - Initiation interval number of minimum cycles
between two issues of instructions using the same
functional unit. -
F. Unit Int. ALU Data Mem FP Add Multiply Divide
Latency 0 1 3 6 24
Init. Interval 1 1 1 1 25
31Extending MIPS Pipeline to Handle Multicycle
Operations
- Pipeline timing of a set of independent FP
instructions - A typical FP code sequence showing the stalls
arising from RAW hazards - Three instructions want to perform a write back
to the FP register simultaneously -
MUL.D IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
ADD.D IF ID A1 A2 A3 A4 MEM WB
L.D IF ID EX MEM WB
S.D IF ID EX MEM WB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
L.D. F4,0(R2) IF ID EX MM WB
MUL.D F0,F4,F6 IF ID Stall M1 M2 M3 M4 M5 M6 M7 MM WB
ADD.D F2,F0,F8 IF Stl ID Stl Stl Stl Stl Stl Stl A1 A2 A3 A4 MM WB
S.D. F2,0(R2) IF Stl Stl Stl Stl Stl Stl ID EX Stl Stl Stl MM
1 2 3 4 5 6 7 8 9 10 11
MUL.D F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
ADD.D F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
L.D. F2, 0(R2) IF ID EX MEM WB
32Extending MIPS Pipeline to Handle Multicycle
Operations
- Difficulties in exploiting ILP various hazards
that impose dependency among instructions, as a
result - RAW(read after write) j tries to read a source
before i writes to it - WAW(write after write) j tries to write an
operand before it is written by i - WAR(write after read) j tries to write a
destination before it is read by - Implementing pipeline in FP hazards and
forwarding in longer latency pipelines - Divide not fully pipelined (structural hazard)
- Multiple writes in a cycle and arrive at WB
variably,WAW and structural hazards. Would there
be WAR? - Out-of-order completion of instructions ? more
problems for exception handling - Higher RAW frequency and longer stalls due to
longer latency
33Extending MIPS Pipeline to Handle Multicycle
Operations
- Introduce interlock
- tracking the use of write port at ID and stalling
issue if detected - use shift register for tracking issued
instructions' use of write port - stall when entering MEM
- can stall any of the contending instructions,
- no need to detect conflict early when is it
harder to see, - give priority to the unit with the longest
latency, - can cause bottleneck stalling
- WAW occurs if LD is issued one cycle earlier and
has F2 as destination (WAW with ADDD) Solution - delay issuing LD until ADDD enters MEM, or,
- stamp out result of ADD
- Hazard detection with FP pipeline
- check for structural hazards a. functional
units, b. write ports - check for RAW hazard source reg. in ID dest.
reg. (issued) - check for WAW hazard dest reg. in ID dest.
reg. (issued)
34Extending MIPS Pipeline to Handle Multicycle
Operations
- Maintain precise exception
- Example of out-of-order completion
- DIVF F0, F2, F3 exception of
SUBF at end of ADDF - ADDF F10, F10, F8 cause imprecise
exception which - SUBF F12, F12, F14 cannot be solved
by HW/SW - Solutions
- Fast imprecise (tolerable in 60's 70s, but much
less so now due to pipelined FP, virtual memory,
and IEEE standard) or slow precise - Buffering of result until all predecessors
finish - the bigger the difference among instruction
execution lengths, the more expensive to
implement (e.g., large number of comparators and
MUXs and large amount of buffer space) - history file keeps track of register values
- future file keeps newer values of registers
until all predecessors are completed - Quasi-precise exception keep enough information
for trap-handling routine to create a precise
sequence for exception - operations in the pipeline and their PCs
- software finishes all instructions issued prior
to the latest completed instruction - Guarded issuing issue only if it is certain that
all prior instructions will complete without
causing an exception - stalling to maintain precise exception
35The MIPS R4000 Pipeline
- R4000 pipeline leads to a 2-cycle load delay
36The MIPS R4000 Pipeline
- R4000 pipeline leads to a 3-cycle basic branch
delay since the condition evaluation is performed
during the EX stage
37Dynamic Scheduling with Scoreboard
- Dynamic Scheduling hardware re-arranges the
instruction execution order to reduce stalls - handles situations where dependences are unknown
or difficult to detect at compile time, thus
simplifying the compiler design - increases portability of the compiled code
- solves problems associated with the so-called
head-of-the-queue (HOTQ) blocking caused by
in-order issue of earlier pipelines. Example - MIPS, which is in-order issue, can be made to
out-of-order execute (implying out-of-order
completion) by splitting ID into two phases (1)
In-order Issue check for structural hazards. (2)
Read operands wait until no data hazards, then
read operands (and then execute, possibly
out-of-order!). The HOTQ problem above can be
solved in this new MIPS!
38Dynamic Scheduling with Scoreboard
39Dynamic Scheduling with Scoreboard
40Dynamic Scheduling with Scoreboard
41Dynamic Scheduling with Scoreboard
42Dynamic Scheduling with Scoreboard
43Dynamic Scheduling with Scoreboard
44Dynamic Scheduling with Scoreboard
45Unpipelined Processor (MIPS)
46Pipelined Processor (MIPS)
47The Eight-stage Pipeline of the R4000
48A 2-cycle Load Delay of The R4000 Integer Pipeline