Title: Appendix%20A.%20Pipelining:%20Basic%20and%20Intermediate%20Concept
1Appendix A. Pipelining Basic and Intermediate
Concept
Rung-Bin Lin
- What is Pipelining?
- Pipelining is an implementation technique whereby
multiple instructions are overlaped in execution. - Pipe stage (pipe segment)
- Throughput
- Machine cycle The time required between moving
an instruction one step down the pipeline. This
time is equal to the time required for the
slowest pipe stage. - In a computer, the machine cycle is usually one
clock cycle. - The pipeline designers goal is to balance the
length of each pipe stage. - If the stages are perfectly balanced,
2A Simple Implementation of A RISC ISA
- Five-cycle implementation
- Instruction fetch cycle (IF)
- Instruction decode/register fetch cycle (ID)
- Operand fetches
- Sign-extending the immediate field
- Decoding is done in parallel with reading
registers. This technique is known as fixed-field
decoding - Test branch condition and computed branch
address finished branching at the end of this
cycle. - Execution/effective address cycle (EX)
- Memory reference
- Register-Register ALU instruction
- Register-Immediate ALU instruction
- Memory access/branch completion cycle (MEM)
- Write-back cycle (WB)
- Register-Register ALU instruction
- Register-Immediate ALU instruction
- Load instruction
3Performance of the Five-Cycle Implementation
- CPI4.54
- Branch instructions (12) take 2 cycles
- Store instructions (10) require 4 cycles
- Others takes 5 cycles
4The Classic Five-Stage Pipeline for a RSIC
Processor
5The RISC Pipeline with Registers
6Instruction Issue
- The process of letting an instruction move from
the instruction decode stage (ID) into execution
stage (EX) of this pipeline.
7Basic Performance Issues in Pipelining
- Pipelining increasing instruction execution
throughput, but it does not reduce the execution
time of an individual instruction due to pipeline
overhead. - Register delay
- Clock skew
- The limitation of pipeline depth is due to
- Pipeline latency
- Pipe stage imbalance
- Pipeline overhead
- Example in A-10.
8The Major Hurdle of Pipelining - Pipelining
Hazards
- A hazard is a situation that prevents the next
instruction in the instruction stream from
executing during its designated clock cycle. - Three classes of hazards
- Structural hazard Arise from resource conflicts.
- Data hazard Arise when an instruction depends on
the results of a previous instruction. - Control hazard Arise from branches and other
instructions that change the PC. - A pipeline can be stalled by a hazard. To
eliminate hazards, - Instructions issued later than the stalled
instruction are also stalled. - Instructions issued earlier than the stalled one
must continue. - Note that a cache miss stalls the whole pipeline.
9Performance of Pipeline with Stalls
- When pipelining is thought of as decreasing the
CPI,
10- When pipelining is thought of as improving the
clock cycle time,
11Structural Hazards
- Due to resource conflicts (Example in A-14)
- Due to some functional unit being not fully
pipelined. - When some resources have not been duplicated
enough.
12Data Hazards
- A memory access depends on the results of
unfinishing instructions.
13Forwarding (Bypassing) ALU Results To Minimize
Hazards
14Forwarding (Bypassing) Results to Store
15Bypassing Results of LOAD
16Data Hazard Classification
- Consider two instructions i and j, with i
occurring before j, the possible hazards are, - RAW (read after write) j tries to read a source
before i writes it. - WAW (write after write) j tries to write an
operand before it is written by i. For example, - LW R1, 0(R2) IF ID EX MEM1
MEM2 WB - DADD R1, R2, R3 IF ID EX
WB - WAR (write after read) j tries to write a
destination before it is read by i. For example,
if read is done in the second half of MEM2, and
write is done in the first half of WB. - SW 0(R1), R2 IF ID EX MEM1 MEM2
WB - DADD R2, R3, R4 IF ID EX
WB - RAR (read after read) not a hazard.
17Data Hazards Requiring Stalls
- Pipeline interlock
- A piece of hardware that detects a hazard and
stalls the pipeline until the hazard is cleared. - Load interlock
- Example (Fig. A.10 at A-21)
18Control Hazards
- Caused by the instructions that change PC.
- Some basics
- If a branch changes the PC to its target address,
it is a taken branch. If it does not change the
PC, it falls through or it is not taken. - Recall that if an instruction i is a taken
branch, the PC is normally not changed until the
end of ID. A stall cycle is required. - Branch Instruction IF ID EX MEM WB
- Branch successor IF IF ID EX
MEM WB - Branch successor1 IF
ID EX MEM WB - Branch successor2
IF ID EX MEM WB
19Branch Penalty
- Branch delay The length of a control hazard.
- Branch penalty The branch delay, unless it is
dealt with, turns into branch penalty. - The deeper the pipeline, the worse the branch
penalty. - The number of branch stalls can be reduced by two
steps - Find out whether the branch is taken or not taken
earlier in the pipeline. - Compute the taken PC (i.e., the address of the
branch target) earlier. - Branch behavior in programs
- Average frequency of taken branches 67
- 60 of the forward branches are taken.
- 85 of the backward branches are taken.
20Reducing Pipeline Branch Penalties
- Static branch prediction methods (Compile-time
guess). - Free or flush the pipeline
- Holding or deleting any instructions after the
branch until the branch destination is known. - Predict-not-taken (untaken) (Fig. A.12 in A-23)
- Predict-taken
- Does it have any advantage? Ans no.
- Delayed branch
- The execution cycle with a branch delay n is
- Branch instruction
- Sequential successor 1
- Sequential successor 2
-
- Sequential successor n (n1 for MIPS)
- Branch target if taken
-
21Scheduling the Branch Delay Slot
22Effectiveness of Scheduling Branch Delay Slots
- Requirements for being effective
- Scheduling from before Always
- Scheduling from target Taken
- Scheduling from fall through Not taken
- The limitation on delayed-branch scheduling
arises from - The restrictions on the instructions that are
scheduled into the delay slots. - The ability to predict at compile time whether a
branch is likely to be taken or not. - Using canceling or nullified branch to relieve
the limlits - In a canceling branch, the instruction includes
the direction that the branch was predicted. When
the branch behaves as predicted, the instruction
in the branch delay slot is simply executed.
Otherwise, the instruction in the branch delay
slot is simply turned into a No-Op.
23How Is Pipelining Implemented?
- Unpipelined 5-cycle implementation
24Simple Pipelining Implementation for MIPS
25Implementing the Control for MIPS Pipeline
- Implementing the control focuses on detecting of
hazards and generating of control signals for
forwarding. - Hazard detection
- All the data hazards can be checked and
forwarding control signals can be set during the
ID phase. If a data hazard exists, the
instruction is stalled before it is issued. - Or, alternatively, hazards forwarding are checked
at the beginning of a clock cycle that uses an
operand (EX and MEM for the MIPS pipeline). - Implementing the logic for hazard detection
- Hazard detection by comparing the destination and
sources of adjacent instructions (fig. A.20 on
page A-34). - An example shows detecting of all load interlocks
when the instruction using the load result in the
ID stage (fig. A.21 on page A-34).
26Implementing Forwarding Logic
- Forwarding sources ALU or data memory output.
- Forwarding destination ALU input, data memory
input, or zero detection unit (for BRANCH). - The forwarding can be implemented by checking the
following conditions - EX/MEM.IR.destination ID/EX.IR.source ?
- MEM/WB.IR.destination ID/EX.IR.source ?
- MEM/WB.IR.destination EX/MEM.IR.source?
27Forwarding Data to the Two ALU Inputs
28Dealing with Branches in the Pipeline
29What Makes Pipelining Hard to Implement
- Exception (interrupt, fault) makes pipelining
difficult to implement. - Instruction set complications
30Types of Exceptions
- Types
- I/O device request
- Invoking an OS service from a user program
- Tracing instruction execution
- Breakpoint
- Integer arithmetic overflow or underflow
- FP arithmetic anomaly
- Page fault
- Misaligned memory access
- Memory-protection violation
- Using an undefined instruction
- Hardware malfunction
- Power failure
- Exceptions for different architecture (fig. A.26
on page A-40).
31Classification of Exceptions
- Synchronous versus asynchronous
- If the event occurs at the same place every time
that the program is executed with the same data
and memory allocation, the event is called
synchronous. - User requested versus coerced
- User maskable versus nonmaskable
- Within versus between instruction
- Depend on whether the event prevents instruction
completion by occurring in the middle of
execution or whether it is recognized between
instructions. - Resume versus terminate (fig. 3.40 on page 182).
32Action Requirements for Different Exception Types
(Fig. A.27 on page A-42)
- Actions
- Resume
- Terminate
- The most difficult exceptions have two
properties - They occur within instructions (i.e. at EX or MEM
stages). - They must be restartable (must save the PC of the
instruction at which to restart).
33Exception Handling
- Stopping and restarting execution
- Force a trap instruction on the next IF
- Until the trap is taken, turn off all writes for
the faulting instruction and for all instructions
that follow in the pipeline. - After the exception-handling routine in the
operating system receives control, it immediately
saves the PC of the faulting instruction. - IF ID EX MEM WB lt--- Faulting instruction
- IF ID EX MEM WB
- IF ID EX MEM WB
- IF ID EX MEM WB
- IF ID EX MEM
- Trap instruction -gt IF ID EX
- If delayed branch is used, we need to save and
restore as many PCs as the length of the branch
delay plus one.
34Precise Interrupt
- If a pipeline can be stopped so that the
instructions just before the faulting instruction
are completed and those after it can be restarted
from scratch. - Supporting precise interrupts is a requirement in
many systems. - Exceptions in DLX
- With pipelining, multiple exceptions may occur in
the same clock cycle. (fig. A.28 on page A-44).
35Implementations of Precise Exceptions
- Principle
- The pipeline should be able to handle the
exceptions caused by instruction i prior to the
exceptions caused by instruction i1. - Implementation
- Hardware posts all exceptions caused by a given
instruction in a status vector associated that
instruction. - Once an exception indication is set in the
exception status vector, any control signal that
may cause a data value to be written is turned
off. - When an instruction enters WB, the exception
status vector is checked, if any exceptions are
posted, they are handled in the order in which
they would occur in time on an unpipelined
machine. - This will guarantee that all exceptions will be
seen on instruction i before any are seen on i1.
36Instruction Committed
- When an instruction is guaranteed to complete, it
is called committed. - In the MIPS pipeline, all instructions are
committed when they reach the end of the MEM
stage and no instruction updates the state before
that stage. Thus precise exceptions are straight
forward.
37Instruction Set Complications
- Some machines have instructions that change the
state in the middle if the instruction execution. - VAX Autoincrement addressing mode.
- VAX or IBM 360 String copy.
- Implicitly set condition code.
- Cause difficulties in scheduling any pipeline
delays between setting condition code and the
branch. - ADD XXX lt--- Set condition code C.
- lt- Can not place
instructions that change C. - BR C, YYY lt--- Use C for branch.
- In fact, the condition code must be treated as an
operand that requires hazard detection for RAW
hazards with branch no matter the condition code
is set implicitly or explicitly - Multicycle operations in VAX
38Extending the MIPS Pipeline to Handle Multi-Cycle
Operations
- Assuming four separate functional units in our
MIPS implementation - Integer unit
- Handle loads and stores, ALU operations and
branches. - FP and integer multiplier
- FP adder
- FP and integer divider
- If an instruction cannot proceed to the EX stage
, the entire pipeline behind that instruction
will be stalled.
39MIPS Pipeline with Multi-cycle Functional Units
40Pipelining Multi-cycle Functional Units
41Latency and Initiation(repeat interval)
- Latency
- The number of intervening cycles between an
instruction that produces a result and an
instruction that uses the result. - Initiation (repeat) interval
- The number of cycles that must elapse between
issuing two operations of a given type. - Latency and initiation interval for pipelining
multi-cycle functional units - Functional Unit Latency Initiation interval
- Integer ALU 0 1
- Data memory access 1 1
- FP add 3 1
- FP (integer) multiply 6 1
- FP (integer) divide 24 25
42Hazards and Forwarding in Longer Latency Pipelines
- Hazard detection and forwarding for a pipeline as
before. - Structural hazards can occur because the divide
unit is not fully pipelined. - The number of register writes can be larger than
1 because the instructions have varying running
time. - WAW hazards are possible, but WAR hazards are not
possible. - Instructions can complete in a different order
than they were issued, causing problems with
exceptions. - Stalls for RAW hazards will be more frequent
because of longer latency. - Assuming all hazard detection is done in ID,
three checks must be done before issuing an
instruction - Check for structural hazards
- Check for a RAW data hazard
- Check for a WAW data hazard
43RAW Hazards Caused by Longer Pipeline
44Structural Hazards in Longer Pipeline
45Maintaining Precise Exceptions (1)
- Problems caused by out-of-order completion
- DIV.D F0, F2, F4
- ADD.D F10, F10, F8
- SUB.D F12, F12, F14
- Four possible approaches
- Ignore the problem and settle for imprecise
exceptions - Buffer the results of an operation until all the
operations that were issued earlier are
completed. - History file approach Buffer the original
register values. - Future file approach Keep the newer values of
registers. - Allow the exceptions to become somewhat
imprecise, but to keep enough information so that
the trap-handling routines can create a precise
sequence for exceptions. This means knowing what
operations were in the pipeline and their PCs.
46Maintaining Precise Exceptions (2)
- Worst-case scenario
- Instruction 1 A long-running instruction that
interrupts. - Instruction 2 not completed.
- .
- Instruction n-1 not completed.
- Instruction n completed. lt-- The latest
completed instruction. - The software must simulate the instruction 1
through instruction n-1 and restart the execution
at instruction n1. - Allows the instruction issue to continue only if
it is certain that all the instructions before
the issuing instruction will complete without
causing an exception. This sometimes means
stalling the machine to maintain precise
exceptions.
47Number of Stalls per FP Operation
48Performance of a MIPS FP Pipeline
49Overview of The MIPS R4000 Pipeline
- An implementation of MIPS64
- Eight pipeline stages (superpipelining)
50Load Delay in MIPS R4000
51Branch Delay in MIPS R4000
52CPI of MIPS R4000
53Concluding Remarks
- We can spend a little money to buy a very
powerful computer today.