Title: Pipelining Issues Lecture
1Pipelining IssuesLecture 5
2Todays Agenda
- Pipelines Provide Significant Performance
Enhancements - Pipelines Also Introduce Special Problems
- Hazards
- Structural
- Data
- Control
- Hazards Challenge Ideal Speedups
- Cause us to reduce CPI
- Reduction in theoretical speedup
- Common Techniques To Deal With Hazards
- Structural can be eliminated by more hardware
- Data Hazards can be minimized by forwarding
- Control Hazards can be minimized by delayed
branching
3For next time.
- Homework ch3 3.1, 3.3,3.4
- Read Chapter 4
- This will complete desktop CPU design
- After Chapter 4 material, we will delve into
Memory
4ReviewCPU Operates as State Machine
- CPUs perform standard operations
- IF Instruction Fetch
- All Instructions do same
- ID Instruction Decode
- All Instructions decoded
- Some Access Registers
- EX Execute Instruction
- Arithmetic operations
- Address Calculation
- MEM Memory Access
- Ld/St Fetch Data To/From Memory
- WB Write Back
- Update Register File With Result
55 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
6Ideal Speedup for Pipelining
- Assume clock cycle of non pipelined machine is
tk. - Assume we are executing N gtgtgtgtgtk instructions
- We get first result out k clocks from start
- We continue to output result every clock, so N-1
clocks later we complete - Therefore.
7Ideal Speedup for Pipeline
- As N goes to infinity, tspeedup -gt k
- Interesting eh..
8Its Not That Easy for Computers
- Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle - Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away) - Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock) - Control hazards Pipelining of branches other
instructions that change the PC - Common solution is to stall the pipeline until
the hazard is resolved, inserting one or more
bubbles in the pipeline
9One Memory Port/Structural HazardsFigure 3.6,
Page 142
10Example Dual-port vs. Single-port
- Machine A Dual ported memory
- Machine B Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate - Ideal CPI 1 for both
- Loads are 40 of instructions executed
- SpeedUpA Pipeline Depth/(1 0) x
(clockunpipe/clockpipe) - Pipeline Depth
- SpeedUpB Pipeline Depth/(1 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05) - (Pipeline Depth/1.4) x 1.05
- 0.75 x Pipeline Depth
- SpeedUpA / SpeedUpB Pipeline
Depth/(0.75 x Pipeline Depth) 1.33 - Machine A is 1.33 times faster
11Data Hazards
- Data Hazards are associated with accessing data
from registers
12Data Hazard
13Software Fix Stall
14Stalling
- Depends on Good Compiler
- Must do register dependency analysis
- Code Rearranging can help.
- However, limited due to nature of programs
- Better solution is forwarding
15Data Forwarding
- Result is somewhere in pipeline
- Can add additional data paths (and muxs) to make
the result available
16Loads
- Unfortunately, forwarding does not completely
solve all problems..
Cant go backwards in time !
17Loads require 1 stall
18Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
19Three Generic Data Hazards
- InstrI followed by InstrJ
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it - Reading from register file
- Writing to register file
- This is what we have been looking at in our
forwarding - Op r1, r2, r3 r1 is written
- Op r4, r1, r5 r1 is read
20Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Read (WAR) InstrJ tries to write
operand before InstrI reads i - Gets wrong operand
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
21Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it - Leaves wrong result ( InstrI not InstrJ )
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
- Will see WAR and WAW in later more complicated
pipes
22Data Hazards Summary
- Of three types (WAW, WAR, RAW)
- WAW pipeline prevents this as all writes occur in
same stage - WAR pipeline prevents this as read occurs in
stage 2, write in stage 5 - RAW Can Occur
- Forwarding solves all but load word hazards
- Need to place NOP in between
- Good compiler can eliminate by code
re-arraingment
23Control Hazards
- Control hazards occur when sequential instruction
execution is violated - Unconditional jump/branch
- Conditional jump/branch
- Lets see
24Original Design
- Conditions for branches not resolved until Ex
stage - New address not muxed until MemAccess stage
- 2 conditions
- Take branch (unconditional or condition met)
- Dont take conditional branch
25Control Hazard on BranchesThree Stage Stall
26Branch Stall Impact
- If CPI 1, 30 branch, Stall 3 cycles gt new CPI
1.9! - Two part solution
- Determine branch taken or not sooner, AND
- Compute taken branch address earlier
- DLX branch tests if register 0 or 0
- DLX Solution
- Move Zero test to ID/RF stage
- Adder to calculate new PC in ID/RF stage
- 1 clock cycle penalty for branch versus 3
27Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
28Four Branch Hazard Alternatives
- 1 Stall until branch direction is clear
- 2 Predict Branch Not Taken
- Execute successor instructions in sequence
- Squash instructions in pipeline if branch
actually taken - Advantage of late pipeline state update
- 47 DLX branches not taken on average
- PC4 already calculated, so use it to get next
instruction - 3 Predict Branch Taken
- 53 DLX branches taken on average
- But havent calculated branch target address in
DLX - DLX still incurs 1 cycle branch penalty
- Other machines branch target known before outcome
29Four Branch Hazard Alternatives
- 4 Delayed Branch
- Define branch to take place AFTER a following
instruction - branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn - branch target if taken
- 1 slot delay allows proper decision and branch
target address in 5 stage pipeline - DLX uses this
Branch delay of length n
30Delayed Branch
- Where to get instructions to fill branch delay
slot? - Before branch instruction
- From the target address only valuable when
branch taken - From fall through only valuable when branch not
taken - Cancelling branches allow more slots to be
filled - Compiler effectiveness for single branch delay
slot - Fills about 60 of branch delay slots
- About 80 of instructions executed in branch
delay slots useful in computation - About 50 (60 x 80) of slots usefully filled
- Delayed Branch downside 7-8 stage pipelines,
multiple instructions issued per clock
(superscalar)
31Evaluating Branch Alternatives
- Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall - Stall pipeline 3 1.42 3.5 1.0
- Predict taken 1 1.14 4.4 1.26
- Predict not taken 1 1.09 4.5 1.29
- Delayed branch 0.5 1.07 4.6 1.31
- Conditional Unconditional 14, 65 change PC