Title: Pipelining - Hazards
1Pipelining - Hazards
2Can Pipelining Get Us Into Trouble?
- Yes Pipeline Hazards
- Structural hazards attempt to use the same
resource two different ways at the same time - E.g., combined washer/dryer would be a structural
hazard or folder busy doing something else
(watching TV) - Control hazards attempt to make a decision
before condition is evaluated - E.g., washing football uniforms and need to get
proper detergent level need to see after dryer
before next load in - Branch instructions
- Data hazards attempt to use item before it is
ready - E.g., one sock of pair in dryer and one in
washer cant fold until get sock from washer
through dryer - Instruction depends on result of prior
instruction still in the pipeline
3Structural Hazard
- A relation between two instructions indicating
that the two instructions may want to use the
same hardware resource (function unit, register
file port, shared bus, cache port, etc.) at the
same time - MIPS pipeline as designed so far does not have
structural hazard - But we had to avoid it
- Usually occurs when a functional unit is not
fully pipelined (e.g., in floating point pipeline)
4Single Memory Port / Structural Hazard
Time (clock cycles)
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 7
Cycle 5
I n s t r. O r d e r
Load
Instr 1
Instr 2
Instr 3
Instr 4
5Single Memory Port / Structural Hazard
Time (clock cycles)
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 7
Cycle 5
I n s t r. O r d e r
Load
DMem
Instr 1
Instr 2
Stall
Instr 3
How do you bubble the pipe?
6Single Memory Port / Structural Hazard
- Instead of stalling the pipeline
- Other solutions
- Make dual ported memory
- Physically separate memory architecture into
instruction and data (Harvard Architecture from
Harvard Mark I project of IBM led by Dr. Howard
Aiken) - Another typical structural hazard
- Functional unit is not fully pipelined due to
cost/complexity - Pipeline interval gt 1 pipe stage
7Example Cost of Structural Hazard
Suppose that 40 of instruction mix are loads or
stores, and that the ideal CPI of the pipelined
machine is 1. Assume that the machine with the
structural hazard has a clock rate that is 5
higher than the clock rate of the machine
without the hazard. Which pipeline is faster,
and by how much?
8Data Hazards
9Three Generic Data Hazards
- True (or Flow) Dependency (Read After Write, or
RAW) - A later instruction tries to read operand before
earlier instructions write it
I add r1,r2,r3 J sub r4,r1,r3
10RAW Hazards
- True (value, flow) dependence between
instructions i and j means i produces a result
value that j uses - This is a producer-consumer relationship
- This is a dependence based on values, not on the
names of the containers of the values - Every true dependence is a RAW hazard
- Not every RAW hazard is a true dependence
- Any RAW hazard that cannot be removed by renaming
is a true dependence
Original program 1 A BC 2 A DE 3 G AH
Renamed Program 1 X BC 2 A DE 3 G AH
True dependence (2,3) RAW hazard (2,3)
True dependence (2,3) RAW hazard (1,3), (2,3)
11Three Generic Data Hazards
- Anti-Dependency (Write After Read, or WAR)
- A later instruction tries to write operand before
earlier instructions read it - This hazard results from reuse of the same
register - Cant happen in our simple 5 stage pipeline
because - All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
I add r2, r1,r3 J sub r1,r4,r3
12Three Generic Data Hazards
- Output Dependency (Write After Write, or WAW)
- A later instruction tries to write operand before
earlier instructions write it - This hazard results from reuse of the same
register - Cant happen in our simple 5 stage pipeline
because - All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
I add r1,r2,r3 J sub r1,r4,r3
13More on WAR and WAW
- WAR and WAW hazards are name dependences
- Two instructions happen to use the same register
(name), although they dont have to - Can often be eliminated by renaming, either in
software or hardware - Implies the use of additional resources, hence
additional cost - Renaming is not always possible implicit
operands such as accumulator, PC, or condition
codes cannot be renamed
14How to Break the Dependency
- Dependency reduces concurrency
- Can we break
- True dependency (RAW)
- Name dependency or False dependency (WAR, WAW)
15Software Solution
- Have compiler guarantee no hazards
- Where do we insert the nops ? sub 2, 1,
3 and 12, 2, 5 or 13, 6, 2 add 14,
2, 2 sw 15, 100(2) - Problem this really slows us down!
16Hardware Solution Forwarding
Time (clock cycles)
add r1,r2,r3
I n s t r O r d e r
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
17Forwarding (simplified)
ID/EX
EX/MEM
MEM/WB
Register File
Data Memory
ALU
MUX
18Forwarding Unit
1. Forwarding between ALUOut and ALUMuxA sub
2, 1, 3 and 12, 2, 5
EX/MEM.RegisterRd ID/EX.RegisterRs 2 gt
Use EX/MEM.ALUOut instead of ID/EX.A a.
Some instructions do not write registers
b. Every use of 0 as an operand must yield an
operand value of zero
If ( EX/MEM.RegWrite (EX/MEM.RegisterRd ?
0) (EX/MEM.RegisterRd ID/EX.RegisterRs)
) ForwardA 01
19Forwarding Unit
2. Forwarding between ALUOut and ALUMuxB sub
2, 1, 3 and 12,5, 2
EX/MEM.RegisterRd ID/EX.RegisterRt 2 gt
Use EX/MEM.ALUOut instead of ID/EX.B
If ( EX/MEM.RegWrite (EX/MEM.RegisterRd ?
0) (EX/MEM.RegisterRd ID/EX.RegisterRt)
) ForwardB 01
20Forwarding (from EX/MEM)
ID/EX
EX/MEM
MEM/WB
Register File
ALU
Data Memory
MUX
21Forwarding Unit
3. Forwarding between ALUOut and ALUMuxA sub
2, 1, 3 and 12, 2, 5 or 13, 2,
6 MEM/WB.RegisterRd MEM/WB.RegisterRs 2
gt Use MEM/WB.ALUOut instead of ID/EX.A
If ( MEM/WB.RegWrite (MEM/WB.RegisterRd ?
0) (MEM/WB.RegisterRd ID/EX.RegisterRs)
) ForwardA 10
22Forwarding Unit
4. Forwarding between ALUOut and ALUMuxB sub
2, 1, 3 and 12, 2, 5 or 13, 6,
2 MEM/WB.RegisterRd MEM/WB.RegisterRt 2
gt Use MEM/WB.ALUOut instead of ID/EX.B
If ( MEM/WB.RegWrite (MEM/WB.RegisterRd ?
0) (MEM/WB.RegisterRd ID/EX.RegisterRt)
) ForwardB 10
23Forwarding (from MEM/WB)
ID/EX
EX/MEM
MEM/WB
Register File
ALU
Data Memory
MUX
24Forwarding (operand selection)
ID/EX
EX/MEM
MEM/WB
Register File
ALU
Data Memory
MUX
Forwarding Unit
25Forwarding (operand propagation)
ID/EX
EX/MEM
MEM/WB
Register File
ALU
Data Memory
MUX
Rd
Rt
EX/MEM Rd
Forwarding Unit
Rt
Rs
MEM/WB Rd
26Forwarding
27Datapath with Forwarding Unit
28Forwarding Unit
add 1, 1, 2 add 1, 1, 3 add 1,
1, 4
If ( MEM/WB.RegWrite (MEM/WB.RegisterRd ?
0) (EX/MEM.RegisterRd ? ID/EX.RegisterRs)
(MEM/WB.RegisterRd ID/EX.RegisterRs) )
ForwardA 10
If ( MEM/WB.RegWrite (MEM/WB.RegisterRd ?
0) (EX/MEM.RegisterRd ? ID/EX.RegisterRt)
(MEM/WB.RegisterRd ID/EX.RegisterRt) )
ForwardB 10
29Some Other Data Dependencies
- lw 1, 0(2) F D X M W
- sw 1, 0(7) F D X M W
- sw 1, 0(8) F D X M W
- sw 1, 0(9) F D X
M W
30Can't always forward
- Load word can still cause a hazard
Time (clock cycles)
I n s t r. O r d e r
31Data Hazard Even with Forwarding
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
NO ISSUE
sub r4,r1,r6
and r6,r1,r7
Bubble
ALU
DMem
or r8,r1,r9
Thus, we need a hazard detection unit to stall
the load instruction
32Stalling
If ( ID/EX.MemRead ((ID/EX.RegisterRt
IF/ID.RegisterRs) (ID/EX.RegisterRt
IF/ID.RegisterRt) )) stall the pipeline
- When the pipeline is stalled
- Do not fetch a new instruction Prevent PC and
IF/ID registers from changing - Create a buble in the pipeline Set all control
signals to 0 to create a do nothing instruction
33Hazard Detection Unit
34Code rescheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
Compiler optimizes for performance. Hardware
checks for safety.
35Branch in the Pipelined Datapath
Computes branch target address
Computes branch outcome
Changes PC
36Branch (Control) Hazards
- When we decide to branch, other instructions are
in - the pipeline!
37Solving Branch Hazards
- Stall the pipeline until the branch is complete
- Brach is detected in ID stage
- Pipeline is stalled
- Pipeline is started in IF stage
- Next instruction
- Branch target
- Three clock cycles will be lost for each branch
!!!
38Reducing Taken Branch Penalty
- Compute branch target address earlier
- Compute branch outcome earlier
39Reducing Taken Branch Penalty
- Branch is completed in ID stage
- If branch is taken, flush the pipeline
- 1 cycle loss for a taken branch
Taken branch F D X M W
Branch 1 F FL FL FL FL
Branch target F D X M W
BT 1 F D X M W
40Flushing the Instruction After Branch
41Predictnot-Taken (Predict-Untaken)
- Continue execution after the branch
- If branch is not taken, no penalty
- If branch is taken, flush the pipeline and loss
of 1 - clock cycles
- What about Predict-Taken?
42Delayed Branches
- Execution cycle with a branch delay of length n
- branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn - branch target if taken
- Instructions in the branch delay slot are
executed irrespective of branch outcome
Branch delay of length n
43Delayed Branches on MIPS
- One branch delay slot on MIPS
- Taken and untaken branch behaviour are similar
- Compiler must fill in the branch delay slot with
useful instructions
44Delayed Branches
- Question What instruction do we put in the
branch delay slot? - Fill with NOP (always possible)
- Fill from before (not always possible)
- Fill from target (not always possible)
- Fill from fall-through (not always possible)
45Filling Branch Delay Slot
Make sure R7 will not be used in taken path
before redefined
46Filling Branch Delay Slot
47Cancelling Branches
- Improves the ability of the compiler to fill in
delay slots - Instruction includes a bit showing its predicted
direction - When branch behaves as predicted, instruction in
the delay slot is executed - When branch is incorrectly predicted, instruction
in the delay slot is turned to NOP
48Predict-Taken Cancelling Branch
49Summary Pipelining
- Reduce CPI by overlapping many instructions
- Average throughput of approximately 1 CPI with
fast clock - Utilize capabilities of the Datapath
- Start next instruction while working on the
current one - Limited by length of longest stage (plus
fill/flush) - Detect and resolve hazards
- What makes it easy
- All instructions are the same length
- Just a few instruction formats
- Memory operands appear only in loads and stores
- What makes it hard?
- Structural hazards suppose we had only one
memory - Control hazards need to worry about branch
instructions - Data hazards an instruction depends on a
previous instruction