Title: EECS 470
1EECS 470
- Pipeline Hazards
- Lecture 4
- Coverage Appendix A
2Basic Pipelining
- Data hazards
- What are they?
- How do you detect them?
- How do you deal with them?
- Micro-architectural changes
- Pipeline depth
- Pipeline width
- Forwarding ISA
3Fetch Decode Execute Memory WB
M U X
1
target
PC1
PC1
0
R0
eq?
R1
regA
ALU result
R2
Inst mem
Register file
regB
valA
M U X
PC
Data memory
instruction
R3
ALU result
mdata
R4
valB
R5
R6
M U X
data
R7
offset
dest
valB
Bits 0-2
dest
dest
dest
Bits 16-18
M U X
Bits 22-24
op
op
op
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
4Fetch Decode Execute Memory WB
M U X
1
target
PC1
PC1
0
R0
eq?
R1
regA
ALU result
R2
Inst mem
Register file
regB
valA
M U X
PC
Data memory
instruction
R3
ALU result
mdata
R4
M U X
valB
R5
R6
M U X
data
R7
offset
dest
valB
dest
dest
dest
op
op
op
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
5Fetch Decode Execute Memory WB
M U X
1
target
PC1
PC1
0
R0
eq?
R1
regA
ALU result
R2
Inst mem
Register file
regB
valA
M U X
PC
Data memory
instruction
R3
ALU result
mdata
R4
M U X
valB
R5
data
R6
M U X
R7
offset
valB
op
op
op
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
6Pipeline function for ADD
- Fetch read instruction from memory
- Decode read source operands from reg
- Execute calculate sum
- Memory Pass results to next stage
- Writeback write sum into register file
7Data Hazards
add 1 2 3 nand 3 4 5
time
add
fetch decode execute memory writeback
nand
fetch decode execute memory
writeback
If not careful, you will read the wrong value of
R3
8Three approaches to handling data hazards
- Avoidance
- Make sure there are no hazards in the code
- Detect and Stall
- If hazards exist, stall the processor until they
go away. - Detect and Forward
- If hazards exist, fix up the pipeline to get the
correct value (if possible)
9Handling data hazards avoid all hazards
- Assume the programmer (or the compiler) knows
about the processor implementation. - Make sure no hazards exist.
- Put noops between any dependent instructions.
write R3 in cycle 5
add 1 2 3 noop noop nand 3 4 5
read R3 in cycle 6
10Problems with this solution
- Old programs (legacy code) may not run correctly
on new implementations - Longer pipelines need more noops
- Programs get larger as noops are included
- Especially a problem for machines that try to
execute more than one instruction every cycle - Intel EPIC Often 25 - 40 of instructions are
noops - Program execution is slower
- CPI is one, but some Is are noops
11Handling data hazards detect and stall
- Detection
- Compare regA with previous DestRegs
- 3 bit operand fields
- Compare regB with previous DestRegs
- 3 bit operand fields
- Stall
- Keep current instructions in fetch and decode
- Pass a noop to execute
12End of Cycle 1
M U X
1
target
PC1
PC1
0
R0
eq?
14
R1
regA
ALU result
7
R2
Inst mem
Register file
regB
valA
M U X
PC
Data memory
10
R3
add 1 2 3
ALU result
mdata
R4
M U X
valB
R5
data
R6
M U X
R7
offset
valB
op
op
op
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
13End of Cycle 2
M U X
1
target
PC1
PC1
0
R0
eq?
14
R1
regA
ALU result
7
R2
Inst mem
Register file
regB
14
M U X
PC
Data memory
10
R3
nand 3 4 5
ALU result
mdata
3
R4
M U X
7
R5
data
R6
M U X
R7
3
valB
add
op
op
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
14First half of cycle 3
M U X
1
target
PC1
PC1
0
R0
eq?
3
14
R1
regA
ALU result
7
R2
Inst mem
Register file
regB
14
M U X
PC
Data memory
nand 3 4 5
10
R3
ALU result
mdata
3
R4
M U X
7
R5
data
R6
M U X
R7
3
valB
add
op
op
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
15Hazard detected
compare
REG file
regA
3
regB
3
IF/ ID
ID/ EX
16Hazard detected
1
compare
0 0 0
0 1 1
regA
regB
0 1 1
3
17Handling data hazards detect and stall the
pipeline until ready
- Detection
- Compare regA with previous DestReg
- 3 bit operand fields
- Compare regB with previous DestReg
- 3 bit operand fields
- Stall
- Keep current instructions in fetch and decode
- Pass a noop to execute
18First half of cycle 3
M U X
1
target
1
2
0
R0
eq?
3
14
R1
regA
ALU result
7
R2
Inst mem
Register file
regB
14
M U X
PC
Data memory
nand 3 4 5
10
R3
ALU result
mdata
3
11
R4
M U X
7
R5
data
R6
M U X
R7
valB
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
19Handling data hazards detect and stall the
pipeline until ready
- Detection
- Compare regA with previous DestReg
- 3 bit operand fields
- Compare regB with previous DestReg
- 3 bit operand fields
- Stall
- Keep current instructions in fetch and decode
- Pass a noop to execute
20End of cycle 3
M U X
1
2
0
R0
14
R1
regA
ALU result
7
R2
Inst mem
Register file
regB
M U X
PC
Data memory
nand 3 4 5
10
R3
21
mdata
3
11
R4
M U X
R5
data
R6
M U X
R7
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
21First half of cycle 4
M U X
1
2
0
R0
3
14
R1
regA
ALU result
7
R2
Inst mem
Register file
regB
M U X
PC
Data memory
nand 3 4 5
10
R3
21
mdata
3
11
R4
M U X
R5
data
R6
M U X
R7
noop
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
22End of cycle 4
M U X
1
2
0
R0
14
R1
regA
21
7
R2
Inst mem
Register file
regB
M U X
PC
Data memory
nand 3 4 5
10
R3
3
11
R4
M U X
R5
data
R6
M U X
R7
noop
noop
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
23First half of cycle 5
M U X
1
2
0
R0
3
14
R1
regA
21
7
R2
Inst mem
Register file
regB
M U X
PC
Data memory
nand 3 4 5
10
R3
3
11
R4
M U X
R5
data
R6
M U X
R7
noop
noop
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
24End of cycle 5
M U X
1
2
3
0
R0
14
R1
regA
7
R2
Inst mem
Register file
regB
21
M U X
PC
Data memory
add 3 7 7
21
R3
11
R4
5
M U X
11
77
R5
data
1
R6
M U X
8
R7
nand
noop
noop
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
25No more hazard stalling
add 1 2 3 nand 3 4 5
time
add
fetch decode execute memory writeback
nand
fetch decode decode decode
execute
hazard
hazard
We are careful to get the right value of R3
26Problems with detect and stall
- CPI increases every time a hazard is detected!
- Is that necessary? Not always!
- Re-route the result of the add to the nand
- nand no longer needs to read R3 from reg file
- It can get the data later (when it is ready)
- This lets us complete the decode this cycle
- But we need more control to remember that the
data that we arent getting from the reg file at
this time will be found elsewhere in the pipeline
at a later cycle.
27Handling data hazards detect and forward
- Detection same as detect and stall
- Except that all 4 hazards are treated differently
- i.e., you cant logical-OR the 4 hazard signals
- Forward
- New datapaths to route computed data to where it
is needed - New Mux and control to pick the right data
28First half of cycle 3
M U X
1
1
2
0
R0
3
14
R1
regA
7
R2
Inst mem
Register file
regB
14
M U X
PC
Data memory
nand 3 4 5
10
R3
3
11
R4
M U X
7
77
R5
data
1
R6
M U X
8
R7
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
29End of cycle 3
M U X
1
2
3
0
R0
14
R1
regA
7
R2
Inst mem
Register file
regB
10
M U X
PC
Data memory
add 6 3 7
10
R3
3
21
11
R4
5
M U X
11
77
R5
data
1
R6
M U X
8
R7
nand
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
30First half of cycle 4
M U X
1
2
3
0
R0
21
14
R1
regA
M U X
3
7
R2
Inst mem
Register file
regB
10
M U X
PC
Data memory
add 6 3 7
10
R3
3
21
11
R4
11
5
M U X
11
77
R5
data
1
R6
M U X
8
R7
nand
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
31End of cycle 4
M U X
1
3
4
0
R0
14
R1
regA
21
M U X
7
R2
Inst mem
Register file
regB
1
M U X
PC
Data memory
lw 3 6 10
10
R3
-2
11
R4
7
5
3
M U X
10
77
R5
data
1
R6
M U X
8
R7
add
nand
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
32First half of cycle 5
M U X
1
3
4
No Hazard
0
R0
3
14
R1
regA
21
M U X
7
R2
Inst mem
Register file
regB
1
M U X
PC
Data memory
lw 3 6 10
10
R3
-2
11
R4
7
5
3
M U X
10
77
R5
data
1
R6
M U X
8
R7
add
nand
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
33End of cycle 5
M U X
1
4
5
0
R0
14
R1
regA
-2
M U X
7
R2
Inst mem
Register file
regB
21
M U X
PC
Data memory
sw 6 2 12
21
R3
22
6
11
R4
7
5
M U X
77
R5
data
1
R6
M U X
8
R7
10
lw
add
nand
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
34First half of cycle 6
M U X
1
4
5
Hazard
0
R0
6
14
R1
regA
-2
M U X
7
R2
Inst mem
Register file
regB
21
M U X
PC
Data memory
sw 6 2 12
21
R3
22
11
R4
6
7
5
M U X
77
R5
L
1
R6
M U X
data
8
R7
10
lw
add
nand
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
35End of cycle 6
M U X
1
5
0
R0
14
R1
regA
22
M U X
7
R2
Inst mem
Register file
regB
M U X
PC
Data memory
sw 6 2 12
21
R3
31
11
R4
6
7
M U X
-2
R5
data
1
R6
M U X
8
R7
lw
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
36First half of cycle 7
M U X
1
5
Hazard
0
R0
6
14
R1
regA
22
M U X
7
R2
Inst mem
Register file
regB
M U X
PC
Data memory
sw 6 2 12
21
R3
31
11
R4
6
7
M U X
-2
R5
data
1
R6
M U X
8
R7
noop
lw
add
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
37End of cycle 7
M U X
1
5
0
R0
14
R1
regA
M U X
7
R2
Inst mem
Register file
regB
1
M U X
PC
Data memory
21
R3
99
11
R4
6
M U X
7
-2
R5
data
1
R6
M U X
22
R7
12
sw
noop
lw
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
38First half of cycle 8
M U X
1
5
0
R0
14
R1
regA
M U X
7
R2
Inst mem
Register file
regB
1
M U X
PC
Data memory
21
R3
99
11
R4
6
M U X
7
-2
R5
data
1
R6
M U X
8
R7
12
sw
noop
lw
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
39End of cycle 8
M U X
1
0
R0
14
R1
regA
M U X
7
R2
Inst mem
Register file
regB
M U X
PC
Data memory
21
R3
111
11
R4
M U X
-2
R5
data
99
R6
M U X
8
R7
7
sw
noop
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
40FP pipeline support
I
add
M1
M2
M3
M4
M5
M6
M7
Mem
WB
fetch
decode
FP multiply
A1
A2
A3
A4
FP adder
Non-pipelined divide
41Adding pipeline stages
- Pipeline frontend
- Fetch, Decode
- Pipeline middle
- Execute
- Pipeline backend
- Memory, Writeback
42Adding stages to fetch, decode
- Delays hazard detection
- No change in forwarding paths
- No performance penalty with respect to data
hazards
43Adding stages to execute
- Check for structural hazards
- ALU not pipelined
- Multiple ALU ops completing at same time
- Data hazards may cause delays
- If multicycle op hasn't computed data before the
dependent instruction is ready to execute - Performance penalty for each stall
44Adding stages to memory, writeback
- Instructions ready to execute may need to wait
longer for multi-cycle memory stage - Adds more pipeline registers
- Thus more source registers to forward
- More complex hazard detection
- Wider muxes
- More control bits to manage muxes
45Wider pipelines
fetch
decode
execute
mem
WB
fetch
decode
execute
mem
WB
More complex hazard detection 2X pipeline
registers to forward from 2X more instructions
to check 2X more destinations (muxes)
46Making forwarding explicit
- add r1 ? r2, EX/Mem ALU result
- Include direct mux controls into the ISA
- Hazard detection is now a compiler task
- New micro-architecture leads to new ISA
- Can reduce some resources
- No longer need to build a heavily ported reg file
Ref TTAs Missing the ILP complexity wall