Title: Pipeline Control, Data Hazards and Branch Hazards
1Pipeline Control, Data Hazardsand Branch
Hazards
EECS 322 Computer Architecture
Instructor Francis G. Wolff wolff_at_eecs.cwru.edu
Case Western Reserve University This
presentation uses powerpoint animation please
viewshow
2Models
Single-cycle model (non-overlapping) The
instruction latency executes in a single cycle
Every instruction and clock-cycle must
be stretched to the slowest instruction (p.438)
Multi-cycle model (non-overlapping) The
instruction latency executes in multiple-cycles
The clock-cycle must be stretched to the
slowest step Ability to share functional units
within the execution of a single instruction
Pipeline model (overlapping, p. 522) The
instruction latency executes in multiple-cycles
The clock-cycle must be stretched to the
slowest step The throughput is mainly one
clock-cycle/instruction Gains efficiency by
overlapping the execution of multiple instruction
s, increasing hardware utilization. (p. 377)
3Recap Can pipelining get us into trouble?
- Yes Pipeline Hazards
- structural hazards attempt to use the same
resource two different ways at the same time - e.g., multiple memory accesses, multiple register
writes - solutions
- multiple memories (separate instruction data
memory) - stretch pipeline
- control hazards attempt to make a decision
before condition is evaulated - e.g., any conditional branch
- solutions prediction, delayed branch
- data hazards attempt to use item before it is
ready - e.g., add r1,r2,r3 sub r4, r1 ,r5 lw r6, 0(r7)
or r8, r6 ,r9 - solutions forwarding/bypassing, stall/bubble
4Review Single-Cycle Datapath
And
M
A
d
d
u
x
Add Result
4
Branch
RegWrite
S
h
i
f
t
l
e
f
t
2
MemWrite
MemRead
ALUctl
RegDst
3
R
e
a
d
ALUSrc
MemtoReg
R
e
a
d
r
e
g
i
s
t
e
r
1
P
C
R
e
a
d
a
d
d
r
e
s
s
R
e
a
d
d
a
t
a
1
Z
e
r
o
r
e
g
i
s
t
e
r
2
A
L
U
A
L
U
R
e
a
d
W
r
i
t
e
R
e
a
d
M
A
d
d
r
e
s
s
r
e
s
u
l
t
M
u
r
e
g
i
s
t
e
r
d
a
t
a
d
a
t
a
2
M
u
I
n
s
t
r
u
c
t
i
o
n
x
u
x
W
r
i
t
e
m
e
m
o
r
y
D
a
t
a
x
d
a
t
a
m
e
m
o
r
y
W
r
i
t
e
d
a
t
a
3
2
1
6
S
i
g
n
e
x
t
e
n
d
5Review Multi vs. Single-cycle Processor Datapath
Combine adders add 1½ Mux 3 temp. registers,
A, B, ALUOut
Combine Memory add 1 Mux 2 temp. registers,
IR, MDR
I
o
r
D
M
e
m
R
e
a
d
M
e
m
W
r
i
t
e
R
e
g
D
s
t
R
e
g
W
r
i
t
e
A
L
U
S
r
c
A
I
R
W
r
i
t
e
P
C
0
0
R
e
a
d
I
n
s
t
r
u
c
t
i
o
n
M
M
r
e
g
i
s
t
e
r
1
2
5
2
1
A
d
d
r
e
s
s
u
u
x
R
e
a
d
x
A
R
e
a
d
I
n
s
t
r
u
c
t
i
o
n
Z
e
r
o
d
a
t
a
1
M
e
m
o
r
y
1
1
r
e
g
i
s
t
e
r
2
2
0
1
6
A
L
U
A
L
U
0
A
L
U
O
u
t
M
e
m
D
a
t
a
R
e
g
i
s
t
e
r
s
r
e
s
u
l
t
W
r
i
t
e
I
n
s
t
r
u
c
t
i
o
n
R
e
a
d
M
B
r
e
g
i
s
t
e
r
1
5
0
d
a
t
a
2
0
u
I
n
s
t
r
u
c
t
i
o
n
W
r
i
t
e
x
M
1
5
1
1
4
I
n
s
t
r
u
c
t
i
o
n
1
W
r
i
t
e
d
a
t
a
1
u
r
e
g
i
s
t
e
r
d
a
t
a
2
x
0
I
n
s
t
r
u
c
t
i
o
n
3
1
5
0
M
u
x
1
M
e
m
o
r
y
3
2
1
6
d
a
t
a
A
L
U
S
h
i
f
t
S
i
g
n
r
e
g
i
s
t
e
r
c
o
n
t
r
o
l
l
e
f
t
2
e
x
t
e
n
d
I
n
s
t
r
u
c
t
i
o
n
5
0
Single-cycle 1 ALU 2 Mem 4 Muxes 2 adders
OpcodeDecoders
Multi-cycle 1 ALU 1 Mem 5½ Muxes 5 Reg
(IR,A,B,MDR,ALUOut) FSM
6Multi-cycle Processor Datapath
Single-cycle 1 ALU 2 Mem 4 Muxes 2 adders
OpcodeDecoders
Multi-cycle 1 ALU 1 Mem 5½ Muxes 5 Reg
(IR,A,B,MDR,ALUOut) FSM
I
o
r
D
M
e
m
R
e
a
d
M
e
m
W
r
i
t
e
R
e
g
D
s
t
R
e
g
W
r
i
t
e
A
L
U
S
r
c
A
I
R
W
r
i
t
e
P
C
0
0
R
e
a
d
I
n
s
t
r
u
c
t
i
o
n
M
M
r
e
g
i
s
t
e
r
1
2
5
2
1
A
d
d
r
e
s
s
u
u
x
R
e
a
d
x
A
R
e
a
d
I
n
s
t
r
u
c
t
i
o
n
Z
e
r
o
d
a
t
a
1
M
e
m
o
r
y
1
1
r
e
g
i
s
t
e
r
2
2
0
1
6
A
L
U
A
L
U
0
A
L
U
O
u
t
M
e
m
D
a
t
a
R
e
g
i
s
t
e
r
s
r
e
s
u
l
t
W
r
i
t
e
I
n
s
t
r
u
c
t
i
o
n
R
e
a
d
M
B
r
e
g
i
s
t
e
r
1
5
0
d
a
t
a
2
0
u
I
n
s
t
r
u
c
t
i
o
n
W
r
i
t
e
x
M
1
5
1
1
4
I
n
s
t
r
u
c
t
i
o
n
1
W
r
i
t
e
d
a
t
a
1
u
r
e
g
i
s
t
e
r
d
a
t
a
2
x
0
I
n
s
t
r
u
c
t
i
o
n
3
1
5
0
M
u
x
1
M
e
m
o
r
y
3
2
1
6
d
a
t
a
A
L
U
S
h
i
f
t
S
i
g
n
r
e
g
i
s
t
e
r
c
o
n
t
r
o
l
l
e
f
t
2
e
x
t
e
n
d
I
n
s
t
r
u
c
t
i
o
n
5
0
5x32 160 additional FFs for multi-cycle
processor over single-cycle processor
7Figure 6.25
2 W3 M4 EX
2 W3 M
PC 32 bits
PC 32
2 W
PC32
M D R 32
Z 1
A32
PC 32 bits
IR 32 bits
B32
ALUOut32
ALUOut32
Datapath Registers
Si32
B32
160 FFs
D5
RT5
D5
213 FFs
RD5
16 FFs
21316 229 additional FFs for pipeline over
multi-cycle processor
8Overhead
Single-cycle model 8 ns Clock (125 MHz),
(non-overlapping) 1 ALU 2 adders 0 Muxes
0 Datapath Register bits (Flip-Flops)
Chip Area
Speed
Multi-cycle model 2 ns Clock (500 MHz),
(non-overlapping) 1 ALU Controller 5
Muxes 160 Datapath Register bits (Flip-Flops)
Pipeline model 2 ns Clock (500 MHz),
(overlapping) 2 ALU Controller 4 Muxes
373 Datapath 16 Controlpath Register bits
(Flip-Flops)
9Pipeline Control Controlpath Register bits
9 control bits
5 control bits
2 control bits
Figure 6.29
10Pipeline Control Controlpath table
Figure 5.20, Single Cycle
Instruction
RegDst
ALUSrc
MemReg
RegWrt
MemRed
MemWrt
Bra-nch
ALUop1
ALUop0
R-format
1
0
0
1
0
0
0
1
0
lw
1
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
Figure 6.28
ID / EXcontrol lines
EX / MEMcontrol lines
MEM / WBcntrl lines
Instruction
RegDst
ALUOp1
ALUOp0
ALUSrc
Bra-nch
MemRed
MemWrt
RegWrt
MemReg
R-format
1
1
0
0
0
0
0
1
0
lw
1
0
0
1
0
1
0
1
1
sw
X
0
0
1
0
0
1
0
X
beq
X
0
1
0
1
0
0
0
X
11Pipeline Hazards
Pipeline hazards Solution 1 always works (for
non-realtime) applications stall, delay
procrastinate!
Structural Hazards (i.e. fetching same memory
bank) Solution 2 partition architecture
Control Hazards (i.e. branching) Solution 1
stall! but decreases throughput Solution 2
guess and back-track Solution 3 delayed
decision delay branch fill slot
Data Hazards (i.e. register dependencies)
Worst case situation Solution 2 re-order
instructions Solution 3 forwarding or
bypassing delayed load
12Pipeline Datapath and Controlpath
Figure 6.30
13load inst.
Figure 6.30
14load inst.
Figure 6.30
15Pipeline single stepping
Contents of Register 1 C1 3 C24 C34
C46 C57 C108 Memory239 Formats ad
d rd,rsA,rtB lw rtB,_at_(rsA)
Clock ltIF/IDgt ltID/EXgt ltEX/MEMgt
ltMEM/WBgt ltPC, IRgt ltPC, A, B, S, Rt, Rdgt
ltPC, Z, ALU, B, Rgt ltMDR, ALU,
Rgt 0 lt0,?gt lt?,?,?,?,?,?gt lt?,?,?,?,?gt lt?,?,?gt
1 lt4,lw 10,20(1)gt lt0,?,?,?,?,?gt
lt?,?,?,?,?gt lt?,?,?gt
2 lt8,sub 11,2,3gt lt4,C1?3,C10?8,20,10,0gt
lt0,?,?,?,?gt lt?,?,?gt
3 lt12,and 12,4,5gt lt8,C2?4,C3?4,X,3,11gt
lt420ltlt2?84,0,203?23,8,10gtlt?,?,?gt
4 lt16,or 13,6,7gt lt12,C4?6,C5?7,X,5,12gtltX,1,
4-40,4,11gt ltMem23?9,23,10gt
5 lt20,add 14,8,9gt lt16,C6 ,C7,X,7,13gt
ltX,0,1,7,12gt ltX,0,11gt
16Clock 1 Figure 6.31a
PC4
PC0
IRlw 10,20(1)
17C
PC4
AC1
BX
PC4
S20
T10
D0
Figure 6.31b
18C
PC420ltlt2
PC8
ALU20C1
D10
Figure 6.32a
19Clock 4 Figure 6.32b
PC20
20Data Dependencies that can be resolved by
forwarding
Figure 6.36
21Data Hazards arithmetic
Figure 6.37
22Data Dependencies no forwarding
sub 2,1,3
and 12,2,5
Suppose every instruction is dependant 1 2
stalls 3 clocks
MIPS Clock 500 Mhz 167 MIPS
CPI 3
23Data Dependencies no forwarding
A dependant instruction will take 1 2 stalls
3 clocks
An independent instruction will take 1 0
stalls 1 clocks
Suppose 10 of the time the instructions are
dependant?
Averge instruction time 103 901 0.103
0.901 1.2 clocks
MIPS Clock 500 Mhz 417 MIPS (10
dependency) CPI 1.2
MIPS Clock 500 Mhz 167 MIPS (100
dependency) CPI 3
MIPS Clock 500 Mhz 500 MIPS (0
dependency) CPI 1
24Data Dependencies with forwarding
sub 2,1,3
and 12,2,5
DetectedData Hazard 1a ID/EX.rs EX/M.rd
Suppose every instruction is dependant 1 0
stalls 1 clock
MIPS Clock 500 Mhz 500 MIPS
CPI 1
25Data Dependencies Hazard Conditions
Data Hazard Condition occurs whenever a data
source needs a previous unavailable result due
to a data destination.
Data Hazard Detection is always comparing a
destination with a source.
26Data Dependencies Hazard Conditions
27Data Dependencies Worst case
Data Hazard sub 2, 1, 3 sub
rd, rs, rt and 12, 2, 2 and
rd, rs, rt or 13, 2, 2 and
rd, rs, rt
28Data Dependencies Hazard Conditions
Hazard Type
Source
Destination
ID/EX.rsID/EX.rt
1a.1b.
EX/MEM.rdest
Pipeline Registers
ID/EX
EX/MEM
rs
rt
rd
rd
29Figure 6.38
30Data Hazards Loads
Figure 6.44
31Data Hazards load stalling
Figure 6.45
32Data Hazards Hazard detection unit (page 490)
Stall Condition
Source
Destination
IF/ID.rsIF/ID.rt
ID/EX.rt ? ID/EX.MemRead1
No Stall Example (only need to look at next
instruction) lw 2, 20(1) lw rt,
addr(rs) and 4, 1, 5 and rd,
rs, rt or 8, 2, 6 or rd,
rs, rt
33Data Hazards Hazard detection unit (page 490)
No Stall Example (only need to look at next
instruction) lw 2, 20(1) lw rt,
addr(rs) and 4, 1, 5 and rd,
rs, rt or 8, 2, 6 or rd,
rs, rt
Exampleload assume half of the instructions are
immediately followed by an instruction that uses
it.
What is the average number of clocks for the load?
load instruction time 50(1 clock) 50(2
clocks)1.5
34Hazard Detection Unit when to stall
Figure 6.46
35Data Dependency Units
36Data Dependency Units
Pipeline Registers
Forwarding Comparisons
Stalling Comparisons
ID/EX
IF/ID
rs
rs
rt
rt
rd
rd
37Branch Hazards Soln 1, Stall until Decision
made (fig. 6.4)
Decision made in ID stage do load
Stall
38Branch Hazards Soln 2, Predict until Decision
made
8
Clock
1
6
7
2
5
3
4
WB
beq 1,3,7
IF
ID
EX
M
Predict false branch
and 12, 2, 5
WB
EX
M
IF
ID
discard and 12,2,5 instruction
lw 4, 50(7)
WB
EX
M
IF
ID
Decision made in ID stage discard branch
39Branch Hazards Soln 3, Delayed Decision
8
Clock
1
6
7
2
5
3
4
WB
beq 1,3,7
IF
ID
EX
M
Move instruction before branch
add 4,6,6
WB
EX
M
IF
ID
Do not need to discard instruction
lw 4, 50(7)
WB
EX
M
IF
ID
Decision made in ID stage branch
40Branch Hazards Soln 3, Delayed Decision
8
Clock
1
6
7
2
5
3
4
WB
beq 1,3,7
IF
ID
EX
M
and 12, 2, 5
WB
EX
M
IF
ID
Decision made in ID stage do branch
lw 4, 50(7)
WB
EX
M
IF
ID
41Branch Hazards Decision made in the ID stage
(figure 6.4)
8
Clock
1
6
7
2
5
3
4
WB
beq 1,3,7
IF
ID
EX
M
nop
WB
EX
M
IF
ID
No decision yet insert a nop
Decision do load
lw 4, 50(7)
WB
EX
M
IF
ID
42Branch Hazards Soln 2, Predict until Decision
made
Branch Decision made in MEM stage Discard values
when wrong prediction
Predict false branch
Same effect as 3 stalls
Figure 6.50
43Figure 6.51
Early branch comparison
Flush if wrong prediciton, add nops
44Performance
load assume half of the instructions are
immediately followed by an instruction that uses
it (i.e. data dependency) load instruction time
50(1 clock) 50(2 clocks)1.5
Jump assume that jumps always pay 1 full clock
cycle delay (stall). Jump instruction time 2
Branch the branch delay of misprediction is 1
clock cycle that 25 of the branches are
mispredicted. branch time 75(1 clocks)
25(2 clocks) 1.25
45Performance, page 504
Also known as the instruction latency with in a
pipeline
Pipeline throughput
Instruction
PipelineCycles
InstructionMix
Single-Cycle
Multi-CycleClocks
loads
1.5(50 dependancy)
23
1
5
stores
1
13
1
4
arithmetic
1
43
1
4
branches
1.25(25 dependancy)
19
1
3
jumps
2
2
1
3
Clockspeed
500 Mhz2 ns
125 Mhz8 ns
500 Mhz2 ns
CPI
1.18
1
4.02
? CyclesMix
MIPS
424 MIPS
Clock/CPI
125 MIPS
125 MIPS
load instruction time 50(1 clock) 50(2
clocks)1.5
branch time 75(1 clocks) 25(2 clocks)
1.25