Title: Pipelining Datapath
1Pipelining Datapath
- Adapted from the lecture notes of Dr. John
Kubiatowicz (UC Berkeley) - and Hank Walker (TAMU)
2Pipelining is Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
3Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
4Pipelined Laundry Start work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
5Pipelining Lessons
- Latency vs. Throughput
- Question
- What is the latency in both cases ?
- What is the throughput in both cases ?
Pipelining doesnt help latency of single task,
it helps throughput of entire workload
6Pipelining Lessons contd
- Question
- What is the fastest operation in the example ?
- What is the slowest operation in the example
Pipeline rate limited by slowest pipeline stage
7Pipelining Lessons contd
Multiple tasks operating simultaneously using
different resources
8Pipelining Lessons contd
- Question
- Would the speedup increase if we had more steps ?
Potential Speedup Number of pipe stages
9Pipelining Lessons contd
- Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
- Question
- Will it affect if Folder also took 40 minutes
Unbalanced lengths of pipe stages reduces speedup
10Pipelining Lessons contd
Time to fill pipeline and time to drain it
reduces speedup
11Five Stages of an Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec Calculate the memory address
- Mem Read the data from the Data Memory
- Wr Write the data back to the register file
12Conventional Pipelined Execution Representation
Time
Program Flow
13Example
14Example contd
- Timepipeline Timenon-pipeline / Pipe stages
- Assumptions
- Stages are perfectly balanced
- Ideal conditions
- Ideally, speedup 8/5 1.6
- Most cases are not ideal !!!
15Example contd
- Speedup in this case 24/14 1.7
- Lets add 1000 more instructions
- Time (non-pipelined) 1000 x 8 24 ns 8000 ns
- Time (pipelined) 1000 x 2 14 ns 2014 ns
- Speedup 8000 / 2014 3.98 4 (approx) 8/2
Instruction throughput is important metric (as
opposed to individual instruction) as real
programs execute billions of instructions in
practical case !!!
16Pipeline Hazards
Program Flow
17Pipeline Hazard contd
- Control Hazard
- Example
- add 4, 5, 6
- beq 1, 2, 40
- lw 3, 300(0)
18Pipleline Hazard contd
- Data Hazards
- Example
- add s0, t0, t1
- sub t2, s0, t3
19Summary Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously using
different resources - Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup - Stall for Dependences
6 PM
7
8
9
Time
T a s k O r d e r
20 Summary of Pipeline Hazards
- Structural Hazards
- Hardware design
- Control Hazard
- Decision based on results
- Data Hazard
- Data Dependency
21Control Signals for existing Datapath
The Right to Left Control can lead to hazards
22Place registers between each step
23Example
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
24Start Fetch 10
n
n
n
n
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
25Fetch 14, Decode 10
n
n
n
lw r1, r2(35)
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
2
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
26Fetch 20, Decode 14, Exec 10
n
n
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
35
2
rt
Reg. File
Reg File
r2
Exec
Mem Access
Data Mem
EX
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
27Fetch 24, Decode 20, Exec 14, Mem 10
n
sub r3, r4, r5
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
3
4
5
Reg. File
Reg File
r2
r235
Exec
Mem Access
Data Mem
M
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
28Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
beq r6, r7 100
Inst. Mem
Decode
WB Ctrl
addI r2
lw r1
sub r3
Mem Ctrl
IR
6
7
Reg. File
Reg File
r4
Mr235
r23
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
29Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
ori r8, r9 17
Inst. Mem
Decode
WB Ctrl
addI r2
sub r3
Mem Ctrl
beq
IR
9
xx
100
r1Mr235
Reg. File
Reg File
r6
r23
r4-r5
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
WB
M
30Pipelining Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw
- The five independent functional units in the
pipeline datapath are - Instruction Memory for the Ifetch stage
- Register Files Read ports (bus A and busB) for
the Reg/Dec stage - ALU for the Exec stage
- Data Memory for the Mem stage
- Register Files Write port (bus W) for the Wr
stage
31Pipelining the R Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec
- ALU operates on the two register operands
- Update PC
- Wr Write the ALU output back to the register file
32Pipelingng Both L and R type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type
- We have pipeline conflict or structural hazard
- Two instructions try to write to the register
file at the same time! - Only one write port
33Important Observations
- Each functional unit can only be used once per
instruction - Each functional unit must be used at the same
stage for all instructions - Load uses Register Files Write Port during its
5th stage - R-type uses Register Files Write Port during its
4th stage
34Solution
- Delay R-types register write by one cycle
- Now R-type instructions also use Reg Files write
port at Stage 5 - Mem stage is a NOOP stage nothing is being done.
4
1
2
3
5
Exec
Mem
R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
R-type
Load
R-type
R-type
35Datapath (Without Pipeline)
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
36Datapath (With Pipeline)
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
if Cond PC lt PCSX
M lt MemS
MemS lt- B
M lt S
M lt S
Rrd lt M
Rrd lt M
Rrt lt M
Equal
Reg. File
Reg File
S
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
37Structural Hazard and Solution
Time (clock cycles)
I n s t r. O r d e r
Load
Mem
Reg
Reg
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
38Control Hazard - 1 Stall
- Stall wait until decision is clear
- Impact 2 lost cycles (i.e. 3 clock cycles per
branch instruction) gt slow
39Control Hazard 2 Predict
- Predict guess one direction then back up if
wrong - Impact 0 lost cycles per branch instruction if
right, 1 if wrong (right 50 of time) - More dynamic scheme history of 1 branch
40Control Hazard - 3 Delayed Branch
- Delayed Branch Redefine branch behavior (takes
place after next instruction) - Impact 0 clock cycles per branch instruction if
can find instruction to put in slot ( 50 of
time)
41Data Hazards (RAW)
- Dependencies backwards in time are hazards
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
42Data Hazards contd
- Forward result from one stage to another
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
43Data Hazards contd
- Dependencies backwards in time are
hazards - Cant solve with forwarding
- Must delay/stall instruction dependent on loads
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
Stall
sub r4,r1,r3
44Hazard Detection
I-Fet ch DCD MemOpFetch OpFetch
Exec Store
IFetch DCD
Structural Hazard
I-Fet ch DCD OpFetch Jump
Control Hazard
IFetch DCD
IF DCD EX Mem WB
RAW (read after write) Data Hazard
IF DCD EX Mem
WB
WAW Data Hazard (write after write)
IF DCD EX Mem WB
IF DCD
OF Ex Mem
IF DCD OF Ex RS
WAR Data Hazard (write after read)
45Hazard Detection
- Suppose instruction i is about to be issued and
a predecessor instruction j is in the
instruction pipeline. - A RAW hazard exists on register ??if ????Rregs( i
) ??Wregs( j ) - A WAW hazard exists on register ??if ????Wregs( i
) ??Wregs( j ) - A WAR hazard exists on register ??if ????Wregs( i
) ??Rregs( j )
46Computing CPI
- Start with Base CPI
- Add stalls
- Suppose
- CPIbase1
- Freqbranch20, freqload30
- Suppose branches always cause 1 cycle stall
- Loads cause a 2 cycle stall
- Then CPI 1 (1?0.20)(2 ? 0.30) 1.8
47Summary
- Control Signals need to be propagated
- Insert Registers between every stage to
remember and propagate values - Solutions to Control Hazard are Stall, Predict
and Delayed Branch - Solutions to Data Hazard is Forwarding
- Effective CPI CPIideal CPIstall