Title: Ch 6: Pipelining Modified from Dave Pattersons notes
1Ch 6 Pipelining Modified from Dave Pattersons
notes
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 30 minutes
- Folder takes 30 minutes
- Stasher takes 30 minutesto put clothes into
drawers
A
B
C
D
2Sequential Laundry
2 AM
12
6 PM
7
8
11
1
10
9
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
T a s k O r d e r
Time
- Sequential laundry takes 8 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
3Pipelined Laundry Start work ASAP
2 AM
12
6 PM
8
1
7
10
11
9
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads!
4Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Multiple tasks operating simultaneously using
different resources - Potential speedup Number pipe stages
- Pipeline rate limited by slowest pipeline stage
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup - Stall for Dependences
6 PM
7
8
9
Time
T a s k O r d e r
5Pipelining
- Improve performance by increasing instruction
throughput. - To increase throughput, is to minimize the
individual stage time duration. - One natural way to minimize the stage time
duration is to split an instruction into more
stages. - One disadvantage for more stage is branching,
because it will alter the instruction flow, which
is sequential. - What do we need to add to actually split the
datapath into stages? - Ans By adding storage devices between stages
to increase more stages.
6The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec Calculate the memory address
- Mem Read the data from the Data Memory
- Wr Write the data back to the register file
7Conventional Pipelined Execution Representation
Time
Program Flow
8Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
9Why Pipeline?
- Suppose we execute 100 instructions
- Single Cycle Machine
- 45 ns/cycle x 1 CPI x 100 inst 4500 ns
- Multicycle Machine
- 10 ns/cycle x 4.6 CPI (due to inst mix) x 100
inst 4600 ns - Ideal pipelined machine
- 10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
1040 ns
10Why Pipeline? Because the resources are there!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
11Can pipelining get us into trouble?
- Yes Pipeline Hazards
- structural hazards attempt to use the same
resource two different ways at the same time - E.g., combined washer/dryer would be a structural
hazard or folder busy doing something else
(watching TV) - data hazards attempt to use item before it is
ready - E.g., one sock of pair in dryer and one in
washer cant fold until get sock from washer
through dryer - instruction depends on result of prior
instruction still in the pipeline - control hazards attempt to make a decision
before condition is evaulated - E.g., washing football uniforms and need to get
proper detergent level need to see after dryer
before next load in - branch instructions
- Can always resolve hazards by waiting
- pipeline control must detect the hazard
- take action (or delay action) to resolve hazards
12Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Mem
Reg
Reg
Load
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
13Structural Hazards limit performance
- Example if 1.3 memory accesses per instruction
and only one memory access per cycle then - average CPI 1.3
- otherwise resource is more than 100 utilized
14Control Hazard Solutions
- Stall wait until decision is clear
- Its possible to move up decision to 2nd stage by
adding hardware to check registers as being read - Impact 2 clock cycles per branch instruction
slow
I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Reg
Reg
15Control Hazard Solutions
- Predict guess one direction then back up if
wrong - Predict not taken
- Impact 1 clock cycles per branch instruction if
right, 2 if wrong (right 50 of time) - More dynamic scheme history of 1 branch ( 90)
I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Mem
Reg
Reg
16Control Hazard Solutions
- Redefine branch behavior (takes place after next
instruction) delayed branch - Impact 0 clock cycles per branch instruction if
can find instruction to put in slot ( 50 of
time) - As launch more instruction per clock cycle, less
useful
I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Misc
Mem
Mem
Reg
Reg
Load
Mem
Mem
Reg
Reg
17Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
18Data Hazard on r1
- Dependencies backwards in time are hazards
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
ALU
Reg
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
ALU
Reg
or r8,r1,r9
xor r10,r1,r11
19Data Hazard Solution
- Forward result from one stage to another
-
- or OK if define read/write properly
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
ALU
Reg
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
ALU
Reg
or r8,r1,r9
xor r10,r1,r11
20Forwarding (or Bypassing) What about Loads
- Dependencies backwards in time are
hazards - Cant solve with forwarding
- Must delay/stall instruction dependent on loads
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
ALU
Reg
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
21Designing a Pipelined Processor
- Go back and examine your datapath and control
diagram - associated resources with states
- ensure that flows do not conflict, or figure out
how to resolve - assert control in appropriate stage
22Pipelined Processor (almost) for slides
- What happens if we start a new instruction every
cycle?
Valid
IRex
IR
IRwb
Inst. Mem
IRmem
WB Ctrl
Dcd Ctrl
Ex Ctrl
Mem Ctrl
Equal
Reg. File
Reg File
Exec
PC
Next PC
Mem Access
Data Mem
23Control and Datapath
IR A S S S S If Cond PC M MemS Rrd Rrd Rrt Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
24Pipelining the Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw
- The five independent functional units in the
pipeline datapath are - Instruction Memory for the Ifetch stage
- Register Files Read ports (bus A and busB) for
the Reg/Dec stage - ALU for the Exec stage
- Data Memory for the Mem stage
- Register Files Write port (bus W) for the Wr
stage
25The Four Stages of R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec
- ALU operates on the two register operands
- Update PC
- Wr Write the ALU output back to the register file
26Pipelining the R-type and Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type
- We have pipeline conflict or structural hazard
- Two instructions try to write to the register
file at the same time! - Only one write port
27Important Observation
- Each functional unit can only be used once per
instruction - Each functional unit must be used at the same
stage for all instructions - Load uses Register Files Write Port during its
5th stage - R-type uses Register Files Write Port during its
4th stage
- 2 ways to solve this pipeline hazard.
28Solution 1 Insert Bubble into the Pipeline
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Load
R-type
Pipeline
R-type
R-type
Bubble
- Insert a bubble into the pipeline to prevent 2
writes at the same cycle - The control logic can be complex.
- Lose instruction fetch and issue opportunity.
- No instruction is started in Cycle 6!
29Solution 2 Delay R-types Write by One Cycle
- Delay R-types register write by one cycle
- Now R-type instructions also use Reg Files write
port at Stage 5 - Mem stage is a NOOP stage nothing is being done.
4
1
2
3
5
Mem
R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
R-type
R-type
Load
R-type
R-type
30The Four Stages of Store
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec Calculate the memory address
- Mem Write the data into the Data Memory
31The Three Stages of Beq
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec
- Registers Fetch and Instruction Decode
- Exec
- compares the two register operand,
- select correct branch target address
- latch into PC
32Control Diagram
IR A S S S S If Cond PC M MemS M M Rrd Rrd Rrt Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
33Lets Try it Out
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
these addresses are octal
34Start Fetch 10
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
10
PC
35Fetch 14, Decode 10
lw r1, r2(35)
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
2
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
14
PC
36Fetch 20, Decode 14, Exec 10
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
35
2
rt
Reg. File
Reg File
r2
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
20
PC
37Fetch 24, Decode 20, Exec 14, Mem 10
sub r3, r4, r5
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
3
4
5
Reg. File
Reg File
r2
r235
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
24
PC
38Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
beq r6, r7 100
Inst. Mem
Decode
WB Ctrl
addI r2
lw r1
sub r3
Mem Ctrl
IR
6
7
Reg. File
Reg File
r4
Mr235
r23
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
30
PC
39Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14
ori r8, r9 17
Inst. Mem
Decode
WB Ctrl
addI r2
sub r3
Mem Ctrl
beq
IR
9
xx
100
r1Mr235
Reg. File
Reg File
r6
r23
r4-r5
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
34
PC
40Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20
Inst. Mem
Decode
ori r8
WB Ctrl
sub r3
beq
add r10, r11, r12
Mem Ctrl
11
12
17
Reg. File
r1Mr235
IR
Reg File
r9
r4-r5
r2 r23
xxx
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
100
PC
ooops, we should have only one delayed instruction
41Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
n
Inst. Mem
Decode
add r10
WB Ctrl
beq
ori r8
Mem Ctrl
and r13, r14, r15
14
15
xx
Reg. File
r1Mr235
IR
Reg File
r11
xxx
r9 17
r2 r23
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
104
PC
Squash the extra instruction in the branch shadow!
42Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30
n
Inst. Mem
Decode
ori r8
add r10
WB Ctrl
and r13
Mem Ctrl
xx
Reg. File
r1Mr235
IR
Reg File
r14
r9 17
r2 r23
r11r12
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
110
PC
Squash the extra instruction in the branch shadow!
43Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
n
NO WB NO Ovflow
and r13
Inst. Mem
Decode
add r10
WB Ctrl
Mem Ctrl
Reg. File
r1Mr235
IR
Reg File
r11r12
r2 r23
r14 R15
Exec
r3 r4-r5
r8 r9 17
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
114
PC
Squash the extra instruction in the branch shadow!
44Summary Pipelining
- What makes it easy
- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
- What makes it hard?
- structural hazards suppose we had only one
memory - control hazards need to worry about branch
instructions - data hazards an instruction depends on a
previous instruction - Well build a simple pipeline and look at these
issues - Well talk about modern processors and what
really makes it hard - exception handling
- trying to improve performance with out-of-order
execution, etc.
45Summary
- Pipelining is a fundamental concept
- multiple steps using distinct resources
- Utilize capabilities of the Datapath by pipelined
instruction processing - start next instruction while working on the
current one - limited by length of longest stage (plus
fill/flush) - detect and resolve hazards
46Ch6 Supplementary
- Branch prediction
- Branch prediction is critical for superpipelining
and superscalar computers - More instructions issued at same time, larger the
penalty of hazards - Statistically, 60 conditional branches will
branch - Higher-level (and more powerful) instruction sets
need less conditional branches, such as those
support variable-length operands - Conditional branching can be classified into two
types - Program loops
- Random decision making
47Static branch prediction
- Static and Dynamic Branch Predictions
- Static are compiler determining conditional
branches - Dynamic ones are run-time (during execution)
generated - Static good for looping
- Loop Exit at Loop Start Predict continue
- Loop Exit at Loop End Predict branching
- Random decision guess branch to be taken
48Dynamic branch prediction
- Dynamic Data sensitive, at run-time (at
execution) - One-bit dynamic branch prediction
- Predict as previous record shows
- Two-bit dynamic branch prediction
- If previous two are same, predict the same
- If previous two are alternating, predict
alternating - Branch branch predict branch
- Not branch not branch predict not branch
- Branch not branch predict branch
- Not branch branch predict not branch
49Branch prediction cont.
- Branch prediction cache
- Cache with entries for instruction addresses that
host the branch instructions, and a bit to
indicate the previous branching or not branching
Last B or NB Previous to last B or
NB B-Branch NB-Not Branch
Branch inst address
?
Valid AND Matched
50Other schemes for minimizing incorrect branch
prediction penalty
- Speculative execution
- Execute first and there is a way to roll back
- By doing the storeback on shadow copy
- By having a backup copy for roll undo
- Conditional execution
- To minimize conditional branching for some common
action such as clear, set-to-1, move, or add - Delayed (conditional) branching
- Always execute the next instruction and then do
the conditional branching - Save a cycle or more
51Branching
- Call and Return are branching instructions
through Stack - Software interrupts are implicit branchings and
transparent to the program even though their
actions are carried out. They are treated as
parts of the program.
52Superscalar
- Superscalar
- Fetch and execute two or more instructions
concurrently - To achieve CPI
- Dynamic issue decided at run-time to schedule
executing two or more instructions concurrently - Require multiple copies of functional units such
as instruction fetch, arithmetic execution, etc.,
and multi-port register files and caches (or
cache buffers) - There are more potential data dependency hazards,
resource hazards, and control hazards - Penalty for incorrect branch prediction is BIG
53VLIW
- Very Long Instruction Word
- A VLIW instruction is machine (implementation)
dependent - A VLIW instruction consists of various fields
- Each field specifies the operation of a
functional unit, such as Ifetch (instruction
fetch), Idecode, Ofetch (operand fetch), EX
(integer execute), FPA (Floating-point add),
FPMUL and FPDIV - Static issue instructions generated by
compilers - Multiple instructions issued and being executed
at same time
54VLIW advantages
- Static generating codes (by compiler)
- Compilers can take a lot of time to pack the VLIW
instructions, else they are done dynamically by
hardware instruction scheduler (analyzed by
circuitry for scheduling functional units) - Easier to power down individual functional units
if they are not used, and easier for compilers to
deliberately arrange the functional unit
executions to minimize power consumption - Can execute different computer architecture
instruction sets with a machine through
respective compilers. - However, the functional units must be so
constructed to support these instruction sets and
architectures.
55VLIW disadvantages
- Compilers are hard to build
- Machine dependent must have different compilers
for different machines of the same architecture - Binary incompatible must have different binary
codes for different machines of the same
architecture - Cannot see input data when compiling, must
prepare for all possible cases of input data - Difficult to recover from compiler mistakes, and
the time penalty can be BIG - Difficult to debug
- Non-VLIW machines can also power down individual
functional units when not used
56Ch6 Supplementary
- Branch prediction
- Branch prediction is critical for superpipelining
and superscalar computers - More instructions issued at same time, larger the
penalty of hazards - Statistically, 60 conditional branches will
branch - Higher-level (and more powerful) instruction sets
need less conditional branches, such as those
support variable-length operands - Conditional branching can be classified into two
types - Program loops
- Random decision making
57Static branch prediction
- Static and Dynamic Branch Predictions
- Static are compiler determining conditional
branches - Dynamic ones are run-time (during execution)
generated - Static good for looping
- Loop Exit at Loop Start Predict continue
- Loop Exit at Loop End Predict branching
- Random decision guess branch to be taken
58Dynamic branch prediction
- Dynamic Data sensitive, at run-time (at
execution) - One-bit dynamic branch prediction
- Predict as previous record shows
- Two-bit dynamic branch prediction
- If previous two are same, predict the same
- If previous two are alternating, predict
alternating - Branch branch predict branch
- Not branch not branch predict not branch
- Branch not branch predict branch
- Not branch branch predict not branch
59Branch prediction cont.
- Branch prediction cache
- Cache with entries for instruction addresses that
host the branch instructions, and a bit to
indicate the previous branching or not branching
Last B or NB Previous to last B or
NB B-Branch NB-Not Branch
Branch inst address
?
Valid AND Matched
60Other schemes for minimizing incorrect branch
prediction penalty
- Speculative execution
- Execute first and there is a way to roll back
- By doing the storeback on shadow copy
- By having a backup copy for roll undo
- Conditional execution
- To minimize conditional branching for some common
action such as clear, set-to-1, move, or add - Delayed (conditional) branching
- Always execute the next instruction and then do
the conditional branching - Save a cycle or more
61Branching
- Call and Return are branching instructions
through Stack - Software interrupts are implicit branchings and
transparent to the program even though their
actions are carried out. They are treated as
parts of the program.
62Superscalar
- Superscalar
- Fetch and execute two or more instructions
concurrently - To achieve CPI
- Dynamic issue decided at run-time to schedule
executing two or more instructions concurrently - Require multiple copies of functional units such
as instruction fetch, arithmetic execution, etc.,
and multi-port register files and caches (or
cache buffers) - There are more potential data dependency hazards,
resource hazards, and control hazards - Penalty for incorrect branch prediction is BIG
63VLIW
- Very Long Instruction Word
- A VLIW instruction is machine (implementation)
dependent - A VLIW instruction consists of various fields
- Each field specifies the operation of a
functional unit, such as Ifetch (instruction
fetch), Idecode, Ofetch (operand fetch), EX
(integer execute), FPA (Floating-point add),
FPMUL and FPDIV - Static issue instructions generated by
compilers - Multiple instructions issued and being executed
at same time
64VLIW advantages
- Static generating codes (by compiler)
- Compilers can take a lot of time to pack the VLIW
instructions, else they are done dynamically by
hardware instruction scheduler (analyzed by
circuitry for scheduling functional units) - Easier to power down individual functional units
if they are not used, and easier for compilers to
deliberately arrange the functional unit
executions to minimize power consumption - Can execute different computer architecture
instruction sets with a machine through
respective compilers. - However, the functional units must be so
constructed to support these instruction sets and
architectures.
65VLIW disadvantages
- Compilers are hard to build
- Machine dependent must have different compilers
for different machines of the same architecture - Binary incompatible must have different binary
codes for different machines of the same
architecture - Cannot see input data when compiling, must
prepare for all possible cases of input data - Difficult to recover from compiler mistakes, and
the time penalty can be BIG - Difficult to debug
- Non-VLIW machines can also power down individual
functional units when not used