Title: Chapter Six Enhancing Performance with Pipelining
1Chapter SixEnhancing Performance with Pipelining
2Definition
- Pipeline is an implementation technique in which
multiple instructions are overlapped in
execution. - Well use a laundry analogy for pipelining to
explain the main concepts. - There are four stages in doing the laundry
- put dirty clothes to the washer (wash)
- placed washed clothes in the dryer (dry)
- place the dry load on the table and fold (fold)
- put clothes away (store)
- What about the MIPS instruction?
3Single-Cycle vs Pipelined Performance
- Look at lw, sw, add, sub,and, or, slt and beq.
- Operation time for major functional components
- 200ps for memory access
- 200ps for ALU operation
- 100ps for register file read or write
- Total execution time for 3 instructions
- 3x800ps2.4 ns for a single-cycled,non-pipelined
processor - 1.4 ns (see Figure in next page) for a pipelined
processor - Total execution time for 1003 instructions
- 1000x800ps 2400 ps 802.4 ns for a
single-cycled,non-pipelined processor - 1000x200ps 1400 ps 201.4 ns for a pipelined
processor - Speedup is less than the number of stages
because - stages may be imperfectly balanced
- overhead involved
4Pipelining
- Improve performance by increasing instruction
throughput - Each instruction still take the same
time to execute - Ideal speedup is number of stages in the
pipeline. Do we achieve this?
P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
o
r
d
e
r
(
i
n
i
n
s
t
r
u
c
t
i
o
n
s
)
I
n
s
t
r
u
c
t
i
o
n
D
a
t
a
2
n
s
R
e
g
A
L
U
R
e
g
f
e
t
c
h
a
c
c
e
s
s
2
n
s
2
n
s
2
n
s
2
n
s
2
n
s
5Pipelining in MIPS- What makes it easy
- All instructions are the same length instruction
fetch (1st pipeline stage) and decoding(2nd
stage) are much easier - MIPS has just a few instruction formats, source
register field in the same location gt register
file read and instruction decoding can be done at
the same time - Memory operands appear only in loads and stores
(as opposed to 80x86, where we could operate on
the operands in memory) - Operands must be aligned in memory need not
worry about a single data transfer instruction
requiring two data memory accesses.
6Pipelining in MIPS- What makes it hard?
- Structural hazards suppose we had only one
memory - Control hazards need to worry about branch
instructions - Data hazards an instruction depends on a
previous instruction
7Structural Hazards
- If we have a fourth instruction in the following
figure? - What happens between time 6
and 8 ns?
P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
o
r
d
e
r
(
i
n
i
n
s
t
r
u
c
t
i
o
n
s
)
I
n
s
t
r
u
c
t
i
o
n
D
a
t
a
2
n
s
R
e
g
A
L
U
R
e
g
f
e
t
c
h
a
c
c
e
s
s
2
n
s
2
n
s
2
n
s
2
n
s
2
n
s
8Control Hazards
- Possible solution
- stall to pause before continuing the pipeline,
not efficient if we have a long pipeline - pipeline stall is also known as bubble
P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
2
4
6
8
1
0
1
2
1
4
1
6
o
r
d
e
r
(
i
n
i
n
s
t
r
u
c
t
i
o
n
s
)
The above figure assumes that we have extra
hardware in place to resolve the branch in the
second stage. Otherwise the pause will be longer
than 4ns.
9Control Hazards
1
0
1
2
1
4
P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
o
r
d
e
r
(
i
n
i
n
s
t
r
u
c
t
i
o
n
s
)
2
n
s
b
u
b
b
l
e
b
u
b
b
l
e
b
u
b
b
l
e
b
u
b
b
l
e
b
u
b
b
l
e
I
n
s
t
r
u
c
t
i
o
n
D
a
t
a
R
e
g
A
L
U
R
e
g
f
e
t
c
h
a
c
c
e
s
s
4
n
s
10Control Hazards
P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
0
1
2
1
4
o
r
d
e
r
(
i
n
i
n
s
t
r
u
c
t
i
o
n
s
)
(
D
e
l
a
y
e
d
b
r
a
n
c
h
s
l
o
t
)
2
n
s
11Data Hazards
- Look at the following example add s0, t0,
t1 sub t2, s0, t3 - We need the result s0 from the add instruction
to do the subtraction. - Is the data ready?
- Compiler cannot handle this issue
- Solution forwarding or bypassing, i.e., getting
the missing item early from the internal
resources.
12Graphical representation of the instruction
pipeline
- IF instruction fetch
- ID instruction decode
- EX execution
- MEM memory access
- WB write back
- Shading element used, White element not used
- Right-shading read, Left-Shading write
2
4
6
8
1
0
T
i
m
e
I
F
I
D
E
X
M
E
M
a
d
d
s
0
,
t
0
,
t
1
W
B
13Forwarding
- As soon as ALU add is finished, forward the
result
P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
2
4
6
8
1
0
o
r
d
e
r
T
i
m
e
(
i
n
i
n
s
t
r
u
c
t
i
o
n
s
)
a
d
d
s
0
,
t
0
,
t
1
I
F
I
D
W
B
E
X
M
E
M
s
u
b
t
2
,
s
0
,
t
3
M
E
M
I
F
I
D
E
X
W
B
M
E
M
14Forwarding with stall
- For R-format instruction following a load that
tries to use the data, load-use data hazard will
occur. - Need to stall in this case.
b
b
l
e
b
u
b
b
l
e
15Reordering Code to Avoid Pipeline Stalls
- Original code register t1 has the address of
vklw t0, 0(t1) reg t0 vklw t2,
4(t1) reg t1vk1sw t2, 0(t1) vk
reg t2sw t0, 4(t1) vk1 reg t0 - Data hazard occurs on register t2 between the
second lw and the first sw - Modified code removes the hazard register t1
has the address of vklw t0, 0(t1) reg t0
vklw t2, 4(t1) reg t1vk1sw t0,
4(t1) vk1 reg t0sw t2, 0(t1) vk
reg t2
16A Pipelined Datapath
-
- What do we need to add to actually split the
datapath into stages?
x
e
c
u
t
e
/
M
E
M
M
e
m
o
r
y
a
c
c
e
s
s
W
B
W
r
i
t
e
b
a
c
k
a
d
d
r
e
s
s
c
a
l
c
u
l
a
t
i
o
n
17Pipelined Datapath
- Can you find a problem even if
there are no dependencies? What instructions
can we execute to manifest the problem?
I
D
/
E
X
R
e
a
d
r
e
g
i
s
t
e
r
1
R
e
a
d
d
a
t
a
1
R
e
a
d
Z
e
r
o
r
e
g
i
s
t
e
r
2
R
e
g
i
s
t
e
r
s
A
L
U
R
e
a
d
A
L
U
R
e
a
d
W
r
i
t
e
1
d
a
t
a
2
A
d
d
r
e
s
s
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
u
D
a
t
a
u
W
r
i
t
e
x
m
e
m
o
r
y
x
d
a
t
a
1
d
a
t
a
1
6
S
i
g
n
e
x
t
e
n
d
18IF Stage
19ID Stage
20EX Stage
21MEM Stage
22WB Stage
23Corrected Datapath
24Portions of the Datapath used by a load
instruction
25Graphically Representing Pipelines
-
- Can help with answering questions like
- how many cycles does it take to execute this
code? - what is the ALU doing during cycle 4?
- use this representation to help understand
datapaths
A
L
U
A
L
U
26Pipeline Control
27Pipeline control
- We have 5 stages. What needs to be controlled in
each stage? - Instruction Fetch and PC Increment
- Instruction Decode / Register Fetch
- Execution RegDst, ALUOp, ALUSrc
- Memory Stage Branch, MemRead, MemWrite
- Write Back MemReg, RegWrite
- How would control be handled in an automobile
plant? - a fancy control center telling everyone what to
do? - should we use a finite state machine?
28Pipeline Control
- Pass control signals along just like the data
29Datapath with Control
30Dependencies
- Problem with starting next instruction before
first is finished - dependencies that go backward in time are data
hazards
31Hazard Conditions
- Type 1.a EX/MEM.RegisterRd ID/EX.RegisterRs
- Type 1.b EX/MEM.RegisterRd ID/EX.RegisterRt
- Type 2.a MEM/WB.RegisterRdID/EX.RegisterRs
- Type 2.b MEM/WB.RegisterRdID/EX.RegisterRt
- Classify the dependencies in the following
sequence sub 2, 1, 3 Reg. 2 set by
sub and 12, 2, 5 1st operand (2) or 13,
6, 2 2nd operand (2) add 14, 2,
2 sw 15, 100(2) - sub-and Type 1a hazard
- sub-or Type 2b
- sub-and no hazard, sub-sw no hazard
32Forwarding
- Use temporary results, dont wait for them to be
written - register file forwarding to handle read/write to
same register - ALU forwarding
-
33Forwarding
34Can't always forward
- Load word can still cause a hazard
- an instruction tries to read a register following
a load instruction that writes to the same
register. -
- Thus, we need a hazard detection unit to stall
the load instruction
35Stalling
- We can stall the pipeline by keeping an
instruction in the same stage
36Hazard Detection Unit
- Stall by letting an instruction that wont write
anything go forward
37Branch Hazards
- When we decide to branch, other instructions are
in the pipeline! - We are predicting branch not taken
- need to add hardware for flushing instructions if
we are wrong
38Flushing Instructions
39Improving Performance
- Try and avoid stalls! E.g., reorder these
instructions - lw t0, 0(t1)
- lw t2, 4(t1)
- sw t2, 0(t1)
- sw t0, 4(t1)
- Add a branch delay slot
- the next instruction after a branch is always
executed - rely on compiler to fill the slot with something
useful
40More on improving performances
- Superpipelining decompose the stage further (not
always practical) - Superscalar start more than one instruction in
the same cycle (extra coordination required) - CPI can be less than 1
- IPC instruction per clock cycle
- Dynamic pipelining
- lw t0, 20(s2)
- addu t1, t0, t2
- sub s4, s4, t3
- slti t5, s4, 20
- Combine extra hardware resources so later
instructions can proceed in parallel. - More complicated pipeline control
- More complicated instruction execution model
41Superscalar MIPS
- Assume two instructions are issued per clock
cycle, say one integer ALU operation or branch,
the other load or store. - Need to fetch and decode 64 bits of instruction
- Extra resources are required.
42Dynamic Scheduling
- The hardware performs the scheduling?
- hardware tries to find instructions to execute
- out of order execution is possible
- speculative execution and dynamic branch
prediction
43Real Stuff
- All modern processors are very complicated
- DEC Alpha 21264 9 stage pipeline, 6 instruction
issue - PowerPC and Pentium branch history table
- Compiler technology important