Chapter Six Enhancing Performance with Pipelining - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Chapter Six Enhancing Performance with Pipelining

Description:

Enhancing Performance with Pipelining Definition Pipeline is an implementation technique in which multiple instructions are overlapped in execution. – PowerPoint PPT presentation

Number of Views:179

Avg rating:3.0/5.0

Slides: 44

Provided by: TodA1

Category:

more less

Transcript and Presenter's Notes

Title: Chapter Six Enhancing Performance with Pipelining

1
Chapter SixEnhancing Performance with Pipelining
2
Definition

Pipeline is an implementation technique in which
multiple instructions are overlapped in
execution.
Well use a laundry analogy for pipelining to
explain the main concepts.
There are four stages in doing the laundry
put dirty clothes to the washer (wash)
placed washed clothes in the dryer (dry)
place the dry load on the table and fold (fold)
put clothes away (store)
What about the MIPS instruction?

3
Single-Cycle vs Pipelined Performance

Look at lw, sw, add, sub,and, or, slt and beq.
Operation time for major functional components
200ps for memory access
200ps for ALU operation
100ps for register file read or write
Total execution time for 3 instructions
3x800ps2.4 ns for a single-cycled,non-pipelined
processor
1.4 ns (see Figure in next page) for a pipelined
processor
Total execution time for 1003 instructions
1000x800ps 2400 ps 802.4 ns for a
single-cycled,non-pipelined processor
1000x200ps 1400 ps 201.4 ns for a pipelined
processor
Speedup is less than the number of stages
because
stages may be imperfectly balanced
overhead involved

4
Pipelining

Improve performance by increasing instruction
throughput
Each instruction still take the same
time to execute
Ideal speedup is number of stages in the
pipeline. Do we achieve this?

P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
o
r
d
e
r
(
i
n

i
n
s
t
r
u
c
t
i
o
n
s
)
I
n
s
t
r
u
c
t
i
o
n
D
a
t
a
2

n
s
R
e
g
A
L
U
R
e
g
f
e
t
c
h
a
c
c
e
s
s
2

n
s
2

n
s
2

n
s
2

n
s
2

n
s
5
Pipelining in MIPS- What makes it easy

All instructions are the same length instruction
fetch (1st pipeline stage) and decoding(2nd
stage) are much easier
MIPS has just a few instruction formats, source
register field in the same location gt register
file read and instruction decoding can be done at
the same time
Memory operands appear only in loads and stores
(as opposed to 80x86, where we could operate on
the operands in memory)
Operands must be aligned in memory need not
worry about a single data transfer instruction
requiring two data memory accesses.

6
Pipelining in MIPS- What makes it hard?

Structural hazards suppose we had only one
memory
Control hazards need to worry about branch
instructions
Data hazards an instruction depends on a
previous instruction

7
Structural Hazards

If we have a fourth instruction in the following
figure?
What happens between time 6
and 8 ns?

P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
o
r
d
e
r
(
i
n

i
n
s
t
r
u
c
t
i
o
n
s
)
I
n
s
t
r
u
c
t
i
o
n
D
a
t
a
2

n
s
R
e
g
A
L
U
R
e
g
f
e
t
c
h
a
c
c
e
s
s
2

n
s
2

n
s
2

n
s
2

n
s
2

n
s
8
Control Hazards

Possible solution
stall to pause before continuing the pipeline,
not efficient if we have a long pipeline
pipeline stall is also known as bubble

P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
2
4
6
8
1
0
1
2
1
4
1
6
o
r
d
e
r
(
i
n

i
n
s
t
r
u
c
t
i
o
n
s
)
The above figure assumes that we have extra
hardware in place to resolve the branch in the
second stage. Otherwise the pause will be longer
than 4ns.
9
Control Hazards

Another solution Predict

1
0
1
2
1
4
P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
o
r
d
e
r
(
i
n

i
n
s
t
r
u
c
t
i
o
n
s
)
2

n
s
b
u
b
b
l
e
b
u
b
b
l
e
b
u
b
b
l
e
b
u
b
b
l
e
b
u
b
b
l
e
I
n
s
t
r
u
c
t
i
o
n
D
a
t
a
R
e
g
A
L
U
R
e
g
f
e
t
c
h
a
c
c
e
s
s
4

n
s
10
Control Hazards

Delayed branch

P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
0
1
2
1
4
o
r
d
e
r
(
i
n

i
n
s
t
r
u
c
t
i
o
n
s
)
(
D
e
l
a
y
e
d

b
r
a
n
c
h

s
l
o
t
)
2

n
s
11
Data Hazards

Look at the following example add s0, t0,
t1 sub t2, s0, t3
We need the result s0 from the add instruction
to do the subtraction.
Is the data ready?
Compiler cannot handle this issue
Solution forwarding or bypassing, i.e., getting
the missing item early from the internal
resources.

12
Graphical representation of the instruction
pipeline

IF instruction fetch
ID instruction decode
EX execution
MEM memory access
WB write back
Shading element used, White element not used
Right-shading read, Left-Shading write

2
4
6
8
1
0
T
i
m
e
I
F
I
D
E
X
M
E
M
a
d
d

s
0
,

t
0
,

t
1
W
B
13
Forwarding

As soon as ALU add is finished, forward the
result

P
r
o
g
r
a
m
e
x
e
c
u
t
i
o
n
2
4
6
8
1
0
o
r
d
e
r
T
i
m
e
(
i
n

i
n
s
t
r
u
c
t
i
o
n
s
)
a
d
d

s
0
,

t
0
,

t
1
I
F
I
D
W
B
E
X
M
E
M
s
u
b

t
2
,

s
0
,

t
3
M
E
M
I
F
I
D
E
X
W
B
M
E
M
14
Forwarding with stall

For R-format instruction following a load that
tries to use the data, load-use data hazard will
occur.
Need to stall in this case.

b
b
l
e
b
u
b
b
l
e
15
Reordering Code to Avoid Pipeline Stalls

Original code register t1 has the address of
vklw t0, 0(t1) reg t0 vklw t2,
4(t1) reg t1vk1sw t2, 0(t1) vk
reg t2sw t0, 4(t1) vk1 reg t0
Data hazard occurs on register t2 between the
second lw and the first sw
Modified code removes the hazard register t1
has the address of vklw t0, 0(t1) reg t0
vklw t2, 4(t1) reg t1vk1sw t0,
4(t1) vk1 reg t0sw t2, 0(t1) vk
reg t2

16
A Pipelined Datapath

What do we need to add to actually split the
datapath into stages?

x
e
c
u
t
e
/
M
E
M

M
e
m
o
r
y

a
c
c
e
s
s
W
B

W
r
i
t
e

b
a
c
k
a
d
d
r
e
s
s

c
a
l
c
u
l
a
t
i
o
n
17
Pipelined Datapath

Can you find a problem even if
there are no dependencies? What instructions
can we execute to manifest the problem?

I
D
/
E
X
R
e
a
d
r
e
g
i
s
t
e
r

1
R
e
a
d
d
a
t
a

1
R
e
a
d
Z
e
r
o
r
e
g
i
s
t
e
r

2
R
e
g
i
s
t
e
r
s
A
L
U
R
e
a
d
A
L
U
R
e
a
d
W
r
i
t
e
1
d
a
t
a

2
A
d
d
r
e
s
s
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
u
D
a
t
a
u
W
r
i
t
e
x
m
e
m
o
r
y
x
d
a
t
a
1
d
a
t
a
1
6
S
i
g
n
e
x
t
e
n
d
18
IF Stage
19
ID Stage
20
EX Stage
21
MEM Stage
22
WB Stage
23
Corrected Datapath
24
Portions of the Datapath used by a load
instruction
25
Graphically Representing Pipelines

Can help with answering questions like
how many cycles does it take to execute this
code?
what is the ALU doing during cycle 4?
use this representation to help understand
datapaths

A
L
U
A
L
U
26
Pipeline Control
27
Pipeline control

We have 5 stages. What needs to be controlled in
each stage?
Instruction Fetch and PC Increment
Instruction Decode / Register Fetch
Execution RegDst, ALUOp, ALUSrc
Memory Stage Branch, MemRead, MemWrite
Write Back MemReg, RegWrite
How would control be handled in an automobile
plant?
a fancy control center telling everyone what to
do?
should we use a finite state machine?

28
Pipeline Control

Pass control signals along just like the data

29
Datapath with Control
30
Dependencies

Problem with starting next instruction before
first is finished
dependencies that go backward in time are data
hazards

31
Hazard Conditions

Type 1.a EX/MEM.RegisterRd ID/EX.RegisterRs
Type 1.b EX/MEM.RegisterRd ID/EX.RegisterRt
Type 2.a MEM/WB.RegisterRdID/EX.RegisterRs
Type 2.b MEM/WB.RegisterRdID/EX.RegisterRt
Classify the dependencies in the following
sequence sub 2, 1, 3 Reg. 2 set by
sub and 12, 2, 5 1st operand (2) or 13,
6, 2 2nd operand (2) add 14, 2,
2 sw 15, 100(2)
sub-and Type 1a hazard
sub-or Type 2b
sub-and no hazard, sub-sw no hazard

32
Forwarding

Use temporary results, dont wait for them to be
written
register file forwarding to handle read/write to
same register
ALU forwarding

33
Forwarding
34
Can't always forward

Load word can still cause a hazard
an instruction tries to read a register following
a load instruction that writes to the same
register.
Thus, we need a hazard detection unit to stall
the load instruction

35
Stalling

We can stall the pipeline by keeping an
instruction in the same stage

36
Hazard Detection Unit

Stall by letting an instruction that wont write
anything go forward

37
Branch Hazards

When we decide to branch, other instructions are
in the pipeline!
We are predicting branch not taken
need to add hardware for flushing instructions if
we are wrong

38
Flushing Instructions

39
Improving Performance

Try and avoid stalls! E.g., reorder these
instructions
lw t0, 0(t1)
lw t2, 4(t1)
sw t2, 0(t1)
sw t0, 4(t1)
Add a branch delay slot
the next instruction after a branch is always
executed
rely on compiler to fill the slot with something
useful

40
More on improving performances

Superpipelining decompose the stage further (not
always practical)
Superscalar start more than one instruction in
the same cycle (extra coordination required)
CPI can be less than 1
IPC instruction per clock cycle
Dynamic pipelining
lw t0, 20(s2)
addu t1, t0, t2
sub s4, s4, t3
slti t5, s4, 20
Combine extra hardware resources so later
instructions can proceed in parallel.
More complicated pipeline control
More complicated instruction execution model

41
Superscalar MIPS

Assume two instructions are issued per clock
cycle, say one integer ALU operation or branch,
the other load or store.
Need to fetch and decode 64 bits of instruction
Extra resources are required.

42
Dynamic Scheduling