Title: CS1104
1CS1104 Computer Organization
- PART 2 Computer Architecture
- Lecture 11
- Pipelining
2Topics
- Pipelining
- Pipelined datapath
- Pipelined control
- Hazards
- Structural
- Data
- Control
- Exceptions
- Perfomance improvements
- Scheduling
- Branch prediction
- Superscalar processors
3Pipelining
- Improve perfomance by increasing instruction
throughput
4Pipelining
- Ideal speedup number of stages
- Do we achieve this?
5Pipelining
- What makes it easy
- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
- What makes it hard?
- structural hazards suppose we had only one
memory - control hazards need to worry about branch
instructions - data hazards an instruction depends on a
previous instruction - Well build a simple pipeline and look at these
issues - Well talk about modern processors and what
really makes it hard - exception handling
- trying to improve performance with out-of-order
execution, etc.
6Basic Idea
- What do we need to add to actually split the
datapath into stages?
7Pipelined Datapath
- Can you find a problem even if there are no
dependencies? What instructions can we execute
to manifest the problem?
64
128
97
64
8lw
9(No Transcript)
10(No Transcript)
11sw
12sw
13Corrected Datapath (lw)
0
M
u
x
1
I
F
/
I
D
E
X
/
M
E
M
M
E
M
/
W
B
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
S
h
i
f
t
l
e
f
t
2
n
o
A
d
d
r
e
s
s
P
C
i
t
c
u
r
t
s
n
I
n
s
t
r
u
c
t
i
o
n
I
m
e
m
o
r
y
0
0
3
2
14Datapath used in all the five stages of lw
15Graphically Representing Pipelines
- Can help with answering questions like
- how many cycles does it take to execute this
code? - what is the ALU doing during cycle 4?
- use this representation to help understand
datapaths
16Pipeline Control
17Pipeline control
- We have 5 stages. What needs to be controlled in
each stage? - Instruction Fetch and PC Increment
- Instruction Decode / Register Fetch
- Execution
- Memory Stage
- Write Back
- How would control be handled in an automobile
plant? - a fancy control center telling everyone what to
do? - should we use a finite state machine?
18Pipeline Control
- Pass control signals along just like the data
- No control signals for IF and ID, but only for
the remaining three stages
19Datapath with Control
20Hazards
21Hazards
- Hazards problems due to pipelining
- Hazard types
- Structural
- same resource is needed multiple times in the
same cycle - Data
- data dependencies limit pipelining
- Control
- next executed instruction is not the next
specified instruction
22Structural hazards
- Examples
- Two accesses to a single ported memory
- Two operations need the same function unitat the
same time - Two operations need the same function unitin
successive cycles, but the unit is not pipelined - Solutions
- stalling
- add more hardware
23Structural hazards
- Simple pipelining diagram (not MIPS!)
- IF instruction fetch
- ID instruction decode
- OF operand fetch
- EX execute stage(s)
- WB write back
time
Instruction stream
Pipeline stalls due to lack of resources
load
time
IF ID OF EX WB
IF ID OF EX WB
IF ID OF EX EX EX WB
Instruction stream
IF ID OF EX WB
IF ID OF EX WB
Shared memory port
One FU
24Structural hazards
Same non-pipelined FU
time
IF
ID
OF
EX
WB
IF
ID
OF
EX
WB
EX
Instruction stream
IF
ID
OF
EX
WB
EX
IF
ID
OF
EX
WB
IF
ID
OF
EX
WB
Stall cycle
25Structural hazards on MIPS
- Q Do we have structural hazards on our simple
MIPS pipeline?
26Data hazards
- Data dependencies
- RaW (read-after-write)
- WaW (write-after-write)
- WaR (write-after-read)
- Hardware solution
- Forwarding / Bypassing
- Detection logic
- Stalling
- Software solution Scheduling
27Data dependences
- Three types RaW, WaR and WaW
- add r1, r2, 5 r1 r25
- sub r4, r1, r3 RaW of r1
- add r1, r2, 5
- sub r2, r4, 1 WaR of r2
- add r1, r2, 5
- sub r1, r1, 1 WaW of r1
- st r1, 5(r2) Mr25 r1
- ld r5, 0(r4) RaW if 5r2 0r4
WaW and WaR do not occur in simple pipelines, but
they limit scheduling freedom! Problems for
your compiler and Pentium! ? use register
renaming to solve this!
28RaW dependence
add r1, r2, 5 r1 r25 sub r4, r1, r3 RaW of
r1
Without bypass circuitry
time
add r1, r2, 5
sub r4, r1, r3
OF
EX
WB
IF
ID
With bypass circuitry
time
add r1, r2, 5
Saves two cycles
sub r4, r1, r3
29RaW on MIPS pipeline
30Forwarding
- Use temporary results, dont wait for them to be
written - register file forwarding to handle read/write to
same register - ALU forwarding
31Hazard Conditions
- Ex/MEM.RegisterRd ID/EX.RegisterRs
- EX/MEM.RegisterRd ID/EX.RegisterRt
- MEM/WB.RegisterRd ID/EX.RegisterRs
- MEM/WB.RegisterRd ID/EX.RegisterRt
32Forwarding hardware
ALU forwarding circuitry principle
buf
from register file
buf
to register file
from register file
buf
33Forwarding
34Forwarding check
- Check for matching register-ids
- For each source-id of operation in the EX-stage
check if there is a matching pending dest-id
Q. How many comparators do we need?
35Forwarding Conditions
36Without and with forwarding
37Can't always forward
- Load word can still cause a hazard
- an instruction tries to read register r following
a load to the same r - Need a hazard detection unit to stall the load
instruction
38Stalling
- We can stall the pipeline by keeping an
instruction in the same stage
39Stalling Condition
if (ID/EX.MemRead) and ((ID/EX.RegisterRt
IF/ID.RegisterRs) or (ID/EX.RegisterRt
IF/ID.RegisterRt)))then stall the pipeline
Line 1 Is the instruction a load (reads data
memory)? Lines 23 Does the dest reg field of
the load in the EX stage match with either of the
source registers of the instruction in the ID
stage? Stall Deassert control signals (create a
do nothing) Change the EX, MEM and WB control
fields of the ID/EX pipeline register to 0
40Hazard Detection Unit
I
D
/
E
X
.
M
e
m
R
e
a
d
H
a
z
a
r
d
d
e
t
e
c
t
i
o
n
I
D
/
E
X
u
n
i
t
W
B
e
t
E
X
/
M
E
M
i
r
W
M
D
I
/
C
o
n
t
r
o
l
M
W
B
u
F
M
E
M
/
W
B
I
x
0
E
X
M
W
B
I
F
/
I
D
e
t
i
r
W
M
n
C
u
o
P
i
t
x
c
u
r
t
R
e
g
i
s
t
e
r
s
s
n
D
a
t
a
I
n
s
t
r
u
c
t
i
o
n
I
A
L
U
P
C
m
e
m
o
r
y
M
m
e
m
o
r
y
u
x
M
u
x
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
s
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
t
R
t
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
t
M
E
X
/
M
E
M
.
R
e
g
i
s
t
e
r
R
d
u
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
d
R
d
x
I
D
/
E
X
.
R
e
g
i
s
t
e
r
R
t
R
s
F
o
r
w
a
r
d
i
n
g
M
E
M
/
W
B
.
R
e
g
i
s
t
e
r
R
d
u
n
i
t
R
t
41Software only solution
- Have compiler guarantee that no hazards occur
- Example where do we insert the NOPs
? sub 2, 1, 3 and 12, 2, 5 or 13,
6, 2 add 14, 2, 2 sw 15, 100(2) - Problem this really slows us down!
42Control hazards
- Control operations may change the sequential flow
of instructions - branch
- jump
- call (jump and link)
- return
- (exception)
43Branch
- Branch actions
- Compute new address
- Determine condition
- Perform the actual branch (if taken) PC new
address - Squash pipeline
- When we decide to branch, other instructions are
in the pipeline! - We are predicting branch not taken
- need to add hardware for flushing instructions if
we are wrong
44Branch with predict not taken
Clock cycles
Branch L
IF
ID
EX
MEM
WB
Predict not taken
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
L
45Branch example
46Branch speedup
- Earlier address computation
- Earlier condition calculation
- Put both in the ID pipeline stage
- adder (from MEM stage)
- comparator (from EX stage)
47Improved branching / flushing IF/ID
48Exception support
- Types of exceptions
- Overflow
- I/O device request
- Operating system call
- Undefined instruction
- Hardware malfunction
- Page fault
- Precise exception
- finish previous instructions (which are still in
the pipeline) - flush excepting and following instructions, redo
them after handling the exception(s)
49Exceptions
- Changes needed for handling overflow exception of
an operation in EX stage - Extend PC input mux with extra entry with fixed
address - Add EPC register recording the ID/EX stage PC
- this is the address of the next instruction !
- Cause register recording exception type
- In case of overflow exception insert 3 bubbles
flush - IF/ID stage
- ID/EX stage
- EX/MEM stage
50Performance improvements
51Performance improvements
- Scheduling
- avoiding data hazards
- avoiding control hazards
- Branches
- delay slot
- branch prediction
- Superscalar
52Scheduling, why?
- Lets look at the execution time
- Texecution Ncycles x Tcycle
- Ninstructions x CPI x Tcycle
- Scheduling may reduce Texecution
- Reduce CPI (cycles per instruction)
- early scheduling of long latency operations
- avoid pipeline stalls due to structural, data and
control hazards - allow Nissue gt 1 and therefore CPI lt 1
- Reduce Ninstructions
- compact many operations into each instruction
(VLIW)
53Scheduling data hazardsexample 1
- Try and avoid RaW stalls (in this case load
interlocks)! - E.g., reorder these instructions
lw t0, 0(t1) lw t2, 4(t1) sw t0, 4(t1) sw
t2, 0(t1)
lw t0, 0(t1) lw t2, 4(t1) sw t2, 0(t1) sw
t0, 4(t1)
?
54Scheduling data hazardsexample 2
Avoiding RaW stalls
Reordering instructions for following program (by
you or the compiler)
Code a b c d e - f
55Scheduling control hazards
- Texecution Ninstructions x CPI x Tcycle
- CPI CPIideal fbranch x Pbranch
- Pbranch Ndelayslots x miss_rate
- Modern processors tend to have large branch
penalty, Pbranch,due to many pipeline stages - Note that penalties have larger effect when
CPIideal is low
56Scheduling control hazards
- What can we do about control hazards and CPI
penalty? - Keep penalty Pbranch low
- Early computation of new PC
- Early determination of condition
- Visible branch delay slots filled by compiler
(MIPS) - Branch prediction
- Reduce control dependencies (control height
reduction) - Remove branches if-conversion
- Conditional instructions CMOVE, cond skip next
- Guarding all instructions TriMedia
57Branch delay slot
- Add a branch delay slot
- the next instruction after a branch is always
executed - rely on compiler to fill the slot with
something useful
58Branch delay slot scheduling
Q. What to put in the delay slot?
op 1
beq r1,r2, L
.............
op 2
.............
'fall-through'
L op 3
branch target
.............
59(No Transcript)
60Branch prediction
- Predict (not)taken schemes use fixed prediction
- Can we remember (dynamically) branch directions?
- 1-bit scheme
- 2-bit schemes
- multi-level branch predictors
- hybrid schemes
611-bit prediction, using prediction buffer
Branch address
2 K entries
(Lower K bits)
prediction bit
- Problems
- Aliasing lower K bits of different branch
instructions could be the same - Solution Use tags however very expensive
- Loops are predicted wrong twice
- Solution Use n-bit saturation counter
prediction - taken if counter ? 2 (n-1)
- not-taken if counter lt 2 (n-1)
- A 2 bit saturating counter predicts a loop wrong
only once
62Using n-bit Saturating Counters
n-bit saturating Up/Down Counter
Branch address
Prediction
a
2-bit saturating counter scheme
N
10/T
11/T
T
T
N
T
00/N
N
01/N
N
T
63Superscalars
- issue (start) multiple instructions per cycle
- multiple function units (like ALU, LD-ST,..)
- extend forwarding circuitry, detection logic
- extend ID logic
- check for independent operations
- dynamic scheduling
64Multiple (2) instructions per cycle
Clock cycle
Instruction
65Multiple (2) instructions per cycle
- Q How will the following code be executed on
2-issue machine, with one extra ALU?
Loop lw r0, 0(r1) addu r0, r0,
r2 sw r0, 0(r1) addi r1, r1, -4 bne r1,
r0, Loop
A Check dependencies Only instr. 3 4 can
be executed in parallel Use loop unrolling
and scheduling to improve execution
66Dynamic Scheduling
- The hardware performs the scheduling
- hardware tries to find instructions to execute
- out-of-order (o-o-o) execution is possible
- register renaming to avoid WaW and WaR stalls
- dynamic branch prediction
- speculative execution
67Dynamic scheduling architecture
Instr. fetch and decode
forwarding buses
FUs
Register file(s)
68Dynamic scheduling
- Q Check the same example, now for a dynamic
scheduling 2-issue superscalar (a MIPS with one
extra ALU).
Loop lw r0, 0(r1) addu r0, r0,
r2 sw r0, 0(r1) addi r1, r1, -4 bne r1,
r0, Loop
69Superscalar processors
- All modern processors are extremely complicated
- Deep pipelines
- Many support o-o-o execution
- Multi-level branch prediction
- Speculative execution beyond 4 or more branches
- Multiple outstanding cache misses
- Multiple threads
- .......
- Compiler technology important
70Performance Increase
Processor Year Freq SpecInt92 Specfp92 Issue
rate pipelining Intel8096 1978 5 0.2 0.1 1 - In
tel 286 1982 6 1.0 0.5 1 - Intel
386 1986 16 3.1 1.6 1 - Intel 486 1989 25 15 7
1 5 Intel Pent P5 1993 66 67.4 63.6 2 5 Intel
Pent Pro 1995 150 245 220 3 14 Intel Pent
III 1999 450 750 550 3 14 Intel Pent
IV 2001 1400 1850 1950 - 20 M68040 1989 2
5 21 15 1 3 Sparc micro 1992 50 26 21 1 5 Sparc
Ultra I 1995 167 275 305 4 9 Sparc Ultra
III 1999 600 1400 2400 - - Mips
3000 1989 33 18 19 1 5 Mips 4000 1992 100 59 61 1
8 Mips 10K 1995 200 300 600 5 5 HP
7000 1990 66 48 75 1 5 HP 7200 1994 140 150 25
0 2 5 HP 8000 1996 180 400 600 4 7 Alpha
21064 1992 200 133 200 2 7 Alpha
21164 1994 300 300 500 4 7 Alpha
21264 1997 500 1100 1900 4 7 Alpha
21364 2001 1000 2800 4800 4 6 MPC
601a 1993 50 40 60 3 4 MPC 604 1994 100 160 165 4
4 MPC 620 1995 130 225 300 5 4
71Performance Increase
SPECfp92 data
SPECint92 data
1000
SPECfp92 growth
SPECint92 growth
100
10
SPECint and SPECfp ratings
1.0
0.1
78
80
82
84
86
88
90
92
94
96
98
00
02
Year
- Microprocessor SPEC Ratings
- 50 SPECint improvement / year
- 60 SPECfp improvement / year
72Summary
- Modern processors are (deeply) pipelined, to
reduce Tcycle and aim at CPI 1 - Hazards increase CPI
- Several measure to avoid or reduce hazards are
taken - Multi-issue further reduces CPI
- Branch prediction to avoid high branch penalties
- Dynamic scheduling
- In all cases a scheduling compiler needed