Title: Computer Architecture and Parallel Computing ??????? Lecture 2 - Pipelining
1Computer Architecture and Parallel
Computing??????? Lecture 2 - Pipelining
- Peng Liu
-
- Dept. of Info. Sci. Elec. Engg.
- Zhejiang University
- liupeng_at_zju.edu.cn
May 12, 2014
2- Microcoding became less attractive as gap between
RAM and ROM speeds reduced - Complex instruction sets difficult to pipeline,
so difficult to increase performance as gate
count grew - Iron Law explains architecture design space
- Trade instructions/program, cycles/instruction,
and time/cycle - Load-Store RISC ISAs designed for efficient
pipelined implementations - Very similar to vertical microcode
- Inspired by earlier Cray machines (more on these
later)
2
3An Ideal Pipeline
- All objects go through the same stages
- No sharing of resources between any two stages
- Propagation delay through all pipeline stages is
equal - The scheduling of an object entering the
pipeline - is not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines, but instructions depend on each
other!
3
4Pipelined MIPS
- To pipeline MIPS
- First build MIPS without pipelining with CPI1
- Next, add pipeline registers to reduce cycle time
while maintaining CPI1
4
5Unpipelined Datapath for MIPS
PCSrc
RegWrite
br
rind
jabs
pc4
Add
Add
31
5
6Hardwired Control Table
Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc
ALU
ALUi
ALUiu
LW
SW
BEQZz0
BEQZz1
J
JAL
JR
JALR
pc4
Reg
Func
no
yes
ALU
rd
sExt16
Imm
Op
rt
uExt16
pc4
sExt16
Imm
no
yes
Mem
rt
br
sExt16
0?
pc4
jabs
jabs
yes
PC
R31
rind
BSrc Reg / Imm WBSrc ALU / Mem / PC
RegDst rt / rd / R31 PCSrc pc4 / br / rind
/ jabs
6
7Pipelined Datapath
0x4
Add
we
rs1
rs2
rd1
we
addr
ws
addr
rdata
ALU
wd
rd2
rdata
GPRs
Data Memory
Inst. Memory
Imm Ext
wdata
Clock period can be reduced by dividing the
execution of an instruction into multiple
cycles tC gt max tIM, tRF, tALU, tDM, tRW (
tDM probably) However, CPI will increase
unless instructions are pipelined
7
8Iron Law of Processor Performance
- Time Instructions Cycles
Time - Program Program Instruction
Cycle
- Instructions per program depends on source code,
compiler technology, and ISA - Cycles per instructions (CPI) depends upon the
ISA and the microarchitecture - Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture CPI cycle time
Microcoded gt1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
8
9CPI Examples
Time
9
10Technology Assumptions
- A small amount of very fast memory (caches)
- backed up by a large, slower memory
- Fast ALU (at least for integers)
- Multiported Register files (slower!)
Thus, the following timing assumption is
reasonable
tIM ??tRF ??tALU ??tDM ? tRW
A 5-stage pipeline will be the focus of our
detailed design - some commercial designs have
over 30 pipeline stages to do an integer add!
10
115-Stage Pipelined Execution
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. instruction1 IF1 ID1 EX1 MA1 WB1 instruction2
IF2 ID2 EX2 MA2 WB2 instruction3 IF3 ID3 EX3 M
A3 WB3 instruction4 IF4 ID4 EX4 MA4 WB4 instru
ction5 IF5 ID5 EX5 MA5 WB5
11
125-Stage Pipelined ExecutionResource Usage Diagram
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5 EX
I1 I2 I3 I4 I5 MA I1 I2 I3 I4 I5 WB
I1 I2 I3 I4 I5
Resources
12
13Pipelined ExecutionALU Instructions
Not quite correct! We need an Instruction Reg
(IR) for each stage
13
14Pipelined MIPS Datapathwithout jumps
ALU
Data Memory
Imm Ext
MD1
MD2
Control Points Need to Be Connected
14
15Instructions interact with each other in pipeline
- An instruction in the pipeline may need a
resource being used by another instruction in the
pipeline ? structural hazard - An instruction may depend on something produced
by an earlier instruction - Dependence may be for a data value ? data
hazard - Dependence may be for the next instructions
address ? control hazard (branches, exceptions)
15
16Resolving Structural Hazards
- Structural hazards occurs when two instructions
need same hardware resource at same time - Can resolve in hardware by stalling newer
instruction till older instruction finished with
resource - A structural hazard can always be avoided by
adding more hardware to design - E.g., if two instructions both need a port to
memory at same time, could avoid hazard by adding
second port to memory - Our 5-stage pipe has no structural hazards by
design - Thanks to MIPS ISA, which was designed for
pipelining
16
17Data Hazards
r1 ??
r4 ?? r1
... r1 ??r0 10 r4 ??r1 17 ...
r1 is stale. Oops!
17
18Resolving Data Hazards (1)
Strategy 1 Wait for the result to be available
by freezing earlier pipeline stages ? interlocks
18
19Feedback to Resolve Hazards
- Later stages provide dependence information to
earlier stages which can stall (or kill)
instructions
- Controlling a pipeline in this manner works
provided the instruction at stage i1 can
complete without any interference from
instructions in stages 1 to i - (otherwise deadlocks may occur)
19
20Interlocks to resolve Data Hazards
... r1 ??r0 10 r4 ??r1 17 ...
20
21Stalled Stages and Pipeline Bubbles
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1
??(r0) 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 ??(r1)
17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3
IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4)
IF4 ID4 EX4 MA4 WB4 (I5)
IF5 ID5 EX5 MA5 WB5
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I
3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA
I1 nop nop nop I2 I3 I4 I5 WB
I1 nop nop nop I2 I3 I4 I5
Resource Usage
nop ? pipeline bubble
21
22Interlock Control Logic
Compare the source registers of the instruction
in the decode stage with the destination register
of the uncommitted instructions.
22
23Interlock Control Logicignoring jumps branches
Should we always stall if the rs field matches
some rd?
not every instruction writes a register ??we
not every instruction reads a register ??re
23
24Source Destination Registers
source(s) destination ALU rd ??
(rs) func (rt) rs, rt rd ALUi rt ??
(rs) op imm rs rt LW rt ??M (rs)
imm rs rt SW M (rs) imm ??
(rt) rs, rt BZ cond (rs) true PC ??
(PC) imm rs false PC ?? (PC) 4 rs J PC
?? (PC) imm JAL r31 ?? (PC), PC ?? (PC)
imm 31 JR PC ?? (rs) rs JALR r31 ??
(PC), PC ?? (rs) rs 31
24
25Deriving the Stall Signal
Cdest ws Case opcode ALU ??rd ALUi,
LW ??rt JAL, JALR ??R31 we Case opcode ALU,
ALUi, LW ?(ws ? 0) JAL, JALR ??on ... ??off
Cre re1 Case opcode ALU, ALUi,
??on ??off re2 Case
opcode ??on ??off
LW, SW, BZ, JR, JALR J, JAL
ALU, SW ...
Cstall stall ((rsD wsE).weE (rsD
wsM).weM (rsD wsW).weW) . re1D
((rtD wsE).weE (rtD wsM).weM
(rtD wsW).weW) . re2D
This is not the full story !
25
26Hazards due to Loads Stores
What if (r1)7 (r3)5 ?
... M(r1)7 ? (r2) r4 ? M(r3)5 ...
Is there any possible data hazard in this
instruction sequence?
26
27Load Store Hazards
... M(r1)7 ? (r2) r4 ? M(r3)5 ...
(r1)7 (r3)5 ? data hazard
However, the hazard is avoided because our memory
system completes writes in one cycle
! Load/Store hazards are sometimes resolved in
the pipeline and sometimes in the memory system
itself. More on this later in the course.
27
28Resolving Data Hazards (2)
Strategy 2 Route data as soon as possible after
it is calculated to the earlier pipeline stage ?
bypass
28
29Bypassing
Each stall or kill introduces a bubble in the
pipeline ? ??CPI gt 1
A new datapath, i.e., a bypass, can get the data
from the output of the ALU to its input
29
30Adding a Bypass
When does this bypass help?
yes
no
no
30
31The Bypass SignalDeriving it from the Stall
Signal
stall ( ((rsD wsE).weE (rsD wsM).weM (rsD
wsW).weW).re1D ((rtD wsE).weE
(rtD wsM).weM (rtD wsW).weW).re2D )
we Case opcode ALU, ALUi, LW ?(ws ? 0)
JAL, JALR ??on ... ??off
ws Case opcode ALU ??rd ALUi, LW ??rt JAL,
JALR ??R31
ASrc (rsDwsE).weE.re1D
Is this correct?
No because only ALU and ALUi instructions can
benefit from this bypass
Split weE into two components we-bypass, we-stall
31
32Bypass and Stall Signals
Split weE into two components we-bypass, we-stall
we-bypassE Case opcodeE ALU, ALUi ? (ws ? 0)
... ??off
we-stallE Case opcodeE LW ? (ws ? 0)
JAL, JALR ??on ... ??off
ASrc (rsD wsE).we-bypassE . re1D
stall ((rsD wsE).we-stallE
(rsDwsM).weM (rsDwsW).weW). re1D
((rtD wsE).weE (rtD wsM).weM (rtD
wsW).weW). re2D
32
33Fully Bypassed Datapath
Is there still a need for the stall signal ?
stall (rsDwsE). (opcodeELWE).(wsE?0 ).re1D
(rtDwsE). (opcodeELWE).(wsE?0 ).re2D
33
34Resolving Data Hazards (3)
Strategy 3 Speculate on the dependence. Two
cases Guessed correctly ? do nothing Guessed
incorrectly ? kill and restart . Well later
see examples of this approach in more complex
processors.
34
35Control Hazards
- What do we need to calculate next PC?
- For Jumps
- Opcode, offset and PC
- For Jump Register
- Opcode and Register value
- For Conditional Branches
- Opcode, PC, Register (for condition), and offset
- For all other instructions
- Opcode and PC
- have to know its not one of above!
35
36Opcode Decoding Bubble(assuming no branch delay
slots for now)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1
??(r0) 10 IF1 ID1 EX1 MA1 WB1 (I2) r3 ??(r2)
17 IF2 IF2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 ID3
EX3 MA3 WB3 (I4)
IF4 IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 nop I2 nop I3 nop I4 ID I1 nop I2 nop I
3 nop I4 EX I1 nop I2 nop I3 nop I4 MA
I1 nop I2 nop I3 nop I4 WB
I1 nop I2 nop I3 nop I4
Resource Usage
nop ? pipeline bubble
36
37Speculate next address is PC4
stall
I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD
A jump instruction kills (not stalls) the
following instruction
How?
37
38Pipelining Jumps
PCSrc (pc4 / jabs / rind/ br)
stall
To kill a fetched instruction -- Insert a mux
before IR
M
E
0x4
IR
IR
Add
Jump?
Any interaction between stall and jump?
addr
PC
inst
IR
Inst Memory
IRSrcD Case opcodeD J, JAL ??nop ... ??IM
38
39Jump Pipeline Diagrams
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 J
304 IF2 ID2 EX2 MA2 WB2 (I3) 104 ADD IF3 nop
nop nop nop (I4) 304 ADD
IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 nop I4 I5 EX
I1 I2 nop I4 I5 MA I1 I2 nop
I4 I5 WB I1 I2 nop I4 I5
Resource Usage
nop ? pipeline bubble
39
40Pipelining Conditional Branches
BEQZ?
Branch condition is not known until the execute
stage what action should be taken in the decode
stage ?
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
40
41Pipelining Conditional Branches
PCSrc (pc4 / jabs / rind / br)
stall
M
E
0x4
IR
IR
Add
addr
PC
inst
IR
Inst Memory
If the branch is taken - kill the two following
instructions - the instruction at the decode
stage is not valid ? stall signal is not valid
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
41
42Pipelining Conditional Branches
stall
PCSrc (pc4/jabs/rind/br)
BEQZ?
M
E
IRSrcE
0x4
IR
IR
Add
Jump?
IRSrcD
addr
nop
PC
inst
IR
Inst Memory
If the branch is taken - kill the two following
instructions - the instruction at the decode
stage is not valid ? stall signal is not valid
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
42
43New Stall Signal
stall ( ((rsD wsE).weE (rsD wsM).weM
(rsD wsW).weW).re1D ((rtD
wsE).weE (rtD wsM).weM (rtD
wsW).weW).re2D ) .
!((opcodeEBEQZ).z (opcodeEBNEZ).!z)
Dont stall if the branch is taken. Why?
Instruction at the decode stage is invalid
43
44Control Equations for PC and IR Muxes
PCSrc Case opcodeE BEQZ.z, BNEZ.!z ??br ...
?? Case opcodeD J, JAL ? ?jabs
JR, JALR ? ?rind ... ?? pc4
Give priority to the older instruction, i.e.,
execute-stage instruction over decode-stage
instruction
IRSrcD Case opcodeE BEQZ.z, BNEZ.!z ??nop ...
?? Case opcodeD J, JAL, JR, JALR
??nop ... ??IM
IRSrcE Case opcodeE BEQZ.z, BNEZ.!z ??nop ...
??stall.nop !stall.IRD
44
45Branch Pipeline Diagrams(resolved in execute
stage)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 BEQZ
200 IF2 ID2 EX2 MA2 WB2 (I3) 104 ADD IF3 ID3
nop nop nop (I4) 108
IF4 nop nop nop nop (I5) 304 ADD
IF5 ID5 EX5 MA5 WB5
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX
I1 I2 nop nop I5 MA I1 I2 nop nop
I5 WB I1 I2 nop nop I5
Resource Usage
nop ? pipeline bubble
45
46Reducing Branch Penalty(resolve in decode stage)
- One pipeline bubble can be removed if an extra
comparator is used in the Decode stage - But might elongate cycle time
PCSrc (pc4 / jabs / rind/ br)
0x4
Add
nop
addr
PC
inst
IR
Inst Memory
D
Pipeline diagram now same as for jumps
46
47Branch Delay Slots(expose control hazard to
software)
- Change the ISA semantics so that the instruction
that follows a jump or branch is always executed - gives compiler the flexibility to put in a useful
instruction where normally a pipeline bubble
would have resulted.
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD I4 304
ADD
Delay slot instruction executed regardless of
branch outcome
- Other techniques include more advanced branch
prediction, which can dramatically reduce the
branch penalty... to come later
47
48Branch Pipeline Diagrams(branch delay slot)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 BEQZ
200 IF2 ID2 EX2 MA2 WB2 (I3) 104
ADD IF3 ID3 EX3 MA3 WB3 (I4) 304 ADD
IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 ID I1 I2 I3 I4 EX
I1 I2 I3 I4 MA I1 I2 I3 I4 WB
I1 I2 I3 I4
Resource Usage
48
49Why an Instruction may not be dispatched every
cycle (CPIgt1)
- Full bypassing may be too expensive to implement
- typically all frequently used paths are provided
- some infrequently used bypass paths may increase
cycle time and counteract the benefit of reducing
CPI - Loads have two-cycle latency
- Instruction after load cannot use load result
- MIPS-I ISA defined load delay slots, a
software-visible pipeline hazard (compiler
schedules independent instruction or inserts NOP
to avoid hazard). - MIPSMicroprocessor without Interlocked Pipeline
Stages - Removed in MIPS-II (pipeline interlocks added in
hardware) - Conditional branches may cause bubbles
- kill following instruction(s) if no delay slots
49
50Iron Law with Software-Visible NOPs
Time Instructions Cycles
Time Program Program
Instruction Cycle
- If software has to insert NOP instructions for
hazard avoidance, instructions/program increases - average cycles/instruction decreases - doing
nothing fast is easy! - But performance (time/program) worse or same as
if hardware instead uses interlocks to avoid
hazard - Hardware-generated interlocks (bubbles) dont
change instructions/program, but only add to
cycles/instruction - Hardware interlocks dont take space in
instruction cache
50
51Exceptionsaltering the normal flow of control
Ii-1
HI1
exception handler
HI2
Ii
program
HIn
Ii1
An exception transfers control to special handler
code run in privileged mode. Exceptions are
usually unexpected or rare from programs point
of view.
51
52Causes of Exceptions
Exception an event that requests the attention
of the processor
- Asynchronous an external interrupt
- input/output device service request
- timer expiration
- power disruptions, hardware failure
- Synchronous an internal exception (a.k.a. traps)
- undefined opcode, privileged instruction
- arithmetic overflow, FPU exception
- misaligned memory access
- virtual memory exceptions page faults,
TLB misses, protection violations - software exceptions system calls, e.g., jumps
into kernel
52
53History of Exception Handling
- First system with exceptions was Univac-I, 1951
- Arithmetic overflow would either
- 1. trigger the execution a two-instruction fix-up
routine at address 0, or - 2. at the programmer's option, cause the computer
to stop - Later Univac 1103, 1955, modified to add external
interrupts - Used to gather real-time wind tunnel data
- First system with I/O interrupts was DYSEAC, 1954
- Had two program counters, and I/O signal caused
switch between two PCs - Also, first system with DMA (direct memory access
by I/O device)
Courtesy Mark Smotherman
53
54DYSEAC, first mobile computer!
Courtesy Mark Smotherman
54
55Asynchronous Interruptsinvoking the interrupt
handler
- An I/O device requests attention by asserting one
of the prioritized interrupt request lines - When the processor decides to process the
interrupt - It stops the current program at instruction Ii,
completing all the instructions up to Ii-1 (a
precise interrupt) - It saves the PC of instruction Ii in a special
register (EPC) - It disables interrupts and transfers control to a
designated interrupt handler running in the
kernel mode
55
56MIPS Interrupt Handler Code
- Saves EPC before re-enabling interrupts to allow
nested interrupts ? - need an instruction to move EPC into GPRs
- need a way to mask further interrupts at least
until EPC can be saved - Needs to read a status register that indicates
the cause of the interrupt - Uses a special indirect jump instruction RFE
(return-from-exception) to resume user code,
this - enables interrupts
- restores the processor to the user mode
- restores hardware status and control state
56
57Synchronous Exception
- A synchronous exception is caused by a particular
instruction - In general, the instruction cannot be completed
and needs to be restarted after the exception has
been handled - requires undoing the effect of one or more
partially executed instructions - In the case of a system call trap, the
instruction is considered to have been completed
- syscall is a special jump instruction involving a
change to privileged kernel mode - Handler resumes at instruction after system call
57
58Exception Handling 5-Stage Pipeline
- How to handle multiple simultaneous exceptions in
different pipeline stages? - How and where to handle external asynchronous
interrupts?
58
59Exception Handling 5-Stage Pipeline
Inst. Mem
Decode
Data Mem
Illegal Opcode
Data address Exceptions
Overflow
PC address Exception
Cause
EPC
Asynchronous Interrupts
59
60Exception Handling 5-Stage Pipeline
- Hold exception flags in pipeline until commit
point (M stage) - Exceptions in earlier pipe stages override later
exceptions for a given instruction - Inject external interrupts at commit point
(override others) - If exception at commit update Cause and EPC
registers, kill all stages, inject handler PC
into fetch stage
60
61Speculating on Exceptions
- Prediction mechanism
- Exceptions are rare, so simply predicting no
exceptions is very accurate! - Check prediction mechanism
- Exceptions detected at end of instruction
execution pipeline, special hardware for various
exception types - Recovery mechanism
- Only write architectural state at commit point,
so can throw away partially executed instructions
after exception - Launch exception handler after flushing pipeline
- Bypassing allows use of uncommitted instruction
results by following instructions
61
62Exception Pipeline Diagram
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 nop overflow! (I2) 100
XOR IF2 ID2 EX2 nop nop (I3) 104
SUB IF3 ID3 nop nop nop (I4) 108 ADD
IF4 nop nop nop nop (I5) Exc.
Handler code IF5 ID5 EX5 MA5 WB5
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX
I1 I2 nop nop I5 MA I1 nop nop
nop I5 WB nop nop nop nop I5
Resource Usage
62
63Acknowledgements
- UCB material derived from course CS152
- Harvard University material derived from course
CS246
63
64Readings
- Computer Architecture A Quantitative Approach,
- 5th Edition (2012)
- D. A. Patterson and J. L. Hennessy, Computer
Organization and Design The Hardware/Software
Interface, 4th Edition,2013.Â