PPT – Computer Architecture and Parallel Computing ??????? Lecture 2 - Pipelining PowerPoint presentation

About This Presentation

Title:

Computer Architecture and Parallel Computing ??????? Lecture 2 - Pipelining

Description:

Exceptions are usually unexpected or rare from program s point of ... timer expiration power ... hazard can always be avoided by adding more ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 65

Provided by: Krst3

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architecture and Parallel Computing ??????? Lecture 2 - Pipelining

1
Computer Architecture and Parallel
Computing??????? Lecture 2 - Pipelining

Peng Liu
Dept. of Info. Sci. Elec. Engg.
Zhejiang University
liupeng_at_zju.edu.cn

May 12, 2014
2

Microcoding became less attractive as gap between
RAM and ROM speeds reduced
Complex instruction sets difficult to pipeline,
so difficult to increase performance as gate
count grew
Iron Law explains architecture design space
Trade instructions/program, cycles/instruction,
and time/cycle
Load-Store RISC ISAs designed for efficient
pipelined implementations
Very similar to vertical microcode
Inspired by earlier Cray machines (more on these
later)

2
3
An Ideal Pipeline

All objects go through the same stages
No sharing of resources between any two stages
Propagation delay through all pipeline stages is
equal
The scheduling of an object entering the
pipeline
is not affected by the objects in other stages

These conditions generally hold for industrial
assembly lines, but instructions depend on each
other!
3
4
Pipelined MIPS

To pipeline MIPS
First build MIPS without pipelining with CPI1
Next, add pipeline registers to reduce cycle time
while maintaining CPI1

4
5
Unpipelined Datapath for MIPS
PCSrc
RegWrite
br
rind
jabs
pc4
Add
Add
31
5
6
Hardwired Control Table
Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc
ALU
ALUi
ALUiu
LW
SW
BEQZz0
BEQZz1
J
JAL
JR
JALR
pc4

Reg
Func
no
yes
ALU
rd
sExt16
Imm
Op
rt
uExt16
pc4
sExt16
Imm

no
yes
Mem
rt
br
sExt16

0?
pc4
jabs
jabs
yes
PC
R31
rind
BSrc Reg / Imm WBSrc ALU / Mem / PC
RegDst rt / rd / R31 PCSrc pc4 / br / rind
/ jabs
6
7
Pipelined Datapath
0x4
Add
we
rs1
rs2
rd1
we
addr
ws
addr
rdata
ALU
wd
rd2
rdata
GPRs
Data Memory
Inst. Memory
Imm Ext
wdata
Clock period can be reduced by dividing the
execution of an instruction into multiple
cycles tC gt max tIM, tRF, tALU, tDM, tRW (
tDM probably) However, CPI will increase
unless instructions are pipelined
7
8
Iron Law of Processor Performance

Time Instructions Cycles
Time
Program Program Instruction
Cycle

Instructions per program depends on source code,
compiler technology, and ISA
Cycles per instructions (CPI) depends upon the
ISA and the microarchitecture
Time per cycle depends upon the microarchitecture
and the base technology

Microarchitecture CPI cycle time
Microcoded gt1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
8
9
CPI Examples
Time
9
10
Technology Assumptions

A small amount of very fast memory (caches)
backed up by a large, slower memory
Fast ALU (at least for integers)
Multiported Register files (slower!)

Thus, the following timing assumption is
reasonable
tIM ??tRF ??tALU ??tDM ? tRW
A 5-stage pipeline will be the focus of our
detailed design - some commercial designs have
over 30 pipeline stages to do an integer add!
10
11
5-Stage Pipelined Execution
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. instruction1 IF1 ID1 EX1 MA1 WB1 instruction2
IF2 ID2 EX2 MA2 WB2 instruction3 IF3 ID3 EX3 M
A3 WB3 instruction4 IF4 ID4 EX4 MA4 WB4 instru
ction5 IF5 ID5 EX5 MA5 WB5
11
12
5-Stage Pipelined ExecutionResource Usage Diagram
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5 EX
I1 I2 I3 I4 I5 MA I1 I2 I3 I4 I5 WB
I1 I2 I3 I4 I5
Resources
12
13
Pipelined ExecutionALU Instructions
Not quite correct! We need an Instruction Reg
(IR) for each stage
13
14
Pipelined MIPS Datapathwithout jumps
ALU
Data Memory
Imm Ext
MD1
MD2
Control Points Need to Be Connected
14
15
Instructions interact with each other in pipeline

An instruction in the pipeline may need a
resource being used by another instruction in the
pipeline ? structural hazard
An instruction may depend on something produced
by an earlier instruction
Dependence may be for a data value ? data
hazard
Dependence may be for the next instructions
address ? control hazard (branches, exceptions)

15
16
Resolving Structural Hazards

Structural hazards occurs when two instructions
need same hardware resource at same time
Can resolve in hardware by stalling newer
instruction till older instruction finished with
resource
A structural hazard can always be avoided by
adding more hardware to design
E.g., if two instructions both need a port to
memory at same time, could avoid hazard by adding
second port to memory
Our 5-stage pipe has no structural hazards by
design
Thanks to MIPS ISA, which was designed for
pipelining

16
17
Data Hazards
r1 ??
r4 ?? r1
... r1 ??r0 10 r4 ??r1 17 ...
r1 is stale. Oops!
17
18
Resolving Data Hazards (1)
Strategy 1 Wait for the result to be available
by freezing earlier pipeline stages ? interlocks
18
19
Feedback to Resolve Hazards

Later stages provide dependence information to
earlier stages which can stall (or kill)
instructions

Controlling a pipeline in this manner works
provided the instruction at stage i1 can
complete without any interference from
instructions in stages 1 to i
(otherwise deadlocks may occur)

19
20
Interlocks to resolve Data Hazards
... r1 ??r0 10 r4 ??r1 17 ...
20
21
Stalled Stages and Pipeline Bubbles
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1
??(r0) 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 ??(r1)
17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3
IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4)
IF4 ID4 EX4 MA4 WB4 (I5)
IF5 ID5 EX5 MA5 WB5
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I
3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA
I1 nop nop nop I2 I3 I4 I5 WB
I1 nop nop nop I2 I3 I4 I5
Resource Usage
nop ? pipeline bubble
21
22
Interlock Control Logic
Compare the source registers of the instruction
in the decode stage with the destination register
of the uncommitted instructions.
22
23
Interlock Control Logicignoring jumps branches
Should we always stall if the rs field matches
some rd?
not every instruction writes a register ??we
not every instruction reads a register ??re
23
24
Source Destination Registers
source(s) destination ALU rd ??
(rs) func (rt) rs, rt rd ALUi rt ??
(rs) op imm rs rt LW rt ??M (rs)
imm rs rt SW M (rs) imm ??
(rt) rs, rt BZ cond (rs) true PC ??
(PC) imm rs false PC ?? (PC) 4 rs J PC
?? (PC) imm JAL r31 ?? (PC), PC ?? (PC)
imm 31 JR PC ?? (rs) rs JALR r31 ??
(PC), PC ?? (rs) rs 31
24
25
Deriving the Stall Signal
Cdest ws Case opcode ALU ??rd ALUi,
LW ??rt JAL, JALR ??R31 we Case opcode ALU,
ALUi, LW ?(ws ? 0) JAL, JALR ??on ... ??off
Cre re1 Case opcode ALU, ALUi,
??on ??off re2 Case
opcode ??on ??off
LW, SW, BZ, JR, JALR J, JAL
ALU, SW ...
Cstall stall ((rsD wsE).weE (rsD
wsM).weM (rsD wsW).weW) . re1D
((rtD wsE).weE (rtD wsM).weM
(rtD wsW).weW) . re2D
This is not the full story !
25
26
Hazards due to Loads Stores
What if (r1)7 (r3)5 ?
... M(r1)7 ? (r2) r4 ? M(r3)5 ...
Is there any possible data hazard in this
instruction sequence?
26
27
Load Store Hazards
... M(r1)7 ? (r2) r4 ? M(r3)5 ...
(r1)7 (r3)5 ? data hazard
However, the hazard is avoided because our memory
system completes writes in one cycle
! Load/Store hazards are sometimes resolved in
the pipeline and sometimes in the memory system
itself. More on this later in the course.
27
28
Resolving Data Hazards (2)
Strategy 2 Route data as soon as possible after
it is calculated to the earlier pipeline stage ?
bypass
28
29
Bypassing
Each stall or kill introduces a bubble in the
pipeline ? ??CPI gt 1
A new datapath, i.e., a bypass, can get the data
from the output of the ALU to its input
29
30
Adding a Bypass
When does this bypass help?
yes
no
no
30
31
The Bypass SignalDeriving it from the Stall
Signal
stall ( ((rsD wsE).weE (rsD wsM).weM (rsD
wsW).weW).re1D ((rtD wsE).weE
(rtD wsM).weM (rtD wsW).weW).re2D )
we Case opcode ALU, ALUi, LW ?(ws ? 0)
JAL, JALR ??on ... ??off
ws Case opcode ALU ??rd ALUi, LW ??rt JAL,
JALR ??R31
ASrc (rsDwsE).weE.re1D
Is this correct?
No because only ALU and ALUi instructions can
benefit from this bypass
Split weE into two components we-bypass, we-stall
31
32
Bypass and Stall Signals
Split weE into two components we-bypass, we-stall
we-bypassE Case opcodeE ALU, ALUi ? (ws ? 0)
... ??off
we-stallE Case opcodeE LW ? (ws ? 0)
JAL, JALR ??on ... ??off
ASrc (rsD wsE).we-bypassE . re1D
stall ((rsD wsE).we-stallE
(rsDwsM).weM (rsDwsW).weW). re1D
((rtD wsE).weE (rtD wsM).weM (rtD
wsW).weW). re2D
32
33
Fully Bypassed Datapath
Is there still a need for the stall signal ?
stall (rsDwsE). (opcodeELWE).(wsE?0 ).re1D
(rtDwsE). (opcodeELWE).(wsE?0 ).re2D
33
34
Resolving Data Hazards (3)
Strategy 3 Speculate on the dependence. Two
cases Guessed correctly ? do nothing Guessed
incorrectly ? kill and restart . Well later
see examples of this approach in more complex
processors.
34
35
Control Hazards

What do we need to calculate next PC?
For Jumps
Opcode, offset and PC
For Jump Register
Opcode and Register value
For Conditional Branches
Opcode, PC, Register (for condition), and offset
For all other instructions
Opcode and PC
have to know its not one of above!

35
36
Opcode Decoding Bubble(assuming no branch delay
slots for now)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1
??(r0) 10 IF1 ID1 EX1 MA1 WB1 (I2) r3 ??(r2)
17 IF2 IF2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 ID3
EX3 MA3 WB3 (I4)
IF4 IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 nop I2 nop I3 nop I4 ID I1 nop I2 nop I
3 nop I4 EX I1 nop I2 nop I3 nop I4 MA
I1 nop I2 nop I3 nop I4 WB
I1 nop I2 nop I3 nop I4
Resource Usage
nop ? pipeline bubble
36
37
Speculate next address is PC4
stall
I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD
A jump instruction kills (not stalls) the
following instruction
How?
37
38
Pipelining Jumps
PCSrc (pc4 / jabs / rind/ br)
stall
To kill a fetched instruction -- Insert a mux
before IR
M
E
0x4
IR
IR
Add
Jump?
Any interaction between stall and jump?
addr
PC
inst
IR
Inst Memory
IRSrcD Case opcodeD J, JAL ??nop ... ??IM
38
39
Jump Pipeline Diagrams
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 J
304 IF2 ID2 EX2 MA2 WB2 (I3) 104 ADD IF3 nop
nop nop nop (I4) 304 ADD
IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 nop I4 I5 EX
I1 I2 nop I4 I5 MA I1 I2 nop
I4 I5 WB I1 I2 nop I4 I5
Resource Usage
nop ? pipeline bubble
39
40
Pipelining Conditional Branches
BEQZ?
Branch condition is not known until the execute
stage what action should be taken in the decode
stage ?
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
40
41
Pipelining Conditional Branches
PCSrc (pc4 / jabs / rind / br)
stall
M
E
0x4
IR
IR
Add
addr
PC
inst
IR
Inst Memory
If the branch is taken - kill the two following
instructions - the instruction at the decode
stage is not valid ? stall signal is not valid
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
41
42
Pipelining Conditional Branches
stall
PCSrc (pc4/jabs/rind/br)
BEQZ?
M
E
IRSrcE
0x4
IR
IR
Add
Jump?
IRSrcD
addr
nop
PC
inst
IR
Inst Memory
If the branch is taken - kill the two following
instructions - the instruction at the decode
stage is not valid ? stall signal is not valid
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
42
43
New Stall Signal
stall ( ((rsD wsE).weE (rsD wsM).weM
(rsD wsW).weW).re1D ((rtD
wsE).weE (rtD wsM).weM (rtD
wsW).weW).re2D ) .
!((opcodeEBEQZ).z (opcodeEBNEZ).!z)
Dont stall if the branch is taken. Why?
Instruction at the decode stage is invalid
43
44
Control Equations for PC and IR Muxes
PCSrc Case opcodeE BEQZ.z, BNEZ.!z ??br ...
?? Case opcodeD J, JAL ? ?jabs
JR, JALR ? ?rind ... ?? pc4
Give priority to the older instruction, i.e.,
execute-stage instruction over decode-stage
instruction
IRSrcD Case opcodeE BEQZ.z, BNEZ.!z ??nop ...
?? Case opcodeD J, JAL, JR, JALR
??nop ... ??IM
IRSrcE Case opcodeE BEQZ.z, BNEZ.!z ??nop ...
??stall.nop !stall.IRD
44
45
Branch Pipeline Diagrams(resolved in execute
stage)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 BEQZ
200 IF2 ID2 EX2 MA2 WB2 (I3) 104 ADD IF3 ID3
nop nop nop (I4) 108
IF4 nop nop nop nop (I5) 304 ADD
IF5 ID5 EX5 MA5 WB5
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX
I1 I2 nop nop I5 MA I1 I2 nop nop
I5 WB I1 I2 nop nop I5
Resource Usage
nop ? pipeline bubble
45
46
Reducing Branch Penalty(resolve in decode stage)

One pipeline bubble can be removed if an extra
comparator is used in the Decode stage
But might elongate cycle time

PCSrc (pc4 / jabs / rind/ br)
0x4
Add
nop
addr
PC
inst
IR
Inst Memory
D
Pipeline diagram now same as for jumps
46
47
Branch Delay Slots(expose control hazard to
software)

Change the ISA semantics so that the instruction
that follows a jump or branch is always executed
gives compiler the flexibility to put in a useful
instruction where normally a pipeline bubble
would have resulted.

I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD I4 304
ADD
Delay slot instruction executed regardless of
branch outcome

Other techniques include more advanced branch
prediction, which can dramatically reduce the
branch penalty... to come later

47
48
Branch Pipeline Diagrams(branch delay slot)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 BEQZ
200 IF2 ID2 EX2 MA2 WB2 (I3) 104
ADD IF3 ID3 EX3 MA3 WB3 (I4) 304 ADD
IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 ID I1 I2 I3 I4 EX
I1 I2 I3 I4 MA I1 I2 I3 I4 WB
I1 I2 I3 I4
Resource Usage
48
49
Why an Instruction may not be dispatched every
cycle (CPIgt1)

Full bypassing may be too expensive to implement
typically all frequently used paths are provided
some infrequently used bypass paths may increase
cycle time and counteract the benefit of reducing
CPI
Loads have two-cycle latency
Instruction after load cannot use load result
MIPS-I ISA defined load delay slots, a
software-visible pipeline hazard (compiler
schedules independent instruction or inserts NOP
to avoid hazard).
MIPSMicroprocessor without Interlocked Pipeline
Stages
Removed in MIPS-II (pipeline interlocks added in
hardware)
Conditional branches may cause bubbles
kill following instruction(s) if no delay slots

49
50
Iron Law with Software-Visible NOPs
Time Instructions Cycles
Time Program Program
Instruction Cycle

If software has to insert NOP instructions for
hazard avoidance, instructions/program increases
average cycles/instruction decreases - doing
nothing fast is easy!
But performance (time/program) worse or same as
if hardware instead uses interlocks to avoid
hazard
Hardware-generated interlocks (bubbles) dont
change instructions/program, but only add to
cycles/instruction
Hardware interlocks dont take space in
instruction cache

50
51
Exceptionsaltering the normal flow of control
Ii-1
HI1
exception handler
HI2
Ii
program
HIn
Ii1
An exception transfers control to special handler
code run in privileged mode. Exceptions are
usually unexpected or rare from programs point
of view.
51
52
Causes of Exceptions
Exception an event that requests the attention
of the processor

Asynchronous an external interrupt
input/output device service request
timer expiration
power disruptions, hardware failure
Synchronous an internal exception (a.k.a. traps)
undefined opcode, privileged instruction
arithmetic overflow, FPU exception
misaligned memory access
virtual memory exceptions page faults,
TLB misses, protection violations
software exceptions system calls, e.g., jumps
into kernel

52
53
History of Exception Handling

First system with exceptions was Univac-I, 1951
Arithmetic overflow would either
1. trigger the execution a two-instruction fix-up
routine at address 0, or
2. at the programmer's option, cause the computer
to stop
Later Univac 1103, 1955, modified to add external
interrupts
Used to gather real-time wind tunnel data
First system with I/O interrupts was DYSEAC, 1954
Had two program counters, and I/O signal caused
switch between two PCs
Also, first system with DMA (direct memory access
by I/O device)

Courtesy Mark Smotherman
53
54
DYSEAC, first mobile computer!
Courtesy Mark Smotherman
54
55
Asynchronous Interruptsinvoking the interrupt
handler

An I/O device requests attention by asserting one
of the prioritized interrupt request lines
When the processor decides to process the
interrupt
It stops the current program at instruction Ii,
completing all the instructions up to Ii-1 (a
precise interrupt)
It saves the PC of instruction Ii in a special
register (EPC)
It disables interrupts and transfers control to a
designated interrupt handler running in the
kernel mode

55
56
MIPS Interrupt Handler Code

Saves EPC before re-enabling interrupts to allow
nested interrupts ?
need an instruction to move EPC into GPRs
need a way to mask further interrupts at least
until EPC can be saved
Needs to read a status register that indicates
the cause of the interrupt
Uses a special indirect jump instruction RFE
(return-from-exception) to resume user code,
this
enables interrupts
restores the processor to the user mode
restores hardware status and control state

56
57
Synchronous Exception

A synchronous exception is caused by a particular
instruction
In general, the instruction cannot be completed
and needs to be restarted after the exception has
been handled
requires undoing the effect of one or more
partially executed instructions
In the case of a system call trap, the
instruction is considered to have been completed
syscall is a special jump instruction involving a
change to privileged kernel mode
Handler resumes at instruction after system call

57
58
Exception Handling 5-Stage Pipeline

How to handle multiple simultaneous exceptions in
different pipeline stages?
How and where to handle external asynchronous
interrupts?

58
59
Exception Handling 5-Stage Pipeline
Inst. Mem
Decode
Data Mem

Illegal Opcode
Data address Exceptions
Overflow
PC address Exception
Cause
EPC
Asynchronous Interrupts
59
60
Exception Handling 5-Stage Pipeline

Hold exception flags in pipeline until commit
point (M stage)
Exceptions in earlier pipe stages override later
exceptions for a given instruction
Inject external interrupts at commit point
(override others)
If exception at commit update Cause and EPC
registers, kill all stages, inject handler PC
into fetch stage

60
61
Speculating on Exceptions

Prediction mechanism
Exceptions are rare, so simply predicting no
exceptions is very accurate!
Check prediction mechanism
Exceptions detected at end of instruction
execution pipeline, special hardware for various
exception types
Recovery mechanism
Only write architectural state at commit point,
so can throw away partially executed instructions
after exception
Launch exception handler after flushing pipeline
Bypassing allows use of uncommitted instruction
results by following instructions

61
62
Exception Pipeline Diagram
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 nop overflow! (I2) 100
XOR IF2 ID2 EX2 nop nop (I3) 104
SUB IF3 ID3 nop nop nop (I4) 108 ADD
IF4 nop nop nop nop (I5) Exc.
Handler code IF5 ID5 EX5 MA5 WB5

time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX
I1 I2 nop nop I5 MA I1 nop nop
nop I5 WB nop nop nop nop I5
Resource Usage
62
63
Acknowledgements

UCB material derived from course CS152
Harvard University material derived from course
CS246

63
64
Readings

Computer Architecture A Quantitative Approach,
5th Edition (2012)
D. A. Patterson and J. L. Hennessy, Computer
Organization and Design The Hardware/Software
Interface, 4th Edition,2013.

Write a Comment

User Comments (0)