Computer Architecture and Parallel Computing ??????? Lecture 2 - Pipelining - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Computer Architecture and Parallel Computing ??????? Lecture 2 - Pipelining

Description:

Exceptions are usually unexpected or rare from program s point of ... timer expiration power ... hazard can always be avoided by adding more ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 65
Provided by: Krst3
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture and Parallel Computing ??????? Lecture 2 - Pipelining


1
Computer Architecture and Parallel
Computing??????? Lecture 2 - Pipelining
  • Peng Liu
  • Dept. of Info. Sci. Elec. Engg.
  • Zhejiang University
  • liupeng_at_zju.edu.cn

May 12, 2014
2
  • Microcoding became less attractive as gap between
    RAM and ROM speeds reduced
  • Complex instruction sets difficult to pipeline,
    so difficult to increase performance as gate
    count grew
  • Iron Law explains architecture design space
  • Trade instructions/program, cycles/instruction,
    and time/cycle
  • Load-Store RISC ISAs designed for efficient
    pipelined implementations
  • Very similar to vertical microcode
  • Inspired by earlier Cray machines (more on these
    later)

2
3
An Ideal Pipeline
  • All objects go through the same stages
  • No sharing of resources between any two stages
  • Propagation delay through all pipeline stages is
    equal
  • The scheduling of an object entering the
    pipeline
  • is not affected by the objects in other stages

These conditions generally hold for industrial
assembly lines, but instructions depend on each
other!
3
4
Pipelined MIPS
  • To pipeline MIPS
  • First build MIPS without pipelining with CPI1
  • Next, add pipeline registers to reduce cycle time
    while maintaining CPI1

4
5
Unpipelined Datapath for MIPS
PCSrc
RegWrite
br
rind
jabs
pc4
Add
Add
31
5
6
Hardwired Control Table
Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc
ALU
ALUi
ALUiu
LW
SW
BEQZz0
BEQZz1
J
JAL
JR
JALR
pc4

Reg
Func
no
yes
ALU
rd
sExt16
Imm
Op
rt
uExt16
pc4
sExt16
Imm

no
yes
Mem
rt
br
sExt16

0?
pc4
jabs
jabs
yes
PC
R31
rind
BSrc Reg / Imm WBSrc ALU / Mem / PC
RegDst rt / rd / R31 PCSrc pc4 / br / rind
/ jabs
6
7
Pipelined Datapath
0x4
Add
we
rs1
rs2
rd1
we
addr
ws
addr
rdata
ALU
wd
rd2
rdata
GPRs
Data Memory
Inst. Memory
Imm Ext
wdata
Clock period can be reduced by dividing the
execution of an instruction into multiple
cycles tC gt max tIM, tRF, tALU, tDM, tRW (
tDM probably) However, CPI will increase
unless instructions are pipelined
7
8
Iron Law of Processor Performance
  • Time Instructions Cycles
    Time
  • Program Program Instruction
    Cycle
  • Instructions per program depends on source code,
    compiler technology, and ISA
  • Cycles per instructions (CPI) depends upon the
    ISA and the microarchitecture
  • Time per cycle depends upon the microarchitecture
    and the base technology

Microarchitecture CPI cycle time
Microcoded gt1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
8
9
CPI Examples
Time
9
10
Technology Assumptions
  • A small amount of very fast memory (caches)
  • backed up by a large, slower memory
  • Fast ALU (at least for integers)
  • Multiported Register files (slower!)

Thus, the following timing assumption is
reasonable
tIM ??tRF ??tALU ??tDM ? tRW
A 5-stage pipeline will be the focus of our
detailed design - some commercial designs have
over 30 pipeline stages to do an integer add!
10
11
5-Stage Pipelined Execution
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. instruction1 IF1 ID1 EX1 MA1 WB1 instruction2
IF2 ID2 EX2 MA2 WB2 instruction3 IF3 ID3 EX3 M
A3 WB3 instruction4 IF4 ID4 EX4 MA4 WB4 instru
ction5 IF5 ID5 EX5 MA5 WB5
11
12
5-Stage Pipelined ExecutionResource Usage Diagram
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5 EX
I1 I2 I3 I4 I5 MA I1 I2 I3 I4 I5 WB
I1 I2 I3 I4 I5
Resources
12
13
Pipelined ExecutionALU Instructions
Not quite correct! We need an Instruction Reg
(IR) for each stage
13
14
Pipelined MIPS Datapathwithout jumps
ALU
Data Memory
Imm Ext
MD1
MD2
Control Points Need to Be Connected
14
15
Instructions interact with each other in pipeline
  • An instruction in the pipeline may need a
    resource being used by another instruction in the
    pipeline ? structural hazard
  • An instruction may depend on something produced
    by an earlier instruction
  • Dependence may be for a data value ? data
    hazard
  • Dependence may be for the next instructions
    address ? control hazard (branches, exceptions)

15
16
Resolving Structural Hazards
  • Structural hazards occurs when two instructions
    need same hardware resource at same time
  • Can resolve in hardware by stalling newer
    instruction till older instruction finished with
    resource
  • A structural hazard can always be avoided by
    adding more hardware to design
  • E.g., if two instructions both need a port to
    memory at same time, could avoid hazard by adding
    second port to memory
  • Our 5-stage pipe has no structural hazards by
    design
  • Thanks to MIPS ISA, which was designed for
    pipelining

16
17
Data Hazards
r1 ??
r4 ?? r1
... r1 ??r0 10 r4 ??r1 17 ...
r1 is stale. Oops!
17
18
Resolving Data Hazards (1)
Strategy 1 Wait for the result to be available
by freezing earlier pipeline stages ? interlocks
18
19
Feedback to Resolve Hazards
  • Later stages provide dependence information to
    earlier stages which can stall (or kill)
    instructions
  • Controlling a pipeline in this manner works
    provided the instruction at stage i1 can
    complete without any interference from
    instructions in stages 1 to i
  • (otherwise deadlocks may occur)

19
20
Interlocks to resolve Data Hazards
... r1 ??r0 10 r4 ??r1 17 ...
20
21
Stalled Stages and Pipeline Bubbles
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1
??(r0) 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 ??(r1)
17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3
IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4)
IF4 ID4 EX4 MA4 WB4 (I5)
IF5 ID5 EX5 MA5 WB5
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I
3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA
I1 nop nop nop I2 I3 I4 I5 WB
I1 nop nop nop I2 I3 I4 I5
Resource Usage
nop ? pipeline bubble
21
22
Interlock Control Logic
Compare the source registers of the instruction
in the decode stage with the destination register
of the uncommitted instructions.
22
23
Interlock Control Logicignoring jumps branches
Should we always stall if the rs field matches
some rd?
not every instruction writes a register ??we
not every instruction reads a register ??re
23
24
Source Destination Registers
source(s) destination ALU rd ??
(rs) func (rt) rs, rt rd ALUi rt ??
(rs) op imm rs rt LW rt ??M (rs)
imm rs rt SW M (rs) imm ??
(rt) rs, rt BZ cond (rs) true PC ??
(PC) imm rs false PC ?? (PC) 4 rs J PC
?? (PC) imm JAL r31 ?? (PC), PC ?? (PC)
imm 31 JR PC ?? (rs) rs JALR r31 ??
(PC), PC ?? (rs) rs 31
24
25
Deriving the Stall Signal
Cdest ws Case opcode ALU ??rd ALUi,
LW ??rt JAL, JALR ??R31 we Case opcode ALU,
ALUi, LW ?(ws ? 0) JAL, JALR ??on ... ??off
Cre re1 Case opcode ALU, ALUi,
??on ??off re2 Case
opcode ??on ??off
LW, SW, BZ, JR, JALR J, JAL
ALU, SW ...
Cstall stall ((rsD wsE).weE (rsD
wsM).weM (rsD wsW).weW) . re1D
((rtD wsE).weE (rtD wsM).weM
(rtD wsW).weW) . re2D
This is not the full story !
25
26
Hazards due to Loads Stores
What if (r1)7 (r3)5 ?
... M(r1)7 ? (r2) r4 ? M(r3)5 ...
Is there any possible data hazard in this
instruction sequence?
26
27
Load Store Hazards
... M(r1)7 ? (r2) r4 ? M(r3)5 ...
(r1)7 (r3)5 ? data hazard
However, the hazard is avoided because our memory
system completes writes in one cycle
! Load/Store hazards are sometimes resolved in
the pipeline and sometimes in the memory system
itself. More on this later in the course.
27
28
Resolving Data Hazards (2)
Strategy 2 Route data as soon as possible after
it is calculated to the earlier pipeline stage ?
bypass
28
29
Bypassing
Each stall or kill introduces a bubble in the
pipeline ? ??CPI gt 1
A new datapath, i.e., a bypass, can get the data
from the output of the ALU to its input
29
30
Adding a Bypass
When does this bypass help?
yes
no
no
30
31
The Bypass SignalDeriving it from the Stall
Signal
stall ( ((rsD wsE).weE (rsD wsM).weM (rsD
wsW).weW).re1D ((rtD wsE).weE
(rtD wsM).weM (rtD wsW).weW).re2D )
we Case opcode ALU, ALUi, LW ?(ws ? 0)
JAL, JALR ??on ... ??off
ws Case opcode ALU ??rd ALUi, LW ??rt JAL,
JALR ??R31
ASrc (rsDwsE).weE.re1D
Is this correct?
No because only ALU and ALUi instructions can
benefit from this bypass
Split weE into two components we-bypass, we-stall
31
32
Bypass and Stall Signals
Split weE into two components we-bypass, we-stall
we-bypassE Case opcodeE ALU, ALUi ? (ws ? 0)
... ??off
we-stallE Case opcodeE LW ? (ws ? 0)
JAL, JALR ??on ... ??off
ASrc (rsD wsE).we-bypassE . re1D
stall ((rsD wsE).we-stallE
(rsDwsM).weM (rsDwsW).weW). re1D
((rtD wsE).weE (rtD wsM).weM (rtD
wsW).weW). re2D
32
33
Fully Bypassed Datapath
Is there still a need for the stall signal ?
stall (rsDwsE). (opcodeELWE).(wsE?0 ).re1D
(rtDwsE). (opcodeELWE).(wsE?0 ).re2D
33
34
Resolving Data Hazards (3)
Strategy 3 Speculate on the dependence. Two
cases Guessed correctly ? do nothing Guessed
incorrectly ? kill and restart . Well later
see examples of this approach in more complex
processors.
34
35
Control Hazards
  • What do we need to calculate next PC?
  • For Jumps
  • Opcode, offset and PC
  • For Jump Register
  • Opcode and Register value
  • For Conditional Branches
  • Opcode, PC, Register (for condition), and offset
  • For all other instructions
  • Opcode and PC
  • have to know its not one of above!

35
36
Opcode Decoding Bubble(assuming no branch delay
slots for now)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1
??(r0) 10 IF1 ID1 EX1 MA1 WB1 (I2) r3 ??(r2)
17 IF2 IF2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 ID3
EX3 MA3 WB3 (I4)
IF4 IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 nop I2 nop I3 nop I4 ID I1 nop I2 nop I
3 nop I4 EX I1 nop I2 nop I3 nop I4 MA
I1 nop I2 nop I3 nop I4 WB
I1 nop I2 nop I3 nop I4
Resource Usage
nop ? pipeline bubble
36
37
Speculate next address is PC4
stall
I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD
A jump instruction kills (not stalls) the
following instruction
How?
37
38
Pipelining Jumps
PCSrc (pc4 / jabs / rind/ br)
stall
To kill a fetched instruction -- Insert a mux
before IR
M
E
0x4
IR
IR
Add
Jump?
Any interaction between stall and jump?
addr
PC
inst
IR
Inst Memory
IRSrcD Case opcodeD J, JAL ??nop ... ??IM
38
39
Jump Pipeline Diagrams
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 J
304 IF2 ID2 EX2 MA2 WB2 (I3) 104 ADD IF3 nop
nop nop nop (I4) 304 ADD
IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 nop I4 I5 EX
I1 I2 nop I4 I5 MA I1 I2 nop
I4 I5 WB I1 I2 nop I4 I5
Resource Usage
nop ? pipeline bubble
39
40
Pipelining Conditional Branches
BEQZ?
Branch condition is not known until the execute
stage what action should be taken in the decode
stage ?
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
40
41
Pipelining Conditional Branches
PCSrc (pc4 / jabs / rind / br)
stall
M
E
0x4
IR
IR
Add
addr
PC
inst
IR
Inst Memory
If the branch is taken - kill the two following
instructions - the instruction at the decode
stage is not valid ? stall signal is not valid
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
41
42
Pipelining Conditional Branches
stall
PCSrc (pc4/jabs/rind/br)
BEQZ?
M
E
IRSrcE
0x4
IR
IR
Add
Jump?
IRSrcD
addr
nop
PC
inst
IR
Inst Memory
If the branch is taken - kill the two following
instructions - the instruction at the decode
stage is not valid ? stall signal is not valid
I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD 108
I4 304 ADD
42
43
New Stall Signal
stall ( ((rsD wsE).weE (rsD wsM).weM
(rsD wsW).weW).re1D ((rtD
wsE).weE (rtD wsM).weM (rtD
wsW).weW).re2D ) .
!((opcodeEBEQZ).z (opcodeEBNEZ).!z)
Dont stall if the branch is taken. Why?
Instruction at the decode stage is invalid
43
44
Control Equations for PC and IR Muxes
PCSrc Case opcodeE BEQZ.z, BNEZ.!z ??br ...
?? Case opcodeD J, JAL ? ?jabs
JR, JALR ? ?rind ... ?? pc4
Give priority to the older instruction, i.e.,
execute-stage instruction over decode-stage
instruction
IRSrcD Case opcodeE BEQZ.z, BNEZ.!z ??nop ...
?? Case opcodeD J, JAL, JR, JALR
??nop ... ??IM
IRSrcE Case opcodeE BEQZ.z, BNEZ.!z ??nop ...
??stall.nop !stall.IRD
44
45
Branch Pipeline Diagrams(resolved in execute
stage)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 BEQZ
200 IF2 ID2 EX2 MA2 WB2 (I3) 104 ADD IF3 ID3
nop nop nop (I4) 108
IF4 nop nop nop nop (I5) 304 ADD
IF5 ID5 EX5 MA5 WB5
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX
I1 I2 nop nop I5 MA I1 I2 nop nop
I5 WB I1 I2 nop nop I5
Resource Usage
nop ? pipeline bubble
45
46
Reducing Branch Penalty(resolve in decode stage)
  • One pipeline bubble can be removed if an extra
    comparator is used in the Decode stage
  • But might elongate cycle time

PCSrc (pc4 / jabs / rind/ br)
0x4
Add
nop
addr
PC
inst
IR
Inst Memory
D
Pipeline diagram now same as for jumps
46
47
Branch Delay Slots(expose control hazard to
software)
  • Change the ISA semantics so that the instruction
    that follows a jump or branch is always executed
  • gives compiler the flexibility to put in a useful
    instruction where normally a pipeline bubble
    would have resulted.

I1 096 ADD I2 100 BEQZ r1 200 I3 104 ADD I4 304
ADD
Delay slot instruction executed regardless of
branch outcome
  • Other techniques include more advanced branch
    prediction, which can dramatically reduce the
    branch penalty... to come later

47
48
Branch Pipeline Diagrams(branch delay slot)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 WB1 (I2) 100 BEQZ
200 IF2 ID2 EX2 MA2 WB2 (I3) 104
ADD IF3 ID3 EX3 MA3 WB3 (I4) 304 ADD
IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 ID I1 I2 I3 I4 EX
I1 I2 I3 I4 MA I1 I2 I3 I4 WB
I1 I2 I3 I4
Resource Usage
48
49
Why an Instruction may not be dispatched every
cycle (CPIgt1)
  • Full bypassing may be too expensive to implement
  • typically all frequently used paths are provided
  • some infrequently used bypass paths may increase
    cycle time and counteract the benefit of reducing
    CPI
  • Loads have two-cycle latency
  • Instruction after load cannot use load result
  • MIPS-I ISA defined load delay slots, a
    software-visible pipeline hazard (compiler
    schedules independent instruction or inserts NOP
    to avoid hazard).
  • MIPSMicroprocessor without Interlocked Pipeline
    Stages
  • Removed in MIPS-II (pipeline interlocks added in
    hardware)
  • Conditional branches may cause bubbles
  • kill following instruction(s) if no delay slots

49
50
Iron Law with Software-Visible NOPs
Time Instructions Cycles
Time Program Program
Instruction Cycle
  • If software has to insert NOP instructions for
    hazard avoidance, instructions/program increases
  • average cycles/instruction decreases - doing
    nothing fast is easy!
  • But performance (time/program) worse or same as
    if hardware instead uses interlocks to avoid
    hazard
  • Hardware-generated interlocks (bubbles) dont
    change instructions/program, but only add to
    cycles/instruction
  • Hardware interlocks dont take space in
    instruction cache

50
51
Exceptionsaltering the normal flow of control
Ii-1
HI1
exception handler
HI2
Ii
program
HIn
Ii1
An exception transfers control to special handler
code run in privileged mode. Exceptions are
usually unexpected or rare from programs point
of view.
51
52
Causes of Exceptions
Exception an event that requests the attention
of the processor
  • Asynchronous an external interrupt
  • input/output device service request
  • timer expiration
  • power disruptions, hardware failure
  • Synchronous an internal exception (a.k.a. traps)
  • undefined opcode, privileged instruction
  • arithmetic overflow, FPU exception
  • misaligned memory access
  • virtual memory exceptions page faults,
    TLB misses, protection violations
  • software exceptions system calls, e.g., jumps
    into kernel

52
53
History of Exception Handling
  • First system with exceptions was Univac-I, 1951
  • Arithmetic overflow would either
  • 1. trigger the execution a two-instruction fix-up
    routine at address 0, or
  • 2. at the programmer's option, cause the computer
    to stop
  • Later Univac 1103, 1955, modified to add external
    interrupts
  • Used to gather real-time wind tunnel data
  • First system with I/O interrupts was DYSEAC, 1954
  • Had two program counters, and I/O signal caused
    switch between two PCs
  • Also, first system with DMA (direct memory access
    by I/O device)

Courtesy Mark Smotherman
53
54
DYSEAC, first mobile computer!
Courtesy Mark Smotherman
54
55
Asynchronous Interruptsinvoking the interrupt
handler
  • An I/O device requests attention by asserting one
    of the prioritized interrupt request lines
  • When the processor decides to process the
    interrupt
  • It stops the current program at instruction Ii,
    completing all the instructions up to Ii-1 (a
    precise interrupt)
  • It saves the PC of instruction Ii in a special
    register (EPC)
  • It disables interrupts and transfers control to a
    designated interrupt handler running in the
    kernel mode

55
56
MIPS Interrupt Handler Code
  • Saves EPC before re-enabling interrupts to allow
    nested interrupts ?
  • need an instruction to move EPC into GPRs
  • need a way to mask further interrupts at least
    until EPC can be saved
  • Needs to read a status register that indicates
    the cause of the interrupt
  • Uses a special indirect jump instruction RFE
    (return-from-exception) to resume user code,
    this
  • enables interrupts
  • restores the processor to the user mode
  • restores hardware status and control state

56
57
Synchronous Exception
  • A synchronous exception is caused by a particular
    instruction
  • In general, the instruction cannot be completed
    and needs to be restarted after the exception has
    been handled
  • requires undoing the effect of one or more
    partially executed instructions
  • In the case of a system call trap, the
    instruction is considered to have been completed
  • syscall is a special jump instruction involving a
    change to privileged kernel mode
  • Handler resumes at instruction after system call

57
58
Exception Handling 5-Stage Pipeline
  • How to handle multiple simultaneous exceptions in
    different pipeline stages?
  • How and where to handle external asynchronous
    interrupts?

58
59
Exception Handling 5-Stage Pipeline
Inst. Mem
Decode
Data Mem

Illegal Opcode
Data address Exceptions
Overflow
PC address Exception
Cause
EPC
Asynchronous Interrupts
59
60
Exception Handling 5-Stage Pipeline
  • Hold exception flags in pipeline until commit
    point (M stage)
  • Exceptions in earlier pipe stages override later
    exceptions for a given instruction
  • Inject external interrupts at commit point
    (override others)
  • If exception at commit update Cause and EPC
    registers, kill all stages, inject handler PC
    into fetch stage

60
61
Speculating on Exceptions
  • Prediction mechanism
  • Exceptions are rare, so simply predicting no
    exceptions is very accurate!
  • Check prediction mechanism
  • Exceptions detected at end of instruction
    execution pipeline, special hardware for various
    exception types
  • Recovery mechanism
  • Only write architectural state at commit point,
    so can throw away partially executed instructions
    after exception
  • Launch exception handler after flushing pipeline
  • Bypassing allows use of uncommitted instruction
    results by following instructions

61
62
Exception Pipeline Diagram
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096
ADD IF1 ID1 EX1 MA1 nop overflow! (I2) 100
XOR IF2 ID2 EX2 nop nop (I3) 104
SUB IF3 ID3 nop nop nop (I4) 108 ADD
IF4 nop nop nop nop (I5) Exc.
Handler code IF5 ID5 EX5 MA5 WB5

time t0 t1 t2 t3 t4 t5 t6 t7 . . .
. IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX
I1 I2 nop nop I5 MA I1 nop nop
nop I5 WB nop nop nop nop I5
Resource Usage
62
63
Acknowledgements
  • UCB material derived from course CS152
  • Harvard University material derived from course
    CS246

63
64
Readings
  • Computer Architecture A Quantitative Approach,
  • 5th Edition (2012)
  • D. A. Patterson and J. L. Hennessy, Computer
    Organization and Design The Hardware/Software
    Interface, 4th Edition,2013. 
Write a Comment
User Comments (0)
About PowerShow.com