Pipelining a CPU

About This Presentation

Title:

Pipelining a CPU

Description:

... the instruction are to be extracted, decoded, and then used to control the ... At the start of the instruction decode and register fetch cycle, the various bit ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 32

Provided by: mattj4

Category:

more less

Transcript and Presenter's Notes

Title: Pipelining a CPU

1
Pipelining a CPU
orHow to Get a CPU to Seem to Go a Lot Faster
Than the Underlying Circuits Are Actually Capable
of

This Powerpoint animation illustrates the basic
operation of a CPU before and after pipelining
the CPU is the MIPS CPU illustrated in Appendix A
of our CS470 textbook (Hennessy and Patterson)
To control the animation, hit either the Enter
key or the spacebar or click the mouse to advance
to the next step hit p or backspace to backup
to the previous step

2
Overview

Overview of the MIPS CPU architecture
Instruction set architecture
Hardware architecture
MIPS CPU operations before pipelining
The MIPS CPU architecture and operations after
pipelining
Complications some pipeline hazards and their
solutions
Summary and Conclusions

3
Relevant Features of the MIPS CPUs Instruction
Set Architecture

It is a register-to-register architecture,
a.k.a., load-and-store only the 2 instructions
Load and Store can access data memory
There are 32 general purpose, software
addressable registers well designate the ith
general purpose register as Ri
There are 64 basic instructions

4
MIPS Instruction Format
Since there are 32 general purpose registers, it
takes 5 bits to specify each register

All instructions are fixed length (4 bytes)
There are three instruction types (formats)

For a J-type, both the destination and one
operand are implicit (the program counter)
The other operand is an immediate, but a larger
set of bits than the immediate in an I-type
instruction, which needed to devote 10 of its
bits to the explicit specification of an operand
and a destination register as well

The 6 opcode bits specify not only what operation
is to be performed but the instruction type
(format) itself, which controls how the other bit
fields in the instruction are to be extracted,
decoded, and then used to control the source and
destination operands for the desired operation

64 possible instructions require 6 bits per
instruction to specify a unique operations code
for each one (64 26, right?)
For the MIPS architecture, the opcode is stored
in bit positions 0 through 5

For an R-type instruction like R3R5-R9, both
operands are general purpose registers, as is the
destination that stores the result, so the R-type
instruction format must encode 3 register
designations (and lets just ignore the ALU
function for now -)
For an I-type instruction, one operand is still a
general purpose register, as is the destination,
but the other operand is an immediate, a set of
bits in the instruction itself
5
The CPUs Architecture Before Pipelining
NPC
B
LMD
A
address
cond
ALUoutput
PC
Imm
load, store or no-op
Heres an Arithmetic-Logic Unit that performs the
required calculations or manipulations

Because of the tremendous disparity in speed
between slow but cheap main memories and a modern
CPU, the MIPS CPU includes a smaller but higher
speed (and hence more expensive) instruction
cache capable of delivering an instruction word
to the Instruction register (IR) in 1 clock cycle
The CPU as shown could actually work without such
a cache, but it would be much slower
The pipelined version really wouldnt work at all
without the cache so well show it here so as to
minimize changes in the architecture diagram
after we pipeline it

Heres the set of 32 software-addressable,
general purpose registers
IR
Since the ALU manipulates at most two operands,
it cant use all of its potential data sources on
every instruction, so we have to put them
temporarily in special purpose registers like
this one
and use multiplexers to select the correct
source of data for any given instruction

The speed disparity between main memory and the
CPU also dictates we cache data as well as
instructions
Lets leave until later in this course our
discussion of the reason for having two separate
caches (one for data, one for instructions, also
called a split cache or Harvard architecture)
rather than a single or unified cache
containing both

Heres a Program Counter to hold the address of
the instruction to be executed
Heres an Instruction Register to hold the
instruction itself
6
Operation of the CPU Before Pipelining
NPC
B
LMD
A
address
cond
ALUoutput
PC
Imm
load, store or no-op
IR

The CPU is synchronous, an instruction takes 5
cycles
Instruction fetch
Instruction decode and register fetch
Execution or address calculation
Memory access
Write back
Lets look at the details of the control and data
flow during each cycle

7
Cycle 1 Instruction Fetch
NPC
B
LMD
4
A
address
cond
ALUoutput
PC
c
Imm
load, store or no-op

Meanwhile, since the instruction length is 4
bytes, most of the time we can use a very small,
special purpose adder to prepare the address for
the next instruction fetch by just adding 4 to
the current PC
But occasionally the current instruction will
call for a jump of some sort, so instead of
sending our new PC4 value directly back into the
PC, well just gate it into a multiplexer where
we can later select a different value (from
ALUoutput) to be sent back to the PC instead, if
need be

IR
instruction fetch

After the instruction fetch, the Instruction
Register (IR) holds the current instruction
The various bit fields of the IR control all
subsequent processing of this instruction by the
CPU

At the start of the instruction fetch cycle, both
the Program Counter (PC) and the NPC hold the
address of the instruction we want to execute
so well go ahead and send that address to the
instruction memory to start the retrieval (the
NPC isnt used until later)
8
Cycle 2 Instruction Decode and Register Fetch
NPC
B
LMD
4
A
address
cond
ALUoutput
PC
For register fetch, the IR6..10 and IR11..15
bits (containing the 5 and the 1 for
R3R5-R1) control which general purpose registers
are gated into the A and B registers
Imm
load, store or no-op
IR
c
c
c
c
c
instruction fetch
instruction decode register fetch
Since the ALU operates on 32 bit operands, an
immediate argument from the IR must be extracted,
left justified, and then arithmetically right
shifted for sign extension to fill out to a full
32 bits wide before being sent to Imm as a
possible input to the ALU (depending on the
instruction type)
At the end of this cycle, all 4 of the possible
ALU operands (two general purpose registers, the
NPC containing the address of the current
instruction, and the immediate bits from the IR
itself) have been fetched into special purpose
registers, ready for 2 of them to actually be
selected by the ALUinput multiplexers on the next
cycle
At the start of the instruction decode and
register fetch cycle, the various bit fields
needed to control the functional units of the CPU
are extracted and decoded
9
Cycle 3 Execution or Address Calculation
NPC
B
LMD
c
4
A
address
cond
c
c
ALUoutput
PC
c
c

IR0..5 and IR21..31 together specify the
arithmetic or logical operation to be performed
by the ALU
Note that only an R-type instruction (as
determined by IR0..5) actually needs to look at
IR21..31 for the simpler I and J types, the
operation is specified by IR0..5 alone

c
Imm
load, store or no-op

Depending on the type of the current instruction,
ALUoutput could ultimately be
Stored in a general purpose register, e.g.,
R3R5-R1
Used as the data memory address for a load or
store operation, e.g., LW R1, 8(R2)
Sent to the PC to be used as the address of the
next instruction, e.g., JUMP -3592
The ALUoutput is actually sent to all those
places, although only one will actually use it,
depending, of course, on the instruction type

There are two multiplexers that, under control of
the opcode bits IR0..5, select which of the
various possible input sources are actually sent
to the ALU
The upper ALUinput multiplexer selects either NPC
or register A as one input, depending on whether
or not the instruction is a branch, for which a
target address must be calculated based on the
value of the NPC (which contains the address of
the current instruction being executed)
The lower ALUinput multiplexer controls whether
register B or an immediate operand is sent to be
the other ALU input, depending on whether or not
the opcode IR0..5 designates an R-type instruction

We get two outputs from the ALU
The result of the specified operation is placed
in ALUoutput
The condition register is set to either true or
false its used later, during writeback, to
control whether a conditional branch is actually
taken, e.g., branch if non-zero, based on the
comparison the ALU just performed

instruction fetch
execution or address calculation
instruction decode register fetch
10
Cycle 4 Memory Access
NPC
c
B

During memory access, data memory will do one of
three things, depending on IR0..5, the opcode of
the instruction
For a store instruction, it will store the
contents of special purpose register B into the
address specified by the ALUoutput
For a load instruction, it will read from the
address specified by ALUoutput and place the
contents in the Load Memory Data register (LMD)
For any other instruction, it does nothing

LMD
c
c
4
A
address
cond
c
ALUoutput
PC
c
Imm
load, store or no-op
IR
instruction fetch
execution or address calculation
memory access
instruction decode register fetch
11
Cycle 5 Write Back
NPC
c
B
LMD
4
A
address
cond
write back
c
ALUoutput
PC
c
Imm
load, store or no-op
The specific register to be written to is
designated by the destination register bits from
the IR, e.g., the 3 in R3R5-R1, which is found
in IR11..15 for an I-type instruction or
IR16..20 for an R-type, the instruction type
being obtained from IR0..5

For an ordinary instruction, the address of the
next instruction will just be the current PC
value 4 at port p2, but if the instruction
being completed was a jump or branch, the ALU
calculated an address for the next instruction
and we may need to select the ALUoutput at port
p1 here
The condition code set by the ALU actually
controls which result gets written back to the PC
and NPC

IR
write back
instruction fetch
execution or address calculation
memory access
instruction decode register fetch

During Write Back, the type of the instruction
determines whether it is the LMD or the ALUoutput
that is written into some general purpose
register
A Load instruction selects the LMD
An R- or I-type instruction selects ALUoutput

After the write-back, instruction execution is
complete
12
Summary of the CPU Processing Before Pipelining
the 5 CPU stages
instr. i
instr. i1
instr. i2
instr. i3
instr. i4

Each instruction takes 5 cycles to work through
the 5 stages of the CPU
So if instr. i through instr. i4 are the 5
sequential instructions in memory shown above, it
will take 25 cycles to complete their processing

13
Pipelining

Pipelining exploits the fact that the various
functional units of the CPU were actually idle
most of the time, e.g., the ALU was only active
during 1 of the 5 cycles
A pipelined CPU overlaps the execution of several
instructions simultaneously During the same
cycle, one stage of the CPU can be working on one
phase of one instruction while another stage can
be working on a different phase of a different
instruction
Needless to say, pipelining adds complexity to
the CPU

14
Overview of the CPU Processing After Pipelining
instr. i
instr. i1
instr. i2

Before the pipelining, the CPU completed an
instruction every 5 cycles
With the overlap in processing, the pipelined CPU
can eventually, once the pipeline is full, emit a
result (complete an instruction) every cycle
So once the pipeline is full, the CPU appears to
be 5 times faster, despite the fact that each
instruction still takes the same 5 cycles to work
through the 5 stages of the CPU

instr. i3
instr. i4
5 sequential instructions in memory
15
Lets Look at the Details
Well examine the operation of the pipelined CPU
during a single cycle where it is processing the
5 instructions (i through i4) shown below
instr. i
instr. i1
instr. i2
instr. i3
instr. i4
instr. i5
instr. i6

Heres the configuration of the CPU stages at the
start of the cycle

And heres what they will look like at the end

instr. i has been in the pipeline the longest
and is on its 5th and last cycle it will
complete this coming cycle instr. i4 will enter
the CPU for the first time this coming cycle
All instructions will have advanced one stage to
the right instr. i will have been completed and
the PC will have been set so that instr. i5 can
be fetched on the next cycle
16
The Pipelined CPU
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
B
B

Well name the stage latches after the two stages
of the pipeline they sit between
Here we see that the IF/ID.NPC and IF/ID.IR are
the only two registers needed to forward data
between the instruction fetch and instruction
decode stages

LMD
A
address
cond
C
ALUoutput
ALUoutput
C
PC
PC
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C

The functional units of a given stage are
controlled by the IR for that stage and process
data that comes from the stages latches to their
left
So, for example, the execution/address
calculation stage functional units are controlled
by the ID/EX.IR, which for this example contains
instr. i2

Note that theres a separate IR for each stage
to control that stages functional units so that
the stages can work on separate instructions
during the same cycle
This is the configuration of the CPU at the start
of the cycle
17
The Pipelined CPU
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
?
B
B
4
4
LMD
A
address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C
C
C
C

As before, multiplexers select the inputs for a
functional unit
Here, for example, for the Write Back of the
results from instruction i into some general
purpose register, if MEM/WB.IR0..5 (the opcode
for instruction i) designates a load instruction,
p1 (containing MEM/WB.LMD) will be selected for
Write Back otherwise the MEM/WB.ALUoutput at p2
will be selected

After the Write Back of prior results (but still
within the same cycle), IF/ID.IR11..15 and
IF/ID.IR16..20 identify the registers to be
fetched for instruction i3 (for use in its
execution phase, during the next cycle)
Note that these bits come from the IF/ID.IR, not
the MEM/WB.IR that controlled the write back for
instruction i

The target register for the write back is
determined by the type of the instruction, i.e.,
the opcode in MEM/WB.IR0..5
If its an I-type instruction, the destination
register is specified in MEM/WB.IR11..15
If its R-type, the destination register is
specified in MEM/WB.IR16..20
Otherwise (J-type), no write back to a general
purpose register is required

After the Write Back of prior results (but still
within the same cycle), IF/ID.IR11..15 and
IF/ID.IR16..20 identify the registers to be
fetched for instruction i3 (for use in its
execution phase, during the next cycle)
Note that these bits come from the IF/ID.IR, not
the MEM/WB.IR that controlled the write back for
instruction i

At the end of the cycle
The CPU has totally completed its processing of
instruction i
Instructions i1 through i4 have each advanced
one stage to the right
The CPU is ready to start the next cycle,
including the fetch of instruction i5

To complete the cycle, all current results are
latched into the appropriate pipeline registers
to set the stage for the next cycle
At the start of each cycle, all the contents of
all the latches are gated out onto data and
control lines to setup all subsequent processing
for that cycle
18
One Cycle in the Life of the Pipelined CPU

Now lets see it again 1 cycle of the pipelined
CPU from start to finish, without pausing for
the insightful, informative, lucid, and possibly
even entertaining annotations that nonetheless
interrupted us and hence distracted us from
perceiving the overall flow of something that
rather incredibly happens billions of times each
second without error for years on end

19
One Cycle in the Life of the Pipelined CPU
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
4
B
B
4
4
LMD
A
address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C
C
C
C

At the end of the cycle
The CPU has totally completed its processing of
instruction i
Instructions i1 through i4 have each moved one
stage to the right
The CPU is ready to start the next cycle,
including the fetch of instruction i5

20
Summary of Pipelining So Far

There are two big sources of additional
complexity
More special purpose registers (now called
pipeline latches) are required than in the
unpipelined CPU, since several of them must be
replicated, some (like the IR) several times
The general purpose register set must now be
twice as fast and its control sequencing more
complicated since it must be able to do both a
write-back and a fetch in the same cycle, one
after the other (the general purpose register set
is now like a 2-stage mini-pipeline in and of
itself, in fact)
Although the pipelined CPU can now, once filled,
complete an instruction every cycle, the cycle
time itself may need to be increased slightly to
accommodate delays through the extra stage latches

21
Complications

The design shown was deliberately over-simplified
to show the basic concept of pipelined
operations it has several problems (a.k.a
hazards) typical of pipelined designs that we
will discuss and fix over the next few weeks
The cost of these fixes, obviously, will be even
further complexity in the form of more circuits
to make the pipeline work efficiently, including
Forwarding logic to resolve hazards without
introducing stalls
Interlocks for stall insertion for unavoidable
hazards
Lets take a look at some of the hazards and the
types of fixes the design will need

22
A RAW Hazard in the Pipeline

During its current register fetch cycle, instr.
i3 needs to fetch R3 and R5 into ID/EX.A and
ID/EX.B so that it can send them to the ALU for
multiplication on the next cycle, when it (instr.
i3) has advanced into execution
But the R3 value being fetched now is not
correct the R2-R7 value we want in R3 is still
being computed by instr. i2 in the execution
stage and has not yet been written back into R3

IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
B
B
LMD
A
address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C
execution or address calc. for instr. i2
instr. decode register fetch for instr. i3
memory access for instr. i1
instr. fetch for instr. i4
write back for instr. i

R3 here is involved in what is called a
Read-After-Write ( or RAW) data hazard, where the
acronym reflects the order of operations desired
but not obtained
More complicated pipelines can also be subject to
WAR and WAW hazards

Suppose our instruction sequence includes the
following instructions R3R2-R7 as instruction
i2, and R6R3R5 as instruction i3
23
Shortcut Logic (a.k.a. Forwarding) Can Resolve
This RAW Hazard
The control circuits for the expanded, lower
ALUinput multiplexer now need to allow it to
identify the hazard and select the ALUoutput
rather than special purpose register B when the
hazard exists if the ID/EX.IR0..5 encodes an
I-type opcode, select p1else if (ID/EX.IR0..5
encodes an R-type opcode) and
(ID/EX.IR6..10 EX/MEM.IR16..20 or
ID/EX.IR11..15 EX/MEM.IR16..20),
select p3 (the hazard case) else select p2
Note that at the start of the next cycle, after
instr. i3 has moved into the execution stage as
shown below, the computation of the R2-R7 value
it needs will already have been completed by the
ALU for instr. i2 during its execution phase on
the previous cycle that value just hasnt been
stored where we need it yet
The solution to the hazard is to add an
additional port to the lower ALUinput multiplexer
and connect EX/MEM.ALUoutput to it for selection
in place of ID/EX.B whenever the hazard condition
occurs
ID/EX
EX/MEM
NPC
B
A
Heres the R2-R7 value we want, although its not
yet even written back into R3, much less already
fetched into ID/EX.B where instr. i3 needs it
to be now

The value here is not the R2-R7 value we need
If we use this value, our results will be
incorrect thats why this condition is called a
hazard

ALUoutput
Imm
IR
IR
c
C
C
execution cycle of R6R3R5
Now, when all the pipeline registers are gated
out at the start of the cycle, EX/MEM.ALUoutput
is forwarded to the new port p3 of the lower ALU
multiplexer where it is available for selection
by instr. i3 during its execution cycle
24
Shortcuts Can Solve Many Problems But

The previous slide showed the complexity incurred
by one shortcut to resolve one hazard
The multiplexer needs an extra port, which in
turn requires a new data path to the new port
The control logic for the multiplexer gets more
complicated, too, requiring yet more bits to be
sent to it

Our simple pipeline has several other hazards
The good news is that many of them can be solved
by shortcuts similar to the one we just saw
The bad news is two fold
Were starting to add quite a bit of additional
circuitry to the CPU
We may have to increase our cycle time a bit more
Our manufacturing cost/chip is rising since our
yield is dropping with the increased area of each
chip
There are still some hazards that cant be
resolved this way at all

25
For Irresolvable Hazards, Part of the Pipeline
Must Be Stalled
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
B
B
LMD
A
instr. i2, the load instruction LW R1, 8(R2)
that will ultimately load R1 with the correct
value, is still in address calculation, using the
ALU to calculate the address R28 to send to data
memory during its memory access cycle (next CPU
cycle)

During its decode and register fetch cycle, the
R4R1-R5 instruction needs to fetch R1 and gate
it into ID/EX.A
But the correct content for R1 is not there yet,
its still in data memory

address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C

In contrast to the last hazard we saw, even after
instr. i3 moves into its execution phase on the
next CPU cycle, the correct data value will still
not be present anywhere in the CPU (it will be
enroute from data memory to the MEM/WB.LMD), so
forwarding wont help
Inescapable conclusion R4R1-R5 must not be
allowed to proceed the front part of the
pipeline must be stalled, its instructions
prevented from advancing to the next stage to the
right for the next cycle

execution or address calc. for instr. i2
instr. decode register fetch for instr. i3
memory access for instr. i1
instr. fetch for instr. i4
write back for instr. i
Suppose our instruction sequence includes the
following instructions LW R1, 8(R2), as instr.
i2 (load R1 from memory address R28), and
R4R1-R5, as instr. i3
26
Overview of the Stall
the 5 CPU stages
instr. i1
instr. i2
instr. i3
instr. i4
instr. i

It is instr. i3 that we wish to keep from
advancing but if we cant let instr. i3
advance, we have to hold up instr. i4 as well,
since there will be no place for it to advance to
instr. i2, instr. i1, and instr. i must all be
allowed to proceed normally, however, so that
instr. i2 will eventually read the correct data
from data memory so that instr. i3 can then
proceed
In general, whenever we stall an instruction
thats not ready to move on, we must also stall
all subsequent instructions while allowing all
preceding instructions to progress normally, so
the hazard will eventually be cleared and it will
be safe to let the stalled instruction move on
again

Each stage that is inactive during a given cycle
can be viewed as a stall bubble proceeding
through the pipeline in place of a real
instruction
Although it looks like we need a two cycle stall
(two bubbles) here, we can cut that back to one
by simply adding a forwarding path from
MEM/WB.LMD to the correct ALU input multiplexer
so that we dont actually have to wait for the
write back of the correct value into R1
But we do still have to stall for one cycle

Although we held R4R1-R5 in place after the last
cycle, we still cant let it advance after doing
its register fetch on this current cycle, since
it will still be fetching a hazardous value LW
R1, 8(R2) still hasnt finished loading the
correct value into R1 yet its still loading the
value we want from memory into the MEM/WB.LMD it
wont be written back into R1 until the next
cycle

Now we can let R4R1-R5 fetch on this cycle and
advance normally on the next, since the write
back of MEM/WB.LMD into R1 will occur at the
start of this current cycle just before R1 is
fetched into ID/EX.A

We cant let R4R1-R5 move into execution after
this cycle, since the value it is about to fetch
from R1 is erroneous LW R1, 8(R2) is still
computing the R28 address it will send to the
data memory on the next cycle to start reading
the value we want for R1, the actual read itself
hasnt even started yet

27
Pipeline Interlocks Are Required to Insert the
Stall Bubble

Well need four new multiplexers in the front-end
stage latches themselves
To stall instructions i4 and i3, the PC,
IF/ID.NPC, and IF/ID.IR registers must recycle
their current contents in place to ensure their
normal execution can resume properly once the
hazard is cleared when instr. i2 completes its
data memory access and writes the needed value
into the MEM/WB.LMD
Additionally, the ID/EX.IR must be set to all
zeroes (no-operation) to insert the stall bubble
into the next cycle/stage since the execution
stage will have nothing to do on the cycle after
instr. i3 is stalled in place in register fetch

All four interlock multiplexers have the same
control logicIf the ID/EX.IR indicates a load
instruction whose destination register is the
same as a source register for the instruction in
the IF/ID.IR, select port p1 (stall), else
select p2 (normal)

EX/MEM.ALUoutput
EX/MEM.cond
IF/ID
ID/EX
p1 p2
p1 p2
MUX
MUX
NPC
p1 p2
p1 p2
MUX
MUX
0
IR
PC
IR
C
C
C
C
C
28
Stalling the Front End

The stall insertion is complete PC, IF/ID.IR,
and IF/ID.NPC have been recycled and ID/EX.IR is
set to no-op
Since ID/EX.IR is now all zeroes, not only will
the execution stage not do anything on the next
cycle, but the zero bits sent from ID/EX.IR at
the start of the next cycle to control the
interlock multiplexers for that cycle will insure
that the control logic for those interlocks does
not detect a hazard for the new cycle
The interlocks will then select their normal
operations port, thus unlocking the pipeline and
allowing instructions i4 and i3 to proceed
normally, having stalled for only the one cycle
necessary to resolve the hazard

At this point, the PC, IF/ID.IR, and IF/ID.NPC
are set to recycle the contents about to be
gated in are the same as they were at the start
of the cycle, so instr. i4 and instr. i3 will
not advance in the pipeline
Heres how instr. i2 and instr. i3 get recycled
and a stall bubble (no-op) inserted into the
pipelines execution stage in place of the
stalled instr. i3
Up to this point, everything has been proceeding
normally, but since the opcode for instr. i2
specifies a load and its destination register
matches one of the source registers for instr.
i3, all the interlocks will now select port p1,
which will stall the first 3 stages of the
pipeline on the next cycle
We wont show the execution/address calculation,
memory access, or write back stages functional
units this cycle since theyre all operating
normally, which weve already seen
Since the ID/EX.IR controls the execution stage
and its about to be set to all zeroes (the code
for no operation) for the next cycle, it
actually doesnt matter whats about to be gated
into the other ID/EX registers at the end of this
cycle the no-op on the next cycle means that
they wont be used anyway so we wont bother to
show them here
Since we dont care about any of the ID/EX
registers except the IR, the instruction
decode/register fetch stage functional units that
feed those other ID/EX registers also dont
matter in this animation although theyre not
really stalled, their outputs are just going to
be ignored anyway so to keep the animation as
simple as possible, we wont show their
processing this cycle either
OK, enough caveats here we go -)
Since instr. i3 is being recycled and will not
move into the execution stage, the execution
stage will have nothing to do on the next cycle
so a no-op is about to be gated into the ID/EX.IR
so that the execution stage functional
units,controlled as they are by the ID/EX.IR,
will in fact do nothing
EX/MEM.ALUoutput
c
EX/MEM.cond
IF/ID
ID/EX
ID/EX
p1 p2
p1 p2
MUX
MUX
NPC
4
p1 p2
p1 p2
MUX
MUX
0
0
IR
PC
C
IR
C
C
C
C
C
C
C
C
C
C
C
C
29
A Stall Bubble Has Been Inserted

Heres the stall bubble
Normal operations of the CPU will eliminate it in
3 more cycles

instr. i1
instr. i2
instr. i4
instr. i3
instr. i5
no-op
instr. i6
instr. i7
instr. i8
EX/MEM.ALUoutput
EX/MEM.cond

After three cycles, the stall bubble has been
expelled, the pipeline is full, operations
continue normally
Note that although it took 3 cycles to clear the
bubble, in only one of those 3 cycles (the last
one) did the CPU not actually emit a result
(i.e., complete an instruction)

IF/ID
ID/EX
ID/EX
p1 p2
p1 p2
MUX
MUX
NPC
p1 p2
p1 p2
MUX
MUX
0
IR
PC
C
IR
C
C
C
C
C
C
C
C
30
Next Complication The PC, the NPC, and Branch
Hazards

Close examination of the normal advance of the
PC into the NPC in the previous animation reveals
that it is unfortunately incorrect
The reason actually has nothing to do with the
interlocks but resulted from the earlier
introduction of the extra pipeline registers for
the NPC themselves (the NPC logic was correct in
the un-pipelined design)
And things are really going to go to worms when
we add jump and branch instructions rather than
just the simple sequential execution weve been
looking at
But Ill leave animating the fixes for those
issues for another year read the textbook -)

31
Summary

Despite the fact that the underlying circuitry is
no faster than before, an n-stage pipeline
emitting 1 result per cycle gives the appearance
of an n-fold speedup
In actuality, the speedup is less than that for
two reasons
The cycle time itself may need to increase to
allow for extra propagation delays through the
new circuits (e.g., stage latches)
Some cycles will not see an instruction completed
and emitted
For a pipeline with n stages, n-1 cycles of
pipeline latency are required to fill the
pipeline before instructions start to complete
jumps and branches will cause this penalty to
occur repeatedly
There will almost always be hazards that
forwarding cant cure so the front end of the
pipeline will have to be stalled each stall
bubble inserted eventually leads to a cycle where
no instruction completes