Pipelining a CPU - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Pipelining a CPU

Description:

... the instruction are to be extracted, decoded, and then used to control the ... At the start of the instruction decode and register fetch cycle, the various bit ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 32
Provided by: mattj4
Category:

less

Transcript and Presenter's Notes

Title: Pipelining a CPU


1
Pipelining a CPU
orHow to Get a CPU to Seem to Go a Lot Faster
Than the Underlying Circuits Are Actually Capable
of
  • This Powerpoint animation illustrates the basic
    operation of a CPU before and after pipelining
    the CPU is the MIPS CPU illustrated in Appendix A
    of our CS470 textbook (Hennessy and Patterson)
  • To control the animation, hit either the Enter
    key or the spacebar or click the mouse to advance
    to the next step hit p or backspace to backup
    to the previous step

2
Overview
  • Overview of the MIPS CPU architecture
  • Instruction set architecture
  • Hardware architecture
  • MIPS CPU operations before pipelining
  • The MIPS CPU architecture and operations after
    pipelining
  • Complications some pipeline hazards and their
    solutions
  • Summary and Conclusions

3
Relevant Features of the MIPS CPUs Instruction
Set Architecture
  • It is a register-to-register architecture,
    a.k.a., load-and-store only the 2 instructions
    Load and Store can access data memory
  • There are 32 general purpose, software
    addressable registers well designate the ith
    general purpose register as Ri
  • There are 64 basic instructions

4
MIPS Instruction Format
Since there are 32 general purpose registers, it
takes 5 bits to specify each register
  • All instructions are fixed length (4 bytes)
  • There are three instruction types (formats)
  • For a J-type, both the destination and one
    operand are implicit (the program counter)
  • The other operand is an immediate, but a larger
    set of bits than the immediate in an I-type
    instruction, which needed to devote 10 of its
    bits to the explicit specification of an operand
    and a destination register as well

The 6 opcode bits specify not only what operation
is to be performed but the instruction type
(format) itself, which controls how the other bit
fields in the instruction are to be extracted,
decoded, and then used to control the source and
destination operands for the desired operation
  • 64 possible instructions require 6 bits per
    instruction to specify a unique operations code
    for each one (64 26, right?)
  • For the MIPS architecture, the opcode is stored
    in bit positions 0 through 5

For an R-type instruction like R3R5-R9, both
operands are general purpose registers, as is the
destination that stores the result, so the R-type
instruction format must encode 3 register
designations (and lets just ignore the ALU
function for now -)
For an I-type instruction, one operand is still a
general purpose register, as is the destination,
but the other operand is an immediate, a set of
bits in the instruction itself
5
The CPUs Architecture Before Pipelining
NPC
B
LMD
A
address
cond
ALUoutput
PC
Imm
load, store or no-op
Heres an Arithmetic-Logic Unit that performs the
required calculations or manipulations
  • Because of the tremendous disparity in speed
    between slow but cheap main memories and a modern
    CPU, the MIPS CPU includes a smaller but higher
    speed (and hence more expensive) instruction
    cache capable of delivering an instruction word
    to the Instruction register (IR) in 1 clock cycle
  • The CPU as shown could actually work without such
    a cache, but it would be much slower
  • The pipelined version really wouldnt work at all
    without the cache so well show it here so as to
    minimize changes in the architecture diagram
    after we pipeline it

Heres the set of 32 software-addressable,
general purpose registers
IR
Since the ALU manipulates at most two operands,
it cant use all of its potential data sources on
every instruction, so we have to put them
temporarily in special purpose registers like
this one
and use multiplexers to select the correct
source of data for any given instruction
  • The speed disparity between main memory and the
    CPU also dictates we cache data as well as
    instructions
  • Lets leave until later in this course our
    discussion of the reason for having two separate
    caches (one for data, one for instructions, also
    called a split cache or Harvard architecture)
    rather than a single or unified cache
    containing both

Heres a Program Counter to hold the address of
the instruction to be executed
Heres an Instruction Register to hold the
instruction itself
6
Operation of the CPU Before Pipelining
NPC
B
LMD
A
address
cond
ALUoutput
PC
Imm
load, store or no-op
IR
  • The CPU is synchronous, an instruction takes 5
    cycles
  • Instruction fetch
  • Instruction decode and register fetch
  • Execution or address calculation
  • Memory access
  • Write back
  • Lets look at the details of the control and data
    flow during each cycle

7
Cycle 1 Instruction Fetch
NPC
B
LMD
4
A
address
cond
ALUoutput
PC
c
Imm
load, store or no-op
  • Meanwhile, since the instruction length is 4
    bytes, most of the time we can use a very small,
    special purpose adder to prepare the address for
    the next instruction fetch by just adding 4 to
    the current PC
  • But occasionally the current instruction will
    call for a jump of some sort, so instead of
    sending our new PC4 value directly back into the
    PC, well just gate it into a multiplexer where
    we can later select a different value (from
    ALUoutput) to be sent back to the PC instead, if
    need be

IR
instruction fetch
  • After the instruction fetch, the Instruction
    Register (IR) holds the current instruction
  • The various bit fields of the IR control all
    subsequent processing of this instruction by the
    CPU

At the start of the instruction fetch cycle, both
the Program Counter (PC) and the NPC hold the
address of the instruction we want to execute
so well go ahead and send that address to the
instruction memory to start the retrieval (the
NPC isnt used until later)
8
Cycle 2 Instruction Decode and Register Fetch
NPC
B
LMD
4
A
address
cond
ALUoutput
PC
For register fetch, the IR6..10 and IR11..15
bits (containing the 5 and the 1 for
R3R5-R1) control which general purpose registers
are gated into the A and B registers
Imm
load, store or no-op
IR
c
c
c
c
c
instruction fetch
instruction decode register fetch
Since the ALU operates on 32 bit operands, an
immediate argument from the IR must be extracted,
left justified, and then arithmetically right
shifted for sign extension to fill out to a full
32 bits wide before being sent to Imm as a
possible input to the ALU (depending on the
instruction type)
At the end of this cycle, all 4 of the possible
ALU operands (two general purpose registers, the
NPC containing the address of the current
instruction, and the immediate bits from the IR
itself) have been fetched into special purpose
registers, ready for 2 of them to actually be
selected by the ALUinput multiplexers on the next
cycle
At the start of the instruction decode and
register fetch cycle, the various bit fields
needed to control the functional units of the CPU
are extracted and decoded
9
Cycle 3 Execution or Address Calculation
NPC
B
LMD
c
4
A
address
cond
c
c
ALUoutput
PC
c
c
  • IR0..5 and IR21..31 together specify the
    arithmetic or logical operation to be performed
    by the ALU
  • Note that only an R-type instruction (as
    determined by IR0..5) actually needs to look at
    IR21..31 for the simpler I and J types, the
    operation is specified by IR0..5 alone

c
Imm
load, store or no-op
  • Depending on the type of the current instruction,
    ALUoutput could ultimately be
  • Stored in a general purpose register, e.g.,
    R3R5-R1
  • Used as the data memory address for a load or
    store operation, e.g., LW R1, 8(R2)
  • Sent to the PC to be used as the address of the
    next instruction, e.g., JUMP -3592
  • The ALUoutput is actually sent to all those
    places, although only one will actually use it,
    depending, of course, on the instruction type
  • There are two multiplexers that, under control of
    the opcode bits IR0..5, select which of the
    various possible input sources are actually sent
    to the ALU
  • The upper ALUinput multiplexer selects either NPC
    or register A as one input, depending on whether
    or not the instruction is a branch, for which a
    target address must be calculated based on the
    value of the NPC (which contains the address of
    the current instruction being executed)
  • The lower ALUinput multiplexer controls whether
    register B or an immediate operand is sent to be
    the other ALU input, depending on whether or not
    the opcode IR0..5 designates an R-type instruction

IR
  • We get two outputs from the ALU
  • The result of the specified operation is placed
    in ALUoutput
  • The condition register is set to either true or
    false its used later, during writeback, to
    control whether a conditional branch is actually
    taken, e.g., branch if non-zero, based on the
    comparison the ALU just performed

instruction fetch
execution or address calculation
instruction decode register fetch
10
Cycle 4 Memory Access
NPC
c
B
  • During memory access, data memory will do one of
    three things, depending on IR0..5, the opcode of
    the instruction
  • For a store instruction, it will store the
    contents of special purpose register B into the
    address specified by the ALUoutput
  • For a load instruction, it will read from the
    address specified by ALUoutput and place the
    contents in the Load Memory Data register (LMD)
  • For any other instruction, it does nothing

LMD
c
c
4
A
address
cond
c
ALUoutput
PC
c
Imm
load, store or no-op
IR
instruction fetch
execution or address calculation
memory access
instruction decode register fetch
11
Cycle 5 Write Back
NPC
c
B
LMD
4
A
address
cond
write back
c
ALUoutput
PC
c
Imm
load, store or no-op
The specific register to be written to is
designated by the destination register bits from
the IR, e.g., the 3 in R3R5-R1, which is found
in IR11..15 for an I-type instruction or
IR16..20 for an R-type, the instruction type
being obtained from IR0..5
  • For an ordinary instruction, the address of the
    next instruction will just be the current PC
    value 4 at port p2, but if the instruction
    being completed was a jump or branch, the ALU
    calculated an address for the next instruction
    and we may need to select the ALUoutput at port
    p1 here
  • The condition code set by the ALU actually
    controls which result gets written back to the PC
    and NPC

IR
write back
instruction fetch
execution or address calculation
memory access
instruction decode register fetch
  • During Write Back, the type of the instruction
    determines whether it is the LMD or the ALUoutput
    that is written into some general purpose
    register
  • A Load instruction selects the LMD
  • An R- or I-type instruction selects ALUoutput

After the write-back, instruction execution is
complete
12
Summary of the CPU Processing Before Pipelining
the 5 CPU stages
instr. i
instr. i1
instr. i2
instr. i3
instr. i4
  • Each instruction takes 5 cycles to work through
    the 5 stages of the CPU
  • So if instr. i through instr. i4 are the 5
    sequential instructions in memory shown above, it
    will take 25 cycles to complete their processing

13
Pipelining
  • Pipelining exploits the fact that the various
    functional units of the CPU were actually idle
    most of the time, e.g., the ALU was only active
    during 1 of the 5 cycles
  • A pipelined CPU overlaps the execution of several
    instructions simultaneously During the same
    cycle, one stage of the CPU can be working on one
    phase of one instruction while another stage can
    be working on a different phase of a different
    instruction
  • Needless to say, pipelining adds complexity to
    the CPU

14
Overview of the CPU Processing After Pipelining
instr. i
instr. i1
instr. i2
  • Before the pipelining, the CPU completed an
    instruction every 5 cycles
  • With the overlap in processing, the pipelined CPU
    can eventually, once the pipeline is full, emit a
    result (complete an instruction) every cycle
  • So once the pipeline is full, the CPU appears to
    be 5 times faster, despite the fact that each
    instruction still takes the same 5 cycles to work
    through the 5 stages of the CPU

instr. i3
instr. i4
5 sequential instructions in memory
15
Lets Look at the Details
Well examine the operation of the pipelined CPU
during a single cycle where it is processing the
5 instructions (i through i4) shown below
instr. i
instr. i1
instr. i2
instr. i3
instr. i4
instr. i5
instr. i6
  • Heres the configuration of the CPU stages at the
    start of the cycle
  • And heres what they will look like at the end

instr. i has been in the pipeline the longest
and is on its 5th and last cycle it will
complete this coming cycle instr. i4 will enter
the CPU for the first time this coming cycle
All instructions will have advanced one stage to
the right instr. i will have been completed and
the PC will have been set so that instr. i5 can
be fetched on the next cycle
16
The Pipelined CPU
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
B
B
  • Well name the stage latches after the two stages
    of the pipeline they sit between
  • Here we see that the IF/ID.NPC and IF/ID.IR are
    the only two registers needed to forward data
    between the instruction fetch and instruction
    decode stages

LMD
A
address
cond
C
ALUoutput
ALUoutput
C
PC
PC
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
  • The functional units of a given stage are
    controlled by the IR for that stage and process
    data that comes from the stages latches to their
    left
  • So, for example, the execution/address
    calculation stage functional units are controlled
    by the ID/EX.IR, which for this example contains
    instr. i2

Note that theres a separate IR for each stage
to control that stages functional units so that
the stages can work on separate instructions
during the same cycle
This is the configuration of the CPU at the start
of the cycle
17
The Pipelined CPU
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
?
B
B
4
4
LMD
A
address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C
C
C
C
  • As before, multiplexers select the inputs for a
    functional unit
  • Here, for example, for the Write Back of the
    results from instruction i into some general
    purpose register, if MEM/WB.IR0..5 (the opcode
    for instruction i) designates a load instruction,
    p1 (containing MEM/WB.LMD) will be selected for
    Write Back otherwise the MEM/WB.ALUoutput at p2
    will be selected
  • After the Write Back of prior results (but still
    within the same cycle), IF/ID.IR11..15 and
    IF/ID.IR16..20 identify the registers to be
    fetched for instruction i3 (for use in its
    execution phase, during the next cycle)
  • Note that these bits come from the IF/ID.IR, not
    the MEM/WB.IR that controlled the write back for
    instruction i
  • The target register for the write back is
    determined by the type of the instruction, i.e.,
    the opcode in MEM/WB.IR0..5
  • If its an I-type instruction, the destination
    register is specified in MEM/WB.IR11..15
  • If its R-type, the destination register is
    specified in MEM/WB.IR16..20
  • Otherwise (J-type), no write back to a general
    purpose register is required
  • After the Write Back of prior results (but still
    within the same cycle), IF/ID.IR11..15 and
    IF/ID.IR16..20 identify the registers to be
    fetched for instruction i3 (for use in its
    execution phase, during the next cycle)
  • Note that these bits come from the IF/ID.IR, not
    the MEM/WB.IR that controlled the write back for
    instruction i
  • At the end of the cycle
  • The CPU has totally completed its processing of
    instruction i
  • Instructions i1 through i4 have each advanced
    one stage to the right
  • The CPU is ready to start the next cycle,
    including the fetch of instruction i5

To complete the cycle, all current results are
latched into the appropriate pipeline registers
to set the stage for the next cycle
At the start of each cycle, all the contents of
all the latches are gated out onto data and
control lines to setup all subsequent processing
for that cycle
18
One Cycle in the Life of the Pipelined CPU
  • Now lets see it again 1 cycle of the pipelined
    CPU from start to finish, without pausing for
    the insightful, informative, lucid, and possibly
    even entertaining annotations that nonetheless
    interrupted us and hence distracted us from
    perceiving the overall flow of something that
    rather incredibly happens billions of times each
    second without error for years on end

19
One Cycle in the Life of the Pipelined CPU
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
4
B
B
4
4
LMD
A
address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C
C
C
C
  • At the end of the cycle
  • The CPU has totally completed its processing of
    instruction i
  • Instructions i1 through i4 have each moved one
    stage to the right
  • The CPU is ready to start the next cycle,
    including the fetch of instruction i5

20
Summary of Pipelining So Far
  • There are two big sources of additional
    complexity
  • More special purpose registers (now called
    pipeline latches) are required than in the
    unpipelined CPU, since several of them must be
    replicated, some (like the IR) several times
  • The general purpose register set must now be
    twice as fast and its control sequencing more
    complicated since it must be able to do both a
    write-back and a fetch in the same cycle, one
    after the other (the general purpose register set
    is now like a 2-stage mini-pipeline in and of
    itself, in fact)
  • Although the pipelined CPU can now, once filled,
    complete an instruction every cycle, the cycle
    time itself may need to be increased slightly to
    accommodate delays through the extra stage latches

21
Complications
  • The design shown was deliberately over-simplified
    to show the basic concept of pipelined
    operations it has several problems (a.k.a
    hazards) typical of pipelined designs that we
    will discuss and fix over the next few weeks
  • The cost of these fixes, obviously, will be even
    further complexity in the form of more circuits
    to make the pipeline work efficiently, including
  • Forwarding logic to resolve hazards without
    introducing stalls
  • Interlocks for stall insertion for unavoidable
    hazards
  • Lets take a look at some of the hazards and the
    types of fixes the design will need

22
A RAW Hazard in the Pipeline
  • During its current register fetch cycle, instr.
    i3 needs to fetch R3 and R5 into ID/EX.A and
    ID/EX.B so that it can send them to the ALU for
    multiplication on the next cycle, when it (instr.
    i3) has advanced into execution
  • But the R3 value being fetched now is not
    correct the R2-R7 value we want in R3 is still
    being computed by instr. i2 in the execution
    stage and has not yet been written back into R3

IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
B
B
LMD
A
address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C
execution or address calc. for instr. i2
instr. decode register fetch for instr. i3
memory access for instr. i1
instr. fetch for instr. i4
write back for instr. i
  • R3 here is involved in what is called a
    Read-After-Write ( or RAW) data hazard, where the
    acronym reflects the order of operations desired
    but not obtained
  • More complicated pipelines can also be subject to
    WAR and WAW hazards

Suppose our instruction sequence includes the
following instructions R3R2-R7 as instruction
i2, and R6R3R5 as instruction i3
23
Shortcut Logic (a.k.a. Forwarding) Can Resolve
This RAW Hazard
The control circuits for the expanded, lower
ALUinput multiplexer now need to allow it to
identify the hazard and select the ALUoutput
rather than special purpose register B when the
hazard exists if the ID/EX.IR0..5 encodes an
I-type opcode, select p1else if (ID/EX.IR0..5
encodes an R-type opcode) and
(ID/EX.IR6..10 EX/MEM.IR16..20 or
ID/EX.IR11..15 EX/MEM.IR16..20),
select p3 (the hazard case) else select p2
Note that at the start of the next cycle, after
instr. i3 has moved into the execution stage as
shown below, the computation of the R2-R7 value
it needs will already have been completed by the
ALU for instr. i2 during its execution phase on
the previous cycle that value just hasnt been
stored where we need it yet
The solution to the hazard is to add an
additional port to the lower ALUinput multiplexer
and connect EX/MEM.ALUoutput to it for selection
in place of ID/EX.B whenever the hazard condition
occurs
ID/EX
EX/MEM
NPC
B
A
Heres the R2-R7 value we want, although its not
yet even written back into R3, much less already
fetched into ID/EX.B where instr. i3 needs it
to be now
  • The value here is not the R2-R7 value we need
  • If we use this value, our results will be
    incorrect thats why this condition is called a
    hazard

ALUoutput
Imm
IR
IR
c
C
C
execution cycle of R6R3R5
Now, when all the pipeline registers are gated
out at the start of the cycle, EX/MEM.ALUoutput
is forwarded to the new port p3 of the lower ALU
multiplexer where it is available for selection
by instr. i3 during its execution cycle
24
Shortcuts Can Solve Many Problems But
  • The previous slide showed the complexity incurred
    by one shortcut to resolve one hazard
  • The multiplexer needs an extra port, which in
    turn requires a new data path to the new port
  • The control logic for the multiplexer gets more
    complicated, too, requiring yet more bits to be
    sent to it
  • Our simple pipeline has several other hazards
  • The good news is that many of them can be solved
    by shortcuts similar to the one we just saw
  • The bad news is two fold
  • Were starting to add quite a bit of additional
    circuitry to the CPU
  • We may have to increase our cycle time a bit more
  • Our manufacturing cost/chip is rising since our
    yield is dropping with the increased area of each
    chip
  • There are still some hazards that cant be
    resolved this way at all

25
For Irresolvable Hazards, Part of the Pipeline
Must Be Stalled
IF/ID
ID/EX
EX/MEM
MEM/WB
NPC
NPC
B
B
LMD
A
instr. i2, the load instruction LW R1, 8(R2)
that will ultimately load R1 with the correct
value, is still in address calculation, using the
ALU to calculate the address R28 to send to data
memory during its memory access cycle (next CPU
cycle)
  • During its decode and register fetch cycle, the
    R4R1-R5 instruction needs to fetch R1 and gate
    it into ID/EX.A
  • But the correct content for R1 is not there yet,
    its still in data memory

address
cond
C
C
ALUoutput
ALUoutput
C
C
PC
C
Imm
load, store or no-op
IR
IR
IR
IR
C
C
C
C
C
C
C
C
C
C
C
  • In contrast to the last hazard we saw, even after
    instr. i3 moves into its execution phase on the
    next CPU cycle, the correct data value will still
    not be present anywhere in the CPU (it will be
    enroute from data memory to the MEM/WB.LMD), so
    forwarding wont help
  • Inescapable conclusion R4R1-R5 must not be
    allowed to proceed the front part of the
    pipeline must be stalled, its instructions
    prevented from advancing to the next stage to the
    right for the next cycle

execution or address calc. for instr. i2
instr. decode register fetch for instr. i3
memory access for instr. i1
instr. fetch for instr. i4
write back for instr. i
Suppose our instruction sequence includes the
following instructions LW R1, 8(R2), as instr.
i2 (load R1 from memory address R28), and
R4R1-R5, as instr. i3
26
Overview of the Stall
the 5 CPU stages
instr. i1
instr. i2
instr. i3
instr. i4
instr. i
  • It is instr. i3 that we wish to keep from
    advancing but if we cant let instr. i3
    advance, we have to hold up instr. i4 as well,
    since there will be no place for it to advance to
  • instr. i2, instr. i1, and instr. i must all be
    allowed to proceed normally, however, so that
    instr. i2 will eventually read the correct data
    from data memory so that instr. i3 can then
    proceed
  • In general, whenever we stall an instruction
    thats not ready to move on, we must also stall
    all subsequent instructions while allowing all
    preceding instructions to progress normally, so
    the hazard will eventually be cleared and it will
    be safe to let the stalled instruction move on
    again
  • Each stage that is inactive during a given cycle
    can be viewed as a stall bubble proceeding
    through the pipeline in place of a real
    instruction
  • Although it looks like we need a two cycle stall
    (two bubbles) here, we can cut that back to one
    by simply adding a forwarding path from
    MEM/WB.LMD to the correct ALU input multiplexer
    so that we dont actually have to wait for the
    write back of the correct value into R1
  • But we do still have to stall for one cycle
  • Although we held R4R1-R5 in place after the last
    cycle, we still cant let it advance after doing
    its register fetch on this current cycle, since
    it will still be fetching a hazardous value LW
    R1, 8(R2) still hasnt finished loading the
    correct value into R1 yet its still loading the
    value we want from memory into the MEM/WB.LMD it
    wont be written back into R1 until the next
    cycle

Now we can let R4R1-R5 fetch on this cycle and
advance normally on the next, since the write
back of MEM/WB.LMD into R1 will occur at the
start of this current cycle just before R1 is
fetched into ID/EX.A
  • We cant let R4R1-R5 move into execution after
    this cycle, since the value it is about to fetch
    from R1 is erroneous LW R1, 8(R2) is still
    computing the R28 address it will send to the
    data memory on the next cycle to start reading
    the value we want for R1, the actual read itself
    hasnt even started yet

27
Pipeline Interlocks Are Required to Insert the
Stall Bubble
  • Well need four new multiplexers in the front-end
    stage latches themselves
  • To stall instructions i4 and i3, the PC,
    IF/ID.NPC, and IF/ID.IR registers must recycle
    their current contents in place to ensure their
    normal execution can resume properly once the
    hazard is cleared when instr. i2 completes its
    data memory access and writes the needed value
    into the MEM/WB.LMD
  • Additionally, the ID/EX.IR must be set to all
    zeroes (no-operation) to insert the stall bubble
    into the next cycle/stage since the execution
    stage will have nothing to do on the cycle after
    instr. i3 is stalled in place in register fetch
  • All four interlock multiplexers have the same
    control logicIf the ID/EX.IR indicates a load
    instruction whose destination register is the
    same as a source register for the instruction in
    the IF/ID.IR, select port p1 (stall), else
    select p2 (normal)

EX/MEM.ALUoutput
EX/MEM.cond
IF/ID
ID/EX
p1 p2
p1 p2
MUX
MUX
NPC
p1 p2
p1 p2
MUX
MUX
0
IR
PC
IR
C
C
C
C
C
28
Stalling the Front End
  • The stall insertion is complete PC, IF/ID.IR,
    and IF/ID.NPC have been recycled and ID/EX.IR is
    set to no-op
  • Since ID/EX.IR is now all zeroes, not only will
    the execution stage not do anything on the next
    cycle, but the zero bits sent from ID/EX.IR at
    the start of the next cycle to control the
    interlock multiplexers for that cycle will insure
    that the control logic for those interlocks does
    not detect a hazard for the new cycle
  • The interlocks will then select their normal
    operations port, thus unlocking the pipeline and
    allowing instructions i4 and i3 to proceed
    normally, having stalled for only the one cycle
    necessary to resolve the hazard

At this point, the PC, IF/ID.IR, and IF/ID.NPC
are set to recycle the contents about to be
gated in are the same as they were at the start
of the cycle, so instr. i4 and instr. i3 will
not advance in the pipeline
Heres how instr. i2 and instr. i3 get recycled
and a stall bubble (no-op) inserted into the
pipelines execution stage in place of the
stalled instr. i3
Up to this point, everything has been proceeding
normally, but since the opcode for instr. i2
specifies a load and its destination register
matches one of the source registers for instr.
i3, all the interlocks will now select port p1,
which will stall the first 3 stages of the
pipeline on the next cycle
We wont show the execution/address calculation,
memory access, or write back stages functional
units this cycle since theyre all operating
normally, which weve already seen
Since the ID/EX.IR controls the execution stage
and its about to be set to all zeroes (the code
for no operation) for the next cycle, it
actually doesnt matter whats about to be gated
into the other ID/EX registers at the end of this
cycle the no-op on the next cycle means that
they wont be used anyway so we wont bother to
show them here
Since we dont care about any of the ID/EX
registers except the IR, the instruction
decode/register fetch stage functional units that
feed those other ID/EX registers also dont
matter in this animation although theyre not
really stalled, their outputs are just going to
be ignored anyway so to keep the animation as
simple as possible, we wont show their
processing this cycle either
OK, enough caveats here we go -)
Since instr. i3 is being recycled and will not
move into the execution stage, the execution
stage will have nothing to do on the next cycle
so a no-op is about to be gated into the ID/EX.IR
so that the execution stage functional
units,controlled as they are by the ID/EX.IR,
will in fact do nothing
EX/MEM.ALUoutput
c
EX/MEM.cond
IF/ID
ID/EX
ID/EX
p1 p2
p1 p2
MUX
MUX
NPC
4
p1 p2
p1 p2
MUX
MUX
0
0
IR
PC
C
IR
C
C
C
C
C
C
C
C
C
C
C
C
29
A Stall Bubble Has Been Inserted
  • Heres the stall bubble
  • Normal operations of the CPU will eliminate it in
    3 more cycles

instr. i1
instr. i2
instr. i4
instr. i3
instr. i5
no-op
instr. i6
instr. i7
instr. i8
EX/MEM.ALUoutput
EX/MEM.cond
  • After three cycles, the stall bubble has been
    expelled, the pipeline is full, operations
    continue normally
  • Note that although it took 3 cycles to clear the
    bubble, in only one of those 3 cycles (the last
    one) did the CPU not actually emit a result
    (i.e., complete an instruction)

IF/ID
ID/EX
ID/EX
p1 p2
p1 p2
MUX
MUX
NPC
p1 p2
p1 p2
MUX
MUX
0
IR
PC
C
IR
C
C
C
C
C
C
C
C
30
Next Complication The PC, the NPC, and Branch
Hazards
  • Close examination of the normal advance of the
    PC into the NPC in the previous animation reveals
    that it is unfortunately incorrect
  • The reason actually has nothing to do with the
    interlocks but resulted from the earlier
    introduction of the extra pipeline registers for
    the NPC themselves (the NPC logic was correct in
    the un-pipelined design)
  • And things are really going to go to worms when
    we add jump and branch instructions rather than
    just the simple sequential execution weve been
    looking at
  • But Ill leave animating the fixes for those
    issues for another year read the textbook -)

31
Summary
  • Despite the fact that the underlying circuitry is
    no faster than before, an n-stage pipeline
    emitting 1 result per cycle gives the appearance
    of an n-fold speedup
  • In actuality, the speedup is less than that for
    two reasons
  • The cycle time itself may need to increase to
    allow for extra propagation delays through the
    new circuits (e.g., stage latches)
  • Some cycles will not see an instruction completed
    and emitted
  • For a pipeline with n stages, n-1 cycles of
    pipeline latency are required to fill the
    pipeline before instructions start to complete
    jumps and branches will cause this penalty to
    occur repeatedly
  • There will almost always be hazards that
    forwarding cant cure so the front end of the
    pipeline will have to be stalled each stall
    bubble inserted eventually leads to a cycle where
    no instruction completes
Write a Comment
User Comments (0)
About PowerShow.com