Title: The Processor Data Path
1The ProcessorData Path ControlChapter 5Part
2 - Multi-Clock Cycle Design
2A Multicycle Design
- As pointed out earlier, a single clock cycle
design has a performance bottleneck namely the
instruction requiring the longest time will
determine the time for all other instructions
even simple instructions such as jump (j). - The offending instruction is the load word memory
instruction (lw)lw uses five functional units of
in series - Instruction memory (fetch)
- The register file (read)
- The ALU (compute address)
- Data memory (read)
- The register file (write)
- Several instruction classes could fit into a
shorter clock cycle, thus overall performance
will be compromised memory instructions may not
be a frequently used instruction but it
determines timing for simpler instructions. - In addition single cycle increases hardware units
cannot multiplex in time. - See performance example on pp.373-375
3Multiple Clock Cycle Design ... Overview
- Allows shorter clock cycle
- Clock cycle derived from longest functional unit
delay and not the longest total data path delay - Multiple clock pulses per instruction
- Use a clock pulse for each functional unit
(memory, register file, alu, ... ). ...
multiplex (share) in time not in space (as
with single clock) - Instead of a single long clock pulse, use a
sequence of short pulses Average instruction
time will be shorter - short instructions will
not have to wait idle for the clock to time out. - Another advantage hardware is reduced.Instructio
ns and data stored in same memory A single ALU
will do all the arithmetic Time shared
functional units. - This approach is easily extended to pipelining
which allows multiple instructions to executed at
one time and which will further enhance
performance (chapter 6).
4Multiple Clock Cycle DesignDesign Details
- We will use the single clock design as a starting
point fig. 5.29, p. 372) - This design will be compressed into what you
see in fig. 5.30, p. 378 ... Single inst memory
one ALU for all (see next). - Single memory unit for both instructions and data
- Single ALU instead of ALU and two adders
- One or more buffer registers added after every
major functional unit to hold output until next
subsequent clock cycle. - Because functional units are now shared for
various phases of execution, must add MUXs and
extend some existing MUXs. See fig 5.31, page
380 (see later). - A few additional hardware elements will also be
added to resolve situations where stored data may
change in one execution phase before a down
stream phase gets to us it - For now we add an instruction register (IR) -
this is to hold the instruction for later phases
during execution after the PC is changed during
the fetch stage and perhaps the memory get reused
during execution.
5MultiClock Cycle DesignDesign Details High
Level View
Fig 5.30 Add register buffers between functional
stages. Single memory holds instructions
data One ALU does all
6MultiClock Cycle DesignSupport for Basic
Instructions
Fig 5.31
7Multiple Clock Cycle DesignDesign Details
(continued)
- We will also need more control lines, and modify
existing ones Control is now sequential and
dynamic - Time and opcode dependent
- Things start to get hairy when we design the
controller ... But Dont Panic! as Douglas
Adams would say. - The first cut at the data path design with
control lines shown is fig 5.32, page 381 All
storage elements will need a separate write
signal, and a read signal is needed for memory.
The old ALU controller from single clock pulse
will be reused. - See fig 5.34 p. 384 for a complete summary of
the control line functions.
8MultiClock Cycle DesignControl Lines Shown
Fig 5.32 Add and extend MUXs for hardware
sharing. Add control lines which must be
generated.
9MultiClock Cycle DesignControl Units To Be
Designed are Depicted
Write PC if(beq ALU is 0) or j inst or PC4
PC4
op
?beq address
rs?
rt ?
rd
For sw or lw?
? for beq
sw ? addr
reg ? data write
?inst funct field
?memory addr. for sw data write
or register data write for R
inst.
Blue is for comments
Fig 5.33 Shows control units to be
designed.Compare to fig. 5.29 for single clock
cycle
10MultiClock Cycle DesignControl Lines Defined
Part 1
Fig 5.34 part 1, Notes lw uses rt as
destination For de-asserting RegWrite Register
file is read by default MemtoReg chooses between
memory or ALU as a source IorD chooses between
Inst fetch ors data access addresses Although
IRWrite cause memory output to go the IR, it also
benignly goes to MDR PCWrite changes PC only for
PC4 or j instructions is de-asserted fir
beq. PCWriteCond is for beq instruction
11MultiClock Cycle DesignControl Lines Defined
Part 2
beq
j inst
Fig 5.34 part 2
12MultiClock Cycle DesignSplitting The Instruction
Execution Into Clock Cycle Phases
- Goal break up execution into phases in such a
way as to balance the amount of work done in
each phase. - Each phase will be one clock cycle and correspond
to one of the functional units in the single
clock cycle design - Restrict each phase to contain at most one ALU
operation, or one register file (or other
register) access, or one memory access. - All operations for one phase occur in parallel
within one clock cycle. - This strategy also will put us in a good position
for pipelining later. - NOTE In the control signal descriptions which
follows, we assumed that if a control signal is
not mentioned or assigned a value, it is
de-asserted by default.
13MultiClock Cycle DesignGeneric Phases Phase 1
- Phase 1 Instruction Fetch
- IR memoryPC
- PC 4
- Assert IRWrite and MemRead and set IorD to 0
(select PC) and bump PC by 4. - gt ALUSrcA 0, ALUSrcB 01, ALUOp 00 (add).
- Store incremented address back to PC by setting
PCSource 00 and PCWrite 1Note that the
incremented PC is also stored in ALUOut which is
redundant and benign. Note Asserting PCSource
00 is not explicitly mentioned in the book. - Comment the memory access based on PC and
incrementing of the PC is allowed because if our
edge triggering assumption see examples in
slides 3 and 4 of the 1st set of Chapter 5 PPT
notes (single clock cycle). We assume that the
PC value is captured by the memory unit before it
gets updated - a consequence of edge triggering.
This value is available at the very beginning of
the clock cycle, and because of delays, the PC is
updated a little later in the cycle. - Comment it is also assumed that the memory can
be read and set into the IR during this same
clock cycle edge. Assume a faster point-to-point
memory to IR connection rather than a bus. See
Elaboration on page 382. - Comment it appears that the MDR benignly also
get the instruction because it is unconditionally
set. If the instructions is lw, this value in
the MDR will get correctly overwritten later in
the cycle. This is not mentioned in the book.
14MultiClock Cycle DesignGeneric Phases Phase 2
- Phase 2 Decode Register Fetch
- optimistic (maybe premature) actions done here
- may not use all results down stream - but
faster and will not cause any problem if not
used. - A register IR25-21 / rs field / B
register IR20-16 / rt field /ALUOut
Target_addr pc ( sign_ext(IR15-0) ltlt 2)
/ target_addr calc is optimistic / - Target_addr may get used (if instruction is
conditional branch) otherwise it gets
harmlessly discarded. it is easier to
calculate it early and through it away if not
needed than to have to recalculate it later. - ALUSrcA set to 0 (choose PC) ALUSrcB set to
11(choose offset field which is both sign
extended and shifted to byte boundary) - and ALUOp 00 (add)
15MultiClock Cycle Design Instruction Content
Depended Phases Phase 3
- Memory address computation arithmetic R-type
execution - Memory Reference For Data
- ALUOut A sign_extend( IR15-0) / ALU
operation / - ALUSrcA set to 1ALUSrcB set to 10 (use sign
extended value)ALUOp set to 00 (add) - Arithmetic-logical R-type instruction
- ALUOut A op B
- ALUSrcA set to 1ALUSrcB set to 00ALUOp set to
10 funct field used to determine the ALU
control settings
16MultiClock Cycle DesignInstruction Content
Depended Phases Phase 3 (continued)
- Branch completion
- Branch (beq)
- if (A B) PC ALUOut
- equal compare reg A B, if equal, set Zero
output of ALU which means branch successful - ALUSrcA set to 1ALUSrcB set to 00ALUOp set to
01 (subtract)PCWriteCond assertedPCSource set
to 01 / PC taken from ALUOut
/PCWrite de-asserted (by default) prevents
beq address from being set when unsuccessful
branch - ALU zero output must be set to use beq
address. - For successful beqs, few write the PC twice
once from direct ALU output during decode/fetch,
and once from ALUout as as above in this step
the last one is used - Jump (j)
- PC PC31-28 (IR25-0 ltlt2) / concatenate
26 bit offset shifted to 28 bits /
/ to PC high 4 bits / - Set PCSource 01 / selects jump address
/PCWrite asserted to overwrite current contents
of PC with jump address.
17MultiClock Cycle DesignInstruction Content
Depended Phases Phase 4
- Memory access
- MDR MemoryALUOut / for lw
/MemoryALUOut B / sw, source
operand saved in B / - MemRead for lw or MemWrite for sw are
assertedIorD set to 1 to get data address
instead of PC - Arithmetic-logical R-type instruction completion
- RegIR15-11 ALUOut / set rd from ALUout
/ - RegDst set to 1 to pick up rd and no rtassert
RegWriteSet MemtoReg to 0 to write from ALU and
not from memory
18MultiClock Cycle DesignInstruction Content
Depended Phases Phase 5
- Memory read completion step (for lw the longest
instruction) - RegIR20-16 MDR /write back to reg from
memory for sw / / writes to rt
register / - Set MemtoReg to 1 to write from memory and not
ALUassert RegWrite to cause a write to register
fileset RegDst to 0 to choose the rt register
and not rd
19MultiClock Cycle Design Summary of steps in each
phase
Fig. 5.35
20High Level View of Finite State Machine Control
Fig. 5.36
21Instruction Fetch Decode
Fig. 5.37
22Memory Reference Instructions
Fig. 5.38
23R-type Instruction
Fig. 5.39
24Branch and Jump Instruction
Branch (beq) instruction
jump instruction
Fig. 5.40
Fig. 5.41
25Complete State Machine for Multi-cycle Controller
(fig 5.33)
Fig. 5.42
26A Possible Implementation of the Multi-cycle
Control Unit
See appendix C For implementation Details.
?Current state
Fig. 5.43