Title: Chapter 4 The Microarchitecture Level
1Chapter 4The Microarchitecture Level
- CS 271 Computer Architecture
- Indiana University Purdue University Fort Wayne
- Mark Temte
2 Microarchitecture level context
3Microarchitecture level
- The ISA level instructions are also known as
macroinstructions - Familiar from assembly language
- ADD, LOAD, STORE, BRANCH, etc.
Java c a b assembly LOAD R3,
a language ADD R3, b R3 is register
3 STORE R3, c BRANCH L4
4Microarchitecture level
- The control unit within the CPU must generate
signals to fetch and execute each of the ISA
level macroinstructions - How?
- Create a microcomputer within the control unit
- This microcomputer runs microprograms consisting
of microinstructions that act on the data path - There is one microprogram for each
macroinstruction - Execution of the microprogram interprets the
corresponding macroinstruction
5Data path
- This data path of the CPU consists of those parts
exclusive of the control unit - Consists of the ALU, registers, and internal
buses - Example
- The following slide shows the data path of a
fictitious computer called IJVM - Integer Java Virtual Machine
- 32-bit data path
- 32-bit registers
- 32 1-bit ALUs
6(No Transcript)
7Data path
- The ALU has 6 control lines
- F0, F1 AND, OR, COMP, SUM
- ENA gate A inputs into ALU
- ENB gate B inputs into ALU
- INVA complement A inputs
- INC assert carry into low-order bit of ALU
- The shifter has 2 control lines with 3 actions
- Negating both lines causes no shift
- SLL8 Shift Left Logical 8 (shift left 1 byte
with 0 fill) - SRA1 Shift Right Arithmetic 1 not changing
leftmost bit - This divides a twos complement number by 2
8Example dividing by 2
- Divide twos complement representation of -14 by 2
Let n 6 bits Represent magnitude 1410
001110 Complement each bit
110001 Add 1
1 110010 Result
-1410 110010
Apply SRA1 to -1410 110010 Obtain 111001 What
is this? 111001 Complement
each bit 000110 Add 1
1 Obtain 710
000111 Thus SRA1 produced 111001 -710
9Recall . . .
10(No Transcript)
11Example
- How could you increment the SP register?
- Look at the data path again
- Enable SP to the B bus
- Compute B1 as follows . . .
- Assert ENB
- Assert lines for SUM
- Assert INC
- No shift
- When shifter output has stabilized, write the C
bus back into the SP
12Incrementing the SP register
- Precise timing of the write pulse to SP is
important
13Memory operations
- There are two ports to memory
- MAR / MDR
- 32-bit data port
- Load the MAR with a 30-bit address
- The address (multiplied by 4) goes to memory
- To multiply, simply shift the address left by 2
bits - The MDR receives data (READ) or provides data
(WRITE) - PC / MBR
- 8-bit data port
- Only for reading
- Load the PC with a 32-bit address
- The address goes to memory
- The MBR receives a byte
- Usually an ISA instruction code
14Memory operations
- The MBR is gated onto the B bus in two ways
- Signed (with sign extension)
- Unsigned
- There is one control signal for each
- Only one of these signals may be asserted at a
time
15Control signals for the data path
- There are 29 signals in all
- 9 - Selects a register to gate to the B-bus
- A 4-bit code is decoded 16 ways
- Only 9 ways are used
- Saves 5 bits
- 9 - Selects a register to load from the C-bus
- 8 - ALU and shifter operations
- 2 - read / write using MAR / MDR
- 1 - fetch using PC / MBR
- These control each data path cycle
- Falling edge of clock to next rising edge
16Microinstruction format
- Each microinstruction sets up control signals on
the data path for the next data path cycle - Each microinstruction is 36 bits
- 24 control bits for the data path
- 24 29 5
- 9-bit address of the next microinstruction
- 3-bit condition code for branching
- Note each microinstruction specifies its
successor
17Microinstruction format
18Mic-1 architecture
- Mic-1 is an example architecture we will study
- Consists of . . .
- Control store
- 512 x 36 bit memory for all the microprograms
- This is ROM
- MPC
- MicroProgram Counter
- MIR
- MicroInstruction Register
19(No Transcript)
20Mic-1 fetch / execute cycle
- At the falling edge of the clock, the MIR is
loaded - The following components then operate and
stabilize - Decoder
- B-bus
- ALU
- Shifter
- C-bus
- Also, the N and Z outputs from the ALU go to
flip-flops - At the next rising edge of the clock
- Registers and N, Z flip-flops are loaded
- MDR and MBR are loaded from memory
- The address of the next microinstruction is
calculated while the clock is high and the cycle
repeats
21Recall the timing of the data path cycle
22MPC address calculation
- Address is just NEXT_ADDRESS when JAM 000
- However . . .
- JAMN 1 causes OR of N-bit with high-order MPC
bit - JAMZ 1 causes OR of Z-bit with high-order MPC
bit - JMPC 1 causes bitwise OR of MBR and 8 low-order
bits of NEXT_ADDRESS - Typically, NEXT_ADDRESS 0 when JMPC 1
- This permits a branch to the address in the MBR
- This address typically is identical to the ISA
op-code
23MPC address calculation
24ISA (macroarchitecture) of the IJVM
- Memory model
- Format of methods
- The IJVM instruction set
- Local variable frames and the operand stack
- How a method call is implemented
25The IJVM memory model
26The IJVM memory model
- The constant pool
- Contains constants, strings, and pointers
- E.g., pointer to the base address of each method
- Loaded when the program is loaded
- Register CPP points to the base of the constant
pool - The constant pool is read-only
- The method area
- Contains method code
- Register PC points to the next instruction
- Organized as a byte array
- Operand stack
27Method format
- The executable code in a method is preceded by .
. . - Two bytes giving the number of parameters
- Two bytes giving the size of the local variable
area - The local variable area size is needed to
initialize the SP to the top of the local
variable frame
number of parameters
size of LV area
PC
executable code
28The IJVM instruction set
- The IJVM instruction set appears on the next
slide - There are 20 instructions altogether
- Many of the instructions require just a single
byte - These have no operands
- DUP, IADD, IAND, IOR, IRETURN, ISUB, NOP, POP,
SWAP, WIDE - Others have an additional single 1-byte operand
- BIPUSH, ILOAD, ISTORE
- Some have a single 2-byte operand
- GOTO, IFEQ, IFLT, IF_ICMPEQ, INVOKEVIRTUAL, LDC_W
- One has two 1-byte operands
- IINC
29(No Transcript)
30Using the IADD instruction
- To add local variables j and k and save the sum
in local variable i . . .
ILOAD j // push a copy of local variable
j on the top of the stack ILOAD k // push
a copy of local variable k on the top of the
stack IADD // pop 2 words from the stack
and push their sum back ISTORE i // pop top
word from stack and store in local variable i
31Sample program fragment
Note The branch instruction IF_ICMPEQ has a
16-bit signed offset that is added to the address
of the current op-code to target L1
32The local variable frame
- The local variable frame is where the local
variables of a method are stored - A new local variable frame is created whenever a
method is called - Each local variable frame is pushed onto a stack
in memory called the operand stack - The stack space occupied by a local variable
frame is released when the associated method
returns
33Operand stack example
- Suppose method A calls method B, which calls
method C - The SP (Stack Pointer) register holds the index
of the top of the stack - The LV (Local Variable pointer) register holds
the base address of the local variable frame
SP
frame for C
LV
frame for B
frame for A
34Operand stack example
- Note how the stack space for B and C is recycled
35Detailed local variable frame structure
- The local variable frame also . . .
- Holds all the parameters set up on the stack in
advance by the caller - Saves the LV and PC registers of the caller
- The saved PC value is the return address within
the caller
36Detailed local variable frame structure
37Calling a method
- Call a method using instruction
- INVOKEVIRTUAL disp
- Parameter disp gives the position in the constant
pool holding a pointer to the called method - INVOKEVIRTUAL does the following
- Sets register LV to the value in SP - (
parameters) - Set the value in the location pointed to by LV to
the value in register SP ( local variables)
1 - Increment register SP by ( local variables)
- Push callers register PC (return address) on the
stack - Set register PC to the 5th byte in the called
method - Push the callers original LV value on the stack
38Intermediate results
- The operand stack is used for storing method
intermediate results - These are pushed on the operand stack above the
local variable frame - The return result is the final intermediate
result - It is always left immediately above the local
variable frame - The other intermediate results have already been
popped - Look at Figure 4-9 again
- IRETURN reverses the steps of INVOKEVIRTUAL
39Returning from a method
40The Mic-1 microprogram for IJVM
- Recall that there is one Mic-1 microprogram for
each of the IJVM macroinstructions - There is also a microprogram for instruction
fetch - Altogether, these microprograms are referred to
as the Mic-1 microprogram for the IJVM - Microinstructions are described using a special
notation - 36 bits could be used instead for each
microinstruction - It is more readable to indicate how the bits
should be set rather than what they are set to - Caution be sure that what is indicated by the
notation is physically possible
41Microinstruction notation
- Some examples
- Everything on a line is done in one clock cycle
- The desired result must be physically possible
- For example, MDR SP MDR is illegal, since
needs one input from register H
PC PC 1 fetch goto (MBR) MAR SP SP-1
rd H TOS MDR TOS MDR H wr goto Main1
42Sequencing of instructions
- All instructions have a implicit or explicit goto
- Sequential instructions are not necessarily
sequential in the control store - The microinstruction sequence for a
macroinstruction starts at the control store
address that corresponds to the numerical value
of the macroinstructions op-code - For example, the IADD op-code is 6016 and the
microinstruction sequence starts at location 6016 - The following microinstruction can be located
anywhere in the control store
43Microinstruction branching
- Example
- Pass TOS through the ALU and look at the Z bit
- L1 and L2 must be exactly 256 locations apart
- Example
- Unconditional branch to instruction pointed to by
the MBR - Convention At the start of any
macroinstruction, register TOS always contains a
copy of the value at the top of the operand stack - Register OPC is a scratch register
- Often saves the op-code
Z TOS if (Z) goto L1 else goto L2
goto (MBR)
44The Mic-1 microprogram for IJVM
- There are 112 microinstructions in all
- Starts with the line labeled Main1
- Before the macroprogram runs . . .
- the PC contains the address just before the 1st
macroinstruction - the MBR contains 0 (the NOP op-code)
- Main1 fetches the next macroinstruction op-code
and branches to the start of the microinstruction
sequence for the current macroinstruction - The last microinstruction in the sequence
branches back to Main1 - On the following slides, focus on instructions
marked with
45The Mic-1 microprogram for IJVM
46The Mic-1 microprogram for IJVM
47The Mic-1 microprogram for IJVM
48The Mic-1 microprogram for IJVM
49The Mic-1 microprogram for IJVM
50Design issues
- We will modify the Mic-1 design in order to
increase performance - Changes involve . . .
- Eliminating decoding
- Reducing the path length
- The path length is the average number of
microinstructions per macroinstruction - The path length can be reduced by . . .
- Eliminating Main1
- Using a 3-bus architecture
- Adding an independent fetch unit
51Eliminating decoding
- Decoding the B-bus slows the potential clock rate
- The decoding must be completed before anything
else can happen - Cost to eliminate decoding
- 5 bits in each microinstruction
- Altogether, 41 bits will be needed instead of 36
52Eliminating Main1
- At Main1 there is a microinstruction to fetch the
opcode of the next macroinstruction - This microinstruction can be eliminated by
merging its code onto the end of the microcode
sequence of each macroinstruction - Usually this can be done in parallel with other
activity for a saving of 1 cycle - This may not always be possible
Main1 PC PC 1 fetch goto( MBR )
53Eliminating Main1
- Microinstruction sequence for POP with Main1 code
merged onto the end
The original order of microinstruction execution
for POP
54Three-bus architecture
- This change allows two registers to be added in
just one clock cycle - There is no need to waste a cycle moving one of
the registers to the H register earlier
55Adding an independent fetch unit
- This new specialized functional unit is called
the IFU - Instruction Fetch Unit
- It independently fetches macroinstruction
opcodes and processes macroinstruction operands - Operands like varnum, disp, offset , etc.
- This eliminates the Main1 microinstruction
entirely - No longer necessary to merge Main1 code onto the
end of each microcode sequence
56Adding an independent fetch unit
- The IFU gives a dramatic improvement in
performance, but . . . - The IFU is surprisingly complicated
- Due to branching and operand handling
- There are some necessary changes in the data path
due to the IFU - In addition to MBR, a new 2-byte register MBR2 is
added to the data path for holding 2-byte
operands - This eliminates the need to combine two bytes in
the data path to form an offset or disp - The old MBR is renamed MBR1
57The IFU
- The PC is now updated by the microprogram only
when a branch occurs - The IFU maintains its own copy of the PC in a
private register called IMAR - The IFU increments the IMAR independently of the
data path - The IFU reads 4 bytes at a time from the user
program into a special shift register capable of
holding 5 bytes
58The IFU
59Mic-2
- The revised microarchitecture is called Mic-2
- Mic-2 includes . . .
- 3-bus architecture
- Prefetching using the IFU
- Shorter microprogram
- 81 microinstructions instead of 112
- Major performance gain
60Mic-2
61The new microprogram for Mic-2
62The new microprogram for Mic-2
63(No Transcript)
64Additional modifications
- The clock cycle time can be reduced with a
piplined design - We first add latch registers to the data path
65Pipelined design
- This design latches . . .
- The A and B inputs to the ALU
- Output from the ALU
- The old clock cycle is broken into 3 microcycles
- The clock is adjusted to run approximately 3
times as fast - Now parts of three microinstructions can be
processed in parallel - We need to add a cache memory so memory
operations can keep up - The ALU is active every cycle
- Not just in the middle of the old cycle
66The pipeline in action
67The SWAP instruction
SWAP with piplining
68The SWAP instruction
- With piplining, note the need to stall the
pipeline occasionally - The third microinstruction caused the pipeline to
stall for two cycles - The SWAP now requires only 11 microcycles instead
of 3 x (6 normal cycles) 18 microcycles
69Mic-3
- The revised microarchitecture is called Mic-3
- Mic-3 includes a 4-stage pipeline with stages . .
. - Fetch
- Latch A and B
- Calculate with the ALU
- Writeback
70Additional modifications
- Mic-3 still has a problem
- Various microinstructions contain microbranches
- Conditional branch
- Branch with a target microinstruction not known
in advance - For example, the last microinstruction in a
sequence always branches to a target not known in
advance - Consider the swap6 microinstruction
- The next microinstruction cannot be prefetched
- This could cause havoc with the microinstruction
pipeline - There is a separate MIR for each microinstruction
in the pipeline - The pipeline must stall until the next
microinstruction is known - The next microinstruction must be anticipated
- Add two more components to the design
- Decoding unit
- Queueing unit
71Decoding unit
- The decoding unit knows which incoming bytes are
opcodes and which are operands like varnum and
disp - The incoming opcode is an index into a ROM table
within the decoding unit - The indexed row gives . . .
- The the number of bytes associated with the
opcode - This allows the decoding unit to know when it
fetches the next opcode - The address in the control store of the first
microinstruction of the sequence associated with
the opcode
72Queueing unit
- The queueing unit contains . . .
- The old control store (ROM)
- The microinstructions in the control store for a
given sequence are now consecutive rather than
scattered - No need for each microinstruction to designate
its successor - A hardware queue of microinstructions (RAM)
- The microinstruction queue holds the proper
sequence of microinstructions across ISA
macroinstruction boundaries
73Queueing unit
- Microinstructions have a modified format
- No longer need the NEXT_ADDRESS field
- No longer have JAM bits
- Have added bits for selecting the A bus
- Also there are two new bits in each
microinstruction - Final bit
- Goto bit
74Queueing unit
- The Final bit is set in the last
microinstruction in each sequence - It is used to indicate the end of the sequence
for the current macroinstruction and reactivate
the IFU - The Goto bit marks microinstructions that have
conditional branches (at the ISA level) - These microinstructions have a different format
from other microinstructions - Have JAM bits
- Contain an index into the control store
75Queueing unit operation (input side)
- Starting with the first microinstruction of a
sequence, the queueing unit . . . - Copies sequential instructions from the control
store into the hardware queue of
microinstructions - Copying continues through the first
microinstruction with the Final bit set - If the Goto bit is not set, the queueing unit . .
. - Gets the index associated with the the next
opcode from the decoding unit - Continues copying microinstructions from the
sequence for the new opcode into the hardware
queue of microinstructions - Copying continues until a Goto bit is set or the
queue of microinstructions is full
76Queueing unit operation (input side)
- When the Goto bit is set (conditional branch)
- The queueing unit stops copying microinstructions
from the control store into its hardware queue - The unit stalls until the microbranch has been
resolved - The fetch queue in the IFU may have to be cleaned
up also
77Queueing unit operation (output side)
- On the ouput side, the queueing unit
- Dequeues microinstructions from its queue
- Feeds them into a queue of four MIRs
- One MIR for each stage of the data path part of
the pipeline
78(No Transcript)
79Mic-4
- The revised microarchitecture is called Mic-4
- Mic-4 includes a 7-stage pipeline with stages . .
. - IFU
- Decoding unit
- Queueing unit
- Latch operands
- ALU
- Register writeback
- Memory
- See circled numbers on Figure 4-35
80Cache memory
- The bottleneck in the Mic-4 design is with memory
- Memory latency is the delay for read and write
- Memory bandwidth is the number of bytes involved
in each read or write - For a given memory technology, an increase in
bandwidth causes an increase in latency - The fastest memory technology is not cost
effective - Cache memory is the cost effective alternative
CPU
cache memory
main memory
81Cache memory terminology
- Spatial locality
- Nearby addresses are likely to needed soon
- Bring in more bytes then needed from the vicinity
of each reference for later use - Temporal locality
- Recently used addresses are likely to be needed
again - Dont discard these right away
82Cache memory terminology
- Cache line
- The block of bytes brought in when a cache miss
occurs - Typically 4, 8, 16, 32, or 64 consecutive bytes
- Unified cache
- Contains both data and instructions
- Split cache
- Separate caches for data and instructions
- Allows parallel access
- Effectively doubles bandwidth
- Instruction cache usually read-only from the CPU
83Several levels of cache are common
84Direct-mapped cache
- A direct-mapped cache is organized into rows
- Each row contains
- Valid bit
- Set whenever the row is loaded
- Bit is clear only when cache line is empty
- Tag
- Consists of the high-order address bits
- Cache line (the data)
- The next slide is an example of a direct-mapped
cache - with 2048 rows
- with a 32-byte cache line
85(No Transcript)
86Direct-mapped cache
- The example cache responds to 32-bit addresses
- The 11-bit line field selects the row of the
cache - The 3-bit word field selects the word of data
within the cache line - The 2-bit byte field selects a byte within the
word - Each row of the cache is shared by all addresses
with the same line field bits - The 16 tag bits of the address are loaded into
the 16-bit tag field when the cache line is loaded
87Direct-mapped cache
- When the cache is referenced . . .
- The tag bits of the address are compared with the
bits in the tag field of the row selected by the
line bits - A cache hit occurs if the tag bits are the same
- A cache miss occurs if the tag bits are different
- Cache hit
- The needed word or byte of the cache line is read
or written - Cache miss
- The existing cache line must be read back to
memory if it has been modified - Replace the cache line with the new data from
memory - Update the tag field
- Read or write the needed word or byte
88Set-associative cache
- Usually 2 or 4 direct-mapped lines per row
- All tag fields are simultaneously compared
- On a cache miss, one of the lines must be
discarded - Which one?
- (LRU) Least Recently Used
89Writing to a cache
- When should the copy in main memory be updated?
- Write through
- Immediately update
- More memory traffic
- Write deferred or write back
- Wait until the cache line is replaced
- Write allocation
- For a cache miss on write, bring the line into
the cache and write to it there - This is in contrast to writing directly to memory
- Usually used with write deferred
90Microarchitecture examples
- Three architectures are considered
- Pentium 4
- UltraSPARC-III
- Intel 8051
- First two are very similar
- Three-bus architecture
- Pipelines
- Split cache
Note We will skip the following textbook
sections Section 4.5.2 Branch prediction
Section 4.5.3 Out-of-order execution and
register renaming Section 4.5.4 Speculative
execution
91Microarchitecture examples
- Pentium 4
- CISC architecture on the outside (at the ISA
level) - The way it appears to assembly language
programmers - Huge and unwieldy instruction set backward
compatible with 8088 - Only 8 visible registers EAX, EBX, ECX, EDX,
etc. - 32-bit architecture with 64-bit memory bus
- RISC architecture on the inside (at
microarchitecture level) - Microarchitecture named NetBurst
- Complete break from Pentium III and earlier
microarchitectures - Up to 126 microinstructions active at a time
- 120 scratch registers
- Two double-speed integer ALUs and two
double-speed floating-point ALUs - 12 billion integer operations possible each
second at 3 GHz - The Mic-4 resembles the Pentium 4 in many ways
- However, Pentium 4 has out-of-order execute
capability - Read on your own
- Textbook pages 312 - 317
92Overview of the NetBurst Microarchitecture
93Microarchitecture examples
- UltraSPARC-III Cu
- Cu indicates copper wiring on chip (not aluminum)
- No microarchitecture level
- True RISC architecture
- Needs special hardware for graphics and
multimedia instructions - 64-bit data path and registers
- 128-bit memory bus
- Microarchitecture much simpler than Pentium 4
- There is a simpler ISA level to implement
- 14-stage pipeline
- Read on your own
- Textbook pages 317 - 323
9414-stageUltraSPARC-III pipeline
95Microarchitecture examples
- Intel 8051
- Similar to Mic-1, but more RISC-like than
CISC-like - Only about 60,000 transistors
- Primary design goal cheap, rather than fast
- No pipelining, no caching, and in-order issue,
execute, and retirement - Single main bus
- Registers ACC, B, and SP
- Similar to Intel 8088s AX, BX, and SP
- TMP1 and TMP2 are latches for ALU
- For embedded applications there are . . .
- Three 16-bit timers for real-time control
- Four 8-bit I/O ports
- Read on your own
- Textbook pages 323 - 325
96Intel 8051