Chapter 4 The Microarchitecture Level

About This Presentation

Title:

Chapter 4 The Microarchitecture Level

Description:

A 4-bit code is decoded 16 ways. Only 9 ways are used. Saves 5 bits ... Eliminating decoding. Reducing the path length ... Eliminating decoding. Decoding the ... – PowerPoint PPT presentation

Number of Views:505

Avg rating:3.0/5.0

Slides: 97

Provided by: markt2

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4 The Microarchitecture Level

1
Chapter 4The Microarchitecture Level

CS 271 Computer Architecture
Indiana University Purdue University Fort Wayne
Mark Temte

2
Microarchitecture level context
3
Microarchitecture level

The ISA level instructions are also known as
macroinstructions
Familiar from assembly language
ADD, LOAD, STORE, BRANCH, etc.

Java c a b assembly LOAD R3,
a language ADD R3, b R3 is register
3 STORE R3, c BRANCH L4
4
Microarchitecture level

The control unit within the CPU must generate
signals to fetch and execute each of the ISA
level macroinstructions
How?
Create a microcomputer within the control unit
This microcomputer runs microprograms consisting
of microinstructions that act on the data path
There is one microprogram for each
macroinstruction
Execution of the microprogram interprets the
corresponding macroinstruction

5
Data path

This data path of the CPU consists of those parts
exclusive of the control unit
Consists of the ALU, registers, and internal
buses
Example
The following slide shows the data path of a
fictitious computer called IJVM
Integer Java Virtual Machine
32-bit data path
32-bit registers
32 1-bit ALUs

6
(No Transcript)
7
Data path

The ALU has 6 control lines
F0, F1 AND, OR, COMP, SUM
ENA gate A inputs into ALU
ENB gate B inputs into ALU
INVA complement A inputs
INC assert carry into low-order bit of ALU
The shifter has 2 control lines with 3 actions
Negating both lines causes no shift
SLL8 Shift Left Logical 8 (shift left 1 byte
with 0 fill)
SRA1 Shift Right Arithmetic 1 not changing
leftmost bit
This divides a twos complement number by 2

8
Example dividing by 2

Divide twos complement representation of -14 by 2

Let n 6 bits Represent magnitude 1410
001110 Complement each bit
110001 Add 1
1 110010 Result
-1410 110010
Apply SRA1 to -1410 110010 Obtain 111001 What
is this? 111001 Complement
each bit 000110 Add 1
1 Obtain 710
000111 Thus SRA1 produced 111001 -710
9
Recall . . .
10
(No Transcript)
11
Example

How could you increment the SP register?
Look at the data path again
Enable SP to the B bus
Compute B1 as follows . . .
Assert ENB
Assert lines for SUM
Assert INC
No shift
When shifter output has stabilized, write the C
bus back into the SP

12
Incrementing the SP register

Precise timing of the write pulse to SP is
important

13
Memory operations

There are two ports to memory
MAR / MDR
32-bit data port
Load the MAR with a 30-bit address
The address (multiplied by 4) goes to memory
To multiply, simply shift the address left by 2
bits
The MDR receives data (READ) or provides data
(WRITE)
PC / MBR
8-bit data port
Only for reading
Load the PC with a 32-bit address
The address goes to memory
The MBR receives a byte
Usually an ISA instruction code

14
Memory operations

The MBR is gated onto the B bus in two ways
Signed (with sign extension)
Unsigned
There is one control signal for each
Only one of these signals may be asserted at a
time

15
Control signals for the data path

There are 29 signals in all
9 - Selects a register to gate to the B-bus
A 4-bit code is decoded 16 ways
Only 9 ways are used
Saves 5 bits
9 - Selects a register to load from the C-bus
8 - ALU and shifter operations
2 - read / write using MAR / MDR
1 - fetch using PC / MBR
These control each data path cycle
Falling edge of clock to next rising edge

16
Microinstruction format

Each microinstruction sets up control signals on
the data path for the next data path cycle
Each microinstruction is 36 bits
24 control bits for the data path
24 29 5
9-bit address of the next microinstruction
3-bit condition code for branching
Note each microinstruction specifies its
successor

17
Microinstruction format
18
Mic-1 architecture

Mic-1 is an example architecture we will study
Consists of . . .
Control store
512 x 36 bit memory for all the microprograms
This is ROM
MPC
MicroProgram Counter
MIR
MicroInstruction Register

19
(No Transcript)
20
Mic-1 fetch / execute cycle

At the falling edge of the clock, the MIR is
loaded
The following components then operate and
stabilize
Decoder
B-bus
ALU
Shifter
C-bus
Also, the N and Z outputs from the ALU go to
flip-flops
At the next rising edge of the clock
Registers and N, Z flip-flops are loaded
MDR and MBR are loaded from memory
The address of the next microinstruction is
calculated while the clock is high and the cycle
repeats

21
Recall the timing of the data path cycle
22
MPC address calculation

Address is just NEXT_ADDRESS when JAM 000
However . . .
JAMN 1 causes OR of N-bit with high-order MPC
bit
JAMZ 1 causes OR of Z-bit with high-order MPC
bit
JMPC 1 causes bitwise OR of MBR and 8 low-order
bits of NEXT_ADDRESS
Typically, NEXT_ADDRESS 0 when JMPC 1
This permits a branch to the address in the MBR
This address typically is identical to the ISA
op-code

23
MPC address calculation
24
ISA (macroarchitecture) of the IJVM

Memory model
Format of methods
The IJVM instruction set
Local variable frames and the operand stack
How a method call is implemented

25
The IJVM memory model
26
The IJVM memory model

The constant pool
Contains constants, strings, and pointers
E.g., pointer to the base address of each method
Loaded when the program is loaded
Register CPP points to the base of the constant
pool
The constant pool is read-only
The method area
Contains method code
Register PC points to the next instruction
Organized as a byte array
Operand stack

27
Method format

The executable code in a method is preceded by .
. .
Two bytes giving the number of parameters
Two bytes giving the size of the local variable
area
The local variable area size is needed to
initialize the SP to the top of the local
variable frame

number of parameters
size of LV area
PC
executable code
28
The IJVM instruction set

The IJVM instruction set appears on the next
slide
There are 20 instructions altogether
Many of the instructions require just a single
byte
These have no operands
DUP, IADD, IAND, IOR, IRETURN, ISUB, NOP, POP,
SWAP, WIDE
Others have an additional single 1-byte operand
BIPUSH, ILOAD, ISTORE
Some have a single 2-byte operand
GOTO, IFEQ, IFLT, IF_ICMPEQ, INVOKEVIRTUAL, LDC_W
One has two 1-byte operands
IINC

29
(No Transcript)
30
Using the IADD instruction

To add local variables j and k and save the sum
in local variable i . . .

ILOAD j // push a copy of local variable
j on the top of the stack ILOAD k // push
a copy of local variable k on the top of the
stack IADD // pop 2 words from the stack
and push their sum back ISTORE i // pop top
word from stack and store in local variable i
31
Sample program fragment
Note The branch instruction IF_ICMPEQ has a
16-bit signed offset that is added to the address
of the current op-code to target L1
32
The local variable frame

The local variable frame is where the local
variables of a method are stored
A new local variable frame is created whenever a
method is called
Each local variable frame is pushed onto a stack
in memory called the operand stack
The stack space occupied by a local variable
frame is released when the associated method
returns

33
Operand stack example

Suppose method A calls method B, which calls
method C
The SP (Stack Pointer) register holds the index
of the top of the stack
The LV (Local Variable pointer) register holds
the base address of the local variable frame

SP
frame for C
LV
frame for B
frame for A
34
Operand stack example

Note how the stack space for B and C is recycled

35
Detailed local variable frame structure

The local variable frame also . . .
Holds all the parameters set up on the stack in
advance by the caller
Saves the LV and PC registers of the caller
The saved PC value is the return address within
the caller

36
Detailed local variable frame structure
37
Calling a method

Call a method using instruction
INVOKEVIRTUAL disp
Parameter disp gives the position in the constant
pool holding a pointer to the called method
INVOKEVIRTUAL does the following
Sets register LV to the value in SP - (
parameters)
Set the value in the location pointed to by LV to
the value in register SP ( local variables)
1
Increment register SP by ( local variables)
Push callers register PC (return address) on the
stack
Set register PC to the 5th byte in the called
method
Push the callers original LV value on the stack

38
Intermediate results

The operand stack is used for storing method
intermediate results
These are pushed on the operand stack above the
local variable frame
The return result is the final intermediate
result
It is always left immediately above the local
variable frame
The other intermediate results have already been
popped
Look at Figure 4-9 again
IRETURN reverses the steps of INVOKEVIRTUAL

39
Returning from a method
40
The Mic-1 microprogram for IJVM

Recall that there is one Mic-1 microprogram for
each of the IJVM macroinstructions
There is also a microprogram for instruction
fetch
Altogether, these microprograms are referred to
as the Mic-1 microprogram for the IJVM
Microinstructions are described using a special
notation
36 bits could be used instead for each
microinstruction
It is more readable to indicate how the bits
should be set rather than what they are set to
Caution be sure that what is indicated by the
notation is physically possible

41
Microinstruction notation

Some examples
Everything on a line is done in one clock cycle
The desired result must be physically possible
For example, MDR SP MDR is illegal, since
needs one input from register H

PC PC 1 fetch goto (MBR) MAR SP SP-1
rd H TOS MDR TOS MDR H wr goto Main1
42
Sequencing of instructions

All instructions have a implicit or explicit goto
Sequential instructions are not necessarily
sequential in the control store
The microinstruction sequence for a
macroinstruction starts at the control store
address that corresponds to the numerical value
of the macroinstructions op-code
For example, the IADD op-code is 6016 and the
microinstruction sequence starts at location 6016
The following microinstruction can be located
anywhere in the control store

43
Microinstruction branching

Example
Pass TOS through the ALU and look at the Z bit
L1 and L2 must be exactly 256 locations apart
Example
Unconditional branch to instruction pointed to by
the MBR
Convention At the start of any
macroinstruction, register TOS always contains a
copy of the value at the top of the operand stack
Register OPC is a scratch register
Often saves the op-code

Z TOS if (Z) goto L1 else goto L2
goto (MBR)
44
The Mic-1 microprogram for IJVM

There are 112 microinstructions in all
Starts with the line labeled Main1
Before the macroprogram runs . . .
the PC contains the address just before the 1st
macroinstruction
the MBR contains 0 (the NOP op-code)
Main1 fetches the next macroinstruction op-code
and branches to the start of the microinstruction
sequence for the current macroinstruction
The last microinstruction in the sequence
branches back to Main1
On the following slides, focus on instructions
marked with

45
The Mic-1 microprogram for IJVM
46
The Mic-1 microprogram for IJVM
47
The Mic-1 microprogram for IJVM
48
The Mic-1 microprogram for IJVM
49
The Mic-1 microprogram for IJVM
50
Design issues

We will modify the Mic-1 design in order to
increase performance
Changes involve . . .
Eliminating decoding
Reducing the path length
The path length is the average number of
microinstructions per macroinstruction
The path length can be reduced by . . .
Eliminating Main1
Using a 3-bus architecture
Adding an independent fetch unit

51
Eliminating decoding

Decoding the B-bus slows the potential clock rate
The decoding must be completed before anything
else can happen
Cost to eliminate decoding
5 bits in each microinstruction
Altogether, 41 bits will be needed instead of 36

52
Eliminating Main1

At Main1 there is a microinstruction to fetch the
opcode of the next macroinstruction
This microinstruction can be eliminated by
merging its code onto the end of the microcode
sequence of each macroinstruction
Usually this can be done in parallel with other
activity for a saving of 1 cycle
This may not always be possible

Main1 PC PC 1 fetch goto( MBR )
53
Eliminating Main1

Microinstruction sequence for POP with Main1 code
merged onto the end

The original order of microinstruction execution
for POP
54
Three-bus architecture

This change allows two registers to be added in
just one clock cycle
There is no need to waste a cycle moving one of
the registers to the H register earlier

55
Adding an independent fetch unit

This new specialized functional unit is called
the IFU
Instruction Fetch Unit
It independently fetches macroinstruction
opcodes and processes macroinstruction operands
Operands like varnum, disp, offset , etc.
This eliminates the Main1 microinstruction
entirely
No longer necessary to merge Main1 code onto the
end of each microcode sequence

56
Adding an independent fetch unit

The IFU gives a dramatic improvement in
performance, but . . .
The IFU is surprisingly complicated
Due to branching and operand handling
There are some necessary changes in the data path
due to the IFU
In addition to MBR, a new 2-byte register MBR2 is
added to the data path for holding 2-byte
operands
This eliminates the need to combine two bytes in
the data path to form an offset or disp
The old MBR is renamed MBR1

57
The IFU

The PC is now updated by the microprogram only
when a branch occurs
The IFU maintains its own copy of the PC in a
private register called IMAR
The IFU increments the IMAR independently of the
data path
The IFU reads 4 bytes at a time from the user
program into a special shift register capable of
holding 5 bytes

58
The IFU
59
Mic-2

The revised microarchitecture is called Mic-2
Mic-2 includes . . .
3-bus architecture
Prefetching using the IFU
Shorter microprogram
81 microinstructions instead of 112
Major performance gain

60
Mic-2
61
The new microprogram for Mic-2
62
The new microprogram for Mic-2
63
(No Transcript)
64
Additional modifications

The clock cycle time can be reduced with a
piplined design
We first add latch registers to the data path

65
Pipelined design

This design latches . . .
The A and B inputs to the ALU
Output from the ALU
The old clock cycle is broken into 3 microcycles
The clock is adjusted to run approximately 3
times as fast
Now parts of three microinstructions can be
processed in parallel
We need to add a cache memory so memory
operations can keep up
The ALU is active every cycle
Not just in the middle of the old cycle

66
The pipeline in action
67
The SWAP instruction
SWAP with piplining
68
The SWAP instruction

With piplining, note the need to stall the
pipeline occasionally
The third microinstruction caused the pipeline to
stall for two cycles
The SWAP now requires only 11 microcycles instead
of 3 x (6 normal cycles) 18 microcycles

69
Mic-3

The revised microarchitecture is called Mic-3
Mic-3 includes a 4-stage pipeline with stages . .
.
Fetch
Latch A and B
Calculate with the ALU
Writeback

70
Additional modifications

Mic-3 still has a problem
Various microinstructions contain microbranches
Conditional branch
Branch with a target microinstruction not known
in advance
For example, the last microinstruction in a
sequence always branches to a target not known in
advance
Consider the swap6 microinstruction
The next microinstruction cannot be prefetched
This could cause havoc with the microinstruction
pipeline
There is a separate MIR for each microinstruction
in the pipeline
The pipeline must stall until the next
microinstruction is known
The next microinstruction must be anticipated
Add two more components to the design
Decoding unit
Queueing unit

71
Decoding unit

The decoding unit knows which incoming bytes are
opcodes and which are operands like varnum and
disp
The incoming opcode is an index into a ROM table
within the decoding unit
The indexed row gives . . .
The the number of bytes associated with the
opcode
This allows the decoding unit to know when it
fetches the next opcode
The address in the control store of the first
microinstruction of the sequence associated with
the opcode

72
Queueing unit

The queueing unit contains . . .
The old control store (ROM)
The microinstructions in the control store for a
given sequence are now consecutive rather than
scattered
No need for each microinstruction to designate
its successor
A hardware queue of microinstructions (RAM)
The microinstruction queue holds the proper
sequence of microinstructions across ISA
macroinstruction boundaries

73
Queueing unit

Microinstructions have a modified format
No longer need the NEXT_ADDRESS field
No longer have JAM bits
Have added bits for selecting the A bus
Also there are two new bits in each
microinstruction
Final bit
Goto bit

74
Queueing unit

The Final bit is set in the last
microinstruction in each sequence
It is used to indicate the end of the sequence
for the current macroinstruction and reactivate
the IFU
The Goto bit marks microinstructions that have
conditional branches (at the ISA level)
These microinstructions have a different format
from other microinstructions
Have JAM bits
Contain an index into the control store

75
Queueing unit operation (input side)

Starting with the first microinstruction of a
sequence, the queueing unit . . .
Copies sequential instructions from the control
store into the hardware queue of
microinstructions
Copying continues through the first
microinstruction with the Final bit set
If the Goto bit is not set, the queueing unit . .
.
Gets the index associated with the the next
opcode from the decoding unit
Continues copying microinstructions from the
sequence for the new opcode into the hardware
queue of microinstructions
Copying continues until a Goto bit is set or the
queue of microinstructions is full

76
Queueing unit operation (input side)

When the Goto bit is set (conditional branch)
The queueing unit stops copying microinstructions
from the control store into its hardware queue
The unit stalls until the microbranch has been
resolved
The fetch queue in the IFU may have to be cleaned
up also

77
Queueing unit operation (output side)

On the ouput side, the queueing unit
Dequeues microinstructions from its queue
Feeds them into a queue of four MIRs
One MIR for each stage of the data path part of
the pipeline

78
(No Transcript)
79
Mic-4

The revised microarchitecture is called Mic-4
Mic-4 includes a 7-stage pipeline with stages . .
.
IFU
Decoding unit
Queueing unit
Latch operands
ALU
Register writeback
Memory
See circled numbers on Figure 4-35

80
Cache memory

The bottleneck in the Mic-4 design is with memory
Memory latency is the delay for read and write
Memory bandwidth is the number of bytes involved
in each read or write
For a given memory technology, an increase in
bandwidth causes an increase in latency
The fastest memory technology is not cost
effective
Cache memory is the cost effective alternative

CPU
cache memory
main memory
81
Cache memory terminology

Spatial locality
Nearby addresses are likely to needed soon
Bring in more bytes then needed from the vicinity
of each reference for later use
Temporal locality
Recently used addresses are likely to be needed
again
Dont discard these right away

82
Cache memory terminology

Cache line
The block of bytes brought in when a cache miss
occurs
Typically 4, 8, 16, 32, or 64 consecutive bytes
Unified cache
Contains both data and instructions
Split cache
Separate caches for data and instructions
Allows parallel access
Effectively doubles bandwidth
Instruction cache usually read-only from the CPU

83
Several levels of cache are common
84
Direct-mapped cache

A direct-mapped cache is organized into rows
Each row contains
Valid bit
Set whenever the row is loaded
Bit is clear only when cache line is empty
Tag
Consists of the high-order address bits
Cache line (the data)
The next slide is an example of a direct-mapped
cache
with 2048 rows
with a 32-byte cache line

85
(No Transcript)
86
Direct-mapped cache

The example cache responds to 32-bit addresses
The 11-bit line field selects the row of the
cache
The 3-bit word field selects the word of data
within the cache line
The 2-bit byte field selects a byte within the
word
Each row of the cache is shared by all addresses
with the same line field bits
The 16 tag bits of the address are loaded into
the 16-bit tag field when the cache line is loaded

87
Direct-mapped cache

When the cache is referenced . . .
The tag bits of the address are compared with the
bits in the tag field of the row selected by the
line bits
A cache hit occurs if the tag bits are the same
A cache miss occurs if the tag bits are different
Cache hit
The needed word or byte of the cache line is read
or written
Cache miss
The existing cache line must be read back to
memory if it has been modified
Replace the cache line with the new data from
memory
Update the tag field
Read or write the needed word or byte

88
Set-associative cache

Usually 2 or 4 direct-mapped lines per row
All tag fields are simultaneously compared
On a cache miss, one of the lines must be
discarded
Which one?
(LRU) Least Recently Used

89
Writing to a cache

When should the copy in main memory be updated?
Write through
Immediately update
More memory traffic
Write deferred or write back
Wait until the cache line is replaced
Write allocation
For a cache miss on write, bring the line into
the cache and write to it there
This is in contrast to writing directly to memory
Usually used with write deferred

90
Microarchitecture examples

Three architectures are considered
Pentium 4
UltraSPARC-III
Intel 8051
First two are very similar
Three-bus architecture
Pipelines
Split cache

Note We will skip the following textbook
sections Section 4.5.2 Branch prediction
Section 4.5.3 Out-of-order execution and
register renaming Section 4.5.4 Speculative
execution
91
Microarchitecture examples

Pentium 4
CISC architecture on the outside (at the ISA
level)
The way it appears to assembly language
programmers
Huge and unwieldy instruction set backward
compatible with 8088
Only 8 visible registers EAX, EBX, ECX, EDX,
etc.
32-bit architecture with 64-bit memory bus
RISC architecture on the inside (at
microarchitecture level)
Microarchitecture named NetBurst
Complete break from Pentium III and earlier
microarchitectures
Up to 126 microinstructions active at a time
120 scratch registers
Two double-speed integer ALUs and two
double-speed floating-point ALUs
12 billion integer operations possible each
second at 3 GHz
The Mic-4 resembles the Pentium 4 in many ways
However, Pentium 4 has out-of-order execute
capability
Read on your own
Textbook pages 312 - 317

92
Overview of the NetBurst Microarchitecture
93
Microarchitecture examples

UltraSPARC-III Cu
Cu indicates copper wiring on chip (not aluminum)
No microarchitecture level
True RISC architecture
Needs special hardware for graphics and
multimedia instructions
64-bit data path and registers
128-bit memory bus
Microarchitecture much simpler than Pentium 4
There is a simpler ISA level to implement
14-stage pipeline
Read on your own
Textbook pages 317 - 323

94
14-stageUltraSPARC-III pipeline
95
Microarchitecture examples

Intel 8051
Similar to Mic-1, but more RISC-like than
CISC-like
Only about 60,000 transistors
Primary design goal cheap, rather than fast
No pipelining, no caching, and in-order issue,
execute, and retirement
Single main bus
Registers ACC, B, and SP
Similar to Intel 8088s AX, BX, and SP
TMP1 and TMP2 are latches for ALU
For embedded applications there are . . .
Three 16-bit timers for real-time control
Four 8-bit I/O ports
Read on your own
Textbook pages 323 - 325

96
Intel 8051

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 4 The Microarchitecture Level - PowerPoint PPT Presentation

Chapter 4 The Microarchitecture Level

A 4-bit code is decoded 16 ways. Only 9 ways are used. Saves 5 bits ... Eliminating decoding. Reducing the path length ... Eliminating decoding. Decoding the ... – PowerPoint PPT presentation