OMSE%20510:%20Computing%20Foundations%203:%20Caches,%20Assembly,%20CPU%20Overview

About This Presentation

Title:

OMSE%20510:%20Computing%20Foundations%203:%20Caches,%20Assembly,%20CPU%20Overview

Description:

The Five Classic Components of a Computer. Next Topic: Simple caching techniques ... (cold start or process migration, first reference): first access to a block ' ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 134

Provided by: franci52

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: OMSE%20510:%20Computing%20Foundations%203:%20Caches,%20Assembly,%20CPU%20Overview

1
OMSE 510 Computing Foundations3 Caches,
Assembly, CPU Overview

Chris Gilmore ltgrimjack_at_cs.pdx.edugt
Portland State University/OMSE

2
Today

Caches
DLX Assembly
CPU Overview

3
Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
4
The Big Picture Where are We Now?

The Five Classic Components of a Computer
Next Topic
Simple caching techniques
Many ways to improve cache performance

5
Recap Levels of the Memory Hierarchy
Upper Level
Processor
faster
Instr. Operands
Cache
Blocks
Memory
Pages
Disk
Files
Tape
Larger
Lower Level
6
Recap exploit locality to achieve fast memory

Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon.
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon.
By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.
DRAM is slow but cheap and dense
Good choice for presenting the user with a BIG
memory system
SRAM is fast but expensive and not very dense
Good choice for providing the user FAST access
time.

7
Memory Hierarchy Terminology

Hit data appears in some block in the upper
level (example Block X)
Hit Rate the fraction of memory access found in
the upper level
Hit Time Time to access the upper level which
consists of
RAM access time Time to determine hit/miss
Miss data needs to be retrieve from a block in
the lower level (Block Y)
Miss Rate 1 - (Hit Rate)
Miss Penalty Time to replace a block in the
upper level
Time to deliver the block the processor
Hit Time ltlt Miss Penalty

Lower Level Memory
Upper Level Memory
To Processor
Blk X
From Processor
Blk Y
8
The Art of Memory System Design
Workload or Benchmark programs
Processor
reference stream ltop,addrgt, ltop,addrgt,ltop,addrgt,lt
op,addrgt, . . . op i-fetch, read, write
Memory
Optimize the memory system organization to
minimize the average memory access time for
typical workloads

MEM
9
Example Fully Associative

Fully Associative Cache
No Cache Index
Compare the Cache Tags of all cache entries in
parallel
Example Block Size 32 B blocks, we need N
27-bit comparators
By definition Conflict Miss 0 for a fully
associative cache

0
4
31
Cache Tag (27 bits long)
Byte Select
Ex 0x01
Cache Data
Valid Bit
Cache Tag

Byte 0
Byte 1
Byte 31
X

Byte 32
Byte 33
Byte 63
X
X
X

X
10
Example 1 KB Direct Mapped Cache with 32 B Blocks

For a 2 N byte cache
The uppermost (32 - N) bits are always the Cache
Tag
The lowest M bits are the Byte Select (Block Size
2 M)

Block address
0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag

0
Byte 0
Byte 1
Byte 31

1
0x50
Byte 32
Byte 33
Byte 63
2
3

31
Byte 992
Byte 1023
11
Set Associative Cache

N-way set associative N entries for each Cache
Index
N direct mapped caches operates in parallel
Example Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared to the input
in parallel
Data is selected based on the tag result

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0

Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Hit
12
Disadvantage of Set Associative Cache

N-way Set Associative Cache versus Direct Mapped
Cache
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss decision and set
selection
In a direct mapped cache, Cache Block is
available BEFORE Hit/Miss
Possible to assume a hit and continue. Recover
later if miss.

13
Block Size Tradeoff

Larger block size take advantage of spatial
locality BUT
Larger block size means larger miss penalty
Takes longer time to fill up the block
If block size is too big relative to cache size,
miss rate will go up
Too few cache blocks
In general, Average Access Time
Hit Time x (1 - Miss Rate) Miss Penalty x
Miss Rate

Average Access Time
Miss Rate
Miss Penalty
Exploits Spatial Locality
Increased Miss Penalty Miss Rate
Fewer blocks compromises temporal locality
Block Size
Block Size
Block Size
14
A Summary on Sources of Cache Misses

Compulsory (cold start or process migration,
first reference) first access to a block
Cold fact of life not a whole lot you can do
about it
Note If you are going to run billions of
instruction, Compulsory Misses are insignificant
Conflict (collision)
Multiple memory locations mappedto the same
cache location
Solution 1 increase cache size
Solution 2 increase associativity
Capacity
Cache cannot contain all blocks access by the
program
Solution increase cache size
Coherence (Invalidation) other process (e.g.,
I/O) updates memory

15
Source of Cache Misses Quiz
Assume constant cost.
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size Small, Medium, Big?
Compulsory Miss
Conflict Miss
Capacity Miss
Coherence Miss
Choices Zero, Low, Medium, High, Same
16
Sources of Cache Misses Answer
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size
Big
Medium
Small
Compulsory Miss
Same
Same
Same
Conflict Miss
High
Medium
Zero
Capacity Miss
Low
Medium
High
Coherence Miss
Same
Same
Same
Note If you are going to run billions of
instruction, Compulsory Misses are insignificant.
17
Recap Four Questions for Caches and Memory
Hierarchy

Q1 Where can a block be placed in the upper
level? (Block placement)
Q2 How is a block found if it is in the upper
level? (Block identification)
Q3 Which block should be replaced on a miss?
(Block replacement)
Q4 What happens on a write? (Write strategy)

18
Q1 Where can a block be placed in the upper
level?

Block 12 placed in 8 block cache
Fully associative, direct mapped, 2-way set
associative
S.A. Mapping Block Number Modulo Number Sets

Fully associative block 12 can go anywhere
Block no.
0 1 2 3 4 5 6 7
19
Q2 How is a block found if it is in the upper
level?
Set Select
Data Select

Direct indexing (using index and block offset),
tag compares, or combination
Increasing associativity shrinks index, expands
tag

20
Q3 Which block should be replaced on a miss?

Easy for Direct Mapped
Set Associative or Fully Associative
Random
FIFO
LRU (Least Recently Used)
LFU (Least Frequently Used)
Associativity 2-way 4-way 8-way
Size LRU Random LRU Random LRU Random
16 KB 5.2 5.7 4.7 5.3 4.4 5.0
64 KB 1.9 2.0 1.5 1.7 1.4 1.5
256 KB 1.15 1.17 1.13 1.13 1.12
1.12

21
Q4 What happens on a write?

Write throughThe information is written to both
the block in the cache and to the block in the
lower-level memory.
Write backThe information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced.
is block clean or dirty?
Pros and Cons of each?
WT read misses cannot result in writes,
coherency easier
WB no writes of repeated writes
WT always combined with write buffers so they
dont wait for lower level memory

22
Write Buffer for Write Through
Cache
Processor
DRAM
Write Buffer

A Write Buffer is needed between the Cache and
Memory
Processor writes data into the cache and the
write buffer
Memory controller write contents of the buffer
to memory
Write buffer is just a FIFO
Typical number of entries 4
Works fine if Store frequency (w.r.t. time) ltlt
1 / DRAM write cycle
Memory system designers nightmare
Store frequency (w.r.t. time) gt 1 / DRAM write
cycle
Write buffer saturation

23
Write Buffer Saturation
Cache
Processor
DRAM
Write Buffer

Store frequency (w.r.t. time) gt 1 / DRAM write
cycle
If this condition exist for a long period of time
(CPU cycle time too quick and/or too many store
instructions in a row)
Store buffer will overflow no matter how big you
make it
The CPU Cycle Time lt DRAM Write Cycle Time
Solution for write buffer saturation
Use a write back cache
Install a second level (L2) cache (does this
always work?)

Cache
L2 Cache
Processor
DRAM
Write Buffer
24
Write-miss Policy Write Allocate versus Not
Allocate

Assume a 16-bit write to memory location 0x0 and
causes a miss
Do we read in the block?
Yes Write Allocate
No Write Not Allocate

0
4
31
9
Cache Index
Cache Tag
Example 0x00
Byte Select
Ex 0x00
Ex 0x00
Cache Data
Valid Bit
Cache Tag

0
Byte 0
0x50
Byte 1
Byte 31

1
Byte 32
Byte 33
Byte 63
2
3

31
Byte 992
Byte 1023
25
Impact on Cycle Time
Cache Hit Time directly tied to clock
rate increases with cache size increases with
associativity
Average Memory Access time Hit Time Miss
Rate x Miss Penalty Time IC x CT x (ideal CPI
memory stalls)
26
What happens on a Cache miss?

For in-order pipeline, 2 options
Freeze pipeline in Mem stage (popular early on
Sparc, R4000) IF ID EX Mem stall stall stall
stall Mem Wr IF ID EX stall stall
stall stall stall Ex Wr
Use Full/Empty bits in registers MSHR queue
MSHR Miss Status/Handler Registers
(Kroft)Each entry in this queue keeps track of
status of outstanding memory requests to one
complete memory line.
Per cache-line keep info about memory address.
For each word register (if any) that is waiting
for result.
Used to merge multiple requests to one memory
line
New load creates MSHR entry and sets destination
register to Empty. Load is released from
pipeline.
Attempt to use register before result returns
causes instruction to block in decode stage.
Limited out-of-order execution with respect to
loads. Popular with in-order superscalar
architectures.
Out-of-order pipelines already have this
functionality built in (load queues, etc).

27
Improving Cache Performance 3 general options
Time IC x CT x (ideal CPI memory
stalls) Average Memory Access time Hit Time
(Miss Rate x Miss Penalty) (Hit Rate x Hit
Time) (Miss Rate x Miss Time)
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
28
Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
29
3Cs Absolute Miss Rate (SPEC92)
Conflict
Compulsory vanishingly small
30
21 Cache Rule
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
31
3Cs Relative Miss Rate
Conflict
Flaws for fixed block size Good insight gt
invention
32
1. Reduce Misses via Larger Block Size
33
2. Reduce Misses via Higher Associativity

21 Cache Rule
Miss Rate DM cache size N Miss Rate 2-way cache
size N/2
Beware Execution time is only final measure!
Will Clock Cycle time increase?
Hill 1988 suggested hit time for 2-way vs.
1-way external cache 10, internal 2

34
Example Avg. Memory Access Time vs. Miss Rate

Example assume CCT 1.10 for 2-way, 1.12 for
4-way, 1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(Green means A.M.A.T. not improved by more
associativity)
(AMAT Average Memory Access Time)

35
3. Reducing Misses via a Victim Cache

How to combine fast hit time of direct mapped
yet still avoid conflict misses?
Add buffer to place data discarded from cache
Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache
Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
36
4. Reducing Misses by Hardware Prefetching

E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Works with data blocks too
Jouppi 1990 1 data stream buffer got 25 misses
from 4KB cache 4 streams got 43
Palacharla Kessler 1994 for scientific
programs for 8 streams got 50 to 70 of misses
from 2 64KB, 4-way set associative caches
Prefetching relies on having extra memory
bandwidth that can be used without penalty

37
5. Reducing Misses by Software Prefetching Data

Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause
faultsa form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues lt savings in reduced
misses?
Higher superscalar reduces difficulty of issue
bandwidth

38
6. Reducing Misses by Compiler Optimizations

McFarling 1989 reduced caches misses by 75 on
8KB direct mapped cache, 4 byte blocks in
software
Instructions
Reorder procedures in memory so as to reduce
conflict misses
Profiling to look at conflicts(using tools they
developed)
Data
Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays
Loop Interchange change nesting of loops to
access data in order stored in memory
Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap
Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows

39
Improving Cache Performance (Continued)
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
40
0. Reducing Penalty Faster DRAM / Interface

New DRAM Technologies
RAMBUS - same initial latency, but much higher
bandwidth
Synchronous DRAM
Better BUS interfaces
CRAY Technique only use SRAM

41
1. Reducing Penalty Read Priority over Write on
Miss
Cache
Processor
DRAM
Write Buffer

A Write Buffer Allows reads to bypass writes
Processor writes data into the cache and the
write buffer
Memory controller write contents of the buffer
to memory
Write buffer is just a FIFO
Typical number of entries 4
Works fine if Store frequency (w.r.t. time) ltlt
1 / DRAM write cycle
Memory system designers nightmare
Store frequency (w.r.t. time) gt 1 / DRAM write
cycle
Write buffer saturation

42
1. Reducing Penalty Read Priority over Write on
Miss

Write-Buffer Issues
Write through with write buffers offer RAW
conflicts with main memory reads on cache misses
If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50
)
? Check write buffer contents before read
if no conflicts, let the memory access continue
Write Back?
Read miss replacing dirty block
Normal Write dirty block to memory, and then do
the read
Instead copy the dirty block to a write buffer,
then do the read, and then do the write
CPU stall less since restarts as soon as do read

43
2. Reduce Penalty Early Restart and Critical
Word First

Dont wait for full block to be loaded before
restarting CPU
Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution
Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Generally useful only in large blocks,
Spatial locality a problem tend to want next
sequential word, so not clear if benefit by early
restart

block
44
3. Reduce Penalty Non-blocking Caches

Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss
requires F/E bits on registers or out-of-order
execution
requires multi-bank memories
hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests
hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses
Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses
Requires multiple memory banks (otherwise cannot
support)
Pentium Pro allows 4 outstanding memory misses

45
Value of Hit Under Miss for SPEC
Hit under n Misses

FP programs on average AMAT 0.68 -gt 0.52 -gt
0.34 -gt 0.26
Int programs on average AMAT 0.24 -gt 0.20 -gt
0.19 -gt 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss

46
4. Reduce Penalty Second-Level Cache
Proc

L2 Equations
AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1
Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2
AMAT Hit TimeL1
Miss RateL1 x (Hit TimeL2 Miss RateL2 x
Miss PenaltyL2)
Definitions
Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2)
Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters

L1 Cache
L2 Cache
47
Reducing Misses which apply to L2 Cache?

Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher
Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Misses by HW Prefetching Instr, Data
5. Reducing Misses by SW Prefetching Data
6. Reducing Capacity/Conf. Misses by Compiler
Optimizations

48
L2 cache block size A.M.A.T.

32KB L1, 8 byte path to memory

49
Improving Cache Performance (Continued)

1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache
- Lower Associativity (victim caching)?
- 2nd-level cache
- Careful Virtual Memory Design

50
Summary 1/ 3

The Principle of Locality
Program likely to access a relatively small
portion of the address space at any instant of
time.
Temporal Locality Locality in Time
Spatial Locality Locality in Space
Three (1) Major Categories of Cache Misses
Compulsory Misses sad facts of life. Example
cold start misses.
Conflict Misses increase cache size and/or
associativity. Nightmare Scenario ping pong
effect!
Capacity Misses increase cache size
Coherence Misses Caused by external processors
or I/O devices
Cache Design Space
total size, block size, associativity
replacement policy
write-hit policy (write-through, write-back)
write-miss policy

51
Summary 2 / 3 The Cache Design Space

Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
write allocation
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
52
Summary 3 / 3 Cache Miss Optimization

Lots of techniques people use to improve the miss
rate of caches

Technique MR MP HT Complexity Larger Block
Size 0Higher Associativity 1Victim
Caches 2Pseudo-Associative Caches 2HW
Prefetching of Instr/Data 2Compiler
Controlled Prefetching 3Compiler Reduce
Misses 0
miss rate
53
Onto Assembler!

What is assembly language?
Machine-Specific Programming Language
one-one correspondence between statements and
native machine language
matches machine instruction set and architecture

54
What is an assembler?

Systems Level Program
Usually works in conjunction with the compiler
translates assembly language source code to
machine language
object file - contains machine instructions,
initial data, and information used when loading
the program
listing file - contains a record of the
translation process, line numbers, addresses,
generated code and data, and a symbol table

55
Why learn assembly?

Learn how a processor works
Understand basic computer architecture
Explore the internal representation of data and
instructions
Gain insight into hardware concepts
Allows creation of small and efficient programs
Allows programmers to bypass high-level language
restrictions
Might be necessary to accomplish certain
operations

56
Machine Representation

A language of numbers, called the Processors
Instruction Set
The set of basic operations a processor can
perform
Each instruction is coded as a number
Instructions may be one or more bytes
Every number corresponds to an instruction

57
Assembly vs Machine

Machine Language Programming
Writing a list of numbers representing the bytes
of machine instructions to be executed and data
constants to be used by the program
Assembly Language Programming
Using symbolic instructions to represent the raw
data that will form the machine language program
and initial data constants

58
Assembly

Mnemonics represent Machine Instructions
Each mnemonic used represents a single machine
instruction
The assembler performs the translation
Some mnemonics require operands
Operands provide additional information
register, constant, address, or variable
Assembler Directives

59
Instruction Set Architecture a Critical
Interface
software
instruction set

hardware
Portion of the machine that is visible to the
programmer or the compiler writer.
60
Good ISA

Lasts through many implementations (portability,
compatibility)
Can be used for many different applications
(generality)
Provide convenient functionality to higher levels
Permits an efficient implementation at lower
levels

61
Von Neumann Machines

Von Neumann invented stored program computer in
1945
Instead of program code being hardwired, the
program code (instructions) is placed in memory
along with data

Control
ALU
Program Data
62
Basic ISA Classes

Memory to Memory Machines
Every instruction contains a full memory address
for each operand
Maybe the simplest ISA design
However memory is slow
Memory is big (lots of address bits)

63
Memory-to-memory machine

Assumptions
Two operands per operation
first operand is also the destination
Memory address 16 bits (2 bytes)
Operand size 32 bits (4 bytes)
Instruction code 8 bits (1 byte)
Example A BC (hypothetical code)
mov A, B A lt B
add A, C A lt BC
5 bytes for instruction
4 bytes for fetch 1st and 2nd operands
4 bytes to store results
add needs 17 bytes and mov needs 13 byts
Total 30 bytes memory traffic

64
Why CPU Storage?

A small amount of storage in the CPU
To reduce memory traffic by keeping repeatedly
used operands in the CPU
Avoid re-referencing memory
Avoid having to specify full memory address of
the operand
This is a perfect example of make the common
case fast.
Simplest Case
A machine with 1 cell of CPU storage the
accumulator

65
Accumulator Machine

Assumptions
Two operands per operation
1st operand in the accumulator
2nd operand in the memory
accumulator is also the destination (except for
store)
Memory address 16 bits (2 bytes)
Operand size 32 bits (4 bytes)
Instruction code 8 bits (1 byte)
Example A BC (hypothetical code)
Load B acc lt B
Add C acc lt BC
Store A A lt acc
3 bytes for instruction
4 bytes to load or store the second operand
7 bytes per instruction
21 bytes total memory traffic

66
Stack Machines

Instruction sets are based on a stack model of
execution.
Aimed for compact instruction encoding
Most instructions manipulate top few data items
(mostly top 2) of a pushdown stack.
Top few items of the stack are kept in the CPU
Ideal for evaluating expressions (stack holds
intermediate results)
Were thought to be a good match for high level
languages
Awkward
Become very slow if stack grows beyond CPU local
storage
No simple way to get data from middle of stack

67
Stack Machines

Binary arithmetic and logic operations
Operands top 2 items on stack
Operands are removed from stack
Result is placed on top of stack
Unary arithmetic and logic operations
Operand top item on the stack
Operand is replaced by result of operation
Data move operations
Push place memory data on top of stack
Pop move top of stack to memory

68
General Purpose Register Machines

With stack machines, only the top two elements of
the stack are directly available to instructions.
In general purpose register machines, the CPU
storage is organized as a set of registers which
are equally available to the instructions
Frequently used operands are placed in registers
(under program control)
Reduces instruction size
Reduces memory traffic

69
General Purpose Registers Dominate

1975-present all machines use general purpose
registers
Advantages of registers
registers are faster than memory
registers are easier for a compiler to use
e.g., (AB) (CD) (EF) can do multiplies in
any order
registers can hold variables
memory traffic is reduced, so program is sped up
(since registers are faster than memory)
code density improves (since register named with
fewer bits than memory location)

70
Classifying General Purpose Register Machines

General purpose register machines are
sub-classified based on whether or not memory
operands can be used by typical ALU instructions
Register-memory machines machines where some ALU
instructions can specify at least one memory
operand and one register operand
Load-store machines the only instructions that
can access memory are the load and the store
instructions

71
Comparing number of instructions

Code sequence for A BC for five classes of
instruction sets

Register (Register-memory) load R1 B add R1
C store A R1
Register (Load-store) Load R1 B Load R2 C Add R1
R1 R2 Store A R1
Stack push B push C add pop A
Memory to Memory mov A B add A C
Accumulator load B add C store A

DLX/MIPS is one of these
72
Instruction Set Definition

Objects architecture entities machine state
Registers
General purpose
Special purpose (e.g. program counter, condition
code, stack pointer)
Memory locations
Linear address space 0, 1, 2, ,2s -1
Operations instruction types
Data operation
Arithmetic
Logical
Data transfer
Move (from register to register)
Load (from memory location to register)
Store (from register to memory location)
Instruction sequencing
Branch (conditional)
Jump (unconditional)

73
Topic DLX

Instructional Architecture
Much nicer and easier to understand than x86
(barf)
The Plan Teach DLX, then move to x86/y86
DLX RISC ISA, very similar to MIPS
Great links to learn more DLX
http//www.softpanorama.org/Hardware/architecture.
shtmlDLX

74
DLX Architecture

Based on observations about instruction set
architecture
Emphasizes
Simple load-store instruction set
Design for pipeline efficiency
Design for compiler target
DLX registers
32 32-bit GPRS named R0, R1, ..., R31
32 32-bit FPRs named F0, F2, ..., F30
Accessed independently for 32-bit data
Accessed in pairs for 64-bit (double-precision)
data
Register R0 is hard-wired to zero
Other status registers, e.g., floating-point
status register
Byte addressable in big-endian with 32-bit
address
Arithmetic instructions operands must be
registers

75
MIPS Software conventions for Registers
0 zero constant 0 1 at reserved for
assembler 2 v0 expression evaluation
3 v1 function results 4 a0 arguments 5 a1 6 a2 7
a3 8 t0 temporary caller saves . . . (callee
can clobber) 15 t7
16 s0 callee saves . . . (callee must
save) 23 s7 24 t8 temporary (contd) 25 t9 26 k0
reserved for OS kernel 27 k1 28 gp Pointer to
global area 29 sp Stack pointer 30 fp frame
pointer 31 ra Return Address (HW)
76
Addressing Modes
This table shows the most common modes.
Addressing Mode Example Instruction Meaning When Used
Register Add R4, R3 RR4 lt- RR4 RR3 When a value is in a register.
Immediate Add R4, 3 RR4 lt- RR4 3 For constants.
Displacement Add R4, 100(R1) RR4 lt- RR4 M100RR1 Accessing local variables.
Register Deferred Add R4, (R1) RR4 lt- RR4 MRR1 Using a pointer or a computed address.
Absolute Add R4, (1001) RR4 lt- RR4 M1001 Used for static data.
77
Memory Organization

Viewed as a large, single-dimension array, with
an address.
A memory address is an index into the array
"Byte addressing" means that the index points to
a byte of memory.

0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
78
Memory Addressing

Bytes are nice, but most data items use larger
"words"
For DLX, a word is 32 bits or 4 bytes.
2 questions for design of ISA
Since one could read a 32-bit word as four loads
of bytes from sequential byte addresses or as one
load word from a single byte address,
How do byte addresses map to word addresses?
Can a word be placed on any byte boundary?

79
Addressing Objects Endianess and Alignment

Big Endian address of most significant byte
word address (xx00 Big End of word)
IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
Little Endian address of least significant byte
word address (xx00 Little End of word)
Intel 80x86, DEC Vax, DEC Alpha (Windows NT)

little endian byte 0
3 2 1 0
msb
lsb
0 1 2 3
0 1 2 3
Aligned
big endian byte 0
Not Aligned
Alignment require that objects fall on address
that is multiple of their size.
80
Assembly Language vs. Machine Language

Assembly provides convenient symbolic
representation
much easier than writing down numbers
e.g., destination first
Machine language is the underlying reality
e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'
e.g., move r10, r11 exists only in Assembly
would be implemented using add r10,r11,r0
When considering performance you should count
real instructions

81
Stored Program Concept

Instructions are bits
Programs are stored in memory to be read or
written just like data
Fetch Execute Cycle
Instructions are fetched and put into a special
register
Bits in the register "control" the subsequent
actions
Fetch the next instruction and continue

memory for data, programs, compilers, editors,
etc.
82
DLX arithmetic

ALU instructions can have 3 operands
add R1, R2, R3
sub R1, R2, R3
Operand order is fixed (destination
first)Example C code A B
C DLX code add r1, r2, r3 (registers
associated with variables by compiler)

83
DLX arithmetic

Design Principle simplicity favors regularity.
Why?
Of course this complicates some things... C
code A B C D E F - A MIPS
code add r1, r2, r3
add r1, r1, r4 sub r5, r6, r1
Operands must be registers, only 32 registers
provided
Design Principle smaller is faster. Why?

84
Execution assembly instructions

Program counter holds the instruction address
CPU fetches instruction from memory and puts it
onto the instruction register
Control logic decodes the instruction and tells
the register file, ALU and other registers what
to do
An ALU operation (e.g. add) data flows from
register file, through ALU and back to register
file

85
ALU Execution Example
86
ALU Execution example
87
Memory Instructions

Load and store instructions
lw r11, offset(r10)
sw r11, offset(r10)
Example C code A8 h A8 assume h in
r2 and base address of the array A in r3
DLX code lw r4, 32(r3) add r4, r2, r4 sw
r4, 32(r3)
Store word has destination last
Remember arithmetic operands are registers, not
memory!

88
Memory Operations - Loads

Load data from memory
lw R6, 0(R5) R6 lt mem0x14

89
Memory Operations - Stores

Storing data to memory works essentially the same
way
sw R6 , 0(R5)
R6 200 lets assume R5 0x18
mem0x18 lt-- 200

90
So far weve learned

DLX loading words but addressing bytes
arithmetic on registers only
Instruction Meaningadd r1, r2, r3 r1 r2
r3sub r1, r2, r3 r1 r2 r3lw r1,
100(r2) r1 Memoryr2100 sw r1,
100(r2) Memoryr2100 r1

91
Use of Registers

Example
a ( b c) - ( d e) // C statement
r1 r5 a - e
add r10, r2, r3
add r11, r4, r5
sub r1, r10, r11
a b A4 // add an array element to a var
// r3 has address A
lw r4, 16(r3)
add r1, r2, r4

92
Use of Registers load and store

Example
A8 a A6 // A is in r3, a is in r2
lw r1, 24(r3) r1 gets A6 contents
add r1, r2, r1 r1 gets the sum
sw r1, 32(r3) sum is put in A8

93
load and store

Ex
a b Ai // A is in r3, a,b, i in //
r1, r2, r4
add r11, r4, r4 r11 2 i
add r11, r11, r11 r11 4 i
add r11, r11, r3 r11 addr. of Ai
(r3(4i))
lw r10, 0(r11) r10 Ai
add r1, r2, r10 a b Ai

94
Example Swap

Swapping words
r2 has the base address of the array v

swap lw r10, 0(r2) lw r11, 4(r2) sw r10,
4(r2) sw r11, 0(r2)
temp v0 v0 v1 v1 temp

95
DLX Instruction Format

Instruction Format
I-type R-type J-type

I-type Instructions 6
5 5 16
R-type Instructions 6
5 5 5
11
J-type Instructions 6
26
96
Machine Language

Instructions, like registers and words of data,
are also 32 bits long
Example add r10, r1, r2
registers have numbers, 10, 1, 2
Instruction Format
000000 00001 00010 01010 0000100000

R-type Instructions 6
5 5 5
11
97
Machine Language

Consider the load-word and store-word
instructions,
What would the regularity principle have us do?
New principle Good design demands a compromise
Introduce a new type of instruction format
I-type for data transfer instructions
other format was R-type for register
Example lw r10, 32(r2)

I-type Instructions (Loads/stores) 6
5 5 16
1000011 01010 00010 0000100000
98
Machine Language

Jump instructions
Example j .L1

J-type Instructions (Jump, Jump and Link, Trap,
return from exception) 6
26
0000010 offset to .L1
99
DLX Instruction Format

Instruction Format
I-type R-type J-type

I-type Instructions 6
5 5 16
R-type Instructions 6
5 5 5
16
J-type Instructions 6
26
100
Instructions for Making Decisions

beq reg1, reg2, L1
Go to the statement labeled L1 if the value in
reg1 equals the value in reg2
bne reg1, reg2, L1
Go to the statement labeled L1 if the value in
reg1 does not equals the value in reg2
j L1
Unconditional jump
jr r10
jump register. Jump to the instruction
specified in register r10

101
Making Decisions

Example
if ( a ! b) goto L1 // x,y,z,a,b mapped to
r1-r5
x y z
L1 x x a
bne r4, r5, L1 goto L1 if a ! b
add r1, r2, r3 x y z (ignored if ab)
L1sub r1, r1, r4 x x a (always ex)

102
if-then-else

Example
if ( ab) x y z
else x y z
bne r4, r5, Else goto Else if a!b
add r1, r2, r3 x y z
j Exit goto Exit
Else sub r1,r2,r3 x y z
Exit

103
Example Loop with array index

Loop g g A i i i j if (i !
h) goto Loop ....
r1, r2, r3, r4 g, h, i, j, array base r5
LOOP add r11, r3, r3 r11 2 i add r11,
r11, r11 r11 4 i add r11, r11, r5 r11
adr. Of Ai lw r10, 0(r11) load
Ai add r1, r1, r10 g g Ai add r3,
r3, r4 i i j bne r3, r2, LOOP

104
Other decisions

Set R1 on R2 less than R3 slt R1, R2, R3
Compares two registers, R2 and R3
R1 1 if R2 lt R3 elseR1 0 if R2 gt R3
Example slt r11, r1, r2
Branch less than
Example if(A lt B) goto LESS
slt r11, r1, r2 t1 1 if A lt B
bne r11, 0, LESS

105
Loops

Example
while ( Ai k ) // i,j,k in r3. r4, r5
i i j // A is in r6
Loop sll r11, r3, 2 r11 4 i
add r11, r11, r6 r11 addr. Of Ai
lw r10, 0(r11) r10 Ai
bne r10, r5, Exit goto Exit if Ai!k
add r3, r3, r4 i i j
j Loop goto Loop
Exit

106
Addresses in Branches and Jumps

Instructions
bne r14,r15,Label Next instruction is at Label
if r14?r15
beq r14,r15,Label Next instruction is at Label
if r14r15
j Label Next instruction is at Label
Formats
Addresses are not 32 bits How do we handle
this with large programs?
First idea limitation of branch space to the
first 216 bits

op rs rt 16 bit address
I J
op 26 bit address
107
Addresses in Branches

Instructions
bne r14,r15,Label Next instruction is at Label if
r14?r15
beq r14,r15,Label Next instruction is at Label if
r14r15
Formats
Treat the 16 bit number as an offset to the PC
register PC-relative addressing
Word offset instead of byte offset, why??
most branches are local (principle of locality)
Jump instructions just use the high order bits of
PC Pseudodirect addressing
32-bit jump address 4 Most Significant bits of
PC concatenated with 26-bit word address (or 28-
bit byte address)
Address boundaries of 256 MB

op rs rt 16 bit address
I
108
Conditional Branch Distance
65 of integer branches are 2 to 4
instructions
109
Conditional Branch Addressing

PC-relative since most branches are relatively
close to the current PC
At least 8 bits suggested (?128 instructions)
Compare Equal/Not Equal most important for
integer programs (86)

110
PC-relative addressing

For larger distances Jump register jr required.

111
Example

LOOP mult 9, 19, 10 R9 R19R10
lw 8, 1000(9) R8 _at_(R91000)
bne 8, 21, EXIT add 19, 19, 20 i
i j j LOOP EXIT ...
Assume address of LOOP is 0x8000

2
0x8000
112
Procedure calls

Procedures or subroutines
Needed for structured programming
Steps followed in executing a procedure call
Place parameters in a place where the procedure
(callee) can access them
Transfer control to the procedure
Acquire the storage resources needed for the
procedure
Perform desired task
Place results in a place where the calling
program (caller) can access them
Return control to the point of origin

113
Resources Involved

Registers used for procedure calling
a0 - a3 four argument registers in which to
pass parameters
v0 - v1 two value registers in which to
return values
r31 one return address register to return to
the point of origin
Transferring the control to the callee
jal ProcedureAddress
jump-and-link to the procedure address
the return address (PC4) is saved in r31
Example jal 20000
Returning the control to the caller
jr r31
instruction following jal is executed next

114
Memory Stacks
Useful for stacked environments/subroutine call
return even if operand stack not part of
architecture
Stacks that Grow Up vs. Stacks that Grow Down
High address
0 Little
inf. Big
a
Memory Addresses
grows up
grows down
SP
b
c
inf. Big
0 Little
Low address
115
Calling conventions
int func(int g, int h, int i, int j) int
f f ( g h ) ( i j ) return ( f
) // g,h,i,j - a0,a1,a2,a3, f in r8 func
addi sp, sp, -12 make room in stack for 3
words sw r11, 8(sp) save the regs we want to
use sw r10, 4(sp) sw r8, 0(sp) add r10,
a0, a1 r10 g h add r11, a2, a3 r11
i j sub r8, r10, r11 r8 has the result
add v0, r8, r0 return reg v0 has f
116
Calling (cont.)

lw r8, 0(sp) restore r8
lw r10, 4(sp) restore r10
lw r11, 8(sp) restore r11
addi sp, sp, 12 restore sp
jr ra
we did not have to restore r10-r19 (caller save)
we do need to restore r1-r8 (must be preserved
by callee)

117
Nested Calls
Stacking of Subroutine Calls Returns and
Environments
A
A CALL B CALL C
C RET
RET
A
B
B
D
A
B
C
A
B
A

Some machines provide a memory stack as part of
the
architecture (e.g., VAX, JVM)
Sometimes stacks are implemented via software
convention

118
Compiling a String Copy Proc.
void strcpy ( char x , y ) int
i0 while ( x i y i ! 0) i
// x and y base addr. are in a0 and
a1 strcpy addi sp, sp, -4 reserve 1
word space in stack sw r8, 0(sp) save
r8 add r8, zer0, zer0 i 0 L1 add r11,
a1, r8 addr. of y i in r11 lb r12,
0(r11) r12 y i add r13, a0, r8
addr. Of x i in r13 sb r12, 0(r13) x i
y i beq r12, zero, L2 if y i 0
goto L2 addi r8, r8, 1 i j L1 go to
L1 L2 lw r8, 0(sp) restore r8 addi sp,
sp, 4 restore sp jr ra return
119
IA - 32

1978 The Intel 8086 is announced (16 bit
architecture)
1980 The 8087 floating point coprocessor is
added
1982 The 80286 increases address space to 24
bits, instructions
1985 The 80386 extends to 32 bits, new
addressing modes
1989-1995 The 80486, Pentium, Pentium Pro add a
few instructions (mostly designed for higher
performance)
1997 57 new MMX instructions are added,
Pentium II
1999 The Pentium III added another 70
instructions (SSE)
2001 Another 144 instructions (SSE2)
2003 AMD extends the architecture to increase
address space to 64 bits, widens all registers
to 64 bits and other changes (AMD64)
2004 Intel capitulates and embraces AMD64
(calls it EM64T) and adds more media extensions
This history illustrates the impact of the
golden handcuffs of compatibilityadding new
features as someone might add clothing to a
packed bagan architecture that is difficult
to explain and impossible to love

120
IA-32 Overview

Complexity
Instructions from 1 to 17 bytes long
one operand must act as both a source and
destination
one operand can come from memory
complex addressing modes e.g., base or scaled
index with 8 or 32 bit displacement
Saving grace
the most frequently used instructions are not too
difficult to build
compilers avoid the portions of the architecture
that are slow
what the 80x86 lacks in style is made up in
quantity, making it beautiful from the right
perspective

121
IA32 Registers

Oversimplified Architecture
Four 32-bit general purpose registers
eax, ebx, ecx, edx
al is a register to mean the lower 8 bits of
eax
Stack Pointer
esp
Fun fact
Once upon a time, only x86 was a 16-bit CPU
So, when they upgraded x86 to 32-bits...
Added an e in front of every register and
called it extended

122
Intel 80x86 Integer Registers
GPR0 EAX Accumulator
GPR1 ECX Count register, string, loop
GPR2 EDX Data Register multiply, divide
GPR3 EBX Base Address Register
GPR4 ESP Stack Pointer
GPR5 EBP Base Pointer for base of stack seg.
GPR6 ESI Index Register
GPR7 EDI Index Register
CS Code Segment Pointer
SS Stack Segment Pointer
DS Data Segment Pointer
ES Extra Data Segment Pointer
FS Data Seg. 2
GS Data Seg. 3
PC EIP Instruction Counter
Eflags Condition Codes
123
x86 Assembly

mov ltdestgt, ltsrcgt
Move the value from ltsrcgt into ltdestgt
Used to set initial values
add ltdestgt, ltsrcgt
Add the value from ltsrcgt to ltdestgt
sub ltdestgt, ltsrcgt
Subtract the value from ltsrcgt from ltdestgt

124
x86 Assembly
push lttargetgt Push the value in lttargetgt onto
the stack Also decrements the stack pointer,
ESP (remember stack grows from high to low) pop
lttargetgt Pops the value from the top of the
stack, put it in lttargetgt Also increments the
stack pointer, ESP

125
x86 Assembly
jmp ltaddressgt Jump to an instruction (like
goto) Change the EIP to ltaddressgt Call
ltaddressgt A function call. Pushes the current
EIP 1 (next instruction) onto the stack, and
jumps to ltaddressgt

126
x86 Assembly
lea ltdestgt, ltsrcgt Load Effective Address of
ltsrcgt into register ltdestgt. Used for pointer
arithmetic (not actual memory reference) int
ltvaluegt interrupt hardware signal to operating
system kernel, with flag ltvaluegt int 0x80 means
Linux system call

127
x86 Assembly
Condition Codes CZ Carry Flag Overflow
Detection (Unsigned) ZF Zero Flag SF Sign
Flag OF Overflow Flag Overflow Detection
(Signed) Conditional Codes are you usually
accessed through conditional branches (Not
Directly)

128
Interrupt convention
int 0x80 System call interupt eax System
call number (eg. 1-exit, 2-fork, 3-read,
4-write) ebx argument 1 ecx argument 2 edx
argument 3

129
CISC vs RISC
RISC Reduced Instruction Set Computer
(DLX) CISC Complex Instruction Set Computer
(x86) Both have their advantages.

130
RISC