Title: OMSE%20510:%20Computing%20Foundations%203:%20Caches,%20Assembly,%20CPU%20Overview
1OMSE 510 Computing Foundations3 Caches,
Assembly, CPU Overview
- Chris Gilmore ltgrimjack_at_cs.pdx.edugt
- Portland State University/OMSE
2Today
- Caches
- DLX Assembly
- CPU Overview
3Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
4The Big Picture Where are We Now?
- The Five Classic Components of a Computer
- Next Topic
- Simple caching techniques
- Many ways to improve cache performance
5Recap Levels of the Memory Hierarchy
Upper Level
Processor
faster
Instr. Operands
Cache
Blocks
Memory
Pages
Disk
Files
Tape
Larger
Lower Level
6Recap exploit locality to achieve fast memory
- Two Different Types of Locality
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon. - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon. - By taking advantage of the principle of locality
- Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology. - DRAM is slow but cheap and dense
- Good choice for presenting the user with a BIG
memory system - SRAM is fast but expensive and not very dense
- Good choice for providing the user FAST access
time.
7Memory Hierarchy Terminology
- Hit data appears in some block in the upper
level (example Block X) - Hit Rate the fraction of memory access found in
the upper level - Hit Time Time to access the upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data needs to be retrieve from a block in
the lower level (Block Y) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Time to replace a block in the
upper level - Time to deliver the block the processor
- Hit Time ltlt Miss Penalty
Lower Level Memory
Upper Level Memory
To Processor
Blk X
From Processor
Blk Y
8The Art of Memory System Design
Workload or Benchmark programs
Processor
reference stream ltop,addrgt, ltop,addrgt,ltop,addrgt,lt
op,addrgt, . . . op i-fetch, read, write
Memory
Optimize the memory system organization to
minimize the average memory access time for
typical workloads
MEM
9Example Fully Associative
- Fully Associative Cache
- No Cache Index
- Compare the Cache Tags of all cache entries in
parallel - Example Block Size 32 B blocks, we need N
27-bit comparators - By definition Conflict Miss 0 for a fully
associative cache
0
4
31
Cache Tag (27 bits long)
Byte Select
Ex 0x01
Cache Data
Valid Bit
Cache Tag
Byte 0
Byte 1
Byte 31
X
Byte 32
Byte 33
Byte 63
X
X
X
X
10Example 1 KB Direct Mapped Cache with 32 B Blocks
- For a 2 N byte cache
- The uppermost (32 - N) bits are always the Cache
Tag - The lowest M bits are the Byte Select (Block Size
2 M)
Block address
0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag
0
Byte 0
Byte 1
Byte 31
1
0x50
Byte 32
Byte 33
Byte 63
2
3
31
Byte 992
Byte 1023
11Set Associative Cache
- N-way set associative N entries for each Cache
Index - N direct mapped caches operates in parallel
- Example Two-way set associative cache
- Cache Index selects a set from the cache
- The two tags in the set are compared to the input
in parallel - Data is selected based on the tag result
Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0
Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Hit
12Disadvantage of Set Associative Cache
- N-way Set Associative Cache versus Direct Mapped
Cache - N comparators vs. 1
- Extra MUX delay for the data
- Data comes AFTER Hit/Miss decision and set
selection - In a direct mapped cache, Cache Block is
available BEFORE Hit/Miss - Possible to assume a hit and continue. Recover
later if miss.
13Block Size Tradeoff
- Larger block size take advantage of spatial
locality BUT - Larger block size means larger miss penalty
- Takes longer time to fill up the block
- If block size is too big relative to cache size,
miss rate will go up - Too few cache blocks
- In general, Average Access Time
- Hit Time x (1 - Miss Rate) Miss Penalty x
Miss Rate
Average Access Time
Miss Rate
Miss Penalty
Exploits Spatial Locality
Increased Miss Penalty Miss Rate
Fewer blocks compromises temporal locality
Block Size
Block Size
Block Size
14A Summary on Sources of Cache Misses
- Compulsory (cold start or process migration,
first reference) first access to a block - Cold fact of life not a whole lot you can do
about it - Note If you are going to run billions of
instruction, Compulsory Misses are insignificant - Conflict (collision)
- Multiple memory locations mappedto the same
cache location - Solution 1 increase cache size
- Solution 2 increase associativity
- Capacity
- Cache cannot contain all blocks access by the
program - Solution increase cache size
- Coherence (Invalidation) other process (e.g.,
I/O) updates memory
15Source of Cache Misses Quiz
Assume constant cost.
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size Small, Medium, Big?
Compulsory Miss
Conflict Miss
Capacity Miss
Coherence Miss
Choices Zero, Low, Medium, High, Same
16Sources of Cache Misses Answer
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size
Big
Medium
Small
Compulsory Miss
Same
Same
Same
Conflict Miss
High
Medium
Zero
Capacity Miss
Low
Medium
High
Coherence Miss
Same
Same
Same
Note If you are going to run billions of
instruction, Compulsory Misses are insignificant.
17Recap Four Questions for Caches and Memory
Hierarchy
- Q1 Where can a block be placed in the upper
level? (Block placement) - Q2 How is a block found if it is in the upper
level? (Block identification) - Q3 Which block should be replaced on a miss?
(Block replacement) - Q4 What happens on a write? (Write strategy)
18Q1 Where can a block be placed in the upper
level?
- Block 12 placed in 8 block cache
- Fully associative, direct mapped, 2-way set
associative - S.A. Mapping Block Number Modulo Number Sets
Fully associative block 12 can go anywhere
Block no.
0 1 2 3 4 5 6 7
19Q2 How is a block found if it is in the upper
level?
Set Select
Data Select
- Direct indexing (using index and block offset),
tag compares, or combination - Increasing associativity shrinks index, expands
tag
20Q3 Which block should be replaced on a miss?
- Easy for Direct Mapped
- Set Associative or Fully Associative
- Random
- FIFO
- LRU (Least Recently Used)
- LFU (Least Frequently Used)
- Associativity 2-way 4-way 8-way
- Size LRU Random LRU Random LRU Random
- 16 KB 5.2 5.7 4.7 5.3 4.4 5.0
- 64 KB 1.9 2.0 1.5 1.7 1.4 1.5
- 256 KB 1.15 1.17 1.13 1.13 1.12
1.12
21Q4 What happens on a write?
- Write throughThe information is written to both
the block in the cache and to the block in the
lower-level memory. - Write backThe information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced. - is block clean or dirty?
- Pros and Cons of each?
- WT read misses cannot result in writes,
coherency easier - WB no writes of repeated writes
- WT always combined with write buffers so they
dont wait for lower level memory
22Write Buffer for Write Through
Cache
Processor
DRAM
Write Buffer
- A Write Buffer is needed between the Cache and
Memory - Processor writes data into the cache and the
write buffer - Memory controller write contents of the buffer
to memory - Write buffer is just a FIFO
- Typical number of entries 4
- Works fine if Store frequency (w.r.t. time) ltlt
1 / DRAM write cycle - Memory system designers nightmare
- Store frequency (w.r.t. time) gt 1 / DRAM write
cycle - Write buffer saturation
23Write Buffer Saturation
Cache
Processor
DRAM
Write Buffer
- Store frequency (w.r.t. time) gt 1 / DRAM write
cycle - If this condition exist for a long period of time
(CPU cycle time too quick and/or too many store
instructions in a row) - Store buffer will overflow no matter how big you
make it - The CPU Cycle Time lt DRAM Write Cycle Time
- Solution for write buffer saturation
- Use a write back cache
- Install a second level (L2) cache (does this
always work?)
Cache
L2 Cache
Processor
DRAM
Write Buffer
24Write-miss Policy Write Allocate versus Not
Allocate
- Assume a 16-bit write to memory location 0x0 and
causes a miss - Do we read in the block?
- Yes Write Allocate
- No Write Not Allocate
0
4
31
9
Cache Index
Cache Tag
Example 0x00
Byte Select
Ex 0x00
Ex 0x00
Cache Data
Valid Bit
Cache Tag
0
Byte 0
0x50
Byte 1
Byte 31
1
Byte 32
Byte 33
Byte 63
2
3
31
Byte 992
Byte 1023
25Impact on Cycle Time
Cache Hit Time directly tied to clock
rate increases with cache size increases with
associativity
Average Memory Access time Hit Time Miss
Rate x Miss Penalty Time IC x CT x (ideal CPI
memory stalls)
26What happens on a Cache miss?
- For in-order pipeline, 2 options
- Freeze pipeline in Mem stage (popular early on
Sparc, R4000) IF ID EX Mem stall stall stall
stall Mem Wr IF ID EX stall stall
stall stall stall Ex Wr - Use Full/Empty bits in registers MSHR queue
- MSHR Miss Status/Handler Registers
(Kroft)Each entry in this queue keeps track of
status of outstanding memory requests to one
complete memory line. - Per cache-line keep info about memory address.
- For each word register (if any) that is waiting
for result. - Used to merge multiple requests to one memory
line - New load creates MSHR entry and sets destination
register to Empty. Load is released from
pipeline. - Attempt to use register before result returns
causes instruction to block in decode stage. - Limited out-of-order execution with respect to
loads. Popular with in-order superscalar
architectures. - Out-of-order pipelines already have this
functionality built in (load queues, etc).
27Improving Cache Performance 3 general options
Time IC x CT x (ideal CPI memory
stalls) Average Memory Access time Hit Time
(Miss Rate x Miss Penalty) (Hit Rate x Hit
Time) (Miss Rate x Miss Time)
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
28Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
293Cs Absolute Miss Rate (SPEC92)
Conflict
Compulsory vanishingly small
3021 Cache Rule
miss rate 1-way associative cache size X
miss rate 2-way associative cache size X/2
Conflict
313Cs Relative Miss Rate
Conflict
Flaws for fixed block size Good insight gt
invention
321. Reduce Misses via Larger Block Size
332. Reduce Misses via Higher Associativity
- 21 Cache Rule
- Miss Rate DM cache size N Miss Rate 2-way cache
size N/2 - Beware Execution time is only final measure!
- Will Clock Cycle time increase?
- Hill 1988 suggested hit time for 2-way vs.
1-way external cache 10, internal 2
34Example Avg. Memory Access Time vs. Miss Rate
- Example assume CCT 1.10 for 2-way, 1.12 for
4-way, 1.14 for 8-way vs. CCT direct mapped - Cache Size Associativity
- (KB) 1-way 2-way 4-way 8-way
- 1 2.33 2.15 2.07 2.01
- 2 1.98 1.86 1.76 1.68
- 4 1.72 1.67 1.61 1.53
- 8 1.46 1.48 1.47 1.43
- 16 1.29 1.32 1.32 1.32
- 32 1.20 1.24 1.25 1.27
- 64 1.14 1.20 1.21 1.23
- 128 1.10 1.17 1.18 1.20
- (Green means A.M.A.T. not improved by more
associativity) - (AMAT Average Memory Access Time)
353. Reducing Misses via a Victim Cache
- How to combine fast hit time of direct mapped
yet still avoid conflict misses? - Add buffer to place data discarded from cache
- Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache - Used in Alpha, HP machines
DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
364. Reducing Misses by Hardware Prefetching
- E.g., Instruction Prefetching
- Alpha 21064 fetches 2 blocks on a miss
- Extra block placed in stream buffer
- On miss check stream buffer
- Works with data blocks too
- Jouppi 1990 1 data stream buffer got 25 misses
from 4KB cache 4 streams got 43 - Palacharla Kessler 1994 for scientific
programs for 8 streams got 50 to 70 of misses
from 2 64KB, 4-way set associative caches - Prefetching relies on having extra memory
bandwidth that can be used without penalty
375. Reducing Misses by Software Prefetching Data
- Data Prefetch
- Load data into register (HP PA-RISC loads)
- Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC v. 9) - Special prefetching instructions cannot cause
faultsa form of speculative execution - Issuing Prefetch Instructions takes time
- Is cost of prefetch issues lt savings in reduced
misses? - Higher superscalar reduces difficulty of issue
bandwidth
386. Reducing Misses by Compiler Optimizations
- McFarling 1989 reduced caches misses by 75 on
8KB direct mapped cache, 4 byte blocks in
software - Instructions
- Reorder procedures in memory so as to reduce
conflict misses - Profiling to look at conflicts(using tools they
developed) - Data
- Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays - Loop Interchange change nesting of loops to
access data in order stored in memory - Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap - Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows
39Improving Cache Performance (Continued)
1. Reduce the miss rate, 2. Reduce the miss
penalty, or 3. Reduce the time to hit in the
cache.
400. Reducing Penalty Faster DRAM / Interface
- New DRAM Technologies
- RAMBUS - same initial latency, but much higher
bandwidth - Synchronous DRAM
- Better BUS interfaces
- CRAY Technique only use SRAM
411. Reducing Penalty Read Priority over Write on
Miss
Cache
Processor
DRAM
Write Buffer
- A Write Buffer Allows reads to bypass writes
- Processor writes data into the cache and the
write buffer - Memory controller write contents of the buffer
to memory - Write buffer is just a FIFO
- Typical number of entries 4
- Works fine if Store frequency (w.r.t. time) ltlt
1 / DRAM write cycle - Memory system designers nightmare
- Store frequency (w.r.t. time) gt 1 / DRAM write
cycle - Write buffer saturation
421. Reducing Penalty Read Priority over Write on
Miss
- Write-Buffer Issues
- Write through with write buffers offer RAW
conflicts with main memory reads on cache misses - If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50
) - ? Check write buffer contents before read
if no conflicts, let the memory access continue - Write Back?
- Read miss replacing dirty block
- Normal Write dirty block to memory, and then do
the read - Instead copy the dirty block to a write buffer,
then do the read, and then do the write - CPU stall less since restarts as soon as do read
432. Reduce Penalty Early Restart and Critical
Word First
- Dont wait for full block to be loaded before
restarting CPU - Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution - Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block. Also
called wrapped fetch and requested word first - Generally useful only in large blocks,
- Spatial locality a problem tend to want next
sequential word, so not clear if benefit by early
restart
block
443. Reduce Penalty Non-blocking Caches
- Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss - requires F/E bits on registers or out-of-order
execution - requires multi-bank memories
- hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests - hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses - Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses - Requires multiple memory banks (otherwise cannot
support) - Pentium Pro allows 4 outstanding memory misses
45Value of Hit Under Miss for SPEC
Hit under n Misses
- FP programs on average AMAT 0.68 -gt 0.52 -gt
0.34 -gt 0.26 - Int programs on average AMAT 0.24 -gt 0.20 -gt
0.19 -gt 0.19 - 8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss
464. Reduce Penalty Second-Level Cache
Proc
- L2 Equations
- AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1 - Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2 - AMAT Hit TimeL1
- Miss RateL1 x (Hit TimeL2 Miss RateL2 x
Miss PenaltyL2) - Definitions
- Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2) - Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2) - Global Miss Rate is what matters
L1 Cache
L2 Cache
47Reducing Misses which apply to L2 Cache?
- Reducing Miss Rate
- 1. Reduce Misses via Larger Block Size
- 2. Reduce Conflict Misses via Higher
Associativity - 3. Reducing Conflict Misses via Victim Cache
- 4. Reducing Misses by HW Prefetching Instr, Data
- 5. Reducing Misses by SW Prefetching Data
- 6. Reducing Capacity/Conf. Misses by Compiler
Optimizations
48L2 cache block size A.M.A.T.
- 32KB L1, 8 byte path to memory
49Improving Cache Performance (Continued)
- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache
- - Lower Associativity (victim caching)?
- - 2nd-level cache
- - Careful Virtual Memory Design
50Summary 1/ 3
- The Principle of Locality
- Program likely to access a relatively small
portion of the address space at any instant of
time. - Temporal Locality Locality in Time
- Spatial Locality Locality in Space
- Three (1) Major Categories of Cache Misses
- Compulsory Misses sad facts of life. Example
cold start misses. - Conflict Misses increase cache size and/or
associativity. Nightmare Scenario ping pong
effect! - Capacity Misses increase cache size
- Coherence Misses Caused by external processors
or I/O devices - Cache Design Space
- total size, block size, associativity
- replacement policy
- write-hit policy (write-through, write-back)
- write-miss policy
51Summary 2 / 3 The Cache Design Space
- Several interacting dimensions
- cache size
- block size
- associativity
- replacement policy
- write-through vs write-back
- write allocation
- The optimal choice is a compromise
- depends on access characteristics
- workload
- use (I-cache, D-cache, TLB)
- depends on technology / cost
- Simplicity often wins
Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
52Summary 3 / 3 Cache Miss Optimization
- Lots of techniques people use to improve the miss
rate of caches
Technique MR MP HT Complexity Larger Block
Size 0Higher Associativity 1Victim
Caches 2Pseudo-Associative Caches 2HW
Prefetching of Instr/Data 2Compiler
Controlled Prefetching 3Compiler Reduce
Misses 0
miss rate
53Onto Assembler!
- What is assembly language?
- Machine-Specific Programming Language
- one-one correspondence between statements and
native machine language - matches machine instruction set and architecture
54What is an assembler?
- Systems Level Program
- Usually works in conjunction with the compiler
- translates assembly language source code to
machine language - object file - contains machine instructions,
initial data, and information used when loading
the program - listing file - contains a record of the
translation process, line numbers, addresses,
generated code and data, and a symbol table
55Why learn assembly?
- Learn how a processor works
- Understand basic computer architecture
- Explore the internal representation of data and
instructions - Gain insight into hardware concepts
- Allows creation of small and efficient programs
- Allows programmers to bypass high-level language
restrictions - Might be necessary to accomplish certain
operations
56Machine Representation
- A language of numbers, called the Processors
Instruction Set - The set of basic operations a processor can
perform - Each instruction is coded as a number
- Instructions may be one or more bytes
- Every number corresponds to an instruction
57Assembly vs Machine
- Machine Language Programming
- Writing a list of numbers representing the bytes
of machine instructions to be executed and data
constants to be used by the program - Assembly Language Programming
- Using symbolic instructions to represent the raw
data that will form the machine language program
and initial data constants
58Assembly
- Mnemonics represent Machine Instructions
- Each mnemonic used represents a single machine
instruction - The assembler performs the translation
- Some mnemonics require operands
- Operands provide additional information
- register, constant, address, or variable
- Assembler Directives
59Instruction Set Architecture a Critical
Interface
software
instruction set
hardware
Portion of the machine that is visible to the
programmer or the compiler writer.
60Good ISA
- Lasts through many implementations (portability,
compatibility) - Can be used for many different applications
(generality) - Provide convenient functionality to higher levels
- Permits an efficient implementation at lower
levels
61Von Neumann Machines
- Von Neumann invented stored program computer in
1945 - Instead of program code being hardwired, the
program code (instructions) is placed in memory
along with data
Control
ALU
Program Data
62Basic ISA Classes
- Memory to Memory Machines
- Every instruction contains a full memory address
for each operand - Maybe the simplest ISA design
- However memory is slow
- Memory is big (lots of address bits)
63Memory-to-memory machine
- Assumptions
- Two operands per operation
- first operand is also the destination
- Memory address 16 bits (2 bytes)
- Operand size 32 bits (4 bytes)
- Instruction code 8 bits (1 byte)
- Example A BC (hypothetical code)
- mov A, B A lt B
- add A, C A lt BC
- 5 bytes for instruction
- 4 bytes for fetch 1st and 2nd operands
- 4 bytes to store results
- add needs 17 bytes and mov needs 13 byts
- Total 30 bytes memory traffic
64Why CPU Storage?
- A small amount of storage in the CPU
- To reduce memory traffic by keeping repeatedly
used operands in the CPU - Avoid re-referencing memory
- Avoid having to specify full memory address of
the operand - This is a perfect example of make the common
case fast. - Simplest Case
- A machine with 1 cell of CPU storage the
accumulator
65Accumulator Machine
- Assumptions
- Two operands per operation
- 1st operand in the accumulator
- 2nd operand in the memory
- accumulator is also the destination (except for
store) - Memory address 16 bits (2 bytes)
- Operand size 32 bits (4 bytes)
- Instruction code 8 bits (1 byte)
- Example A BC (hypothetical code)
- Load B acc lt B
- Add C acc lt BC
- Store A A lt acc
- 3 bytes for instruction
- 4 bytes to load or store the second operand
- 7 bytes per instruction
- 21 bytes total memory traffic
66Stack Machines
- Instruction sets are based on a stack model of
execution. - Aimed for compact instruction encoding
- Most instructions manipulate top few data items
(mostly top 2) of a pushdown stack. - Top few items of the stack are kept in the CPU
- Ideal for evaluating expressions (stack holds
intermediate results) - Were thought to be a good match for high level
languages - Awkward
- Become very slow if stack grows beyond CPU local
storage - No simple way to get data from middle of stack
67Stack Machines
- Binary arithmetic and logic operations
- Operands top 2 items on stack
- Operands are removed from stack
- Result is placed on top of stack
- Unary arithmetic and logic operations
- Operand top item on the stack
- Operand is replaced by result of operation
- Data move operations
- Push place memory data on top of stack
- Pop move top of stack to memory
68General Purpose Register Machines
- With stack machines, only the top two elements of
the stack are directly available to instructions.
In general purpose register machines, the CPU
storage is organized as a set of registers which
are equally available to the instructions - Frequently used operands are placed in registers
(under program control) - Reduces instruction size
- Reduces memory traffic
69General Purpose Registers Dominate
- 1975-present all machines use general purpose
registers - Advantages of registers
- registers are faster than memory
- registers are easier for a compiler to use
- e.g., (AB) (CD) (EF) can do multiplies in
any order - registers can hold variables
- memory traffic is reduced, so program is sped up
(since registers are faster than memory) - code density improves (since register named with
fewer bits than memory location)
70Classifying General Purpose Register Machines
- General purpose register machines are
sub-classified based on whether or not memory
operands can be used by typical ALU instructions - Register-memory machines machines where some ALU
instructions can specify at least one memory
operand and one register operand - Load-store machines the only instructions that
can access memory are the load and the store
instructions
71Comparing number of instructions
- Code sequence for A BC for five classes of
instruction sets
Register (Register-memory) load R1 B add R1
C store A R1
Register (Load-store) Load R1 B Load R2 C Add R1
R1 R2 Store A R1
Stack push B push C add pop A
Memory to Memory mov A B add A C
Accumulator load B add C store A
DLX/MIPS is one of these
72Instruction Set Definition
- Objects architecture entities machine state
- Registers
- General purpose
- Special purpose (e.g. program counter, condition
code, stack pointer) - Memory locations
- Linear address space 0, 1, 2, ,2s -1
- Operations instruction types
- Data operation
- Arithmetic
- Logical
- Data transfer
- Move (from register to register)
- Load (from memory location to register)
- Store (from register to memory location)
- Instruction sequencing
- Branch (conditional)
- Jump (unconditional)
73Topic DLX
- Instructional Architecture
- Much nicer and easier to understand than x86
(barf) - The Plan Teach DLX, then move to x86/y86
- DLX RISC ISA, very similar to MIPS
- Great links to learn more DLX
- http//www.softpanorama.org/Hardware/architecture.
shtmlDLX
74DLX Architecture
- Based on observations about instruction set
architecture - Emphasizes
- Simple load-store instruction set
- Design for pipeline efficiency
- Design for compiler target
- DLX registers
- 32 32-bit GPRS named R0, R1, ..., R31
- 32 32-bit FPRs named F0, F2, ..., F30
- Accessed independently for 32-bit data
- Accessed in pairs for 64-bit (double-precision)
data - Register R0 is hard-wired to zero
- Other status registers, e.g., floating-point
status register - Byte addressable in big-endian with 32-bit
address - Arithmetic instructions operands must be
registers
75MIPS Software conventions for Registers
0 zero constant 0 1 at reserved for
assembler 2 v0 expression evaluation
3 v1 function results 4 a0 arguments 5 a1 6 a2 7
a3 8 t0 temporary caller saves . . . (callee
can clobber) 15 t7
16 s0 callee saves . . . (callee must
save) 23 s7 24 t8 temporary (contd) 25 t9 26 k0
reserved for OS kernel 27 k1 28 gp Pointer to
global area 29 sp Stack pointer 30 fp frame
pointer 31 ra Return Address (HW)
76Addressing Modes
This table shows the most common modes.
Addressing Mode Example Instruction Meaning When Used
Register Add R4, R3 RR4 lt- RR4 RR3 When a value is in a register.
Immediate Add R4, 3 RR4 lt- RR4 3 For constants.
Displacement Add R4, 100(R1) RR4 lt- RR4 M100RR1 Accessing local variables.
Register Deferred Add R4, (R1) RR4 lt- RR4 MRR1 Using a pointer or a computed address.
Absolute Add R4, (1001) RR4 lt- RR4 M1001 Used for static data.
77Memory Organization
- Viewed as a large, single-dimension array, with
an address. - A memory address is an index into the array
- "Byte addressing" means that the index points to
a byte of memory.
0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
78Memory Addressing
- Bytes are nice, but most data items use larger
"words" - For DLX, a word is 32 bits or 4 bytes.
- 2 questions for design of ISA
- Since one could read a 32-bit word as four loads
of bytes from sequential byte addresses or as one
load word from a single byte address, - How do byte addresses map to word addresses?
- Can a word be placed on any byte boundary?
79Addressing Objects Endianess and Alignment
- Big Endian address of most significant byte
word address (xx00 Big End of word) - IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
- Little Endian address of least significant byte
word address (xx00 Little End of word) - Intel 80x86, DEC Vax, DEC Alpha (Windows NT)
little endian byte 0
3 2 1 0
msb
lsb
0 1 2 3
0 1 2 3
Aligned
big endian byte 0
Not Aligned
Alignment require that objects fall on address
that is multiple of their size.
80Assembly Language vs. Machine Language
- Assembly provides convenient symbolic
representation - much easier than writing down numbers
- e.g., destination first
- Machine language is the underlying reality
- e.g., destination is no longer first
- Assembly can provide 'pseudoinstructions'
- e.g., move r10, r11 exists only in Assembly
- would be implemented using add r10,r11,r0
- When considering performance you should count
real instructions
81Stored Program Concept
- Instructions are bits
- Programs are stored in memory to be read or
written just like data - Fetch Execute Cycle
- Instructions are fetched and put into a special
register - Bits in the register "control" the subsequent
actions - Fetch the next instruction and continue
memory for data, programs, compilers, editors,
etc.
82DLX arithmetic
- ALU instructions can have 3 operands
- add R1, R2, R3
- sub R1, R2, R3
- Operand order is fixed (destination
first)Example C code A B
C DLX code add r1, r2, r3 (registers
associated with variables by compiler)
83DLX arithmetic
- Design Principle simplicity favors regularity.
Why? - Of course this complicates some things... C
code A B C D E F - A MIPS
code add r1, r2, r3 - add r1, r1, r4 sub r5, r6, r1
- Operands must be registers, only 32 registers
provided - Design Principle smaller is faster. Why?
84Execution assembly instructions
- Program counter holds the instruction address
- CPU fetches instruction from memory and puts it
onto the instruction register - Control logic decodes the instruction and tells
the register file, ALU and other registers what
to do - An ALU operation (e.g. add) data flows from
register file, through ALU and back to register
file
85ALU Execution Example
86ALU Execution example
87Memory Instructions
- Load and store instructions
- lw r11, offset(r10)
- sw r11, offset(r10)
- Example C code A8 h A8 assume h in
r2 and base address of the array A in r3 - DLX code lw r4, 32(r3) add r4, r2, r4 sw
r4, 32(r3) - Store word has destination last
- Remember arithmetic operands are registers, not
memory!
88Memory Operations - Loads
- Load data from memory
- lw R6, 0(R5) R6 lt mem0x14
89Memory Operations - Stores
- Storing data to memory works essentially the same
way - sw R6 , 0(R5)
- R6 200 lets assume R5 0x18
- mem0x18 lt-- 200
90So far weve learned
- DLX loading words but addressing bytes
arithmetic on registers only - Instruction Meaningadd r1, r2, r3 r1 r2
r3sub r1, r2, r3 r1 r2 r3lw r1,
100(r2) r1 Memoryr2100 sw r1,
100(r2) Memoryr2100 r1
91Use of Registers
- Example
- a ( b c) - ( d e) // C statement
- r1 r5 a - e
- add r10, r2, r3
- add r11, r4, r5
- sub r1, r10, r11
- a b A4 // add an array element to a var
- // r3 has address A
- lw r4, 16(r3)
- add r1, r2, r4
92Use of Registers load and store
- Example
- A8 a A6 // A is in r3, a is in r2
- lw r1, 24(r3) r1 gets A6 contents
- add r1, r2, r1 r1 gets the sum
- sw r1, 32(r3) sum is put in A8
-
93load and store
- Ex
- a b Ai // A is in r3, a,b, i in //
r1, r2, r4 - add r11, r4, r4 r11 2 i
- add r11, r11, r11 r11 4 i
- add r11, r11, r3 r11 addr. of Ai
- (r3(4i))
- lw r10, 0(r11) r10 Ai
- add r1, r2, r10 a b Ai
94Example Swap
- Swapping words
- r2 has the base address of the array v
swap lw r10, 0(r2) lw r11, 4(r2) sw r10,
4(r2) sw r11, 0(r2)
temp v0 v0 v1 v1 temp
95DLX Instruction Format
- Instruction Format
- I-type R-type J-type
I-type Instructions 6
5 5 16
R-type Instructions 6
5 5 5
11
J-type Instructions 6
26
96Machine Language
- Instructions, like registers and words of data,
are also 32 bits long - Example add r10, r1, r2
- registers have numbers, 10, 1, 2
- Instruction Format
- 000000 00001 00010 01010 0000100000
R-type Instructions 6
5 5 5
11
97Machine Language
- Consider the load-word and store-word
instructions, - What would the regularity principle have us do?
- New principle Good design demands a compromise
- Introduce a new type of instruction format
- I-type for data transfer instructions
- other format was R-type for register
- Example lw r10, 32(r2)
I-type Instructions (Loads/stores) 6
5 5 16
1000011 01010 00010 0000100000
98Machine Language
- Jump instructions
- Example j .L1
J-type Instructions (Jump, Jump and Link, Trap,
return from exception) 6
26
0000010 offset to .L1
99DLX Instruction Format
- Instruction Format
- I-type R-type J-type
I-type Instructions 6
5 5 16
R-type Instructions 6
5 5 5
16
J-type Instructions 6
26
100Instructions for Making Decisions
- beq reg1, reg2, L1
- Go to the statement labeled L1 if the value in
reg1 equals the value in reg2 - bne reg1, reg2, L1
- Go to the statement labeled L1 if the value in
reg1 does not equals the value in reg2 - j L1
- Unconditional jump
- jr r10
- jump register. Jump to the instruction
specified in register r10
101Making Decisions
- Example
- if ( a ! b) goto L1 // x,y,z,a,b mapped to
r1-r5 - x y z
- L1 x x a
- bne r4, r5, L1 goto L1 if a ! b
- add r1, r2, r3 x y z (ignored if ab)
- L1sub r1, r1, r4 x x a (always ex)
102if-then-else
- Example
- if ( ab) x y z
- else x y z
- bne r4, r5, Else goto Else if a!b
- add r1, r2, r3 x y z
- j Exit goto Exit
- Else sub r1,r2,r3 x y z
- Exit
103Example Loop with array index
- Loop g g A i i i j if (i !
h) goto Loop .... - r1, r2, r3, r4 g, h, i, j, array base r5
- LOOP add r11, r3, r3 r11 2 i add r11,
r11, r11 r11 4 i add r11, r11, r5 r11
adr. Of Ai lw r10, 0(r11) load
Ai add r1, r1, r10 g g Ai add r3,
r3, r4 i i j bne r3, r2, LOOP
104Other decisions
- Set R1 on R2 less than R3 slt R1, R2, R3
- Compares two registers, R2 and R3
- R1 1 if R2 lt R3 elseR1 0 if R2 gt R3
- Example slt r11, r1, r2
- Branch less than
- Example if(A lt B) goto LESS
- slt r11, r1, r2 t1 1 if A lt B
- bne r11, 0, LESS
105Loops
- Example
- while ( Ai k ) // i,j,k in r3. r4, r5
- i i j // A is in r6
- Loop sll r11, r3, 2 r11 4 i
- add r11, r11, r6 r11 addr. Of Ai
- lw r10, 0(r11) r10 Ai
- bne r10, r5, Exit goto Exit if Ai!k
- add r3, r3, r4 i i j
- j Loop goto Loop
- Exit
-
106Addresses in Branches and Jumps
- Instructions
- bne r14,r15,Label Next instruction is at Label
if r14?r15 - beq r14,r15,Label Next instruction is at Label
if r14r15 - j Label Next instruction is at Label
- Formats
- Addresses are not 32 bits How do we handle
this with large programs? - First idea limitation of branch space to the
first 216 bits
op rs rt 16 bit address
I J
op 26 bit address
107Addresses in Branches
- Instructions
- bne r14,r15,Label Next instruction is at Label if
r14?r15 - beq r14,r15,Label Next instruction is at Label if
r14r15 - Formats
- Treat the 16 bit number as an offset to the PC
register PC-relative addressing - Word offset instead of byte offset, why??
- most branches are local (principle of locality)
- Jump instructions just use the high order bits of
PC Pseudodirect addressing - 32-bit jump address 4 Most Significant bits of
PC concatenated with 26-bit word address (or 28-
bit byte address) - Address boundaries of 256 MB
op rs rt 16 bit address
I
108Conditional Branch Distance
65 of integer branches are 2 to 4
instructions
109Conditional Branch Addressing
- PC-relative since most branches are relatively
close to the current PC - At least 8 bits suggested (?128 instructions)
- Compare Equal/Not Equal most important for
integer programs (86)
110PC-relative addressing
- For larger distances Jump register jr required.
111Example
- LOOP mult 9, 19, 10 R9 R19R10
lw 8, 1000(9) R8 _at_(R91000) - bne 8, 21, EXIT add 19, 19, 20 i
i j j LOOP EXIT ... - Assume address of LOOP is 0x8000
2
0x8000
112Procedure calls
- Procedures or subroutines
- Needed for structured programming
- Steps followed in executing a procedure call
- Place parameters in a place where the procedure
(callee) can access them - Transfer control to the procedure
- Acquire the storage resources needed for the
procedure - Perform desired task
- Place results in a place where the calling
program (caller) can access them - Return control to the point of origin
113Resources Involved
- Registers used for procedure calling
- a0 - a3 four argument registers in which to
pass parameters - v0 - v1 two value registers in which to
return values - r31 one return address register to return to
the point of origin - Transferring the control to the callee
- jal ProcedureAddress
- jump-and-link to the procedure address
- the return address (PC4) is saved in r31
- Example jal 20000
- Returning the control to the caller
- jr r31
- instruction following jal is executed next
114Memory Stacks
Useful for stacked environments/subroutine call
return even if operand stack not part of
architecture
Stacks that Grow Up vs. Stacks that Grow Down
High address
0 Little
inf. Big
a
Memory Addresses
grows up
grows down
SP
b
c
inf. Big
0 Little
Low address
115Calling conventions
int func(int g, int h, int i, int j) int
f f ( g h ) ( i j ) return ( f
) // g,h,i,j - a0,a1,a2,a3, f in r8 func
addi sp, sp, -12 make room in stack for 3
words sw r11, 8(sp) save the regs we want to
use sw r10, 4(sp) sw r8, 0(sp) add r10,
a0, a1 r10 g h add r11, a2, a3 r11
i j sub r8, r10, r11 r8 has the result
add v0, r8, r0 return reg v0 has f
116Calling (cont.)
- lw r8, 0(sp) restore r8
- lw r10, 4(sp) restore r10
- lw r11, 8(sp) restore r11
- addi sp, sp, 12 restore sp
- jr ra
- we did not have to restore r10-r19 (caller save)
- we do need to restore r1-r8 (must be preserved
by callee)
117Nested Calls
Stacking of Subroutine Calls Returns and
Environments
A
A CALL B CALL C
C RET
RET
A
B
B
D
A
B
C
A
B
A
- Some machines provide a memory stack as part of
the - architecture (e.g., VAX, JVM)
- Sometimes stacks are implemented via software
convention
118Compiling a String Copy Proc.
void strcpy ( char x , y ) int
i0 while ( x i y i ! 0) i
// x and y base addr. are in a0 and
a1 strcpy addi sp, sp, -4 reserve 1
word space in stack sw r8, 0(sp) save
r8 add r8, zer0, zer0 i 0 L1 add r11,
a1, r8 addr. of y i in r11 lb r12,
0(r11) r12 y i add r13, a0, r8
addr. Of x i in r13 sb r12, 0(r13) x i
y i beq r12, zero, L2 if y i 0
goto L2 addi r8, r8, 1 i j L1 go to
L1 L2 lw r8, 0(sp) restore r8 addi sp,
sp, 4 restore sp jr ra return
119IA - 32
- 1978 The Intel 8086 is announced (16 bit
architecture) - 1980 The 8087 floating point coprocessor is
added - 1982 The 80286 increases address space to 24
bits, instructions - 1985 The 80386 extends to 32 bits, new
addressing modes - 1989-1995 The 80486, Pentium, Pentium Pro add a
few instructions (mostly designed for higher
performance) - 1997 57 new MMX instructions are added,
Pentium II - 1999 The Pentium III added another 70
instructions (SSE) - 2001 Another 144 instructions (SSE2)
- 2003 AMD extends the architecture to increase
address space to 64 bits, widens all registers
to 64 bits and other changes (AMD64) - 2004 Intel capitulates and embraces AMD64
(calls it EM64T) and adds more media extensions - This history illustrates the impact of the
golden handcuffs of compatibilityadding new
features as someone might add clothing to a
packed bagan architecture that is difficult
to explain and impossible to love
120IA-32 Overview
- Complexity
- Instructions from 1 to 17 bytes long
- one operand must act as both a source and
destination - one operand can come from memory
- complex addressing modes e.g., base or scaled
index with 8 or 32 bit displacement - Saving grace
- the most frequently used instructions are not too
difficult to build - compilers avoid the portions of the architecture
that are slow - what the 80x86 lacks in style is made up in
quantity, making it beautiful from the right
perspective
121IA32 Registers
- Oversimplified Architecture
- Four 32-bit general purpose registers
- eax, ebx, ecx, edx
- al is a register to mean the lower 8 bits of
eax - Stack Pointer
- esp
- Fun fact
- Once upon a time, only x86 was a 16-bit CPU
- So, when they upgraded x86 to 32-bits...
- Added an e in front of every register and
called it extended
122Intel 80x86 Integer Registers
GPR0 EAX Accumulator
GPR1 ECX Count register, string, loop
GPR2 EDX Data Register multiply, divide
GPR3 EBX Base Address Register
GPR4 ESP Stack Pointer
GPR5 EBP Base Pointer for base of stack seg.
GPR6 ESI Index Register
GPR7 EDI Index Register
CS Code Segment Pointer
SS Stack Segment Pointer
DS Data Segment Pointer
ES Extra Data Segment Pointer
FS Data Seg. 2
GS Data Seg. 3
PC EIP Instruction Counter
Eflags Condition Codes
123x86 Assembly
- mov ltdestgt, ltsrcgt
- Move the value from ltsrcgt into ltdestgt
- Used to set initial values
- add ltdestgt, ltsrcgt
- Add the value from ltsrcgt to ltdestgt
- sub ltdestgt, ltsrcgt
- Subtract the value from ltsrcgt from ltdestgt
124x86 Assembly
push lttargetgt Push the value in lttargetgt onto
the stack Also decrements the stack pointer,
ESP (remember stack grows from high to low) pop
lttargetgt Pops the value from the top of the
stack, put it in lttargetgt Also increments the
stack pointer, ESP
125x86 Assembly
jmp ltaddressgt Jump to an instruction (like
goto) Change the EIP to ltaddressgt Call
ltaddressgt A function call. Pushes the current
EIP 1 (next instruction) onto the stack, and
jumps to ltaddressgt
126x86 Assembly
lea ltdestgt, ltsrcgt Load Effective Address of
ltsrcgt into register ltdestgt. Used for pointer
arithmetic (not actual memory reference) int
ltvaluegt interrupt hardware signal to operating
system kernel, with flag ltvaluegt int 0x80 means
Linux system call
127x86 Assembly
Condition Codes CZ Carry Flag Overflow
Detection (Unsigned) ZF Zero Flag SF Sign
Flag OF Overflow Flag Overflow Detection
(Signed) Conditional Codes are you usually
accessed through conditional branches (Not
Directly)
128Interrupt convention
int 0x80 System call interupt eax System
call number (eg. 1-exit, 2-fork, 3-read,
4-write) ebx argument 1 ecx argument 2 edx
argument 3
129CISC vs RISC
RISC Reduced Instruction Set Computer
(DLX) CISC Complex Instruction Set Computer
(x86) Both have their advantages.
130RISC
- Not very many instructions
- All instructions all the same length in both
execution time, and bit length - Results in simpler CPUs (Easier to optimize)
- Usually takes more ins