Title: Memory and IO Systems
1 EE 382N Superscalar Microprocessor
Architecture Chapter 3
- Memory and I/O Systems
- Prof. Lizy Kurian John
2A Typical Computer System
3Memory Hierarchy
4Properties of ideal memory system
- Infinite capacity
- Infinite bandwidth
- Instantaneous or zero latency
- Persistence or non-volatility
- Low implementation cost
5Memory Hierarchy Components
6Memory Hierarchy
Higher
Lower
As we move to deeper levels the latency goes up
and price per bit goes down.
7Memory Hierarchy
- If level closer to Processor, it must be
- smaller
- faster
- subset of lower levels (contains most recently
used data) - Lowest Level (usually disk) contains all
available data - Other levels?
8Attributes of memory hierarchy components
9Why We Use Caches
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1989
1990
1980
1981
1983
1984
1985
1986
1987
1988
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
- 1989 first Intel CPU with cache on chip
- 1998 Pentium III has two levels of cache on chip
- 2007 many chips have 3 levels of caches
10Memory Hierarchy Basis
- Disk contains everything.
- When Processor needs something, bring it into to
all higher levels of memory. - Cache contains copies of data in memory that are
being used. - Memory contains copies of data on disk that are
being used. - Entire idea is based on Locality
11Locality
- Temporal Locality if we use it now, well want
to use it again soon (a Big Idea) - Spatial Locality if we use something now, well
want to use things near it very soon - Caches contain the hardware mechanism to capture
the temporal and spatial locality in programs
12Temporal and Spatial Locality
13Capturing Locality
- Temporal Locality Save what you bring
- Spatial LocalityBring in nearby items too Use
large blocks
14Cache Design
- How do we decide what to bring into the cache?
- How do we decide where to put it into?
- How do we know which elements are in cache?
- How do we quickly locate them?
- When we bring something in, if there is no space,
how do we make space for it?
15Cache design
- Mapping Strategies
- Direct mapped
- Set Associative
- Fully Associative
- Replacement Strategies
- LRU (Least recently used)
- Random, FIFO, OPTIMAL
16Cache Organization schemes
(a) Direct Mapped
(b) Fully Associative
(c) Set Associative
17Cache Mapping Strategies
- Direct-Mapped Cache Each memory address or block
can go into only one specific location in the
cache - Set Assoc block can occupy any position within a
set - Fully Associative block can be written into any
position
18Direct-Mapped Cache
- Cache Location 0 can be occupied by data from
- Memory location 0, 4, 8, ...
- 4 blocks gt any memory location that is multiple
of 4
19Associative Cache Example
- Heres a simple 2 way set associative cache.
20Fully Associative Cache
- Any Cache Location can be occupied by data from
any blocks
21Tag and Index bits
- Since multiple memory addresses map to same cache
index, how do we tell which one is in there? - What if we have a block size gt 1 byte?
22Locating Stuff in Cache
- Index specifies the cache index (which row of
the cache we should look in) - Offset once weve found correct block, specifies
which byte within the block we want - Tag the remaining bits after offset and index
are determined these are used to distinguish
between all the memory addresses that map to the
same location
23Direct-Mapped Cache Example
- Index (index into an array of blocks)
- need to specify correct row in cache
- cache contains 16 KB 214 bytes
- block contains 24 bytes (4 words)
- blocks/cache
- bytes/cache bytes/block
- 214 bytes/cache 24 bytes/block
- 210 blocks/cache
- need 10 bits to specify this many rows
24Direct-Mapped Cache Example
- Tag use remaining bits as tag
- tag length addr length offset - index
32 - 4 - 10 bits 18 bits - so tag is leftmost 18 bits of memory address
- Why not full 32 bit address as tag?
- All bytes within block need same address (4b)
- Index must be same for every address within a
block, so its redundant in tag check, thus can
leave off to save memory (here 10 bits)
25Accessing data in a direct mapped cache
- 4 Addresses
- 0x00000014, 0x0000001C, 0x00000034, 0x00008014
- 4 Addresses divided (for convenience) into Tag,
Index, Byte Offset fields
000000000000000000 0000000001 0100 000000000000000
000 0000000001 1100 000000000000000000 0000000011
0100 000000000000000010 0000000001 0100 Tag
Index Offset
26Fully Associative Cache
- Fully Associative Cache (e.g., 32 B block)
- compare tags in parallel
27Fully Associative Cache (1/2)
- What does this mean?
- no rows any block can go anywhere in the cache
- must compare with all tags in entire cache to see
if data is there - Memory address fields
- Tag same as before
- Offset same as before
- Index non-existent
28Fully Associative Cache (2/2)
- Benefit of Fully Assoc Cache
- No Conflict Misses (since data can go anywhere)
- Drawbacks of Fully Assoc Cache
- Need hardware comparator for every single entry
if we have a 64KB of data in a cache with 4B
entries, we need 16K comparators infeasible
29Caching Terminology
- When we try to read memory, 3 things can happen
- cache hit cache block is valid and contains
proper address, so read desired word - cache miss nothing in cache in appropriate
block, so fetch from memory - cache miss, block replacement required data not
in cache some other data in the space fetch
desired data from memory and replace
30Block Replacement Policy (1/2)
- Direct-Mapped Cache index completely specifies
position which position a block can go in on a
miss - N-Way Set Assoc index specifies a set, but block
can occupy any position within the set on a miss - Fully Associative block can be written into any
position - Question if we have the choice, where should we
write an incoming block?
31Block Replacement Policy (2/2)
- If there are any locations with valid bit off
(empty), then usually write the new block into
the first one. - If all possible locations already have a valid
block, we must pick a replacement policy rule by
which we determine which block gets cached out
on a miss.
32Block Replacement Policy LRU
- LRU (Least Recently Used)
- Idea cache out block which has been accessed
(read or write) least recently - Pro temporal locality ? recent past use implies
likely future use in fact, this is a very
effective policy - Con with 2-way set assoc, easy to keep track
(one LRU bit) with 4-way or greater, requires
complicated hardware and much time to keep track
of this
33Block Replacement Example
- We have a 2-way set associative cache with a four
word total capacity and one word blocks. We
perform the following word accesses (ignore bytes
for this problem) - 0, 2, 0, 1, 4, 0, 2, 3, 5, 4
- How many hits and how many misses will there be
for the LRU block replacement policy?
34Block Replacement Example LRU
loc 0
loc 1
lru
0
- Addresses 0, 2, 0, 1, 4, 0, ...
0 miss, bring into set 0 (loc 0)
2
2 miss, bring into set 0 (loc 1)
0 hit
1 miss, bring into set 1 (loc 0)
lru
1
4 miss, bring into set 0 (loc 1, replace 2)
0 hit
35Block Size Tradeoff Conclusions
36Block Size Tradeoff (1/3)
- Benefits of Larger Block Size
- Spatial Locality if we access a given word,
were likely to access other nearby words soon - Very applicable with Stored-Program Concept if
we execute a given instruction, its likely that
well execute the next few as well - Works nicely in sequential array accesses too
37Block Size Tradeoff (2/3)
- Drawbacks of Larger Block Size
- Larger block size means larger miss penalty
- on a miss, takes longer time to load a new block
from next level - If block size is too big relative to cache size,
then there are too few blocks - Result miss rate goes up
- In general, minimize Average Memory Access Time
(AMAT) - Hit Time Miss Penalty x Miss Rate
38Block Size Tradeoff (3/3)
- Hit Time time to find and retrieve data from
current level cache - Miss Penalty average time to retrieve data on a
current level miss (includes the possibility of
misses on successive levels of memory hierarchy) - Hit Rate of requests that are found in
current level cache - Miss Rate 1 - Hit Rate
39Cache Design Parameters
40What to do on a write hit?
- Write-through
- update the word in cache block and corresponding
word in memory - Write-back
- update word in cache block
- allow memory word to be stale
- Write back later
- ? add dirty bit to each block indicating that
memory needs to be updated when block is replaced
41Write Allocate/No-Write-Allocate
- If WT strategy, what happens at cache miss? Is
the block brought to cache at a write? - WTNWA - NO
- WTWA -YES
42Types of Cache Misses (1/2)
- Three Cs Model of Misses
- 1st C Compulsory Misses
- occur when a program is first started
- cache does not contain any of that programs data
yet, so misses are bound to occur - cant be avoided easily, so wont focus on these
in this course
43Types of Cache Misses (2/2)
- 2nd C Conflict Misses
- miss that occurs because two distinct memory
addresses map to the same cache location - two blocks (which happen to map to the same
location) can keep overwriting each other - big problem in direct-mapped caches
- how do we lessen the effect of these?
- Dealing with Conflict Misses
- Solution 1 Make the cache size bigger
- Fails at some point
- Solution 2 Multiple distinct blocks can fit in
the same cache Index?
44Third Type of Cache Miss
- Capacity Misses
- miss that occurs because the cache has a limited
size - miss that would not occur if we increase the size
of the cache - sketchy definition, so just get the general idea
- This is the primary type of miss for Fully
Associative caches.
45- Average Memory Access Time (AMAT)
- Hit Time Miss Penalty x Miss
Rate - CPI Ideal CPI (Core CPI) MCPI
- MCPI Memory CPI
46Example
- Assume
- Hit Time 1 cycle
- Miss rate 5
- Miss penalty 20 cycles
- Calculate AMAT
- Avg mem access time
- 1 0.05 x 20
- 1 1 cycles
- 2 cycles
47Cache Area Overhead
- Cache contains useful data and Tag, Valid bit,
dirty bit etc. - If a cache is described to be16K bytes, often
16KB is the useful data capacity - The cache RAM is often 20 or 24K bytes
- The amount of area spent as tag depends on the
mapping strategy and block size - Fully assoc means more tag area
- Small block size means more tag area
48A Typical Memory Hierarchy
49A Typical Main Memory Organization
50DRAM Chip Organization
51Memory Module Organization
52Virtual Memory System
53Another View of the Memory Hierarchy
Regs
Upper Level
Instr. Operands
Faster
Cache
Blocks
L2 Cache
Blocks
Memory
Pages
Disk
Files
Larger
Tape
Lower Level
54Memory Hierarchy Requirements
- If Principle of Locality allows caches to offer
(close to) speed of cache memory with size of
DRAM memory,then recursively why not use at next
level to give speed of DRAM memory, size of Disk
memory? - While were at it, what other things do we need
from our memory system?
55Virtual Memory
- Allows OS to share memory, protect programs from
each other - Today, more important for protection vs. just
another level of memory hierarchy - Each process thinks it has all the memory to
itself - Historically, it predates caches
56Comparing the 2 levels of hierarchy
- Cache Version Virtual Memory vers.
- Block or Line Page
- Miss Page Fault
- Block Size 32-64B Page Size 4K-8KB
- Placement Fully AssociativeDirect Mapped,
N-way Set Associative - Replacement Least Recently UsedLRU or
Random (LRU) - Write Thru or Back Write Back
57Virtual to Physical Addr. Translation
Program operates in its virtual address space
Physical memory (incl. caches)
HW mapping
virtual address (inst. fetch load, store)
physical address (inst. fetch load, store)
- Each program operates in its own virtual address
space only program running - Each is protected from the other
- OS can decide where each goes in memory
- Hardware (HW) provides virtual ? physical mapping
58Mapping Virtual Memory to Physical Memory
Virtual Memory
- Divide into equal sizedchunks (about 4 KB - 8 KB)
Stack
- Any chunk of Virtual Memory assigned to any chuck
of Physical Memory (page)
Physical Memory
64 MB
0
0
59Paging Organization (assume 1 KB pages)
Page is unit of mapping
Page also unit of transfer from disk to physical
memory
60- Use table lookup (Page Table) for mappings
Page number is index - Physical Page Number PageTableVirtual Page
Number - (P.P.N. also called Page Frame)
61Page Table
- A page table is an operating system structure
which contains the mapping of virtual addresses
to physical locations
62Address Mapping Page Table
Page Table located in physical memory
63Paging/Virtual Memory Multiple Processes
User B Virtual Memory
User A Virtual Memory
Physical Memory
Stack
Stack
64 MB
Heap
Heap
Static
Static
0
Code
Code
0
0
64Virtual Memory Problem 1
- Map every address ? 1 indirection via Page Table
in memory per virtual address ? 1 virtual memory
accesses 2 physical memory accesses ? SLOW! - Observation since locality in pages of data,
there must be locality in virtual address
translations of those pages - Since small is fast, why not use a small cache of
virtual to physical address translations to make
translation fast? - For historical reasons, cache is called a
Translation Lookaside Buffer, or TLB
65Translation Look-Aside Buffers (TLBs)
- TLBs usually small, typically 128 - 256 entries
- Like any other cache, the TLB can be direct
mapped, set associative, or fully associative
hit
PA
miss
VA
TLB Lookup
Cache
Main Memory
Processor
miss
hit
Trans- lation
data
On TLB miss, get page table entry from main memory
66What if not in TLB?
- Option 1 Hardware checks page table and loads
new Page Table Entry into TLB - Option 2 Hardware traps to OS, up to OS to
decide what to do - MIPS follows Option 2 Hardware knows nothing
about page table
67What if the data is on disk?
- We load the page off the disk into a free block
of memory, using a DMA (Direct Memory Access
very fast!) transfer - Meantime we switch to some other process waiting
to be run - When the DMA is complete, we get an interrupt and
update the process's page table - So when we switch back to the task, the desired
data will be in memory
68What if we dont have enough memory?
- We chose some other page belonging to a program
and transfer it onto the disk if it is dirty - If clean (disk copy is up-to-date), just
overwrite that data in memory - We chose the page to evict based on replacement
policy (e.g., LRU) - And update that program's page table to reflect
the fact that its memory moved somewhere else - If continuously swap between disk and memory,
called Thrashing
69Virtual Memory Overview (1/4)
- User program view of memory
- Contiguous
- Start from some set address
- Infinitely large
- Is the only running program
- Reality
- Non-contiguous
- Start wherever available memory is
- Finite size
- Many programs running at a time
70Virtual Memory Overview (2/4)
- Virtual memory provides
- illusion of contiguous memory
- all programs starting at same set address
- illusion of infinite memory (232 or 264 bytes)
- protection
71Virtual Memory Overview (3/4)
- Implementation
- Divide memory into chunks (pages)
- Operating system controls page table that maps
virtual addresses into physical addresses - Think of memory as a cache for disk
- TLB is a cache for the page table
72Virtual Memory Overview (4/4)
- Lets say were fetching some data
- Check TLB (input VPN, output PPN)
- hit fetch translation
- miss check page table (in memory)
- Page table hit fetch translation
- Page table miss page fault, fetch page from disk
to memory, return translation to TLB - Check cache (input PPN, output data)
- hit return value
- miss fetch value from memory
73Overview of Address Translation
74Virtual Memory System
75Handling a Page Fault
76A Typical Page Table Entry
77Multilevel Forward Page Table
78Hashed Page Table
79Memory Hierarchy Implementation
80Direct Mapped Cache
(a) Single Word Per Block
(a) Multi-Word Per Block
81Fully Associative Cache
82Set Associative Cache
83Translation of Virtual Word Address
84Translation of Virtual Page Address
85Direct Mapped TLB
86Other configurations of TLB
(a) Set Associative TLB
(b) Fully Associative TLB
87Interaction between TLB and D-cache
88Virtually Indexed D-cache
89Input/Output Systems
90Disk Drive Structures
91Striping Data in Disk Arrays
92Placement of Parity Blocks
93Bus Design Parameters
94Time Sharing the CPU