Title: Memory%20Hierarchy%20Design
1 Memory Hierarchy Design
2Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
3Many Levels in Memory Hierarchy
Pipeline registers
Invisible only to high-levellanguage programmers
Register file
There can also bea 3rd (or more) cache levels
here
1st-level cache(on-chip)
Usually madeinvisible tothe programmer(even
assemblyprogrammers)
2nd-level cache(on same MCM as CPU)
Our focusin chapter 5
Physical memory(usu. mounted on same board as
CPU)
Virtual memory(on hard disk, often in same
enclosure as CPU)
Disk files(on hard disk often in same enclosure
as CPU)
Network-accessible disk files(often in the same
building as the CPU)
Tape backup/archive system(often in the same
building as the CPU)
Data warehouse Robotically-accessed room full of
shelves of tapes (usually on the same planet as
the CPU)
4Simple Hierarchy Example
- Note many orders of magnitude change in
characteristics between levels
128?
8192?
200?
50,000 ?
4 ?
100 ?
5CPU vs. Memory Performance Trends
Relative performance (vs. 1980 perf.) as a
function of year
55/year
35/year
7/year
6Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
7Cache Basics
- A cache is a (hardware managed) storage,
intermediate in size, speed, and cost-per-bit
between the programmer-visible registers and main
physical memory. - The cache itself may be SRAM or fast DRAM.
- There may be gt1 levels of caches
- Basis for cache to work Principle of Locality
- When a location is accessed, it and nearby
locations are likely to be accessed again soon. - Temporal locality - Same location likely again
soon. - Spatial locality - Nearby locations likely soon.
8Four Basic Questions
- Consider levels in a memory hierarchy.
- Use block for the unit of data transfer satisfy
Principle of Locality. - Transfer between cache levels, and the memory
- The level design is described by four behaviors
- Block Placement
- Where could a new block be placed in the level?
- Block Identification
- How is a block found if it is in the level?
- Block Replacement
- Which existing block should be replaced if
necessary? - Write Strategy
- How are writes to the block handled?
9Block Placement Schemes
10Direct-Mapped Placement
- A block can only go into one frame in the cache
- Determined by blocks address (in memory space)
- Frame number usually given by some low-order bits
of block address. - This can also be expressed as
- (Frame number) (Block address) mod (Number of
frames (sets) in cache) - Note that in a direct-mapped cache,
- block placement replacement are both completely
determined by the address of the new block that
is to be accessed.
11Direct-Mapped Identification
Tags
Block frames
Address
Frm
Tag
Off.
Decode Row Select
One Selected Compared
Muxselect
?
Compare Tags
Data Word
Hit
12Fully-Associative Placement
- One alternative to direct-mapped is
- Allow block to fill any empty frame in the cache.
- How do we then locate the block later?
- Can associate each stored block with a tag
- Identifies the blocks location in cache.
- When the block is needed, treat the cache as an
associative memory, using the tag to match all
frames in parallel, to pull out the appropriate
block. - Another alternative to direct-mapped is placement
under full program control. - A register file can be viewed as a small
programmer-controlled cache (w. 1-word blocks).
13Fully-Associative Identification
Block addrs
Block frames
Address
Block addr
Off.
Parallel Compare Select
- Note that, compared to Direct
- More address bits have to be stored with each
block frame. - A comparator is needed for each frame, to do the
parallel associative lookup.
Muxselect
Hit
Data Word
14Set-Associative Placement
- The block address determines not a single frame,
but a frame set (several frames, grouped
together). - Frame set Block address mod of frame sets
- The block can be placed associatively anywhere
within that frame set. - If there are n frames in each frame set, the
scheme is called n-way set-associative. - Direct mapped 1-way set-associative.
- Fully associative There is only 1 frame set.
15Set-Associative Identification
Tags
Block frames
Address
Set
Tag
Off.
Note4separatesets
Set Select
Parallel Compare within the Set
- Intermediate between direct-mapped and
fully-associative in number of tag bits needed to
be associated with cache frames. - Still need a comparator for each frame (but only
those in one set need be activated).
Muxselect
Hit
Data Word
16Cache Size Equation
- Simple equation for the size of a cache
- (Cache size) (Block size) (Number of sets)
(Set Associativity) - Can relate to the size of various address fields
- (Block size) 2( of offset bits)
- (Number of sets) 2( of index bits)
- ( of tag bits) ( of memory address bits)
? ( of index bits) ? ( of
offset bits)
Memory address
17Replacement Strategies
- Which block do we replace when a new block comes
in (on cache miss)? - Direct-mapped Theres only one choice!
- Associative (fully- or set-)
- If any frame in the set is empty, pick one of
those. - Otherwise, there are many possible strategies
- Random Simple, fast, and fairly effective
- Least-Recently Used (LRU), and approximations
thereof - Require bits to record replacement info., e.g.
4-way requires 4! 24 permutations, need 5 bits
to define the MRU to LRU positions - FIFO Replace the oldest block.
18Write Strategies
- Most accesses are reads, not writes
- Especially if instruction reads are included
- Optimize for reads performance matters
- Direct mapped can return value before valid check
- Writes are more difficult
- Cant write to cache till we know the right block
- Object written may have various sizes (1-8 bytes)
- When to synchronize cache with memory?
- Write through - Write to cache and to memory
- Prone to stalls due to high bandwidth
requirements - Write back - Write to memory upon replacement
- Memory may be out of date
19Another Write Strategy
- Maintain a FIFO queue (write buffer) of cache
frames (e.g. can use a doubly-linked list) - Meanwhile, take items from top of queue and write
them to memory as fast as bus can handle - Reads might take priority, or have a separate bus
- Advantages Write stalls are minimized, while
keeping memory as up-to-date as possible
20Write Miss Strategies
- What do we do on a write to a block thats not in
the cache? - Two main strategies Both do not stop processor
- Write-allocate (fetch on write) - Cache the
block. - No write-allocate (write around) - Just write to
memory. - Write-back caches tend to use write-allocate.
- White-through tends to use no-write-allocate.
- Use dirty bit to indicate write-back is needed in
write-back strategy
21Example Alpha 21264
- 64KB, 2-way, 64-byte block, 512 sets
- 44 physical address bits
22Instruction vs. Data Caches
- Instructions and data have different patterns of
temporal and spatial locality - Also instructions are generally read-only
- Can have separate instruction data caches
- Advantages
- Doubles bandwidth between CPU memory hierarchy
- Each cache can be optimized for its pattern of
locality - Disadvantages
- Slightly more complex design
- Cant dynamically adjust cache space taken up by
instructions vs. data
23I/D Split and Unified Caches
Size I-Cache D-Cache Unified Cache
8KB 8.16 44.0 63.0
16KB 3.82 40.9 51.0
32KB 1.36 38.4 43.3
64KB 0.61 36.9 39.4
128KB 0.30 35.3 36.2
256KB 0.02 32.6 32.9
- Miss per 1000 accesses
- Much lower instruction miss rate than data miss
rate
24Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
25Cache Performance Equations
- Memory stalls per program (blocking cache)
- CPU time formula
- More cache performance will be given later!
26Cache Performance Example
- Ideal CPI2.0, memory references / inst1.5,
cache size64KB, miss penalty75ns, hit time1
clock cycle - Compare performance of two caches
- Direct-mapped (1-way) cycle time1ns, miss
rate1.4 - 2-way cycle time1.25ns, miss rate1.0
27Out-Of-Order Processor
- Define new miss penalty considering overlap
- Compute memory latency and overlapped latency
- Example (from previous slide)
- Assume 30 of 75ns penalty can be overlapped, but
with longer (1.25ns) cycle on 1-way design due to
OOO
28Cache Performance Improvement
- Consider the cache performance equation
- It obviously follows that there are three basic
ways to improve cache performance - Reducing miss penalty (5.4)
- Reducing miss rate (5.5)
- Reducing miss penalty/rate via parallelism (5.6)
- Reducing hit time (5.7)
- Note that by Amdahls Law, there will be
diminishing returns from reducing only hit time
or amortized miss penalty by itself, instead of
both together.
(Average memory access time) (Hit time) (Miss
rate)(Miss penalty)
Amortized miss penalty
29Cache Performance Improvement
- Reduce miss penalty
- Multilevel cache Critical word first and early
restart priority to read miss Merging write
buffer Victim cache - Reduce miss rate
- Larger block size Increase cache size Higher
associativity Way prediction and
Pseudo-associative caches Compiler
optimizations - Reduce miss penalty/rate via parallelism
- Non-blocking cache Hardware prefetching
Compiler-controlled prefetching - Reduce hit time
- Small simple cache Avoid address translation in
indexing cache Pipelined cache access Trace
caches
30Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
31Multi-Level Caches
- What is important faster caches or larger
caches? - Average memory access time Hit time (L1)
Miss rate (L1) x Miss Penalty (L1) - Miss penalty (L1)
- Hit time (L2) Miss rate (L2) x Miss Penalty
(L2) - Can plug 2nd equation into the first
- Average memory access time
- Hit time(L1) Miss rate(L1) x (Hit time(L2)
Miss rate(L2)x Miss penalty(L2))
32Multi-level Cache Terminology
- Local miss rate
- The miss rate of one hierarchy level by itself.
- of misses at that level / accesses to that
level - e.g. Miss rate(L1), Miss rate(L2)
- Global miss rate
- The miss rate of a whole group of hierarchy
levels - of accesses coming out of that group (to lower
levels) / of accesses to that group - Generally this is the product of the miss rates
at each level in the group. - Global L2 Miss rate Miss rate(L1) Local Miss
rate(L2)
33Effect of 2-level Caching
- L2 size usually much bigger than L1
- Provide reasonable hit rate
- Decreases miss penalty of 1st-level cache
- May increase L2 miss penalty
- Multiple-level cache inclusion property
- Inclusive cache L1 is subset of L2 simplify
cache coherence mechanism, effective cache size
L2 - Exclusive cache L1, L2 are exclusive increase
effect cache sizes L1 L2 - Enforce inclusion property Backward
invalidation on L2 replacement
34L2 Cache Performance
- Global cache miss rate is similar to the single
cache miss rate - Local miss rate is not a good measure of
secondary caches
35Early Restart, Critical Word First
- Early restart
- Dont wait for entire block to fill
- Resume CPU as soon as requested word is fetched
- Critical word first
- ? wrapped fetch, requested word first
- Fetch the requested word from memory first
- Resume CPU
- Then transfer the rest of the cache block
- Most beneficial if block size is large
- Commonly used in all the processors
36Read Misses Take Priority
- Processor must wait on a read, not on a write
- Miss penalty is higher for reads to begin with
and more benefit from reducing read miss penalty - Write buffer can queue values to be written
- Until memory bus is not busy with reads
- Careful about the memory consistency issue
- What if we want to read a block in write buffer?
- Wait for write, then read block from memory
- Better Read block out of write buffer.
- Dirty block replacement when reading
- Write old block, read new block - Delays the
read. - Old block to buffer, read new, write old. -
Better!
37Sub-block Placement
- Larger blocks have smaller tags (match faster)
- Smaller blocks have lower miss penalty
- Compromise solution
- Use a large block size for tagging purposes
- Use a small block size for transfer purposes
- How? Valid bits associated with sub-blocks.
Blocks
Tags
38Merging Write Buffer
- A mechanism to help reduce write stalls
- On a write to memory, block address and data to
be written are placed in a write buffer. - CPU can continue immediately
- Unless the write buffer is full.
- Write merging
- If the same block is written again before it has
been flushed to memory, old contents are replaced
with new contents. - Care must be taken to not violate memory
consistency and proper write ordering
39Write Merging Example
40Victim Cache
- Small extra cache
- Holds blocks overflowing from the occasional
overfull frame set. - Very effective for reducing conflict misses.
- Can be checked in parallel with main cache
- Insignificant increase to hit time.
41Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
42Three Types of Misses
- Compulsory
- During a program, the very first access to a
block will not be in the cache (unless
pre-fetched) - Capacity
- The working set of blocks accessed by the program
is too large to fit in the cache - Conflict
- Unless cache is fully associative, sometimes
blocks may be evicted too early because too many
frequently-accessed blocks map to the same
limited set of frames.
43Misses by Type
Conflict
- Conflict misses are significant in a
direct-mapped cache. - From direct-mapped to 2-way helps as much as
doubling cache size. - Going from direct-mapped to 4-way is better
than doubling cache size.
44As fraction of total misses
45Larger Block Size
- Keep cache size associativity constant
- Reduces compulsory misses
- Due to spatial locality
- More accesses are to a pre-fetched block
- Increases capacity misses
- More unused locations pulled into cache
- May increase conflict misses (slightly)
- Fewer sets may mean more blocks utilized per set
- Depends on pattern of addresses accessed
- Increases miss penalty - longer block transfers
46Block Size Effect
Miss rate is actually goes up if the block is too
large relative to the cache size
47Larger Caches
- Keep block size, set size, etc. constant
- No effect on compulsory misses.
- Block still wont be there on its 1st access!
- Reduces capacity misses
- More capacity!
- Reduces conflict misses (in general)
- Working blocks spread out over more frame sets
- Fewer blocks map to a set on average
- Less chance that the number of active blocks that
map to a given set exceeds the set size. - But, increases hit time! (And cost.)
48Higher Associativity
- Keep cache size block size constant
- Decreasing the number of sets
- No effect on compulsory misses
- No effect on capacity misses
- By definition, these are misses that would happen
anyway in fully-associative - Decreases conflict misses
- Blocks in active set may not be evicted early
- for set size smaller than capacity
- Can increase hit time (slightly)
- Direct-mapped is fastest
- n-way associative lookup a bit slower for larger n
49Performance Comparison
- Assume
- 4KB, 1-way miss-rate9.8 4-way miss-rate7.1
50Higher Set-Associativity
Cache Size 1-way 2-way 4-way 8-way
4KB 3.44 3.25 3.22 3.28
8KB 2.69 2.58 2.55 2.62
16KB 2.23 2.40 2.46 2.53
32KB 2.06 2.30 2.37 2.45
64KB 1.92 2.14 2.18 2.25
128KB 1.52 1.84 1.92 2.00
256KB 1.32 1.66 1.74 1.82
512KB 1.20 1.55 1.59 1.66
- Higher associativity increase the cycle time
- The table shows the average memory access time
- 1-way is better most of cases
51Way Prediction
- Keep in each set a way-prediction information to
predict the block in each set will be accessed
next - Only one tag may be matched at the first cycle
if miss, other blocks need to be examined - Beneficial in two aspects
- Fast data access Access the data without
knowing the tag comparison results - Low power Only match a single tag if majority
of the prediction is correct - Different systems use variations of the concept
52Pseudo-Associative Caches
- Essentially this is 2-way set-associative, but
with sequential (rather than parallel) lookups. - Fast hit time if first frame checked is right.
- An occasional slow hit if an earlier conflict had
moved the block to its backup location.
53Pseudo-Associative Caches
- Placement
- Place block b in frame (b mod n).
- Identification
- Look for block b first in frame (b mod n), then
in its secondary location ((bn/2) mod n). (flip
the most-significant bit) If found there,
primary and secondary blocks are swapped. - May maintain a MRU bit to reduce the search and
for better replacement. - Replacement
- Block in frame (b mod n) is moved to secondary
location ((bn/2) mod n). (Block there is
flushed.) - Write strategy
- Any desired write strategy can be used.
54Compiler Optimizations
- Reorganize code to improve locality properties.
- The hardware designers favorite solution.
- Requires no new hardware!
- Various techniques Cache awareness
- Merging Arrays
- Loop Interchange
- Loop Fusion
- Blocking (in multidimensional arrays)
- Other source-source transformation technique
55Loop Blocking Matrix Multiply
Before
After
56Effect of Compiler Optimizations
57Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
via Parallelism
58Non-blocking Caches
- Known as lockup-free cache, hit under miss
- While a miss is being processed,
- Allow other cache lookups to continue anyway
- Useful in dynamically scheduled CPUs
- Other instructions may be in the load queue
- Reduces effective miss penalty
- Useful CPU work fills the miss penalty delay
slot - hit under multiple miss, miss under miss
- Extend technique to allow multiple misses to be
queued up, while still processing new hits
59Non-blocking Caches
60Hardware Prefetching
- When memory is idle, speculatively get some
blocks before the CPU first asks for them! - Simple heuristic Fetch 1 or more blocks that are
consecutive to last one(s) fetched - Often, the extra blocks are placed in a special
stream buffer so as not to conflict with actual
active blocks in the cache, otherwise the
prefetch may pollute the cache - Prefetching can reduce misses considerably
- Speculative fetches should be low-priority
- Use only otherwise-unused memory bandwidth
- Energy-inefficient (like all speculation)
61Compiler-Controlled Prefetching
- Insert special instructions to load addresses
from memory well before they are needed. - Register vs. cache, faulting vs. nonfaulting
- Semantic invisibility, nonblocking-ness
- Can considerably reduce misses
- Can also cause extra conflict misses
- Replacing a block before it is completely used
- Can also delay valid accesses (tying up bus)
- Low-priority, can be pre-empted by real access
62Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
63Small and Simple Caches
- Make cache smaller to improve hit time
- Or (probably better), add a new smaller L0
cache between existing L1 cache and CPU. - Keep L1 cache on same chip as CPU
- Physically close to functional units that access
it - Keep L1 design simple, e.g. direct-mapped
- Avoids multiple tag comparisons
- Tag can be compared after data cache fetch
- Reduces effective hit time
64Access Time in a CMOS Cache
65Avoid Address Translation
- In systems with virtual address spaces, virtual
addr. must be mapped to physical addresses. - If cache blocks are indexed/tagged w. physical
addresses, we must do this translation before we
can do the cache lookup. Long hit time! - Solution Access cache using the virtual
address. Call this a Virtual Cache - Drawback Cache flush on context switch
- Can fix by tagging blocks with Process Ids (PIDs)
- Another problem Aliasing, i.e. two virtual
addresses mapped to same real address - Fix with anti-aliasing or page coloring
66Benefit of PID Tags in Virtual Cache
W/o PIDs, purge
W. PIDs
Miss rate
W/o context switching
67Pipelined Cache Access
- Pipeline cache access so that
- Effective latency of first level cache hit can be
multiple clock cycles - Fast cycle time and slow hits
- Hit times 1 for Pentium, 2 for P3, and 4 for P4
- Increases number of pipeline stages
- Higher penalty on mispredicted branches
- More cycles from issue of load to use of data
- In reality
- Increases the bandwidth of instructions than
decreasing the actual latency of a cache hit
68Trace Caches
- Supply enough instructions per cycle without
dependencies - Finding ILP beyond 4 instructions per cycle
- Dont limit instructions in a static cache block
to spatial locality - Find dynamic sequence of instructions including
taken branches - NetBurst (P4) uses trace caches
- Addresses are no longer aligned.
- Same instruction is stored more than once
- If part of multiple traces
69Summary of Cache Optimizations
70Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations for Improving
Performance - Memory Technology
- Virtual Memory
- Conclusion
71Wider Main Memory
72Simple Interleaved Memory
- Adjacent words found in different mem. banks
- Banks can be accessed in parallel
- Overlap latencies for accessing each word
- Can use narrow bus
- To return accessed words sequentially
- Fits well with sequential access
- e.g., of words in cache blocks
73Independent Memory Banks
- Original motivation for memory banks
- Higher bandwidth by interleaving seq. accesses
- Allows multiple independent accesses
- Each bank requires separate address/data lines
- Non-blocking caches allow CPU to proceed beyond a
cache miss - Allows multiple simultaneous cache misses
- Possible only with memory banks
74Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
75Main Memory
- Bandwidth Bytes read or written per unit time
- Latency Described by
- Access Time Delay between initiation/completion
- For reads Present address till result ready.
- Cycle time Minimum interval between separate
requests to memory. - Address lines Separate bus CPU?Mem to carry
addresses. - RAS (Row Access Strobe)
- First half of address, sent first.
- CAS (Column Access Strobe)
- Second half of address, sent second.
76RAS vs. CAS
DRAM bit-cell array
1. RAS selects a row
2. Parallelreadout ofall row data
3. CAS selectsa column to read
4. Selected bitwritten to memory bus
77Typical DRAM Organization
(256 Mbit)
Low 14 bits
High14 bits
78Types of Memory
- DRAM (Dynamic Random Access Memory)
- Cell design needs only 1 transistor per bit
stored. - Cell charges leak away and may dynamically (over
time) drift from their initial levels. - Requires periodic refreshing to correct drift
- e.g. every 8 ms
- Time spent refreshing kept to lt5 of bandwidth
- SRAM (Static Random Access Memory)
- Cell voltages are statically (unchangingly) tied
to power supply references. No drift, no
refresh. - But needs 4-6 transistors per bit.
- DRAM 4-8x larger, 8-16x slower, 8-16x cheaper/bit
79Amdahl/Case Rule
- Memory size (and I/O bandwidth) should grow
linearly with CPU speed - Typical 1 MB main memory, 1 Mbps I/O bandwidth
per 1 MIPS CPU performance. - Takes a fairly constant 8 seconds to scan entire
memory (if memory bandwidth I/O bandwidth, 4
bytes/load, 1 load/4 instructions, no latency
problem) - Moores Law
- DRAM size doubles every 18 months (up 60/yr)
- Tracks processor speed improvements
- Unfortunately, DRAM latency has only decreased
7/year. Latency is a big deal.
80Some DRAM Trend Data
Since 1998, the rate of increase in chip capacity
has slowed to 2x per 2 years 128 Mb in
1998 256 Mb in 2000 512 Mb in 2002
81ROM and Flash
- ROM (Read-Only Memory)
- Nonvolatile protection
- Flash
- Nonvolatile RAMs
- NVRAMs require no power to maintain state
- Reading flash is near DRAM speeds
- Writing is 10-100x slower than DRAM
- Frequently used for upgradeable embedded SW
- Used in Embedded Processors
82DRAM Variations
- SDRAM Synchronous DRAM
- DRAM internal operation synchronized by a clock
signal provided on the memory bus - Double Data Rate (DDR) uses both clock edges
- RDRAM RAMBUS (Inc.) DRAM
- Proprietary DRAM interface technology
- on-chip interleaving / multi-bank technology
- a high-speed packet-switched (split-transaction)
bus interface - byte-wide interface, synchronous, dual-rate
- Licensed to many chip CPU makers
- Higher bandwidth, costly than generic SDRAM
- DRDRAM Direct RDRAM (2nd ed. spec.)
- Separate row and column address/command buses
- Higher bandwidth (18-bit data, more banks, faster
clock)
83Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
84Virtual Memory
The addition of the virtual memory mechanism
complicated the cache access
85Paging vs. Segmentation
- Paged Segment Each segment has integral number
of pages for easy replacement and can still treat
each segmentation as a unit
86Four Important Questions
- Where to place a block in main memory?
- Operating systems takes care of it
- Replacement takes very long fully associative
- How to find a block in main memory?
- Page table is used
- Offset is concatenated when paging is used
- Offset is added when segmentation is used.
- Which block to replace when needed?
- Obviously LRU is used to minimize page faults
- What happens on a write?
- Magnetic disks takes millions of cycles to
access. - Always write back (use of dirty bit).
87Addressing Virtual Memories
88Fast Address Calculation
- Page tables are very large
- Kept in main memory
- Two memory accesses for one read or write
- Remember the last translation
- Reuse if the address is on the same page
- Exploit the principle of locality
- If access have locality, the address translations
should also have locality - Keep the address translations in a cache
- Translation lookaside buffer (TLB)
- The tag part stores the virtual address and the
data part stores the page number.
89TLB Example Alpha 21264
Same as PID
90An Memory Hierarchy Example
28
91Protection of Virtual Memory
- Maintain two registers
- Base
- Bound
- For each address check
- base lt address lt bound
- Provide two modes
- User
- OS (kernel, supervisor, executive)
92Alpha-21264 Virtual Addr. Mapping
Supports both segmentation and paging
93Outline
- Introduction
- Cache Basics
- Cache Performance
- Reducing Cache Miss Penalty
- Reducing Cache Miss Rate
- Reducing Hit Time
- Main Memory and Organizations
- Memory Technology
- Virtual Memory
- Conclusion
94Design of Memory Hierarchies
- Superscalar CPU, number of ports to cache
- Speculative execution and memory system
- Combine inst. Cache with fetch and decode
- Caches in embedded systems!
- Real-time vs. power
- I/O and consistency of cached data
- The cache coherence problem
95The Cache Coherency Problem
96Alpha 21264 Memory Hierarchy