Memory%20Hierarchy%20Design

About This Presentation

Title:

Memory%20Hierarchy%20Design

Description:

Memory Hierarchy Design – PowerPoint PPT presentation

Number of Views:202

Avg rating:3.0/5.0

Slides: 97

Provided by: Prab72

Learn more at: http://www.ece.uprm.edu

Category:

more less

Transcript and Presenter's Notes

Title: Memory%20Hierarchy%20Design

1
Memory Hierarchy Design
2
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

3
Many Levels in Memory Hierarchy
Pipeline registers
Invisible only to high-levellanguage programmers
Register file
There can also bea 3rd (or more) cache levels
here
1st-level cache(on-chip)
Usually madeinvisible tothe programmer(even
assemblyprogrammers)
2nd-level cache(on same MCM as CPU)
Our focusin chapter 5
Physical memory(usu. mounted on same board as
CPU)
Virtual memory(on hard disk, often in same
enclosure as CPU)
Disk files(on hard disk often in same enclosure
as CPU)
Network-accessible disk files(often in the same
building as the CPU)
Tape backup/archive system(often in the same
building as the CPU)
Data warehouse Robotically-accessed room full of
shelves of tapes (usually on the same planet as
the CPU)
4
Simple Hierarchy Example

Note many orders of magnitude change in
characteristics between levels

128?
8192?
200?
50,000 ?
4 ?
100 ?
5
CPU vs. Memory Performance Trends
Relative performance (vs. 1980 perf.) as a
function of year
55/year
35/year
7/year
6
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

7
Cache Basics

A cache is a (hardware managed) storage,
intermediate in size, speed, and cost-per-bit
between the programmer-visible registers and main
physical memory.
The cache itself may be SRAM or fast DRAM.
There may be gt1 levels of caches
Basis for cache to work Principle of Locality
When a location is accessed, it and nearby
locations are likely to be accessed again soon.
Temporal locality - Same location likely again
soon.
Spatial locality - Nearby locations likely soon.

8
Four Basic Questions

Consider levels in a memory hierarchy.
Use block for the unit of data transfer satisfy
Principle of Locality.
Transfer between cache levels, and the memory
The level design is described by four behaviors
Block Placement
Where could a new block be placed in the level?
Block Identification
How is a block found if it is in the level?
Block Replacement
Which existing block should be replaced if
necessary?
Write Strategy
How are writes to the block handled?

9
Block Placement Schemes
10
Direct-Mapped Placement

A block can only go into one frame in the cache
Determined by blocks address (in memory space)
Frame number usually given by some low-order bits
of block address.
This can also be expressed as
(Frame number) (Block address) mod (Number of
frames (sets) in cache)
Note that in a direct-mapped cache,
block placement replacement are both completely
determined by the address of the new block that
is to be accessed.

11
Direct-Mapped Identification
Tags
Block frames
Address
Frm
Tag
Off.
Decode Row Select
One Selected Compared
Muxselect
?
Compare Tags
Data Word
Hit
12
Fully-Associative Placement

One alternative to direct-mapped is
Allow block to fill any empty frame in the cache.
How do we then locate the block later?
Can associate each stored block with a tag
Identifies the blocks location in cache.
When the block is needed, treat the cache as an
associative memory, using the tag to match all
frames in parallel, to pull out the appropriate
block.
Another alternative to direct-mapped is placement
under full program control.
A register file can be viewed as a small
programmer-controlled cache (w. 1-word blocks).

13
Fully-Associative Identification
Block addrs
Block frames
Address
Block addr
Off.
Parallel Compare Select

Note that, compared to Direct
More address bits have to be stored with each
block frame.
A comparator is needed for each frame, to do the
parallel associative lookup.

Muxselect
Hit
Data Word
14
Set-Associative Placement

The block address determines not a single frame,
but a frame set (several frames, grouped
together).
Frame set Block address mod of frame sets
The block can be placed associatively anywhere
within that frame set.
If there are n frames in each frame set, the
scheme is called n-way set-associative.
Direct mapped 1-way set-associative.
Fully associative There is only 1 frame set.

15
Set-Associative Identification
Tags
Block frames
Address
Set
Tag
Off.
Note4separatesets
Set Select
Parallel Compare within the Set

Intermediate between direct-mapped and
fully-associative in number of tag bits needed to
be associated with cache frames.
Still need a comparator for each frame (but only
those in one set need be activated).

Muxselect
Hit
Data Word
16
Cache Size Equation

Simple equation for the size of a cache
(Cache size) (Block size) (Number of sets)
(Set Associativity)
Can relate to the size of various address fields
(Block size) 2( of offset bits)
(Number of sets) 2( of index bits)
( of tag bits) ( of memory address bits)
? ( of index bits) ? ( of
offset bits)

Memory address
17
Replacement Strategies

Which block do we replace when a new block comes
in (on cache miss)?
Direct-mapped Theres only one choice!
Associative (fully- or set-)
If any frame in the set is empty, pick one of
those.
Otherwise, there are many possible strategies
Random Simple, fast, and fairly effective
Least-Recently Used (LRU), and approximations
thereof
Require bits to record replacement info., e.g.
4-way requires 4! 24 permutations, need 5 bits
to define the MRU to LRU positions
FIFO Replace the oldest block.

18
Write Strategies

Most accesses are reads, not writes
Especially if instruction reads are included
Optimize for reads performance matters
Direct mapped can return value before valid check
Writes are more difficult
Cant write to cache till we know the right block
Object written may have various sizes (1-8 bytes)
When to synchronize cache with memory?
Write through - Write to cache and to memory
Prone to stalls due to high bandwidth
requirements
Write back - Write to memory upon replacement
Memory may be out of date

19
Another Write Strategy

Maintain a FIFO queue (write buffer) of cache
frames (e.g. can use a doubly-linked list)
Meanwhile, take items from top of queue and write
them to memory as fast as bus can handle
Reads might take priority, or have a separate bus
Advantages Write stalls are minimized, while
keeping memory as up-to-date as possible

20
Write Miss Strategies

What do we do on a write to a block thats not in
the cache?
Two main strategies Both do not stop processor
Write-allocate (fetch on write) - Cache the
block.
No write-allocate (write around) - Just write to
memory.
Write-back caches tend to use write-allocate.
White-through tends to use no-write-allocate.
Use dirty bit to indicate write-back is needed in
write-back strategy

21
Example Alpha 21264

64KB, 2-way, 64-byte block, 512 sets
44 physical address bits

22
Instruction vs. Data Caches

Instructions and data have different patterns of
temporal and spatial locality
Also instructions are generally read-only
Can have separate instruction data caches
Advantages
Doubles bandwidth between CPU memory hierarchy
Each cache can be optimized for its pattern of
locality
Disadvantages
Slightly more complex design
Cant dynamically adjust cache space taken up by
instructions vs. data

23
I/D Split and Unified Caches
Size I-Cache D-Cache Unified Cache
8KB 8.16 44.0 63.0
16KB 3.82 40.9 51.0
32KB 1.36 38.4 43.3
64KB 0.61 36.9 39.4
128KB 0.30 35.3 36.2
256KB 0.02 32.6 32.9

Miss per 1000 accesses
Much lower instruction miss rate than data miss
rate

24
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

25
Cache Performance Equations

Memory stalls per program (blocking cache)
CPU time formula
More cache performance will be given later!

26
Cache Performance Example

Ideal CPI2.0, memory references / inst1.5,
cache size64KB, miss penalty75ns, hit time1
clock cycle
Compare performance of two caches
Direct-mapped (1-way) cycle time1ns, miss
rate1.4
2-way cycle time1.25ns, miss rate1.0

27
Out-Of-Order Processor

Define new miss penalty considering overlap
Compute memory latency and overlapped latency
Example (from previous slide)
Assume 30 of 75ns penalty can be overlapped, but
with longer (1.25ns) cycle on 1-way design due to
OOO

28
Cache Performance Improvement

Consider the cache performance equation
It obviously follows that there are three basic
ways to improve cache performance
Reducing miss penalty (5.4)
Reducing miss rate (5.5)
Reducing miss penalty/rate via parallelism (5.6)
Reducing hit time (5.7)
Note that by Amdahls Law, there will be
diminishing returns from reducing only hit time
or amortized miss penalty by itself, instead of
both together.

(Average memory access time) (Hit time) (Miss
rate)(Miss penalty)
Amortized miss penalty
29
Cache Performance Improvement

Reduce miss penalty
Multilevel cache Critical word first and early
restart priority to read miss Merging write
buffer Victim cache
Reduce miss rate
Larger block size Increase cache size Higher
associativity Way prediction and
Pseudo-associative caches Compiler
optimizations
Reduce miss penalty/rate via parallelism
Non-blocking cache Hardware prefetching
Compiler-controlled prefetching
Reduce hit time
Small simple cache Avoid address translation in
indexing cache Pipelined cache access Trace
caches

30
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

31
Multi-Level Caches

What is important faster caches or larger
caches?
Average memory access time Hit time (L1)
Miss rate (L1) x Miss Penalty (L1)
Miss penalty (L1)
Hit time (L2) Miss rate (L2) x Miss Penalty
(L2)
Can plug 2nd equation into the first
Average memory access time
Hit time(L1) Miss rate(L1) x (Hit time(L2)
Miss rate(L2)x Miss penalty(L2))

32
Multi-level Cache Terminology

Local miss rate
The miss rate of one hierarchy level by itself.
of misses at that level / accesses to that
level
e.g. Miss rate(L1), Miss rate(L2)
Global miss rate
The miss rate of a whole group of hierarchy
levels
of accesses coming out of that group (to lower
levels) / of accesses to that group
Generally this is the product of the miss rates
at each level in the group.
Global L2 Miss rate Miss rate(L1) Local Miss
rate(L2)

33
Effect of 2-level Caching

L2 size usually much bigger than L1
Provide reasonable hit rate
Decreases miss penalty of 1st-level cache
May increase L2 miss penalty
Multiple-level cache inclusion property
Inclusive cache L1 is subset of L2 simplify
cache coherence mechanism, effective cache size
L2
Exclusive cache L1, L2 are exclusive increase
effect cache sizes L1 L2
Enforce inclusion property Backward
invalidation on L2 replacement

34
L2 Cache Performance

Global cache miss rate is similar to the single
cache miss rate
Local miss rate is not a good measure of
secondary caches

35
Early Restart, Critical Word First

Early restart
Dont wait for entire block to fill
Resume CPU as soon as requested word is fetched
Critical word first
? wrapped fetch, requested word first
Fetch the requested word from memory first
Resume CPU
Then transfer the rest of the cache block
Most beneficial if block size is large
Commonly used in all the processors

36
Read Misses Take Priority

Processor must wait on a read, not on a write
Miss penalty is higher for reads to begin with
and more benefit from reducing read miss penalty
Write buffer can queue values to be written
Until memory bus is not busy with reads
Careful about the memory consistency issue
What if we want to read a block in write buffer?
Wait for write, then read block from memory
Better Read block out of write buffer.
Dirty block replacement when reading
Write old block, read new block - Delays the
read.
Old block to buffer, read new, write old. -
Better!

37
Sub-block Placement

Larger blocks have smaller tags (match faster)
Smaller blocks have lower miss penalty
Compromise solution
Use a large block size for tagging purposes
Use a small block size for transfer purposes
How? Valid bits associated with sub-blocks.

Blocks
Tags
38
Merging Write Buffer

A mechanism to help reduce write stalls
On a write to memory, block address and data to
be written are placed in a write buffer.
CPU can continue immediately
Unless the write buffer is full.
Write merging
If the same block is written again before it has
been flushed to memory, old contents are replaced
with new contents.
Care must be taken to not violate memory
consistency and proper write ordering

39
Write Merging Example
40
Victim Cache

Small extra cache
Holds blocks overflowing from the occasional
overfull frame set.
Very effective for reducing conflict misses.
Can be checked in parallel with main cache
Insignificant increase to hit time.

41
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

42
Three Types of Misses

Compulsory
During a program, the very first access to a
block will not be in the cache (unless
pre-fetched)
Capacity
The working set of blocks accessed by the program
is too large to fit in the cache
Conflict
Unless cache is fully associative, sometimes
blocks may be evicted too early because too many
frequently-accessed blocks map to the same
limited set of frames.

43
Misses by Type
Conflict

Conflict misses are significant in a
direct-mapped cache.
From direct-mapped to 2-way helps as much as
doubling cache size.
Going from direct-mapped to 4-way is better
than doubling cache size.

44
As fraction of total misses
45
Larger Block Size

Keep cache size associativity constant
Reduces compulsory misses
Due to spatial locality
More accesses are to a pre-fetched block
Increases capacity misses
More unused locations pulled into cache
May increase conflict misses (slightly)
Fewer sets may mean more blocks utilized per set
Depends on pattern of addresses accessed
Increases miss penalty - longer block transfers

46
Block Size Effect
Miss rate is actually goes up if the block is too
large relative to the cache size
47
Larger Caches

Keep block size, set size, etc. constant
No effect on compulsory misses.
Block still wont be there on its 1st access!
Reduces capacity misses
More capacity!
Reduces conflict misses (in general)
Working blocks spread out over more frame sets
Fewer blocks map to a set on average
Less chance that the number of active blocks that
map to a given set exceeds the set size.
But, increases hit time! (And cost.)

48
Higher Associativity

Keep cache size block size constant
Decreasing the number of sets
No effect on compulsory misses
No effect on capacity misses
By definition, these are misses that would happen
anyway in fully-associative
Decreases conflict misses
Blocks in active set may not be evicted early
for set size smaller than capacity
Can increase hit time (slightly)
Direct-mapped is fastest
n-way associative lookup a bit slower for larger n

49
Performance Comparison

Assume
4KB, 1-way miss-rate9.8 4-way miss-rate7.1

50
Higher Set-Associativity
Cache Size 1-way 2-way 4-way 8-way
4KB 3.44 3.25 3.22 3.28
8KB 2.69 2.58 2.55 2.62
16KB 2.23 2.40 2.46 2.53
32KB 2.06 2.30 2.37 2.45
64KB 1.92 2.14 2.18 2.25
128KB 1.52 1.84 1.92 2.00
256KB 1.32 1.66 1.74 1.82
512KB 1.20 1.55 1.59 1.66

Higher associativity increase the cycle time
The table shows the average memory access time
1-way is better most of cases

51
Way Prediction

Keep in each set a way-prediction information to
predict the block in each set will be accessed
next
Only one tag may be matched at the first cycle
if miss, other blocks need to be examined
Beneficial in two aspects
Fast data access Access the data without
knowing the tag comparison results
Low power Only match a single tag if majority
of the prediction is correct
Different systems use variations of the concept

52
Pseudo-Associative Caches

Essentially this is 2-way set-associative, but
with sequential (rather than parallel) lookups.
Fast hit time if first frame checked is right.
An occasional slow hit if an earlier conflict had
moved the block to its backup location.

53
Pseudo-Associative Caches

Placement
Place block b in frame (b mod n).
Identification
Look for block b first in frame (b mod n), then
in its secondary location ((bn/2) mod n). (flip
the most-significant bit) If found there,
primary and secondary blocks are swapped.
May maintain a MRU bit to reduce the search and
for better replacement.
Replacement
Block in frame (b mod n) is moved to secondary
location ((bn/2) mod n). (Block there is
flushed.)
Write strategy
Any desired write strategy can be used.

54
Compiler Optimizations

Reorganize code to improve locality properties.
The hardware designers favorite solution.
Requires no new hardware!
Various techniques Cache awareness
Merging Arrays
Loop Interchange
Loop Fusion
Blocking (in multidimensional arrays)
Other source-source transformation technique

55
Loop Blocking Matrix Multiply
Before
After
56
Effect of Compiler Optimizations
57
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

via Parallelism
58
Non-blocking Caches

Known as lockup-free cache, hit under miss
While a miss is being processed,
Allow other cache lookups to continue anyway
Useful in dynamically scheduled CPUs
Other instructions may be in the load queue
Reduces effective miss penalty
Useful CPU work fills the miss penalty delay
slot
hit under multiple miss, miss under miss
Extend technique to allow multiple misses to be
queued up, while still processing new hits

59
Non-blocking Caches
60
Hardware Prefetching

When memory is idle, speculatively get some
blocks before the CPU first asks for them!
Simple heuristic Fetch 1 or more blocks that are
consecutive to last one(s) fetched
Often, the extra blocks are placed in a special
stream buffer so as not to conflict with actual
active blocks in the cache, otherwise the
prefetch may pollute the cache
Prefetching can reduce misses considerably
Speculative fetches should be low-priority
Use only otherwise-unused memory bandwidth
Energy-inefficient (like all speculation)

61
Compiler-Controlled Prefetching

Insert special instructions to load addresses
from memory well before they are needed.
Register vs. cache, faulting vs. nonfaulting
Semantic invisibility, nonblocking-ness
Can considerably reduce misses
Can also cause extra conflict misses
Replacing a block before it is completely used
Can also delay valid accesses (tying up bus)
Low-priority, can be pre-empted by real access

62
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

63
Small and Simple Caches

Make cache smaller to improve hit time
Or (probably better), add a new smaller L0
cache between existing L1 cache and CPU.
Keep L1 cache on same chip as CPU
Physically close to functional units that access
it
Keep L1 design simple, e.g. direct-mapped
Avoids multiple tag comparisons
Tag can be compared after data cache fetch
Reduces effective hit time

64
Access Time in a CMOS Cache
65
Avoid Address Translation

In systems with virtual address spaces, virtual
addr. must be mapped to physical addresses.
If cache blocks are indexed/tagged w. physical
addresses, we must do this translation before we
can do the cache lookup. Long hit time!
Solution Access cache using the virtual
address. Call this a Virtual Cache
Drawback Cache flush on context switch
Can fix by tagging blocks with Process Ids (PIDs)
Another problem Aliasing, i.e. two virtual
addresses mapped to same real address
Fix with anti-aliasing or page coloring

66
Benefit of PID Tags in Virtual Cache
W/o PIDs, purge
W. PIDs
Miss rate
W/o context switching
67
Pipelined Cache Access

Pipeline cache access so that
Effective latency of first level cache hit can be
multiple clock cycles
Fast cycle time and slow hits
Hit times 1 for Pentium, 2 for P3, and 4 for P4
Increases number of pipeline stages
Higher penalty on mispredicted branches
More cycles from issue of load to use of data
In reality
Increases the bandwidth of instructions than
decreasing the actual latency of a cache hit

68
Trace Caches

Supply enough instructions per cycle without
dependencies
Finding ILP beyond 4 instructions per cycle
Dont limit instructions in a static cache block
to spatial locality
Find dynamic sequence of instructions including
taken branches
NetBurst (P4) uses trace caches
Addresses are no longer aligned.
Same instruction is stored more than once
If part of multiple traces

69
Summary of Cache Optimizations
70
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations for Improving
Performance
Memory Technology
Virtual Memory
Conclusion

71
Wider Main Memory
72
Simple Interleaved Memory

Adjacent words found in different mem. banks
Banks can be accessed in parallel
Overlap latencies for accessing each word
Can use narrow bus
To return accessed words sequentially
Fits well with sequential access
e.g., of words in cache blocks

73
Independent Memory Banks

Original motivation for memory banks
Higher bandwidth by interleaving seq. accesses
Allows multiple independent accesses
Each bank requires separate address/data lines
Non-blocking caches allow CPU to proceed beyond a
cache miss
Allows multiple simultaneous cache misses
Possible only with memory banks

74
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

75
Main Memory

Bandwidth Bytes read or written per unit time
Latency Described by
Access Time Delay between initiation/completion
For reads Present address till result ready.
Cycle time Minimum interval between separate
requests to memory.
Address lines Separate bus CPU?Mem to carry
addresses.
RAS (Row Access Strobe)
First half of address, sent first.
CAS (Column Access Strobe)
Second half of address, sent second.

76
RAS vs. CAS
DRAM bit-cell array
1. RAS selects a row
2. Parallelreadout ofall row data
3. CAS selectsa column to read
4. Selected bitwritten to memory bus
77
Typical DRAM Organization
(256 Mbit)
Low 14 bits
High14 bits
78
Types of Memory

DRAM (Dynamic Random Access Memory)
Cell design needs only 1 transistor per bit
stored.
Cell charges leak away and may dynamically (over
time) drift from their initial levels.
Requires periodic refreshing to correct drift
e.g. every 8 ms
Time spent refreshing kept to lt5 of bandwidth
SRAM (Static Random Access Memory)
Cell voltages are statically (unchangingly) tied
to power supply references. No drift, no
refresh.
But needs 4-6 transistors per bit.
DRAM 4-8x larger, 8-16x slower, 8-16x cheaper/bit

79
Amdahl/Case Rule

Memory size (and I/O bandwidth) should grow
linearly with CPU speed
Typical 1 MB main memory, 1 Mbps I/O bandwidth
per 1 MIPS CPU performance.
Takes a fairly constant 8 seconds to scan entire
memory (if memory bandwidth I/O bandwidth, 4
bytes/load, 1 load/4 instructions, no latency
problem)
Moores Law
DRAM size doubles every 18 months (up 60/yr)
Tracks processor speed improvements
Unfortunately, DRAM latency has only decreased
7/year. Latency is a big deal.

80
Some DRAM Trend Data
Since 1998, the rate of increase in chip capacity
has slowed to 2x per 2 years 128 Mb in
1998 256 Mb in 2000 512 Mb in 2002
81
ROM and Flash

ROM (Read-Only Memory)
Nonvolatile protection
Flash
Nonvolatile RAMs
NVRAMs require no power to maintain state
Reading flash is near DRAM speeds
Writing is 10-100x slower than DRAM
Frequently used for upgradeable embedded SW
Used in Embedded Processors

82
DRAM Variations

SDRAM Synchronous DRAM
DRAM internal operation synchronized by a clock
signal provided on the memory bus
Double Data Rate (DDR) uses both clock edges
RDRAM RAMBUS (Inc.) DRAM
Proprietary DRAM interface technology
on-chip interleaving / multi-bank technology
a high-speed packet-switched (split-transaction)
bus interface
byte-wide interface, synchronous, dual-rate
Licensed to many chip CPU makers
Higher bandwidth, costly than generic SDRAM
DRDRAM Direct RDRAM (2nd ed. spec.)
Separate row and column address/command buses
Higher bandwidth (18-bit data, more banks, faster
clock)

83
Outline

Introduction
Cache Basics
Cache Performance
Reducing Cache Miss Penalty
Reducing Cache Miss Rate
Reducing Hit Time
Main Memory and Organizations
Memory Technology
Virtual Memory
Conclusion

84
Virtual Memory
The addition of the virtual memory mechanism
complicated the cache access
85
Paging vs. Segmentation

Paged Segment Each segment has integral number
of pages for easy replacement and can still treat
each segmentation as a unit

86
Four Important Questions

Where to place a block in main memory?
Operating systems takes care of it
Replacement takes very long fully associative
How to find a block in main memory?
Page table is used
Offset is concatenated when paging is used
Offset is added when segmentation is used.
Which block to replace when needed?
Obviously LRU is used to minimize page faults
What happens on a write?
Magnetic disks takes millions of cycles to
access.
Always write back (use of dirty bit).

87
Addressing Virtual Memories
88
Fast Address Calculation

Page tables are very large
Kept in main memory
Two memory accesses for one read or write
Remember the last translation
Reuse if the address is on the same page
Exploit the principle of locality
If access have locality, the address translations
should also have locality
Keep the address translations in a cache
Translation lookaside buffer (TLB)
The tag part stores the virtual address and the
data part stores the page number.