Title: Chapter 7 Large and Fast: Exploiting Memory Hierarchy
1Chapter 7Large and Fast Exploiting Memory
Hierarchy
2Principle of locality
- programs access a relatively small portion of
their address space at a given time. - Temporal locality (locality in time) if an item
is referenced, it will tend to be referenced
again soon. - Spatial locality (locality in space) if an item
is referenced, items whose addresses are close
will tend to be referenced soon.
3Basic Structure
4The Principal
- By combining two concepts (locality and
hierarchy) - Temporal Locality gt Keep most recently accessed
data items closer to the processor - Spatial Locality gt Move blocks consisting of
multiple contiguous words to upper levels of the
hierarchy
5Memory Hierarchy (I)
6Memory hierarchy (II)
- Data is copied between adjacent levels
- Minimum unit of information copied is a block
- If the requested data appears in some block in
the upper level, this is called a hit, otherwise
a miss and a block containing the requested data
is copied from a lower level. - The hit rate or hit ratio, is the fraction of
memory accesses found in the upper level. The
miss rate (1.0 - hit rate) is the fraction not
found at the upper level. - Hit time the time to access the upper level
including the time to determine if the access is
a hit or a miss. - Miss penalty the time to replace a block in the
upper level.
7Memory Hierarchy (II)
8The Moores Law
9Cache
- A safe place for hiding or storing things
- The level of memory hierarchy between processor
and main memory - Refer to any storage managed to take advantage pf
locality of access - Motivation
- high processor cycle speed
- low memory cycle speed
- fast access to recently used portions of a
program's code and data
10The Basic Cache Concept
1. The CPU is requesting data item Xn 2. The
request results in a miss 3. The word Xn is
brought from memory into cache
11Direct Mapped Cache
- Each memory location is mapped to exactly one
location in the cache. - address of the block modulo number of blocks in
the cache. - Answer two crucial questions
- How do we know if a data item is in the cache?
- If it is, how do we find it?
12The Example of Direct-Mapped Cache
13(No Transcript)
14Cache Contents
m
n
- Tag
- Identify whether a word in the cache corresponds
to the requested word. - Valid bit
- indicates whether an entry contains a valid
address - Data
Tag size 32 n 2 32 10 - 2
Size 2index x ( valid tag data)
2n x ( 1 m 48)
15Direct-Mapped Example
How many total bits are required for
direct-mapped?
- A Cache
- 16 KB of data
- 4-word blocks
- 32 bits address
4-word
n m 4 32 . (1) 16KB 4K words 210
block ? n 10 m 18 The total bits 210 x (1
18 448) 147 Kbits
16 KB
4 x 4 x 8 128 bits
16Mapping an address to a cache block
Source http//www.faculty.uaf.edu/ffdr/EE443/
17Block Size vs. Miss Rate
18Handling Cache Misses
- Stall the entire pipeline fetch the requested
word - Steps to handle an instruction cache miss
- Send the original PC value (PC-4) to the memory.
- Instruct main memory to perform a read and wait
for the memory to complete its access. - Write the cache entry, putting the data from
memory in the data portion of the entry, writing
the upper bits of the address (from the ALU) into
the tag field, and turning the valid bit on. - Restart the instruction execution at the first
step, which will refresh the instruction, this
time finding it in the cache.
19Write-Through
- A scheme in which writes always update both the
cache and the memory, ensuring that data is
always consistent between the two. - Write buffer
- A queue that holds data while the data are
waiting to be written to memory.
20Write-Back
- A scheme that handles writes by updating values
only to the block in the cache, then writing the
modified block to the lower level of the
hierarchy when the block is replaced. - Pro Improve performance, especially when writes
are frequent (and couldnt be handled by write
buffer) - Con More complex to implement
21Cache Performance
- CPU time (CPU execution clock cycles
Memory-stall clock cycles) x Clock cycle time - Memory-stall clock cycles Read-stall cycles
Write-stall cycles - Read-stall cycles (Reads/Program) x Read miss
rate x Read miss penalty - Write-stall cycles ((Writes/Program) x Write
miss rate x Write miss penalty) Write buffer
stalls - Memory-stall clock cycles (MemoryAccess/Program)
x Miss Rate x Miss Penalty - Memory-stall clock cycles (Instructions/Program)
x Misses/Instructions) x Miss Penalty
22The Example
Source http//www.faculty.uaf.edu/ffdr/EE443/
(1.38 2)
23What if .
- What if the processor is made faster, but the
memory system stays the same? - Speed up the machine by improving the CPI from 2
to 1 without increasing the clock - The system with a perfect cache would be 2.38 / 1
2.38 times faster - The amount of time spent on memory stalls rises
from 1.38/3.38 41 to 1.38/2.38 58
24What if .
25Our Observations
- Relative cache penalties increases as a processor
becomes faster - The lower the CPI, the more pronounced the impact
of stall cycles - If the main memory system is the same, a higher
CPU clock rate leads to a larger miss penalty
26Decreasing miss ratio with associative cache
- direct-mapped cache A cache structure in which
each memory location is mapped to exactly one
location in the cache. - set-associative cache A cache that has a fixed
number of locations (at least two) where each
block can be placed. - fully associative cache A cache structure in
which a block can be placed in any location in
the cache.
27The Example
(12 mod 8) 4
(12 mod 4) 0
Can appear in any of the eight cache block
28One More Example Direct Mapped
5 Misses
29Two-Way Set Associative Cache
which block to replace commonly used is LRU
scheme
Least recently used (LRU) A replacement scheme in
which the block replaced is the one that has been
unused for the longest time.
4 Misses
30The Implementation of 4-Way Set Associative Cache
31Fully Associative Cache
3 Misses
Increasing degree of associativity ? decrease in
miss rate
32Performance of Multilevel Cache
11/4 2.8
33Designing the Memory System to Support Caches (I)
- Consider hypothetical memory system parameters
- 1 memory bus clock cycle to send address
- 15 memory bus clock cycles to initiate DRAM
access - 1 memory bus clock cycle to transfer a word of
data - a cache block is a 4-word blocks
- 1-word-wide bank of DRAMs
- The miss penalty is 1 4 15 4 1 65
clock cycles - Number of bytes transferred per clock cycle per
miss - (44) / 65 0.25
34Designing the Memory System to Support Caches (II)
35Virtual Memory
- The technique in which main memory acts as a
"cache" for the secondary storage - automatically manages main memory and secondary
storage - Motivation
- allow efficient sharing of memory among multiple
programs - remove the programming burdens of a small,
limited amount of main memory
36Basic Concepts of Virtual Memory
Source http//www.faculty.uaf.edu/ffdr/EE443/
- Virtual memory allows each program to exceed the
size of primary memory - It automatically manages two levels of memory
hierarchy - Main memory (physical memory)
- Secondary storage
- Same concepts as in caches, different terminology
- A virtual memory block a page
- A virtual memory miss a page fault
- CPU produces a virtual address (which is
translated to a physical address, used to access
main memory). This process (accomplished by a
combination o HW and SW) is called memory mapping
or address translation.
37Mapping from a Virtual to Physical Address
232 4 GB
230 1 GB
38High Cost of a Miss
- Page fault takes millions of cycles to process
- E.g., main memory is 100,000 times faster than
disk - This time is dominated by the time it takes to
get the first word for typical page size - Key decisions
- Page size large enough to amortize the high
access time - Pick organization that reduces page fault rate
(e.g., fully associative placement of pages) - Handle page faults in software (overhead is small
compared to disk access times) and use clever
algorithms for page placement - Use write-back
39Page Table
- Containing the virtual to physical address
translations in a virtual memory system. - Resides in memory
- Indexed with the page number form the virtual
address - Contains corresponding physical page number
- Each program has its own page table
- Hardware includes a register pointing to the
start of the page table (page table register)
40Page Table Size
- For Example
- Consider 32-bit virtual addresses,
- 4-KB page size,
- 4B per page table entry
- Number of page table entries
- 230/212 220
- Size of page table
- 220 x 4 4 MB
41Page Faults
- Occurs when a valid bit (V) is found to be 0
- Transfer the control to the operating system
(using the exception mechanism) - The operating system must find the appropriate
page in the next level of hierarchy - Decide where to place it in the main memory
- Where is the page on this disk?
- The information can be found either in the same
page table, or in a separate structure - The OS creates the space on disk for all the
pages of the process - at the time it creates the process
- At the same time, a data structure that records
the location of each - page is also created.
42The Translation-Lookaside Buffer (TLB)
- Each memory access by a program requires two
memory accesses - Obtain the physical address (reference the page
table) - Get the data
- Because of the spatial and temporal locality
within each page, a translation for a virtual
page will likely be needed in the near future. - To speed this process up include a special cache
that keeps track of recently used translations
43The Translation-Lookaside Buffer (TLB)
44Processing read/write requests
45Where Can a Block Be Placed?
1. Increase in the degree of associativity
usually decreases the miss rate. 2. The
improvement in miss rate comes from reduced
competition for the same location.
46How Is a Block Found?
47What block is replaced on a miss?
- Which block is a candidate for replacement
- In a fully associative cache all blocks are
candidates - In a set-associative cache all the blocks in
the set - In a direct-mapped cache there is only one
candidate - In set-associative and fully associative caches,
use one of two strategies - 1. Random. (use hardware assistance to make it
fast) - 2. LRU (Least recently used). usually two
complicated even for fourway associativity.
48How Are Write Handled?
- There are two basic options
- Write-through The information is written to
both the block in the cache and to the block in
the lower level of the memory hierarchy - Write-back The modified block is written to the
lower level only when it is replaced - ADVANTAGES of WRITE-THROUGH
- Misses are cheaper and simpler
- Easier to implement (although it usually requires
a write buffer) - ADVANTAGES of WRITE-BACK
- CPU can write at the rate that the cache can
accept - Combined writes
- Effective use of bandwidth (writing the entire
block) - Virtual memory is a special case only a
write-back is practical
49The Big Picture
- Where to place a block?
- One place (direct-mapped)
- A few places (set-associative)
- Any place (fully-associative)
- How to find a block?
- Indexing (direct-mapped)
- Limited search (set-associative)
- Full search (fully associative)
- Separate lookup table (page table)
- 3. Which block should be replaced on a cache
miss? - Random
- LRU
- 4. What happens on a write?
- Write-through
- Write-back
50The 3Cs
- Compulsory misses caused by the first access to
a block that has never been in the cache
(cold-start misses) - INCREASE THE BLOCK SIZE (increase in miss
penalty) - Capacity misses caused when the cache cannot
contain all the blocks needed by the program.
Blocks are being replaced and later retrieved
again. - INCREASE THE SIZE (access time increases as well)
- Conflict misses occur when multiple blocks
compete for the same set (collision misses) - INCREASE ASSOCIATIVITY (may slow down access time)
51The Design Challenges