Chapter 7 Large and Fast: Exploiting Memory Hierarchy

About This Presentation

Title:

Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Description:

programs access a relatively small portion of their address space at a given time. ... Page size large enough to amortize the high access time ... – PowerPoint PPT presentation

Number of Views:530

Avg rating:3.0/5.0

Slides: 52

Provided by: boch7

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 7 Large and Fast: Exploiting Memory Hierarchy

1
Chapter 7Large and Fast Exploiting Memory
Hierarchy

Bo Cheng

2
Principle of locality

programs access a relatively small portion of
their address space at a given time.
Temporal locality (locality in time) if an item
is referenced, it will tend to be referenced
again soon.
Spatial locality (locality in space) if an item
is referenced, items whose addresses are close
will tend to be referenced soon.

3
Basic Structure
4
The Principal

By combining two concepts (locality and
hierarchy)
Temporal Locality gt Keep most recently accessed
data items closer to the processor
Spatial Locality gt Move blocks consisting of
multiple contiguous words to upper levels of the
hierarchy

5
Memory Hierarchy (I)
6
Memory hierarchy (II)

Data is copied between adjacent levels
Minimum unit of information copied is a block
If the requested data appears in some block in
the upper level, this is called a hit, otherwise
a miss and a block containing the requested data
is copied from a lower level.
The hit rate or hit ratio, is the fraction of
memory accesses found in the upper level. The
miss rate (1.0 - hit rate) is the fraction not
found at the upper level.
Hit time the time to access the upper level
including the time to determine if the access is
a hit or a miss.
Miss penalty the time to replace a block in the
upper level.

7
Memory Hierarchy (II)
8
The Moores Law
9
Cache

A safe place for hiding or storing things
The level of memory hierarchy between processor
and main memory
Refer to any storage managed to take advantage pf
locality of access
Motivation
high processor cycle speed
low memory cycle speed
fast access to recently used portions of a
program's code and data

10
The Basic Cache Concept
1. The CPU is requesting data item Xn 2. The
request results in a miss 3. The word Xn is
brought from memory into cache
11
Direct Mapped Cache

Each memory location is mapped to exactly one
location in the cache.
address of the block modulo number of blocks in
the cache.
Answer two crucial questions
How do we know if a data item is in the cache?
If it is, how do we find it?

12
The Example of Direct-Mapped Cache
13
(No Transcript)
14
Cache Contents
m
n

Tag
Identify whether a word in the cache corresponds
to the requested word.
Valid bit
indicates whether an entry contains a valid
address
Data

Tag size 32 n 2 32 10 - 2
Size 2index x ( valid tag data)
2n x ( 1 m 48)
15
Direct-Mapped Example
How many total bits are required for
direct-mapped?

A Cache
16 KB of data
4-word blocks
32 bits address

4-word
n m 4 32 . (1) 16KB 4K words 210
block ? n 10 m 18 The total bits 210 x (1
18 448) 147 Kbits
16 KB
4 x 4 x 8 128 bits
16
Mapping an address to a cache block
Source http//www.faculty.uaf.edu/ffdr/EE443/
17
Block Size vs. Miss Rate
18
Handling Cache Misses

Stall the entire pipeline fetch the requested
word
Steps to handle an instruction cache miss
Send the original PC value (PC-4) to the memory.
Instruct main memory to perform a read and wait
for the memory to complete its access.
Write the cache entry, putting the data from
memory in the data portion of the entry, writing
the upper bits of the address (from the ALU) into
the tag field, and turning the valid bit on.
Restart the instruction execution at the first
step, which will refresh the instruction, this
time finding it in the cache.

19
Write-Through

A scheme in which writes always update both the
cache and the memory, ensuring that data is
always consistent between the two.
Write buffer
A queue that holds data while the data are
waiting to be written to memory.

20
Write-Back

A scheme that handles writes by updating values
only to the block in the cache, then writing the
modified block to the lower level of the
hierarchy when the block is replaced.
Pro Improve performance, especially when writes
are frequent (and couldnt be handled by write
buffer)
Con More complex to implement

21
Cache Performance

CPU time (CPU execution clock cycles
Memory-stall clock cycles) x Clock cycle time
Memory-stall clock cycles Read-stall cycles
Write-stall cycles
Read-stall cycles (Reads/Program) x Read miss
rate x Read miss penalty
Write-stall cycles ((Writes/Program) x Write
miss rate x Write miss penalty) Write buffer
stalls
Memory-stall clock cycles (MemoryAccess/Program)
x Miss Rate x Miss Penalty
Memory-stall clock cycles (Instructions/Program)
x Misses/Instructions) x Miss Penalty

22
The Example
Source http//www.faculty.uaf.edu/ffdr/EE443/
(1.38 2)
23
What if .

What if the processor is made faster, but the
memory system stays the same?
Speed up the machine by improving the CPI from 2
to 1 without increasing the clock
The system with a perfect cache would be 2.38 / 1
2.38 times faster
The amount of time spent on memory stalls rises
from 1.38/3.38 41 to 1.38/2.38 58

24
What if .
25
Our Observations

Relative cache penalties increases as a processor
becomes faster
The lower the CPI, the more pronounced the impact
of stall cycles
If the main memory system is the same, a higher
CPU clock rate leads to a larger miss penalty

26
Decreasing miss ratio with associative cache

direct-mapped cache A cache structure in which
each memory location is mapped to exactly one
location in the cache.
set-associative cache A cache that has a fixed
number of locations (at least two) where each
block can be placed.
fully associative cache A cache structure in
which a block can be placed in any location in
the cache.

27
The Example
(12 mod 8) 4
(12 mod 4) 0
Can appear in any of the eight cache block
28
One More Example Direct Mapped
5 Misses
29
Two-Way Set Associative Cache
which block to replace commonly used is LRU
scheme
Least recently used (LRU) A replacement scheme in
which the block replaced is the one that has been
unused for the longest time.
4 Misses
30
The Implementation of 4-Way Set Associative Cache
31
Fully Associative Cache
3 Misses
Increasing degree of associativity ? decrease in
miss rate
32
Performance of Multilevel Cache
11/4 2.8
33
Designing the Memory System to Support Caches (I)

Consider hypothetical memory system parameters
1 memory bus clock cycle to send address
15 memory bus clock cycles to initiate DRAM
access
1 memory bus clock cycle to transfer a word of
data
a cache block is a 4-word blocks
1-word-wide bank of DRAMs
The miss penalty is 1 4 15 4 1 65
clock cycles
Number of bytes transferred per clock cycle per
miss
(44) / 65 0.25

34
Designing the Memory System to Support Caches (II)
35
Virtual Memory

The technique in which main memory acts as a
"cache" for the secondary storage
automatically manages main memory and secondary
storage
Motivation
allow efficient sharing of memory among multiple
programs
remove the programming burdens of a small,
limited amount of main memory

36
Basic Concepts of Virtual Memory
Source http//www.faculty.uaf.edu/ffdr/EE443/

Virtual memory allows each program to exceed the
size of primary memory
It automatically manages two levels of memory
hierarchy
Main memory (physical memory)
Secondary storage
Same concepts as in caches, different terminology
A virtual memory block a page
A virtual memory miss a page fault
CPU produces a virtual address (which is
translated to a physical address, used to access
main memory). This process (accomplished by a
combination o HW and SW) is called memory mapping
or address translation.

37
Mapping from a Virtual to Physical Address
232 4 GB
230 1 GB
38
High Cost of a Miss

Page fault takes millions of cycles to process
E.g., main memory is 100,000 times faster than
disk
This time is dominated by the time it takes to
get the first word for typical page size
Key decisions
Page size large enough to amortize the high
access time
Pick organization that reduces page fault rate
(e.g., fully associative placement of pages)
Handle page faults in software (overhead is small
compared to disk access times) and use clever
algorithms for page placement
Use write-back

39
Page Table

Containing the virtual to physical address
translations in a virtual memory system.
Resides in memory
Indexed with the page number form the virtual
address
Contains corresponding physical page number
Each program has its own page table
Hardware includes a register pointing to the
start of the page table (page table register)

40
Page Table Size

For Example
Consider 32-bit virtual addresses,
4-KB page size,
4B per page table entry
Number of page table entries
230/212 220
Size of page table
220 x 4 4 MB

41
Page Faults

Occurs when a valid bit (V) is found to be 0
Transfer the control to the operating system
(using the exception mechanism)
The operating system must find the appropriate
page in the next level of hierarchy
Decide where to place it in the main memory
Where is the page on this disk?
The information can be found either in the same
page table, or in a separate structure
The OS creates the space on disk for all the
pages of the process
at the time it creates the process
At the same time, a data structure that records
the location of each
page is also created.

42
The Translation-Lookaside Buffer (TLB)

Each memory access by a program requires two
memory accesses
Obtain the physical address (reference the page
table)
Get the data
Because of the spatial and temporal locality
within each page, a translation for a virtual
page will likely be needed in the near future.
To speed this process up include a special cache
that keeps track of recently used translations

43
The Translation-Lookaside Buffer (TLB)
44
Processing read/write requests
45
Where Can a Block Be Placed?
1. Increase in the degree of associativity
usually decreases the miss rate. 2. The
improvement in miss rate comes from reduced
competition for the same location.
46
How Is a Block Found?
47
What block is replaced on a miss?

Which block is a candidate for replacement
In a fully associative cache all blocks are
candidates
In a set-associative cache all the blocks in
the set
In a direct-mapped cache there is only one
candidate
In set-associative and fully associative caches,
use one of two strategies
1. Random. (use hardware assistance to make it
fast)
2. LRU (Least recently used). usually two
complicated even for fourway associativity.

48
How Are Write Handled?

There are two basic options
Write-through The information is written to
both the block in the cache and to the block in
the lower level of the memory hierarchy
Write-back The modified block is written to the
lower level only when it is replaced
ADVANTAGES of WRITE-THROUGH
Misses are cheaper and simpler
Easier to implement (although it usually requires
a write buffer)
ADVANTAGES of WRITE-BACK
CPU can write at the rate that the cache can
accept
Combined writes
Effective use of bandwidth (writing the entire
block)
Virtual memory is a special case only a
write-back is practical

49
The Big Picture

Where to place a block?
One place (direct-mapped)
A few places (set-associative)
Any place (fully-associative)
How to find a block?
Indexing (direct-mapped)
Limited search (set-associative)
Full search (fully associative)
Separate lookup table (page table)
3. Which block should be replaced on a cache
miss?
Random
LRU
4. What happens on a write?
Write-through
Write-back

50
The 3Cs

Compulsory misses caused by the first access to
a block that has never been in the cache
(cold-start misses)
INCREASE THE BLOCK SIZE (increase in miss
penalty)
Capacity misses caused when the cache cannot
contain all the blocks needed by the program.
Blocks are being replaced and later retrieved
again.
INCREASE THE SIZE (access time increases as well)
Conflict misses occur when multiple blocks
compete for the same set (collision misses)
INCREASE ASSOCIATIVITY (may slow down access time)

51
The Design Challenges

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 7 Large and Fast: Exploiting Memory Hierarchy - PowerPoint PPT Presentation

Chapter 7 Large and Fast: Exploiting Memory Hierarchy

programs access a relatively small portion of their address space at a given time. ... Page size large enough to amortize the high access time ... – PowerPoint PPT presentation