Title: Cache - Basics
1Cache - Basics
2Computer System
- Instructions and data are stored in memory
- Processors access memory for
- Instruction fetch ? access memory almost every
cycle - Data load/store (20 of instructions) ? access
memory every 5th cycle
3Random-Access Memory
- Static RAM (SRAM)
- Each cell stores bit with a six-transistor
circuit - Retains value indefinitely, as long as it is kept
powered - Faster and more expensive than DRAM
- Dynamic RAM (DRAM)
- Each cell stores bit with a capacitor and
transistor - Value must be refreshed every 10-100ms
- Slower and cheaper than SRAM
4CPU-Memory Performance Gap
- Processor-memory performance gap
- Grows 50 per year
- No cache before 1980, 2-level cache since 1995
5CPU-Memory Performance Gap (contd)
- The increasing gap between DRAM, disk, and CPU
speeds
6Memory Hierarchy
- Cache small, fast storage
- Improves average access time to slow memory
- Exploits spatial and temporal locality
- Caches other than memory hierarchy
- TLB cache on page table
- Branch-prediction cache on prediction
information
7Memory Hierarchies
- Some fundamental and enduring properties of
hardware and software - Fast storage technologies cost more per byte and
have less capacity - The gap between CPU and main memory speed is
widening - Well-written programs tend to exhibit good
locality - They suggest an approach for organizing memory
and storage systems known as a memory hierarchy
8An Example Memory Hierarchy
CPU registers hold words retrieved from L1 cache.
Smaller, Faster, and Costlier (per byte)
L0
registers
on-chip L1 cache (SRAM)
L1
on-chip L2 cache (SRAM)
L2
main memory (DRAM)
L3
Larger, Slower, and Cheaper (per byte)
local secondary storage (local magnetic disks)
L4
remote secondary storage (distributed file
systems, Web servers)
L5
9Hierarchy Works
- How hierarchy works
- Place a copy of frequently accessed data at the
higher levels of hierarchy - CPUs search for the highest copy of the data to
be accessed - Principle of locality
- Program accesses a small portion of the address
space at any given time period - 90/10 rule 90 of the accesses are to 10 of
memory locations - Users want large and fast memories!
- Access time (ns) Cost (per
GB) (2004) - SRAM 0.5 5 4000 10,000
- DRAM 50 70 100 200
- Disk 5M 20M 0.50 2
10Locality
- Principle of Locality
- Temporal locality
- Recently referenced items are likely to be
referenced in the near future. - Spatial locality
- Items with nearby addresses tend to be referenced
close together in time. - Locality Example
- Data
- Reference array elements in succession Spatial
locality - Reference sum each iteration Temporal locality
- Instructions
- Reference instructions in sequence Spatial
locality - Cycle through loop repeatedly Temporal locality
sum 0 for (i 0 i lt n i) sum
ai return sum
11Cache Terms
- Our initial focus two levels (upper, lower ?
cache, memory) - Block (or line) minimum unit of data
- Hit data requested is in the upper level
- Miss data requested is not in the upper level
-
- Hit rate
-
- Hit time SRAM access time time to determine
hit/miss - Miss rate 1 Hit rate
- Miss penalty time to fetch a block from lower
level (memory) - Performance
- Average access time
- hit time miss rate x miss penalty
number of found in the upper level (cache) number
of accesses
12Caching in a Memory Hierarchy
- Access addresses of blocks 4, 10
4
10
4
10
0
1
2
3
4
5
6
7
4
Lower Level
8
9
10
11
10
12
13
14
15
13General Caching Concepts
Program
- Program needs an object, which is stored in some
blocks, 14 and 12 - Cache hit
- Found at level k (e.g., block 14)
- Cache miss
- Not found at level k, so fetched from level k1
(e.g., block 12) - If level k is full, a victim block must be
replaced (evicted) (e.g., block 4) - If the victim is clean (not modified), just
replace - If the victim is dirty (modified and different
from the one in level k1), update the lower
Request 14
Request 12
14
12
0
1
2
3
Level k
14
4
12
9
3
14
4
Request 12
4
12
0
1
2
3
4
5
6
7
Level k1
4
8
9
10
11
12
13
14
15
12
14Locating Data Items
- How do we know if a data item in the cache?
- Direct mapped
- A block can go exactly one place in the cache
- (Block address) modulo (Number of cache blocks in
the cache)
15Matching Address
- How do we know if the data in the cache
corresponds to a requested word? - Tag matching
- A set of tags stored in the cache along with data
items - Some of upper bits of the address used as tag
address
Block address
Block offset
Tag
Block offset
Index
16Validating Data Items
- How do we know that a cache block has a valid
data item? - Add a valid bit to the cache block entry
- If a valid bit 0, not matched
- (i.e. information in tag and data block is
invalid)
17Cache Example (1)
8 word direct-mapped
18Cache Example (2)
19Cache Example (3)
20Cache Example (4)
21Cache Example (5)
22Cache Example (6)
A block is replaced, if newly accessed block is
mapped on to the same location
23Actions on Write
- Write through
- Data is written to both the block in the current
level (cache) and the block in the lower level
memory - Need write buffers not to wait for the completion
of lower-level write transaction - May result in repeated writes to the same
location - Write back
- Data is written only to the block in the current
level (cache) - Modified cache block is written to the lower
level memory when it is replaced (need dirty bit
per cache block) - May result in writes on read misses
24Actions on Write Misses
- Write allocate (fetch on miss)
- Allocate an entry in the cache and fetch the data
for the write miss - No-allocate (write around)
- Without entry allocation, update the lower level
memory hierarchy
steps Write through Write through Write back
steps Write allocate No allocate Write allocate
1 pick replacement pick replacement
2 write back if dirty
3 fetch block fetch block
4 write cache write cache
5 write lower level write lower level
25Direct-Mapped Cache
- Block size 4 byte word
- Address
- Block offset 2 bits
- Index 10 bits
- Tag 20 bits
- Total size
- 210 x (12032)
- 53K bits
- Locality exploited?
26Spatial Locality
- Block size needs to be more than one word
274 Word Long Block Size
28Memory Bandwidth
- Bandwidth
- Amount of data transferred
- per unit time
- Access 4 words (16 bytes)
- (A) 1-word-width memory bus
- 1 4x15 4x1 65 cycles
- (B) 2-word-width memory bus
- 1 2x15 2x1 33 cycles
- (B) 4-word-width memory bus
- 1 15 1 17 cycles
- (C) 1-word-width multi-banks bus
- 1 15 4x1 20 cycles
- Data interleaving on multiple banks
- Achieves a high bandwidth memory
- system with a narrow bus
29Summary
- Processor-Memory speed gap increases
- Memory hierarchy works
- Cache
- Small SRAM storage for fast access from the
processor - Performance average access time hit time
miss rate x miss penalty - Locality exploited
- Keep recently accessed data (temporal locality)
- Bring data in by the block which is larger than a
word (spatial locality) - Cache mechanism
- Block placement (mapping), tag matching, valid
bit, - Actions on writes and write misses