Title: Internal Memory
1Computer Organization and Architecture
Chapter 4 Cache Memory
2Topics
- Computer Memory System Overview
- Memory Hierarchy
- Cache Memory Principles
- Elements of Cache Design
- Pentium and PowerPC Cache
3Computer Memory System Overview
- Characteristics of memory systems
- Location
- Capacity
- Unit of transfer
- Access method
- Performance
- Physical type
- Physical characteristics
- Organization
4Location
- CPU registers, control memory
- Internal main memory
- External secondary memory
5Capacity
- In terms of words or bytes
- 1 Byte 8 bits
- Word size
- The natural unit of organization
- size 8, 16, and 32 bits are common, even 64 bits
6Unit of Transfer
- Number of data elements transferred at a time
- Internal
- Usually governed by data bus width
- External
- Usually a block which is much larger than a word
7Addressable Unit
- Smallest location which can be uniquely addressed
- Word or byte
- E.g., Motorola 68000
- word 16 bits
- internal transfer unit 16 bits
- addressable unit 8 bits (byte addressable)
- Let A address length in bits
- N addressable units
- ? 2A N
8Access Methods (1)
- Sequential
- Data does not have a unique address
- Start at the beginning and read through in order
- Must read intermediate data items until the
desired item found - Access time depends on location of data and
previous location - e.g. tape
9Access Methods (2)
- Direct
- Individual blocks have unique addresses
- Access is by jumping to vicinity plus sequential
search - Access time depends on location and previous
location - e.g. disk
10Access Methods (3)
- Random
- Individual addresses identify locations exactly
- Location can be selected randomly and addressed
and accessed directly - Access time is independent of location or
previous access (i.e., constant) - e.g. RAM
11Access Methods (4)
- Associative
- A variation of random access
- Data is located by a comparison with contents of
a portion of the store - All words are searched simultaneously
- Access time is independent of location or
previous access - e.g. cache
12Performance (1)
- Access time
- Time between presenting the address and getting
the valid data - For random access memory time to address data
unit and perform transfer - For non-random access memory time to position
hardware mechanism at the desired position - Memory Cycle time
- Primarily applied to random access memory
- Time may be required for the memory to recover
before next access - Cycle time is (access time recovery time)
13Performance (2)
- Transfer Rate R bps
- Rate at which data can be transferred in/out of
memory - For random access memory, R 1/(memory cycle
time) - For non-random access memory, TN TA N/R,
where - TN average time to R/W N bits
- TA average access time
- N bits
14Physical Types
- Semiconductor
- RAM
- Magnetic
- Disk Tape
- Optical
- CD DVD
15Physical Characteristics
- Decay
- Volatility
- Erasability
- Power consumption
16Organization
- Physical arrangement of bits into words
- Not always obvious
- e.g. interleaved
17The Bottom Line
- How much?
- Capacity
- How fast?
- Time is money
- How expensive?
- Cost/bit
18Memory Hierarchy (1)
- Major design objective of memory systems
- Provision of adequate storage capacity at
- an acceptable level of performance
- a reasonable cost
- Memory technologies
- Smaller access time ? greater cost/bit
- Greater capacity ? smaller cost/bit
- Greater capacity ? greater access time
- ? DILEMMA
- ? Solution MEMORY HIERARCHY
19(No Transcript)
20Memory Hierarchy (2)
- If
- Memory organized according to A) - C)
- Data and instruction distributed according to D)
- then
- Overall cost reduced
- Level of performance maintained
- How can we validate D)?
21Locality of Reference (1)
- Basis for validity of D)
- During the course of the execution of a program,
memory references tend to cluster - Examples?
- Over a long period of time, clusters in used
migrate from one locality to another - Over a short period of time, fixed clusters are
used primarily - Current locality kept in high speed memory
- ? average access time reduced
22Locality of Reference (2)
- Spatial locality
- Tendency of execution to involve a number of
memory locations that are clustered - E.g., sequential instruction access, subroutines,
arrays, tables - Temporal locality
- Tendency to access memory locations that have
been used recently - E.g., iteration loops
23Typical Memory Hierarchy
- Registers
- In CPU
- Internal or Main memory
- May include one or more levels of cache
- RAM
- External memory
- Backing store
24Hierarchy List
- Registers
- L1 Cache
- L2 Cache
- Main memory
- Disk cache
- Disk
- Optical
- Tape
25Performance example (1)
- Assume 2-level memory system
- Level 1 access time T1
- Level 2 access time T2
- Hit ratio, H fraction of time a reference
can be found in level 1 - Average access time, Tave
- prob(found in level1) x T(found in level1)
prob(not found in level1) x T(not found in
level1) - H xT1 (1- H ) x (T1 T2 )
- T1 (1 - H )T2
26Performance example (2)
- Assume 2-level memory system
- Level 1 access time T1 1 ?s
- Level 2 access time T2 10 ?s
- Hit ratio, H 95
- Average access time,
- Tave H xT1 (1- H )x(T1 T2 ) .95 x 1
(1 - .95) X (1 10) .95 .05 X 11
1.5 ?s
27Performance example (3)
Higher hit ratio ? better performance
28So you want Speed?
- It is possible to build a computer which uses
only static RAM (technique for cache) - This would be very fast
- This would need no cache
- How can you cache cache?
- This would cost a very large amount
- Stick with memory hierarchy!
29Cache Memory Principles
- Objective
- High speed
- Large memory size
- Less expensive memory system
- Cache
- Small amount of fast memory
- Sits between normal main memory and CPU
- May be located on CPU chip or module
30Cache and Main Memory
- Cache contains a copy of portions of main memory
- smaller, faster larger, slower
31Cache operation - overview
- Consider READ operation
- CPU requests contents of memory location
- Check cache for this data
- If present, get from cache (fast)
- If not present, read required block from main
memory to cache - Then deliver from cache to CPU
- Q Why delivering a whole block into cache?
- Cache includes tags to identify which block of
main memory is in each cache slot
32Typical Cache Organization
33Cache/Main-Memory Structure
- Memory
- 2n addressable words
- each word has a unique n-bit address
- M fixed length blocks of K words each ? M 2n/K
- Cache
- C slots (lines) of K words each
- C ltlt M
34Cache/Main-Memory Structure
- At any time, some subset of blocks resides in
lines - As C ltlt M, each line includes a tag indicating
which block is being stored - tag is a portion of an address
35(line)
36Elements of Cache Design
- Size
- Mapping Function
- Replacement Algorithm
- Write Policy
- Block Size
- Number of Caches
37Size does matter
- Usually 1K - 512K
- Cost
- More cache is more expensive
- Speed
- More cache is faster (up to a point)
- Checking cache for data takes time
38Mapping Function
- Algorithms for mapping main memory blocks to
cache lines - Needed, as C ltlt M
- Approaches
- Direct
- Associate
- Set Associate
39Mapping Function Example
- Cache of 64KByte
- Cache block of 4 bytes
- i.e. cache is 16K (214) lines of 4 bytes (why?)
- 16MBytes main memory, byte addressable
- 24 bit address
- (224 16M)
- 4M blocks
- C 16K, M 4M, C ltlt M
40Direct Mapping (1)
- Each block of main memory maps to only one
possible cache line - i.e. if a block is in cache, it must be in one
specific place - Mapping
- i j mod m, where
- i cache line number
- j memory block number
- m number of lines (i.e., C )
41Direct Mapping (2)
- Example of mapping 16 blocks, 4 lines
- line blocks
- 0 0, 4, 8, 12
- 1 1, 5, 9, 13
- 2 2, 6, 10, 14
- 3 3, 7, 11, 15
- Which block (in the line)?
- No two blocks in the same line have the same Tag
field in address - Check contents of cache by finding line and then
check Tag
42Direct Mapping - Address Structure
- Address is in 3 fields
- Least Significant w bits identify unique word in
a block (or line) - Most Significant s bits specify one memory block
- The MSBs are split into
- cache line field of r bits, where m 2r (or C
2r) - tag of s-r (most significant) bits
43Direct Mapping Cache Line Table
- Cache line Main Memory blocks held
- 0 0, m, 2m, , 2s-m
- 1 1, m1, 2m1, , 2s-m1
-
- m-1 m-1, 2m-1, 3m-1, , 2s-1
44Direct Mapping Cache Organization
45Direct Mapping Example (1)
Tag s-r
Line or Slot r
Word w
14
2
8
- 24 bit address
- 2 bit word identifier (4 byte block)
- 22 bit block identifier
- 8 bit tag (22-14)
- 14 bit slot or line
- Again
- No two blocks in the same line have the same Tag
field - Check contents of cache by finding line and
checking Tag
46(No Transcript)
47Direct Mapping Example (2)
- Q1 Where in cache is the word from main memory
location 16339D mapped? - 0 C E 7
- Ans Line 0CE7, Tag 16, word offset 1
- Q2 Where in cache is the word from main memory
location ABCDEF mapped?
Tag 8 bits
Line 14 bits
Word 2 bits
01
0001 0110
0011 0011 1001 11
48Direct Mapping Summary
- Address length (s w) bits
- Number of addressable units 2sw words or bytes
- Block size line size 2w words or bytes
- Number of blocks in main memory 2s w/2w 2s
- Number of lines in cache m 2r
- Size of tag (s r) bits
49Direct Mapping pros cons
- Advantages
- Simple
- Inexpensive to implement
- Disadvantage
- Fixed location for given block
- ? If a program accesses 2 blocks that map to the
- same line repeatedly, cache misses are very
high - ? These blocks will be continually swapped in
and out ? Hit ratio will be low
50Associative Mapping
- A main memory block can load into any line of
cache - Memory address is interpreted as tag and word
- Tag uniquely identifies block of memory
- Every lines tag is examined for a match
- Cache searching gets expensive
- must simultaneously examine every lines tag for
a match
51Fully Associative Cache Organization
52F
F
F
53Associative MappingAddress Structure Example
Word 2 bit
Tag 22 bit
- 24 bit address
- 22 bit tag stored with each 32 bit block of data
- Compare tag field with tag entry in cache to
check for hit - Least significant 2 bits of address identify
which byte is required from 32 bit data block
54Associative Mapping Example
Word 2 bit
Tag 22 bit
- Address Tag Cache line Offset Data
- FFFFFC 3FFFFF 3FFF 00 24
- 16339D 058CE7 0001 01 DC
- ABCDEF ? ? ? ?
55Associative Mapping Summary
- Address length (s w) bits
- Number of addressable units 2sw words or bytes
- Block size line size 2w words or bytes
- Number of blocks in main memory 2s w/2w 2s
- Number of lines in cache undetermined
- Size of tag s bits
56Associate Mapping pros cons
- Advantage
- Flexible
- Disadvantages
- Cost
- Complex circuit for simultaneous comparison
57Set Associative Mapping
- Compromise between the previous two
- Cache is divided into v sets of k lines each
- m v x k, where m lines
- i j mod v, where
- i cache set number
- j memory block number
- A given block maps to any line in a given set
- K-way set associate cache
- 2-way and 4-way are common
58Set Associative Mapping Example
- m 16 lines, v 8 sets
- ? k 2 lines/set, 2 way set associative mapping
- Assume 32 blocks in memory, i j mod v
- set blocks
- 0 0, 8, 16, 24
- 1 1, 9, 17, 25
-
- 7 7, 15, 23, 31
- A given block can be in one of 2 lines in only
one set - e.g., block 17 can be assigned to either line 0
or line 1 in set 1
59Set Associative MappingAddress Structure
Word w bit
Tag (s-d) bit
Set d bit
- d bits v 2d, specify one of v sets
- s bits specify one of 2s blocks
- Use set field to determine cache set to look in
- Compare tag field simultaneously to see if we
have a hit
60K Way Set Associative Cache Organization
61Set Associative MappingExample
Word 2 bit
Tag 9 bit
Set 13 bit
- Same example, 2-way set associate
- 214 lines, 2 lines/set ? 213 sets ? 29 blocks can
be loaded to either of the two lines in a set - Each block mapped into a set has a unique tag
- E.g., Address Tag Set Offset Data
- FFFFFF ? 1FF 7FFF 1FF 1FFF 11 68
- 16339D ? 02C 339D 02C 0CE7 01 DC
- ABCDEF ? ? ? ? ?
62(No Transcript)
63Set Associative Mapping Summary
- Address length (s w) bits
- Number of addressable units 2sw words or bytes
- Block size line size 2w words or bytes
- Number of blocks in main memory 2d
- Number of lines in set k
- Number of sets v 2d
- Number of lines in cache kv k 2d
- Size of tag (s d) bits
64Remarks
- Why is the simultaneous comparison cheaper here,
compared to associate mapping? - Tag is much smaller
- Only k tags within a set are compared
- Relationship between set associate and the first
two extreme cases of set associate - k 1 ? v m ? direct (1 line/set)
- k m ? v 1 ? associate (one big set)
65Replacement Algorithms (1)Direct mapping
- Replacement algorithm
- When a new block is brought into cache, one of
existing blocks must be replaced - Direct Mapping
- No choice
- Each block only maps to one line
- Replace that line
66Replacement Algorithms (2)Associative Set
Associative
- Hardware implemented algorithm (speed)
- Least Recently used (LRU)
- e.g. in 2 way set associative
- Which of the 2 block is LRU?
- First in first out (FIFO)
- replace block that has been in cache longest
- Least frequently used
- replace block which has had fewest hits
- Random
67Write Policy
- Must not overwrite a cache block unless main
memory is up to date - Multiple CPUs may have individual caches
- I/O may address main memory directly
68Write through
- All writes go to main memory as well as cache
- Both copies always agree
- Multiple CPUs can monitor main memory traffic to
keep local (to CPU) cache up to date - Disadvantage
- Lots of traffic ? bottleneck
69Write back
- Updates initially made in cache only
- Update bit for cache slot is set when update
occurs - If block is to be replaced, write to main memory
only if update bit is set, i.e., only if the
cache line is dirty, i.e., only if at least
one word in the cache line is updated - Other caches get out of sync
- I/O must access main memory through cache
- N.B. 15 of memory references are writes
70Block Size
- Block size line size
- As block size increases from very small
- ? hit ratio increases because of
- the principle of locality
- As block size becomes very large
- ? hit ratio decreases as
- Number of blocks decreases
- Probability of referencing all words in a block
decreases - 4 - 8 addressable units is reasonable
71Number of Caches
- Two aspects
- Number of levels
- Unified vs. split
72Multilevel Caches
- Modern CPU has on-chip cache (L1) that increases
overall performance - e.g., 80486 8KB
- Pentium 16KB
- PowerPC up to 64KB
- Secondary, off-chip cache (L2) provides high
speed access to main memory - Generally 512KB or less
73Unified vs. Split
- Unified cache
- Stores data and instructions in one cache
- Flexible and can balance the load between data
and instruction fetches - ? higher hit ratio
- Only one cache to design and implement
- Split cache
- Two caches, one for data and one for instructions
- Trend toward split cache
- Good for superscalar machines that support
parallel execution, prefetch, and pipelining - Overcome cache contention
74Pentium 4 Cache
- 80386 no on chip cache
- 80486 single on-chip, 8k using 16 byte lines
and four way set associative organization - Pentium (all versions) two on chip L1 caches
- Data instructions
- Pentium 4 L1 caches
- 8k bytes
- 64 byte lines
- four way set associative
- L2 cache
- Feeding both L1 caches
- 256k
- 128 byte lines
- 8 way set associative
75Pentium 4 Diagram (Simplified)
76Pentium 4 Core Processor
- Fetch/Decode Unit
- Fetches instructions from L2 cache
- Decode into micro-ops
- Store micro-ops in L1 cache
- Out of order execution logic
- Schedules micro-ops
- Based on data dependence and resources
- May speculatively execute
- Execution units
- Execute micro-ops
- Data from L1 cache
- Results in registers
- Memory subsystem
- L2 cache and systems bus
77Pentium 4 Design Reasoning
- Decodes instructions into RISC like micro-ops
before L1 cache - Micro-ops fixed length
- Superscalar pipelining and scheduling
- Pentium instructions long complex
- Performance improved by separating decoding from
scheduling pipelining - (More later ch14)
- Data cache is write back
- Can be configured to write through
- L1 cache controlled by 2 bits in register
- CD cache disable
- NW not write through
- 2 instructions to invalidate (flush) cache and
write back then invalidate
78Power PC Cache Organization
- 601 1 x 32kb 8-way set associative 32b/line
- 603 2 x 8kb 2-way set associative 32b/line
- 604 2 x 16kb 4-way set associative 32b/line
- 620 2 x 32kb 8-way set associative 64b/line
- G3 G4
- 2 x 32kb L1 cache
- 8 way set associative
- G3 64b/line, G4 32b/line
- 256k, 512k or 1M L2 cache
- two way set associative
79PowerPC G4
80Comparison of Cache Sizes