Title: Lecture 16: Large Cache Innovations
1Lecture 16 Large Cache Innovations
- Today Large cache design and other cache
innovations - Midterm scores
- 91-80 17 students
- 79-75 14 students
- 74-68 8 students
- 63-54 5 students
- Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops)
mostly correct - Q4 (ooo) 50 correct many didnt stall
renaming - Q5 (multi-core) less than a handful got 8
points - Q6 (memdep) less than a handful got part (b)
right and - correctly articulated the
effect on power/energy
2Shared Vs. Private Caches in Multi-Core
- Advantages of a shared cache
- Space is dynamically allocated among cores
- No wastage of space because of replication
- Potentially faster cache coherence (and easier
to - locate data on a miss)
- Advantages of a private cache
- small L2 ? faster access time
- private bus to L2 ? less contention
3UCA and NUCA
- The small-sized caches so far have all been
uniform cache - access the latency for any access is a
constant, no matter - where data is found
- For a large multi-megabyte cache, it is
expensive to limit - access time by the worst case delay hence,
non-uniform - cache architecture
4Large NUCA
- Issues to be addressed for
- Non-Uniform Cache Access
- Mapping
- Migration
- Search
- Replication
CPU
5Static and Dynamic NUCA
- Static NUCA (S-NUCA)
- The address index bits determine where the block
- is placed
- Page coloring can help here as well to improve
locality - Dynamic NUCA (D-NUCA)
- Blocks are allowed to move between banks
- The block can be anywhere need some search
- mechanism
- Each core can maintain a partial tag structure
so they - have an idea of where the data might be
(complex!) - Every possible bank is looked up and the search
- propagates (either in series or in parallel)
(complex!)
6Example Organization
Latency 65 cyc
Data must be placed close to the center-of-gravity
of requests
Latency 13-17cyc
7Examples Frequency of Accesses
- Dark ? more
- accesses
- OLTP (on-line
- transaction
- processing)
- Ocean ?
- (scientific code)
8Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
Scalable Non-broadcast Interconnect
Shared L2 Cache and Directory State
L2 Cache Controller
9Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
L2
L2
L2
L2
Scalable Non-broadcast Interconnect
Replicated Tags of all L2 and L1 Caches
Controller that handles L2 misses
Off-chip access
10A single tile composed of a core, L1 caches,
and a bank (slice) of the shared L2 cache
Core 0
Core 1
Core 2
Core 3
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
Core 4
Core 5
Core 6
Core 7
The cache controller forwards address requests
to the appropriate L2 bank and handles
coherence operations
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
Memory Controller for off-chip access
11Memory controller for off-chip access
Top die with L2 cache banks Each core
has low-latency access to one L2 bank
L2
L2
L2
L2
Core 0
Core 1
Core 2
Core 3
Bottom die with cores and L1 caches
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
12The Best of S-NUCA and D-NUCA
- Employ S-NUCA (no search required) and use page
- coloring to influence the blocks cache index
bits and - hence the bank that the block gets placed in
- Page migration enables block movement just as in
D-NUCA
13Prefetching
- Hardware prefetching can be employed for any of
the - cache levels
- It can introduce cache pollution prefetched
data is - often placed in a separate prefetch buffer to
avoid - pollution this buffer must be looked up in
parallel - with the cache access
- Aggressive prefetching increases coverage, but
leads - to a reduction in accuracy ? wasted memory
bandwidth - Prefetches must be timely they must be issued
sufficiently - in advance to hide the latency, but not too
early (to avoid - pollution and eviction before use)
14Stream Buffers
- Simplest form of prefetch on every miss, bring
in - multiple cache lines
- When you read the top of the queue, bring in the
next line
L1
Sequential lines
Stream buffer
15Stride-Based Prefetching
- For each load, keep track of the last address
accessed - by the load and a possibly consistent stride
- FSM detects consistent stride and issues
prefetches
incorrect
init
steady
correct
correct
incorrect (update stride)
PC
tag
prev_addr
stride
state
correct
correct
trans
no-pred
incorrect (update stride)
incorrect (update stride)
16Compiler Optimizations
- Loop interchange loops can be re-ordered to
exploit - spatial locality
- for (j0 jlt100 j)
- for (i0 ilt5000 i)
- xij 2 xij
- is converted to
- for (i0 ilt5000 i)
- for (j0 jlt100 j)
- xij 2 xij
17Exercise
- Re-organize data accesses so that a piece of
data is - used a number of times before moving on in
other - words, artificially create temporal locality
for (i0iltNi) for (j0jltNj)
r0 for (k0kltNk) r r
yik zkj xij r
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
18Exercise
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
19Exercise
- Original code could have 2N3 N2 memory
accesses, - while the new version has 2N3/B N2
for (i0iltNi) for (j0jltNj)
r0 for (k0kltNk) r r
yik zkj xij r
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
20Title