Lecture 16: Large Cache Innovations - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 16: Large Cache Innovations

Description:

Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops) ... Q5 (multi-core) less than a handful got 8 points ... pollution and eviction before use) 14. Stream Buffers ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 21

Provided by: rajeevbala

Learn more at: https://my.eng.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 16: Large Cache Innovations

1
Lecture 16 Large Cache Innovations

Today Large cache design and other cache
innovations
Midterm scores
91-80 17 students
79-75 14 students
74-68 8 students
63-54 5 students
Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops)
mostly correct
Q4 (ooo) 50 correct many didnt stall
renaming
Q5 (multi-core) less than a handful got 8
points
Q6 (memdep) less than a handful got part (b)
right and
correctly articulated the
effect on power/energy

2
Shared Vs. Private Caches in Multi-Core

Advantages of a shared cache
Space is dynamically allocated among cores
No wastage of space because of replication
Potentially faster cache coherence (and easier
to
locate data on a miss)
Advantages of a private cache
small L2 ? faster access time
private bus to L2 ? less contention

3
UCA and NUCA

The small-sized caches so far have all been
uniform cache
access the latency for any access is a
constant, no matter
where data is found
For a large multi-megabyte cache, it is
expensive to limit
access time by the worst case delay hence,
non-uniform
cache architecture

4
Large NUCA

Issues to be addressed for
Non-Uniform Cache Access
Mapping
Migration
Search
Replication

CPU
5
Static and Dynamic NUCA

Static NUCA (S-NUCA)
The address index bits determine where the block
is placed
Page coloring can help here as well to improve
locality
Dynamic NUCA (D-NUCA)
Blocks are allowed to move between banks
The block can be anywhere need some search
mechanism
Each core can maintain a partial tag structure
so they
have an idea of where the data might be
(complex!)
Every possible bank is looked up and the search
propagates (either in series or in parallel)
(complex!)

6
Example Organization
Latency 65 cyc
Data must be placed close to the center-of-gravity
of requests
Latency 13-17cyc
7
Examples Frequency of Accesses

Dark ? more
accesses
OLTP (on-line
transaction
processing)
Ocean ?
(scientific code)

8
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
Scalable Non-broadcast Interconnect
Shared L2 Cache and Directory State
L2 Cache Controller
9
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
L2
L2
L2
L2
Scalable Non-broadcast Interconnect
Replicated Tags of all L2 and L1 Caches
Controller that handles L2 misses
Off-chip access
10
A single tile composed of a core, L1 caches,
and a bank (slice) of the shared L2 cache
Core 0
Core 1
Core 2
Core 3
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
Core 4
Core 5
Core 6
Core 7
The cache controller forwards address requests
to the appropriate L2 bank and handles
coherence operations
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L2
L2
L2
L2
Memory Controller for off-chip access
11
Memory controller for off-chip access
Top die with L2 cache banks Each core
has low-latency access to one L2 bank
L2
L2
L2
L2
Core 0
Core 1
Core 2
Core 3
Bottom die with cores and L1 caches
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
L1 D
L1 I
12
The Best of S-NUCA and D-NUCA

Employ S-NUCA (no search required) and use page
coloring to influence the blocks cache index
bits and
hence the bank that the block gets placed in
Page migration enables block movement just as in
D-NUCA

13
Prefetching

Hardware prefetching can be employed for any of
the
cache levels
It can introduce cache pollution prefetched
data is
often placed in a separate prefetch buffer to
avoid
pollution this buffer must be looked up in
parallel
with the cache access
Aggressive prefetching increases coverage, but
leads
to a reduction in accuracy ? wasted memory
bandwidth
Prefetches must be timely they must be issued
sufficiently
in advance to hide the latency, but not too
early (to avoid
pollution and eviction before use)

14
Stream Buffers

Simplest form of prefetch on every miss, bring
in
multiple cache lines
When you read the top of the queue, bring in the
next line

L1
Sequential lines
Stream buffer
15
Stride-Based Prefetching

For each load, keep track of the last address
accessed
by the load and a possibly consistent stride
FSM detects consistent stride and issues
prefetches

incorrect
init
steady
correct
correct
incorrect (update stride)
PC
tag
prev_addr
stride
state
correct
correct
trans
no-pred
incorrect (update stride)
incorrect (update stride)
16
Compiler Optimizations

Loop interchange loops can be re-ordered to
exploit
spatial locality
for (j0 jlt100 j)
for (i0 ilt5000 i)
xij 2 xij
is converted to
for (i0 ilt5000 i)
for (j0 jlt100 j)
xij 2 xij

17
Exercise

Re-organize data accesses so that a piece of
data is
used a number of times before moving on in
other
words, artificially create temporal locality

for (i0iltNi) for (j0jltNj)
r0 for (k0kltNk) r r
yik zkj xij r
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
18
Exercise
for (jj0 jjltN jj B) for (kk0 kkltN kk
B) for (i0iltNi) for (jjj jlt
min(jjB,N) j) r0 for (kkk
klt min(kkB,N) k) r r yik
zkj xij xij r
y
z
x
y
z
x
y
z
x
y
z
x
y
z
x
19
Exercise