Adaptive Insertion Policies for Managing Shared Caches - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Adaptive Insertion Policies for Managing Shared Caches

Description:

Adaptive Insertion Policies. for Managing Shared Caches ... TADIP Scales to Large Number of Concurrently Executing Applications ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 29

Provided by: aamerj

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Insertion Policies for Managing Shared Caches

1
Adaptive Insertion Policies for Managing Shared
Caches

Aamer Jaleel, William Hasenplaugh, Moinuddin
Qureshi,
Julien Sebot, Simon C. Steely, Joel Emer

International Conference on Parallel
Architectures and Compilation Techniques (PACT)
2
Paper Motivation

Shared caches common and more so with increasing
of cores
concurrent applications ? ? contention for
shared cache ?
High Performance ? Manage shared cache efficiently

3
Addressing Shared Cache Performance

Conventional LRU policy allocates resources based
on rate of demand
Applications that do not benefit from cache cause
destructive cache interference
Cache Partitioning Reserves cache resources
based on application benefit rather than rate of
demand
Requires HW for detecting benefit
Significant changes to cache structure
Not scalable to large of applications

4
Paper Contributions

Problem For shared caches, conventional LRU
policy allocates cache resources based on rate-of
demand rather than benefit
Goals Design a hardware mechanism that
1. Provides High Performance by Allocating
Cache on a Benefit-basis
2. Is Robust Across Different Concurrently
Executing Applications
3. Scales to Large Number of Competing
Applications
4. Requires Low Design Overhead
Solution Thread-Aware Dynamic Insertion Policy
(TADIP) that improves average throughput by
12-18 for 2, 4, 8, and 16-core systems with ?
two bytes of storage per HW-thread

5
Review Insertion Policies

Adaptive Insertion Policies for High-Performance
Caching
Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon
Steely Jr., Joel Emer
Appeared in ISCA07

6
Cache Replacement 101 ISCA07

Two components of cache replacement
Victim Selection
Which line to replace for incoming line? (E.g.
LRU, Random etc)
Insertion Policy
With what priority is the new line placed in the
replacement list? (E.g. insert new line into MRU
position)

Simple changes to insertion policy can minimize
cache thrashing and improves cache performance
for memory-intensive workloads
7
Static Insertion Policies ISCA07

Conventional (MRU Insertion) Policy
Choose victim, promote to MRU
LRU Insertion Policy (LIP)
Choose victim, DO NOT promote to MRU
Unless reused, lines stay at LRU position
Bimodal Insertion Policy (BIP)
LIP does not age older lines
Infrequently insert some misses at MRU
Bimodal Throttle b
We used b 1/32 3

8
Dynamic Insertion Policy using Set-Dueling
ISCA07

Set Dueling Monitors (SDMs) Dedicated sets to
estimate the performance of a pre-defined policy
Divide the cache in three
SDM-LRU Dedicated LRU-sets
SDM-BIP Dedicated BIP-sets
Follower sets
PSEL n-bit saturating counter
misses to SDM-LRU PSEL
misses to SDM-BIP PSEL--
Follower sets insertion policy
Use LRU If PSEL MSB 0
Use BIP If PSEL MSB 1

PSEL

- Based on Analytical and Empirical Studies
32 Sets per SDM
10 bit PSEL counter

HW Required 10 bits Combinational Logic
9
Extending DIP to Shared Caches

DIP uses a single policy (LRU or BIP) for all
applications competing for the cache
DIP can not distinguish between apps that benefit
from cache and those that do not
Example soplex h264ref w/2MB cache
DIP learns LRU for both apps
soplex causes destructive interference
Desirable that only h264ref follow LRU and soplex
follow BIP

soplex
Misses Per 1000 Instr (under LRU)
h264ref
Need a Thread-Aware Dynamic Insertion Policy
(TADIP)
10
Thread Aware Dynamic Insertion Policy (TADIP)

Assume N concurrent applications, what is best
insertion policy for each? (LRU0, BIP1)
Insertion policy decision can be thought of as an
N-bit binary string
lt P0, P1, P2 PN gt
If Px 1, then for application c use BIP, else
use LRU
e.g. 0000 ? always use conventional LRU, 1111 ?
always use BIP
With N-bit string, 2N possible combinations. How
to find best one???
Offline Profiling Input set/system dependent
impractical with large N
Brute Force Search using SDMs Infeasible with
large N

Need a PRACTICAL and SCALABLE Implementation of
TADIP
11
Using Set-Dueling As a Practical Approach to TADIP

Unnecessary to exhaustively search all 2N
combinations
Some bits of the best binary insertion string can
be learned independently
Example Always use BIP for applications that do
not benefit from cache
Exponential Search Space ? Linear Search Space
Learn best policy (BIP or LRU) for each app in
presence of all other apps

Use Per-Application SDMs To Determine In the
presence of other apps, should a given app use
BIP or LRU?
12
TADIP Using Set-Dueling Monitors (SDMs)

Assume a cache shared by 4 applications APP0
APP1 APP2 APP3

lt P0, P1, P2, P3 gt
lt 0, P1, P2, P3 gt
In the presence of other apps, should APP0 do LRU
or BIP?
lt 1, P1, P2, P3 gt
lt P0, 0, P2, P3 gt
lt P0, 1, P2, P3 gt
lt P0, P1, 0, P3 gt
lt P0, P1, 1, P3 gt
lt P0, P1, P2, 0 gt
lt P0, P1, P2, 1 gt
Follower Sets

Pc MSB( PSELc )

High-Level View of Cache
Set-Level View of Cache
13
TADIP Using Set-Dueling Monitors (SDMs)

Assume a cache shared by 4 applications APP0
APP1 APP2 APP3

LRU SDMs for each APP
BIP SDMs for each APP
Follower sets
Per-APP PSEL saturating counters
misses to LRU PSEL
misses to BIP PSEL--
Follower sets insertion policy
SDMs of one thread are follower sets of another
thread
Let Px MSB PSELx
Fill Decision ltP0, P1gt

lt P0, P1, P2, P3 gt
lt 0, P1, P2, P3 gt
lt 1, P1, P2, P3 gt
lt P0, 0, P2, P3 gt
lt P0, 1, P2, P3 gt
lt P0, P1, 0, P3 gt
lt P0, P1, 1, P3 gt
lt P0, P1, P2, 0 gt
lt P0, P1, P2, 1 gt
Follower Sets

32 sets per SDM
10-bit PSEL
Pc MSB( PSELc )

HW Required (10T) bits Combinational Logic
14
Experimental Setup

Simulator and Benchmarks
CMPim A Pin-based Multi-Core Performance
Simulator
17 representative SPEC CPU2006 benchmarks
Baseline Study
4-core CMP with in-order cores
Three-level Non-Inclusive Cache Hierarchy 32KB
L1, 256KB L2, 4MB L3
15 workload mixes of four different SPEC CPU2006
benchmarks
Scalability Study
2-core, 4-core, 8-core, 16-core systems
50 workload mixes of 2, 4, 8, 16 different SPEC
CPU2006 benchmarks

15
soplex h264ref Sharing 2MB Cache
MPKI
APKI accesses per 1000 inst MPKI misses per
1000 inst
SOPLEX
H264REF
TADIP Improves Throughput by 27 over LRU and DIP
16
TADIP Results Throughput
No Gains from DIP
TADIP Provides More Than Two Times Performance of
DIP TADIP Improves Performance over LRU by 18
17
TADIP Compared to Offline Best Static Policy
Static Best almost always better since it
optimized for best IPC while TADIP optimized for
fewer misses. TADIP optimizing for other
metrics such as IPC can reduce the gap
TADIP Better Due to Phase Adaptation
TADIP is within 85 Best Offline Determined
Insertion Policy
18
TADIP Vs. Utility Based Cache Partitioning (UCP)
DIP Out-Performs UCP Without Requiring Any Cache
Partitioning Hardware
Unlike Cache Partitioning Schemes, TADIP Does NOT
Reserve Cache Space Instead TADIP Does Efficient
CACHE MANAGEMENT by Changing Insertion Policy
19
TADIP Results Sensitivity to Cache Size
TADIP Provides Performance Equivalent to Doubling
Cache Size
20
TADIP Results Scalability
Throughput Normalized to Baseline System
TADIP Scales to Large Number of Concurrently
Executing Applications
21
Summary

The Problem LRU causes cache thrashing when
workloads with differing working sets compete for
a shared cache
Solution Thread-Aware Dynamic Insertion Policy
(TADIP)
1. Provides High Performance by Minimizing
Thrashing
- Up to 94, 64, 26 and 16 performance
on 2, 4, 8, and 16 core CMPs
2. Is Robust Across Different Workload Mixes
- Does not significantly hurt performance
when LRU works well
3. Scales to Large Number of Competing
Applications
- Evaluated 16-cores in our study
4. Requires Low Design Overhead
- Less than 2 bytes of HW require per
hardware-thread in the system

22
QA
23
TADIP Results Weighted Speedup
24
TADIP Results Fairness Metric
25
TADIP In Presence of Prefetching on 4-core CMP
26
Cache Occupancy (16-Cores)

Changing fill policy directly controls the amount
of cache resources provided to an application
In figure, only showing only the fill policy for
xalancbmk and sphinx3
28 perf improvement

27
TADIP Using Set-Dueling Monitors (SDMs)

Assume a cache shared by 2 applications APP0
and APP1

lt P0 , P1 gt
lt 0 , P1 gt
In the presence of other apps, should APP0 do LRU
or BIP?
PSEL0
lt 1 , P1 gt
lt P0 , 0 gt
In the presence of other apps, should APP1 do LRU
or BIP?
PSEL1
lt P0 , 1 gt
Follower Sets