Title: Adaptive Insertion Policies for Managing Shared Caches
1Adaptive Insertion Policies for Managing Shared
Caches
- Aamer Jaleel, William Hasenplaugh, Moinuddin
Qureshi, - Julien Sebot, Simon C. Steely, Joel Emer
International Conference on Parallel
Architectures and Compilation Techniques (PACT)
2Paper Motivation
- Shared caches common and more so with increasing
of cores - concurrent applications ? ? contention for
shared cache ? - High Performance ? Manage shared cache efficiently
3Addressing Shared Cache Performance
- Conventional LRU policy allocates resources based
on rate of demand - Applications that do not benefit from cache cause
destructive cache interference - Cache Partitioning Reserves cache resources
based on application benefit rather than rate of
demand - Requires HW for detecting benefit
- Significant changes to cache structure
- Not scalable to large of applications
-
4Paper Contributions
- Problem For shared caches, conventional LRU
policy allocates cache resources based on rate-of
demand rather than benefit - Goals Design a hardware mechanism that
- 1. Provides High Performance by Allocating
Cache on a Benefit-basis - 2. Is Robust Across Different Concurrently
Executing Applications - 3. Scales to Large Number of Competing
Applications - 4. Requires Low Design Overhead
- Solution Thread-Aware Dynamic Insertion Policy
(TADIP) that improves average throughput by
12-18 for 2, 4, 8, and 16-core systems with ?
two bytes of storage per HW-thread
5Review Insertion Policies
- Adaptive Insertion Policies for High-Performance
Caching - Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon
Steely Jr., Joel Emer - Appeared in ISCA07
6Cache Replacement 101 ISCA07
- Two components of cache replacement
- Victim Selection
- Which line to replace for incoming line? (E.g.
LRU, Random etc) - Insertion Policy
- With what priority is the new line placed in the
replacement list? (E.g. insert new line into MRU
position)
Simple changes to insertion policy can minimize
cache thrashing and improves cache performance
for memory-intensive workloads
7Static Insertion Policies ISCA07
- Conventional (MRU Insertion) Policy
- Choose victim, promote to MRU
- LRU Insertion Policy (LIP)
- Choose victim, DO NOT promote to MRU
- Unless reused, lines stay at LRU position
- Bimodal Insertion Policy (BIP)
- LIP does not age older lines
- Infrequently insert some misses at MRU
- Bimodal Throttle b
- We used b 1/32 3
8Dynamic Insertion Policy using Set-Dueling
ISCA07
- Set Dueling Monitors (SDMs) Dedicated sets to
estimate the performance of a pre-defined policy - Divide the cache in three
- SDM-LRU Dedicated LRU-sets
- SDM-BIP Dedicated BIP-sets
- Follower sets
- PSEL n-bit saturating counter
- misses to SDM-LRU PSEL
- misses to SDM-BIP PSEL--
- Follower sets insertion policy
- Use LRU If PSEL MSB 0
- Use BIP If PSEL MSB 1
PSEL
- - Based on Analytical and Empirical Studies
- 32 Sets per SDM
- 10 bit PSEL counter
HW Required 10 bits Combinational Logic
9Extending DIP to Shared Caches
- DIP uses a single policy (LRU or BIP) for all
applications competing for the cache - DIP can not distinguish between apps that benefit
from cache and those that do not - Example soplex h264ref w/2MB cache
- DIP learns LRU for both apps
- soplex causes destructive interference
- Desirable that only h264ref follow LRU and soplex
follow BIP
soplex
Misses Per 1000 Instr (under LRU)
h264ref
Need a Thread-Aware Dynamic Insertion Policy
(TADIP)
10Thread Aware Dynamic Insertion Policy (TADIP)
- Assume N concurrent applications, what is best
insertion policy for each? (LRU0, BIP1) - Insertion policy decision can be thought of as an
N-bit binary string - lt P0, P1, P2 PN gt
- If Px 1, then for application c use BIP, else
use LRU - e.g. 0000 ? always use conventional LRU, 1111 ?
always use BIP - With N-bit string, 2N possible combinations. How
to find best one??? - Offline Profiling Input set/system dependent
impractical with large N - Brute Force Search using SDMs Infeasible with
large N
Need a PRACTICAL and SCALABLE Implementation of
TADIP
11Using Set-Dueling As a Practical Approach to TADIP
- Unnecessary to exhaustively search all 2N
combinations - Some bits of the best binary insertion string can
be learned independently - Example Always use BIP for applications that do
not benefit from cache - Exponential Search Space ? Linear Search Space
- Learn best policy (BIP or LRU) for each app in
presence of all other apps
Use Per-Application SDMs To Determine In the
presence of other apps, should a given app use
BIP or LRU?
12TADIP Using Set-Dueling Monitors (SDMs)
- Assume a cache shared by 4 applications APP0
APP1 APP2 APP3
lt P0, P1, P2, P3 gt
lt 0, P1, P2, P3 gt
In the presence of other apps, should APP0 do LRU
or BIP?
lt 1, P1, P2, P3 gt
lt P0, 0, P2, P3 gt
lt P0, 1, P2, P3 gt
lt P0, P1, 0, P3 gt
lt P0, P1, 1, P3 gt
lt P0, P1, P2, 0 gt
lt P0, P1, P2, 1 gt
Follower Sets
High-Level View of Cache
Set-Level View of Cache
13TADIP Using Set-Dueling Monitors (SDMs)
- Assume a cache shared by 4 applications APP0
APP1 APP2 APP3
- LRU SDMs for each APP
- BIP SDMs for each APP
- Follower sets
- Per-APP PSEL saturating counters
- misses to LRU PSEL
- misses to BIP PSEL--
- Follower sets insertion policy
- SDMs of one thread are follower sets of another
thread - Let Px MSB PSELx
- Fill Decision ltP0, P1gt
lt P0, P1, P2, P3 gt
lt 0, P1, P2, P3 gt
lt 1, P1, P2, P3 gt
lt P0, 0, P2, P3 gt
lt P0, 1, P2, P3 gt
lt P0, P1, 0, P3 gt
lt P0, P1, 1, P3 gt
lt P0, P1, P2, 0 gt
lt P0, P1, P2, 1 gt
Follower Sets
- 32 sets per SDM
- 10-bit PSEL
- Pc MSB( PSELc )
HW Required (10T) bits Combinational Logic
14Experimental Setup
- Simulator and Benchmarks
- CMPim A Pin-based Multi-Core Performance
Simulator - 17 representative SPEC CPU2006 benchmarks
- Baseline Study
- 4-core CMP with in-order cores
- Three-level Non-Inclusive Cache Hierarchy 32KB
L1, 256KB L2, 4MB L3 - 15 workload mixes of four different SPEC CPU2006
benchmarks - Scalability Study
- 2-core, 4-core, 8-core, 16-core systems
- 50 workload mixes of 2, 4, 8, 16 different SPEC
CPU2006 benchmarks
15soplex h264ref Sharing 2MB Cache
MPKI
APKI accesses per 1000 inst MPKI misses per
1000 inst
SOPLEX
H264REF
TADIP Improves Throughput by 27 over LRU and DIP
16TADIP Results Throughput
No Gains from DIP
TADIP Provides More Than Two Times Performance of
DIP TADIP Improves Performance over LRU by 18
17TADIP Compared to Offline Best Static Policy
Static Best almost always better since it
optimized for best IPC while TADIP optimized for
fewer misses. TADIP optimizing for other
metrics such as IPC can reduce the gap
TADIP Better Due to Phase Adaptation
TADIP is within 85 Best Offline Determined
Insertion Policy
18TADIP Vs. Utility Based Cache Partitioning (UCP)
DIP Out-Performs UCP Without Requiring Any Cache
Partitioning Hardware
Unlike Cache Partitioning Schemes, TADIP Does NOT
Reserve Cache Space Instead TADIP Does Efficient
CACHE MANAGEMENT by Changing Insertion Policy
19TADIP Results Sensitivity to Cache Size
TADIP Provides Performance Equivalent to Doubling
Cache Size
20TADIP Results Scalability
Throughput Normalized to Baseline System
TADIP Scales to Large Number of Concurrently
Executing Applications
21Summary
- The Problem LRU causes cache thrashing when
workloads with differing working sets compete for
a shared cache - Solution Thread-Aware Dynamic Insertion Policy
(TADIP) - 1. Provides High Performance by Minimizing
Thrashing - - Up to 94, 64, 26 and 16 performance
on 2, 4, 8, and 16 core CMPs - 2. Is Robust Across Different Workload Mixes
- - Does not significantly hurt performance
when LRU works well - 3. Scales to Large Number of Competing
Applications - - Evaluated 16-cores in our study
- 4. Requires Low Design Overhead
- - Less than 2 bytes of HW require per
hardware-thread in the system
22QA
23TADIP Results Weighted Speedup
24TADIP Results Fairness Metric
25TADIP In Presence of Prefetching on 4-core CMP
26Cache Occupancy (16-Cores)
- Changing fill policy directly controls the amount
of cache resources provided to an application - In figure, only showing only the fill policy for
xalancbmk and sphinx3 - 28 perf improvement
27TADIP Using Set-Dueling Monitors (SDMs)
- Assume a cache shared by 2 applications APP0
and APP1
lt P0 , P1 gt
lt 0 , P1 gt
In the presence of other apps, should APP0 do LRU
or BIP?
PSEL0
lt 1 , P1 gt
lt P0 , 0 gt
In the presence of other apps, should APP1 do LRU
or BIP?
PSEL1
lt P0 , 1 gt
Follower Sets
- 32 sets per SDM
- 9-bit PSEL
- Pc MSB( PSELc )
High-Level View of Cache
Set-Level View of Cache
28TADIP Using Set-Dueling Monitors (SDMs)
- Assume a cache shared by 2 applications APP0
and APP1
- LRU SDMs for each APP
- BIP SDMs for each APP
- Follower sets
- PSEL0, PSEL1 per-APP PSEL
- misses to LRU PSEL
- misses to BIP PSEL--
- Follower sets insertion policy
- SDMs of one thread are follower sets of another
thread - Let Px MSB PSELx
- Fill Decision ltP0, P1gt
lt P0 , P1 gt
lt 0 , P1 gt
PSEL0
lt 1 , P1 gt
lt P0 , 0 gt
PSEL1
lt P0 , 1 gt
Follower Sets
- 32 sets per SDM
- 9-bit PSEL cntr
- Pc MSB( PSELc )
HW Required (9T) bits Combinational Logic