Title: Cooperative Caching for Chip Multiprocessors
1Cooperative Caching forChip Multiprocessors
- Jichuan Chang, Enric Herrero, Ramon Canal and
Gurindar S. Sohi - HP Labs
- Universitat Politècnica de Catalunya
- University of Wisconsin-Madison
- M. S. Obaidat and S. Misra (Editors)Chapter 13,
Cooperative Networking (Wiley)
2Outline
- Motivation
- Cooperative Caching for CMPs
- Applications of CMP Cooperative Caching
- Latency Reduction
- Adaptive Repartitioning
- Performance Isolation
- Conclusions
3Motivation - Background
- Chip multiprocessors (CMPs) both require and
enable innovative on-chip cache designs - Critical for CMPs
- Processor/memory gap
- Limited pin-bandwidth
- Current designs
- Shared cache
- sharing can lead to contention
- Private caches
- isolation can waste resources
P
P
Mem.
Narrow
P
P
Slow
4-core CMP
Capacitycontention
Wastedcapacity
4Motivation Challenges
- Key challenges
- Growing on-chip wire delay
- Expensive off-chip accesses
- Destructive inter-thread interference
- Diverse workload characteristics
- Three important demands for CMP caching
- Capacity reduce off-chip accesses
- Latency reduce remote on-chip references
- Isolation reduce inter-thread interference
- Need to combine the strength of both private and
shared cache designs
5Outline
- Motivation
- Cooperative Caching for CMPs
- Applications of CMP Cooperative Caching
- Latency Reduction
- Adaptive Repartitioning
- Performance Isolation
- Conclusions
6CMP Cooperative Caching
- Form an aggregate global cache via cooperative
private caches - Use private caches to attract data for fast reuse
- Share capacity through cooperative policies
- Throttle cooperation to find an optimal sharing
point - Inspired by cooperative file/web caches
- Similar latency tradeoff
- Similar algorithms
7CMP Cooperative Caching
- Private L2 caches to reduce access latency.
- Centralized directory with duplicated tags grants
coherence on-chip. - Spilling Evicted blocks are forwarded to other
caches for a more efficient use of cache space.
(N-Chance forwarding mechanism)
8Distributed Cooperative Caching
- Objective Keep the benefits of Cooperative
caching while improving scalability and energy
consumption. - Distributed directory with different tag
allocation mechanism.
Main Memory
Bus
DCE
DCE
Interconnect
L1
L1
L1
L1
L2
L2
P
P
9Tag Structure Comparison
10Outline
- Motivation
- Cooperative Caching for CMPs
- Applications of CMP Cooperative Caching
- Latency Reduction
- Adaptive Repartitioning
- Performance Isolation
- Conclusions
113. Applications of CMP Cooperative Caching
- Several techniques have appeared that take
advantage of Cooperative Caching for CMPs. - For Latency Reduction
- Cooperation Throtling
- For Adaptive Repartitioning
- Elastic Cooperative Caching
- For Performance Isolation
- Cooperative Cache Partitioning
12Outline
- Motivation
- Cooperative Caching for CMPs
- Applications of CMP Cooperative Caching
- Latency Reduction
- Adaptive Repartitioning
- Performance Isolation
- Conclusions
13Policies to Reduce Off-chip Accesses
- Cooperation policies for capacity sharing
- (1) Cache-to-cache transfers of clean data
- (2) Replication-aware replacement
- (3) Global replacement of inactive data
- Implemented by two unified techniques
- Policies enforced by cache replacement/placement
- Information/data exchange supported by modifying
the coherence protocol
14Policy (1) - Make use of all on-chip data
- Dont go off-chip if on-chip (clean) data exist
- Beneficial and practical for CMPs
- Peer cache is much closer than next-level storage
- Affordable implementations of clean ownership
- Important for all workloads
- Multi-threaded (mostly) read-only shared data
- Single-threaded spill into peer caches for later
reuse
15Policy (2) Control replication
- Intuition increase of unique on-chip data
- Latency/capacity tradeoff
- Evict singlets only when no replications exist
- Modify the default cache replacement policy
- Spill an evicted singlet into a peer cache
- Can further reduce on-chip replication
- Randomly choose a recipient cache for spilling
singlets
16Policy (3) - Global cache management
- Approximate global-LRU replacement
- Combine global spill/reuse history with local LRU
- Identify and replace globally inactive data
- First become the LRU entry in the local cache
- Set as MRU if spilled into a peer cache
- Later become LRU entry again evict globally
- 1-chance forwarding (1-Fwd)
- Blocks can only be spilled once if not reused
17Cooperation Throttling
- Why throttling?
- Further tradeoff between capacity/latency
- Two probabilities to help make decisions
- Cooperation probability control replication
- Spill probability throttle spilling
Shared
Private
Cooperative Caching
CC 100
18Performance Evaluation
19Outline
- Motivation
- Cooperative Caching for CMPs
- Applications of CMP Cooperative Caching
- Latency Reduction
- Adaptive Repartitioning
- Performance Isolation
- Conclusions
20CC for Adaptive Repartitioning
- Tradeoff between minimizing off-chip misses and
avoiding inter-thread interference. - Elastic Cooperative Caching adapts caches
according to application requirements. - High cache requirements -gt Big local private
cache - Low data reuse -gt Cache space reassigned to
increase the size of the global shared cache for
spilled blocks.
21Elastic Cooperative Caching structure
Local Shared/Private cache with independent
repartitioning unit.
Distributed Coherence Engines grant coherence
Allocates evicted blocks from all private regions
Only local core can allocate
Every N cycles repartitions cache based on LRU
hits in SP partitions.
Distributes evicted blocks from private partition
among nodes.
22Repartitioning Unit Working Example
Counter gt HT Private Counter lt LT Shared
Independent Partitioning, distributed structure.
3
4
23Spilled Allocator Working Example
Only data from the private region can be spilled
Private Ways
Shared Ways
Broadcast
No need of perfectly updated information, out of
critical path.
24Performance and energy-efficiency evaluation
24 Over ASR
12 Over ASR
25Outline
- Motivation
- Cooperative Caching for CMPs
- Applications of CMP Cooperative Caching
- Latency Reduction
- Adaptive Repartitioning
- Performance Isolation
- Conclusions
26Time-share Based Partitioning
- Throughput-fairness dilemma
- Cooperation Taking turns to speed up
- Multiple time-sharing partitions (MTP)
- QoS guarantee
- Cooperatively shrink/expand across MTPs
- Bound average slowdown over the long term
P4
P1
P4
P2
P3
P1
P2
P3
Time
Time
IPC 0.52 WS2.42 QoS -0.52 FS 1.22
IPC 0.52 WS 2.42 QoS 0 FS 1.97
ltlt
Fairness improvement and QoS guarantee reflected
by higher FS and bounded QoS values
27MTP Benefits
- Better than single spatial partition (SSP)
- MTP/long-termQoS almost the same as MTP/noQoS
Fair Speedup
Percentage of workloads achieving various FS
values
Offline analysis based on profile info, 210
workloads (job mixes)
28Better than MTP
- MTP issues
- Not needed if LRU performs better (LRU often
near-optimal Stone et al. IEEE TOC 92) - Partitioning is more complex than SSP
- Cooperative Cache Partitioning (CCP)
- Integration with Cooperative Caching (CC)
- Exploit CCs latency and LRU-based sharing
benefits - Simplify the partitioning algorithm
- Total execution time Epochs(CC) Epochs(MTP)
- Weighted by of threads benefiting from CC vs.
MTP
29Partitioning Heuristic
- When is MTP better than CC
- QoS ?speedup gt ?slowdown (over N partitions)
- Speedup should be large
- CC already good at fine-grained tuning
Baseline
C_shrink
Throughput (normalized)
thrashing _test Speedup gt (N-1) x Slowdown
Allocated cache ways (16-way total, 4-core)
30Partitioning Algorithm
- S All threads - supplier threads (e.g., gcc,
swim) - Allocate them with gPar (guaranteed partition, or
min. - capacity needed for QoS) Yeh/Reinman CASES
05 - For threads in S, init their C_expand and
C_shrink - Do thrashing_test iteratively for each thread in
S - If thread t fails, allocate t with gPar, remove t
from S - Update C_expand and C_shrink for other threads in
S - Repeat until S is empty or all threads in S pass
the test
31Fair Speedup Results
- Two groups of workloads
- PAR MTP better than CC (partitioning helps)
- LRU CC better than MTP (partitioning hurts)
Percentage of workloads
Percentage of workloads
PAR (67 out of 210 workloads)
LRU (143 out of 210 workloads)
32Performance and QoS evaluation
33Outline
- Motivation
- Cooperative Caching for CMPs
- Applications of CMP Cooperative Caching
- Latency Reduction
- Adaptive Repartitioning
- Performance Isolation
- Conclusions
344. Conclusions
- On-chip cache hierarchy plays an important role
in CMPs and must provide fast and fair data
accesses for multiple, competing processor cores. - Cooperative Caching for CMPs is an effective
framework to manage caches in such environment. - Cooperative sharing mechanisms and the philosophy
of using cooperation for conflict resolution can
be applied to many other resource management
problems.