Cooperative Caching for Chip Multiprocessors - PowerPoint PPT Presentation

About This Presentation

Title:

Cooperative Caching for Chip Multiprocessors

Description:

Cooperative Caching for Chip Multiprocessors Jichuan Chang , Enric Herrero , Ramon Canal and Gurindar S. Sohi* HP Labs Universitat Polit cnica de Catalunya – PowerPoint PPT presentation

Number of Views:237

Avg rating:3.0/5.0

Slides: 35

Provided by: Smart89

Category:

more less

Transcript and Presenter's Notes

Title: Cooperative Caching for Chip Multiprocessors

1
Cooperative Caching forChip Multiprocessors

Jichuan Chang, Enric Herrero, Ramon Canal and
Gurindar S. Sohi
HP Labs
Universitat Politècnica de Catalunya
University of Wisconsin-Madison
M. S. Obaidat and S. Misra (Editors)Chapter 13,
Cooperative Networking (Wiley)

2
Outline

Motivation
Cooperative Caching for CMPs
Applications of CMP Cooperative Caching
Latency Reduction
Adaptive Repartitioning
Performance Isolation
Conclusions

3
Motivation - Background

Chip multiprocessors (CMPs) both require and
enable innovative on-chip cache designs
Critical for CMPs
Processor/memory gap
Limited pin-bandwidth
Current designs
Shared cache
sharing can lead to contention
Private caches
isolation can waste resources

P
P
Mem.

Narrow
P
P
Slow
4-core CMP
Capacitycontention
Wastedcapacity
4
Motivation Challenges

Key challenges
Growing on-chip wire delay
Expensive off-chip accesses
Destructive inter-thread interference
Diverse workload characteristics
Three important demands for CMP caching
Capacity reduce off-chip accesses
Latency reduce remote on-chip references
Isolation reduce inter-thread interference
Need to combine the strength of both private and
shared cache designs

5
Outline

Motivation
Cooperative Caching for CMPs
Applications of CMP Cooperative Caching
Latency Reduction
Adaptive Repartitioning
Performance Isolation
Conclusions

6
CMP Cooperative Caching

Form an aggregate global cache via cooperative
private caches
Use private caches to attract data for fast reuse
Share capacity through cooperative policies
Throttle cooperation to find an optimal sharing
point
Inspired by cooperative file/web caches
Similar latency tradeoff
Similar algorithms

7
CMP Cooperative Caching

Private L2 caches to reduce access latency.
Centralized directory with duplicated tags grants
coherence on-chip.
Spilling Evicted blocks are forwarded to other
caches for a more efficient use of cache space.
(N-Chance forwarding mechanism)

8
Distributed Cooperative Caching

Objective Keep the benefits of Cooperative
caching while improving scalability and energy
consumption.
Distributed directory with different tag
allocation mechanism.

Main Memory
Bus
DCE
DCE
Interconnect
L1
L1
L1
L1
L2
L2
P
P
9
Tag Structure Comparison
10
Outline

Motivation
Cooperative Caching for CMPs
Applications of CMP Cooperative Caching
Latency Reduction
Adaptive Repartitioning
Performance Isolation
Conclusions

11
3. Applications of CMP Cooperative Caching

Several techniques have appeared that take
advantage of Cooperative Caching for CMPs.
For Latency Reduction
Cooperation Throtling
For Adaptive Repartitioning
Elastic Cooperative Caching
For Performance Isolation
Cooperative Cache Partitioning

12
Outline

Motivation
Cooperative Caching for CMPs
Applications of CMP Cooperative Caching
Latency Reduction
Adaptive Repartitioning
Performance Isolation
Conclusions

13
Policies to Reduce Off-chip Accesses

Cooperation policies for capacity sharing
(1) Cache-to-cache transfers of clean data
(2) Replication-aware replacement
(3) Global replacement of inactive data
Implemented by two unified techniques
Policies enforced by cache replacement/placement
Information/data exchange supported by modifying
the coherence protocol

14
Policy (1) - Make use of all on-chip data

Dont go off-chip if on-chip (clean) data exist
Beneficial and practical for CMPs
Peer cache is much closer than next-level storage
Affordable implementations of clean ownership
Important for all workloads
Multi-threaded (mostly) read-only shared data
Single-threaded spill into peer caches for later
reuse

15
Policy (2) Control replication

Intuition increase of unique on-chip data
Latency/capacity tradeoff
Evict singlets only when no replications exist
Modify the default cache replacement policy
Spill an evicted singlet into a peer cache
Can further reduce on-chip replication
Randomly choose a recipient cache for spilling

singlets
16
Policy (3) - Global cache management

Approximate global-LRU replacement
Combine global spill/reuse history with local LRU
Identify and replace globally inactive data
First become the LRU entry in the local cache
Set as MRU if spilled into a peer cache
Later become LRU entry again evict globally
1-chance forwarding (1-Fwd)
Blocks can only be spilled once if not reused

17
Cooperation Throttling

Why throttling?
Further tradeoff between capacity/latency
Two probabilities to help make decisions
Cooperation probability control replication
Spill probability throttle spilling

Shared
Private
Cooperative Caching
CC 100
18
Performance Evaluation
19
Outline

Motivation
Cooperative Caching for CMPs
Applications of CMP Cooperative Caching
Latency Reduction
Adaptive Repartitioning
Performance Isolation
Conclusions

20
CC for Adaptive Repartitioning

Tradeoff between minimizing off-chip misses and
avoiding inter-thread interference.
Elastic Cooperative Caching adapts caches
according to application requirements.
High cache requirements -gt Big local private
cache
Low data reuse -gt Cache space reassigned to
increase the size of the global shared cache for
spilled blocks.

21
Elastic Cooperative Caching structure
Local Shared/Private cache with independent
repartitioning unit.
Distributed Coherence Engines grant coherence
Allocates evicted blocks from all private regions
Only local core can allocate
Every N cycles repartitions cache based on LRU
hits in SP partitions.
Distributes evicted blocks from private partition
among nodes.
22
Repartitioning Unit Working Example
Counter gt HT Private Counter lt LT Shared
Independent Partitioning, distributed structure.
3
4
23
Spilled Allocator Working Example
Only data from the private region can be spilled
Private Ways
Shared Ways
Broadcast
No need of perfectly updated information, out of
critical path.
24
Performance and energy-efficiency evaluation
24 Over ASR
12 Over ASR
25
Outline

Motivation
Cooperative Caching for CMPs
Applications of CMP Cooperative Caching
Latency Reduction
Adaptive Repartitioning
Performance Isolation
Conclusions

26
Time-share Based Partitioning

Throughput-fairness dilemma
Cooperation Taking turns to speed up
Multiple time-sharing partitions (MTP)
QoS guarantee
Cooperatively shrink/expand across MTPs
Bound average slowdown over the long term

P4
P1
P4
P2
P3
P1
P2
P3
Time
Time
IPC 0.52 WS2.42 QoS -0.52 FS 1.22
IPC 0.52 WS 2.42 QoS 0 FS 1.97
ltlt
Fairness improvement and QoS guarantee reflected
by higher FS and bounded QoS values
27
MTP Benefits

Better than single spatial partition (SSP)
MTP/long-termQoS almost the same as MTP/noQoS

Fair Speedup
Percentage of workloads achieving various FS
values
Offline analysis based on profile info, 210
workloads (job mixes)
28
Better than MTP

MTP issues
Not needed if LRU performs better (LRU often
near-optimal Stone et al. IEEE TOC 92)
Partitioning is more complex than SSP
Cooperative Cache Partitioning (CCP)
Integration with Cooperative Caching (CC)
Exploit CCs latency and LRU-based sharing
benefits
Simplify the partitioning algorithm
Total execution time Epochs(CC) Epochs(MTP)
Weighted by of threads benefiting from CC vs.
MTP

29
Partitioning Heuristic

When is MTP better than CC
QoS ?speedup gt ?slowdown (over N partitions)
Speedup should be large
CC already good at fine-grained tuning

Baseline
C_shrink
Throughput (normalized)
thrashing _test Speedup gt (N-1) x Slowdown
Allocated cache ways (16-way total, 4-core)
30
Partitioning Algorithm

S All threads - supplier threads (e.g., gcc,
swim)
Allocate them with gPar (guaranteed partition, or
min.
capacity needed for QoS) Yeh/Reinman CASES
05
For threads in S, init their C_expand and
C_shrink
Do thrashing_test iteratively for each thread in
S
If thread t fails, allocate t with gPar, remove t
from S
Update C_expand and C_shrink for other threads in
S
Repeat until S is empty or all threads in S pass
the test

31
Fair Speedup Results

Two groups of workloads
PAR MTP better than CC (partitioning helps)
LRU CC better than MTP (partitioning hurts)

Percentage of workloads
Percentage of workloads
PAR (67 out of 210 workloads)
LRU (143 out of 210 workloads)
32
Performance and QoS evaluation
33
Outline

Motivation
Cooperative Caching for CMPs
Applications of CMP Cooperative Caching
Latency Reduction
Adaptive Repartitioning
Performance Isolation
Conclusions

34
4. Conclusions

On-chip cache hierarchy plays an important role
in CMPs and must provide fast and fair data
accesses for multiple, competing processor cores.
Cooperative Caching for CMPs is an effective
framework to manage caches in such environment.
Cooperative sharing mechanisms and the philosophy
of using cooperation for conflict resolution can
be applied to many other resource management
problems.

Write a Comment

User Comments (0)