Title: ASR: Adaptive Selective Replication for CMP Caches
1ASR Adaptive Selective Replication for CMP Caches
- Brad Beckmann, Mike Marty, and David Wood
- Multifacet Project
- University of Wisconsin-Madison
- 12/13/06
currently at Microsoft
2Introduction Shared Cache
L1 I
L1 I
L2 Bank
L2 Bank
CPU 3
CPU 4
L1 D
L1 D
L1 I
L1 I
L2 Bank
L2 Bank
CPU 2
CPU 5
L1 D
L1 D
L1 I
L1 I
L2 Bank
L2 Bank
CPU 1
CPU 6
L1 D
L1 D
L1 I
L1 I
L2 Bank
L2 Bank
CPU 0
CPU 7
L1 D
L1 D
3Introduction Private Caches
L1 I
L1 I
CPU 3
CPU 4
Private
Private
L1 D
L1 D
L2
L2
L1 I
L1 I
CPU 2
CPU 5
Private
Private
L1 D
L1 D
L2
L2
L1 I
L1 I
Private
Private
CPU 1
CPU 6
L1 D
L1 D
L2
L2
Desire both Fast Access High Capacity
L1 I
L1 I
Private
Private
CPU 0
CPU 7
L2
L1 D
L1 D
L2
4Introduction
- Previous hybrid proposals
- Victim Replication, CMP-NuRapid, Cooperative
Caching - Achieve fast access and high capacity
- Under certain workloads system configurations
- Utilize static rules
- Non-adaptive
- Adaptive Selective Replication ASR
- Dynamically monitor workload behavior
- Adapt the L2 cache to workload demand
- Up to 12 improvement vs. previous proposals
5Outline
- Introduction
- Understanding L2 Replication
- Benefit
- Cost
- Key Observation
- Solution
- ASR Adaptive Selective Replication
- Evaluation
ASR Adaptive Selective Replication for CMP
Caches
5
Beckmann, Marty, Wood
6Understanding L2 Replication
- Three L2 block sharing types
- Single requestor
- All requests by a single processor
- Shared read only
- Read only requests by multiple processors
- Shared read-write
- Read and write requests by multiple processors
- Profile L2 blocks during their on-chip lifetime
- 8 processor CMP
- 16 MB shared L2 cache
- 64-byte block size
7Understanding L2 Replication
Apache
Shared Read-only
Shared Read-write
Single Requestor
8Understanding L2 Replication Benefit
L2 Hit Cycles
Replication Capacity
9Understanding L2 Replication Cost
L2 Miss Cycles
Replication Capacity
10Understanding L2 Replication Key Observation
ASR Adaptive Selective Replication for CMP
Caches
10
Beckmann, Marty, Wood
11Understanding L2 Replication Solution
Property of Workload Cache Interaction
Not Fixed ? Must Adapt
Total Cycles
Replication Capacity
12Outline
- Wires and CMP caches
- Understanding L2 Replication
- ASR Adaptive Selective Replication
- SPR Selective Probabilistic Replication
- Monitoring and adapting to workload behavior
- Evaluation
13SPR Selective Probabilistic Replication
- Mechanism for Selective Replication
- Relax L2 inclusion property
- L2 evictions do not force L1 evictions
- Non-exclusive cache hierarchy
- Ring Writebacks
- L1 Writebacks passed clockwise between private L2
caches - Merge with other existing L2 copies
- Probabilistically choose between
- Local writeback ? allow replication
- Ring writeback ? disallow replication
- Replicates frequently requested blocks
14SPR Selective Probabilistic Replication
L1 I
Private L2
Private L2
CPU 3
CPU 4
L1 D
L1 I
Private L2
Private L2
CPU 2
CPU 5
L1 D
L1 I
Private L2
Private L2
CPU 1
CPU 6
L1 D
L1 I
Private L2
Private L2
CPU 0
CPU 7
L1 D
15SPR Selective Probabilistic Replication
Replication Level 0 1 2 3 4 5
Prob. of Replication 0 1/64 1/16 1/4 1/2 1
Current Level
Replication Capacity
1
3
4
5
0
2
Replication Levels
16Monitoring and Adapting to Workload Behavior
Replication Benefit Curve
L2 Hit Cycles
Replication Capacity
- Decrease in Replication Benefit
- Bit marks replicas of the current, but not lower
level - Increase in Replication Benefit
- Store 8-bit partial tags of next higher level
replications
17Monitoring and Adapting to Workload Behavior
Replication Cost Curve
L2 Miss Cycles
Replication Capacity
- 3. Decrease in Replication Cost
- Stores 16-bit partial tags of recently evicted
blocks - 4. Increase in Replication Cost
- Way and Set counters track soon-to-be-evicted
blocks
ASR Adaptive Selective Replication for CMP
Caches
17
Beckmann, Marty, Wood
18Outline
- Wires and CMP caches
- Understanding L2 Replication
- ASR Adaptive Selective Replication
- Evaluation
ASR Adaptive Selective Replication for CMP
Caches
18
Beckmann, Marty, Wood
19Methodology
- Full system simulation
- Simics
- Wisconsins GEMS Timing Simulator
- Out-of-order processor
- Memory system
- Workloads
- Commercial
- apache, jbb, otlp, zeus
- Scientific (see paper)
- SpecOMP apsi art
- Splash barnes ocean
20System Parameters
8 core CMP, 45 nm technology
Memory System Memory System Dynamically Scheduled Processor Dynamically Scheduled Processor
L1 I D caches 64 KB, 4-way, 3 cycles Clock frequency 5.0 GHz
Unified L2 cache 16 MB, 16-way Reorder buffer / scheduler 128 / 64 entries
L1 / L2 prefetching Unit Non-unit strided prefetcher (similar Power4) Pipeline width 4-wide fetch issue
Memory latency 500 cycles Pipeline stages 30
Memory bandwidth 50 GB/s Direct branch predictor 3.5 KB YAGS
Memory size 4 GB of DRAM Return address stack 64 entries
Outstanding memory request / CPU 16 Indirect branch predictor 256 entries (cascaded)
21Replication Benefit, Cost, Effectiveness Curves
Benefit
Cost
22Replication Benefit, Cost, Effectiveness Curves
Effectiveness
23Comparison of Replication Policies
- SPR ? multiple possible policies
- Evaluated 4 shared read-only replication policies
- VR Victim Replication
- Previously proposed Zhang ISCA 05
- Disallow replicas to evict shared owner blocks
- NR CMP-NuRapid
- Previously proposed Chishti ISCA 05
- Replicate upon the second request
- CC Cooperative Caching
- Previously proposed Chang ISCA 06
- Replace replicas first
- Spill singlets to remote caches
- Tunable parameter 100, 70, 30, 0
- ASR Adaptive Selective Replication
- Our proposal
- Monitor and adjust to workload demand
24ASR Performance
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
25Conclusions
- CMP Cache Replication
- No replications ? conservers capacity
- All replications ? reduces on-chip latency
- Previous hybrid proposals
- Work well for certain criteria
- Non-adaptive
- Adaptive Selective Replication
- Probabilistic policy favors frequently requested
blocks - Dynamically monitor replication benefit cost
- Replicate benefit gt cost
- Improves performance up to 12 vs. previous
schemes
26Backup Slides
27ASR Memory Cycles
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
28L2 Cache Requests Breakdown
29L2 Cache Requests Breakdown User OS
30Shared Read-write Requests Breakdown
31Shared Read-write Block Breakdown
32ASR Decrease-in-replication Benefit
L2 Hit Cycles
Replication Capacity
33ASR Decrease-in-replication Benefit
- Goal
- Determine replication benefit decrease of the
next lower level - Mechanism
- Current Replica Bit
- Per L2 cache block
- Set for replications of the current level
- Not set for replications of lower level
- Current replica hits would be remote hits with
next lower level - Overhead
- 1-bit x 256 K L2 blocks 32 KB
34ASR Increase-in-replication Benefit
L2 Hit Cycles
Replication Capacity
35ASR Increase-in-replication Benefit
- Goal
- Determine replication benefit increase of the
next higher level - Mechanism
- Next Level Hit Buffers (NLHBs)
- 8-bit partial tag buffer
- Store replicas of the next higher
- NLHB hits would be local L2 hits with next higher
level - Overhead
- 8-bits x 16 K entries x 8 processors 128 KB
36ASR Decrease-in-replicationCost
L2 Miss Cycles
Replication Capacity
37ASR Decrease-in-replication Cost
- Goal
- Determine replication cost decrease of the next
lower level - Mechanism
- Victim Tag Buffers (VTBs)
- 16-bit partial tags
- Store recently evicted blocks of current
replication level - VTB hits would be on-chip hits with next lower
level - Overhead
- 16-bits x 1 K entry x 8 processors 16 KB
38ASR Increase-in-replicationCost
L2 Miss Cycles
Replication Capacity
39ASR Increase-in-replication Cost
- Goal
- Determine replication cost increase of the next
higher level - Mechanism
- Way and Set counters Suh et al. HPCA 2002
- Identify soon-to-be-evicted blocks
- 16-way pseudo LRU
- 256 set groups
- On-chip hits that would be off-chip with next
higher level - Overhead
- 255-bit pseudo LRU tree x 8 processors 255 B
- Overall storage overhead 212 KB or 1.2 of
total storage
40ASR Triggering a Cost-Benefit Analysis
- Goal
- Dynamically adapt to workload behavior
- Avoid unnecessary replication level changes
- Mechanism
- Evaluation trigger
- Local replications or NLHB allocations exceed 1K
- Replication change
- Four consecutive evaluations in the same direction
41ASR Adaptive Algorithm
Decrease in Replication Cost gt Increase in Replication Benefit Decrease in Replication Cost lt Increase in Replication Benefit
Decrease in Replication Benefit gt Increase in Replication Cost Go in direction with greater value Increase Replication
Decrease in Replication Benefit lt Increase in Replication Cost Decrease Replication Do Nothing
42ASR Adapting to Workload Behavior
Oltp All CPUs
43ASR Adapting to Workload Behavior
Apache All CPUs
44ASR Adapting to Workload Behavior
Apache CPU 0
45ASR Adapting to Workload Behavior
Apache CPUs 1-7
46Replication Capacity
47Replication Capacity
4 MB 150 Memory Latency In-order processors
48Replication Benefit, Cost, Effectiveness Curves
4 MB 150 Memory Latency In-order processors
Benefit
Cost
49Replication Benefit, Cost, Effectiveness Curves
Effectiveness
4 MB 150 Memory Latency In-order processors
50Replication Benefit, Cost, Effectiveness Curves
16 MB 500 Memory Latency In-order processors
Benefit
Cost
51Replication Benefit, Cost, Effectiveness Curves
Effectiveness
16 MB 500 Memory Latency In-order processors
52Replication Analytic Model
- Utilize workload characterization data
- Goal initutition not accuracy
- Optimal point of replication
- Sensitive to cache size
- Sensitive to memory latency
53Replication Model Selective Replication
54ASR Memory Cycles
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
4 MB 150 Memory Latency In-order processors
55ASR Performance
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
4 MB 150 Memory Latency In-order processors
56ASR Memory Cycles
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
16 MB 250 Memory Latency Out-of-order processors
57ASR Performance
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
16 MB 250 Memory Latency Out-of-order processors
58ASR Memory Cycles
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
16 MB 500 Memory Latency Out-of-order processors
59ASR Performance
- S CMP-Shared
- P CMP-Private
- V SPR-VR
- N SPR-NR
- C SPR-CC
- A SPR-ASR
16 MB 500 Memory Latency Out-of-order processors
60Token Coherence
- Proposed for SMPs Martin 03, CMPs Marty 05
- Provides a simple correctness substrate
- One token to read
- All tokens to write
- Advantages
- Permits a broadcast protocol on unordered network
without acknowledgement messages - Supports multiple allocation policies
- Disadvantages
- All blocks must be written back (cannot destroy
tokens) - Token counts at memory
- Persistent request can be a performance bottleneck