Title: Two%20Ways%20to%20Exploit%20Multi-Megabyte%20Caches
1Two Ways to Exploit Multi-Megabyte Caches
- AENAO Research Group _at_ Toronto
- Kaveh Aasaraai
- Ioana Burcea
- Myrto Papadopoulou
- Elham Safi
- Jason Zebchuk
- Andreas Moshovos
aasaraai, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
2Future Caches Just Larger?
CPU
D
I
interconnect
10s 100s of MB
Main Memory
- Big Picture Management
- Store Metadata
3 Conventional Block Centric Cache
Fine-Grain View of Memory
L2 Cache
- Small Blocks
- Optimizes Bandwidth and Performance
- Large L2/L3 caches especially
Big Picture Lost
4Big Picture View
Coarse-Grain View of Memory
L2 Cache
- Region 2n sized, aligned area of memory
- Patterns and behavior exposed
- Spatial locality
- Exploit for performance/area/power
5Exploiting Coarse-Grain Patterns
Coarse-Grain Framework
Circuit-Switched Coherence
CPU
Stealth Prefetching
RegionScout
- Embed coarse-grain information in tag array
- Support many different optimizations with less
area overhead
Run-time Adaptive Cache Hierarchy Management via
Reference Analysis
L2 Cache
Destination-Set Prediction
Coarse-Grain Coherence Tracking
Spatial Memory Streaming
- Many existing coarse-grain optimizations
- Add new structures to track coarse-grain
information
Adaptable optimization FRAMEWORK
Hard to justify for a commercial design
6RegionTracker Solution
L2 Cache
L1
Data Array
Data Blocks
Tag Array
Region Tracker
L1
Block Requests
L1
Region Probes
L1
Region Responses
Block Requests
- Manage blocks, but also track and manage regions
7RegionTracker Summary
- Replace conventional tag array
- 4-core CMP with 8MB shared L2 cache
- Within 1 of original performance
- Up to 20 less tag area
- Average 33 less energy consumption
- Optimization Framework
- Stealth Prefetching same performance, 36 less
area - RegionScout 2x more snoops avoided, no area
overhead
8Road Map
- Introduction
- Goals
- Coarse-Grain Cache Designs
- RegionTracker A Tag Array Replacement
- RegionTracker An Optimization Framework
- Conclusion
9Goals
- Conventional Tag Array Functionality
- Identify data block location and state
- Leave data array un-changed
- Optimization Framework Functionality
- Is Region X cached?
- Which blocks of Region X are cached? Where?
- Evict or migrate Region X
- Easy to assign properties to each Region
10Coarse-Grain Cache Designs
Large Block Size
Tag Array
Data Array
Region X
- Increased BW, Decreased hit-rates
11Sector Cache
Tag Array
Data Array
Region X
12Sector Pool Cache
Tag Array
Data Array
Region X
- High Associativity (2 - 4 times)
13Decoupled Sector Cache
Tag Array
Data Array
Status Table
Region X
- Region information not exposed
- Region replacement requires scanning multiple
entries
14Design Requirements
- Small block size (64B)
- Miss-rate does not increase
- Lookup associativity does not increase
- No additional access latency
- (i.e., No scanning, no multiple block evictions)
- Does not increase latency, area, or energy
- Allows banking and interleaving
- Fit in conventional tag array envelope
15RegionTracker A Tag Array Replacement
L1
Data Array
Region Vector Array
L1
L1
Evicted Region Buffer
L1
Block Status Table
- 3 SRAM arrays, combined smaller than tag array
16Basic Structures
Ex 8MB, 16-way set-associative cache, 64-byte
blocks, 1KB region
Region Vector Array (RVA)
Block Status Table (BST)
Region Tag
status
block15
block0
3
2
way
V
1
4
- Address specific RVA set and BST set
- RVA entry multiple, consecutive BST sets
- BST entry one of four RVA sets
17Common Case Hit
Ex 8MB, 16-way set-associative cache, 64-byte
blocks, 1KB region
49
0
6
10
21
Address
Region Tag
RVA Index
Region Offset
Block Offset
Block Offset
Data Array BST Index
19
6
0
Region Vector Array (RVA)
Block Status Table (BST)
Region Tag
status
block15
block0
3
2
way
V
To Data Array
1
4
18Worst Case (Rare) Region Miss
49
0
6
10
21
Region Tag
RVA Index
Region Offset
Block Offset
Address
Block Offset
Data Array BST Index
Ptr
19
6
0
Region Vector Array (RVA)
Block Status Table (BST)
Evicted Region Buffer (ERB)
No Match!
Region Tag
status
Ptr
block15
block0
3
2
way
V
19Methodology
P
P
P
P
- Flexus simulator from CMU SimFlex group
- Based on Simics full-system simulator
- 4-core CMP modeled after Piranha
- Private 32KB, 4-way set-associative L1 caches
- Shared 8MB, 16-way set-associative L2 cache
- 64-byte blocks
- Miss-rates Functional simulation of 2 billion
instructions per core - Performance and Energy Timing simulation using
SMARTS sampling methodology - Area and Power Full custom implementation on
130nm commercial technology - 9 commercial workloads
- WEB SpecWEB on Apache and Zeus
- OLTP TPC-C on DB2 and Oracle
- DSS 5 TPC-H queries on DB2
D
I
D
I
D
I
D
I
Interconnect
L2
20Miss-Rates vs. Area
Sector Cache (0.25, 1.26)
48-way
Relative Miss-Rate
52-way
14-way
15-way
better
Relative Tag Array Area
- Sector Cache 512KB sectors, SPC and RT 1KB
regions - Trade-offs comparable to conventional cache
21Performance Energy
Performance
Energy
better
better
Reduction in Tag Energy
Normalized Execution Time
- 12-way set-associative RegionTracker 20 less
area - Error bars 95 confidence interval
- Performance within 1, with 33 tag energy
reduction
22Road Map
- Introduction
- Goals
- Coarse-Grain Cache Designs
- RegionTracker A Tag Array Replacement
- RegionTracker An Optimization Framework
- Conclusion
23RegionTracker An Optimization Framework
Stealth Prefetching Average 20 performance
improvement Drop-in RegionTracker for 36 less
area overhead
RegionScout In-depth analysis
24Snoop Coherence Common Case
CPU
CPU
CPU
Read x
Read x1
Read x2
Read xn
miss
miss
Main Memory
Many snoops are to non-shared regions
25RegionScout
CPU
CPU
CPU
Read x
Miss
Miss
Region Miss
Region Miss
Global Region Miss
Main Memory
Non-Shared Regions
Locally Cached Regions
Eliminate broadcasts for non-shared regions
26RegionTracker Implementation
- Minimal overhead to support RegionScout
optimization - Still uses less area than conventional tag array
27RegionTracker RegionScout
- 4 processors, 512KB L2 Caches
- 1KB regions
BlockScout (4KB)
better
Reduction in Snoop Broadcasts
Avoid 41 of Snoop Broadcasts, no area overhead
compared to conventional tag array
28Result Summary
- Replace Conventional Tag Array
- 20 Less tag area
- 33 Less tag energy
- Within 1 of original performance
- Coarse-Grain Optimization Framework
- 36 reduction in area overhead for Stealth
Prefetching - Filter 41 of snoop broadcasts with no area
overhead compared to conventional cache
29Predictor Virtualization
- Ioana Burcea
- Joint work with
- Stephen Somogyi
- Babak Falsafi
30Optimization Engines Predictors
Predictor Virtualization
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
Interconnect
L2
Main Memory
31Motivating Trends
- Dedicating resources to predictors hard to
justify - Chip multiprocessors
- Space dedicated to predictors X processors
- Larger predictor tables
- Increased performance
- Memory hierarchies offer the opportunity
- Increased capacity
- How many apps really use the space?
Use conventional memory hierarchies to store
predictor information
32PV Architecture contd.
Optimization Engine
request
request
prediction
Predictor Table
33PV Architecture contd.
Optimization Engine
request
prediction
Predictor Virtualization
34PV Architecture contd.
Optimization Engine
request
prediction
PVCache
MSHR
PVStart
index
PVProxy
On the backside of the L1
L2
PVTable
Main Memory
35To Virtualize Or Not to Virtualize?
Common Case
- Re-Use2. Predictor Info Prefetching
36To Virtualize or Not?
- Challenge
- Hit in the PVCache most of the time
- Will not work for all predictors out of the box
- Reuse is necessary
- Intrinsic
- Easy to virtualize
- Non-intrinsic
- Must be engineered
- More so if the predictor needs to be fast to
start with
37Will There Be Reuse?
- Intrinsic
- Multiple predictions per entry
- Well see an example
- Can be engineered
- Group temporally correlated entries together
Cache block
38Spatial Memory Streaming
- Footprint
- Blocks accessed per memory region
- Predict next time the footprint will be the same
- Handle PC offset within region
39Spatial Generations
40Virtualizing SMS
Virtualize
patterns
Predictor
Detector
patterns
trigger access
prefetches
41Virtualizing SMS
Virtual Table
PVCache
8
1K
11
11
tag
pattern
tag
tag
pattern
pattern
unused
85
0
11
43
54
42Packing Entries in One Cache Block
- Index PC offset within spatial group
- PC ?16 bits
- 32 blocks in a spatial group ? 5 bit offset
- ?
32 bit spatial pattern - Pattern table 1K sets
- 10 bits to index the table ? 11 bit tag
- Cache block 64 bytes
- 11 entries per cache block ? Pattern table
- 1K sets
11-way set associative
21 bit index
tag
pattern
tag
tag
pattern
pattern
unused
85
0
11
43
54
43Memory Address Calculation
PC
Block offset
16 bits
5 bits
PV Start Address
000000
10 bits
Memory Address
44Simulation Infrastructure
- SimFlex CMU Impetus
- Full-system simulator based on Simics
- Base processor configuration
- 8-wide OoO
- 256-entry ROB / 64-entry LSQ
- L1D/L1I 64KB 4-way set-associative
- UL2 8MB 16-way set-associative
- Commercial workloads
- TPC-C DB2 and Oracle
- TPC-H Query 1, Query 2, Query 16, Query 17
- Web Apache and Zeus
45SMS Performance Potential
better
46Virtualized Spatial Memory Streaming
better
Original Prefetcher Cost 60KB Virtualized
Prefetcher Cost lt1Kbyte Nearly Identical
Performance
47Impact of Virtualization on L2 Misses
48Impact of Virtualization on L2 Requests
49Coarse-Grain Tracking