Two%20Ways%20to%20Exploit%20Multi-Megabyte%20Caches - PowerPoint PPT Presentation

About This Presentation

Title:

Two%20Ways%20to%20Exploit%20Multi-Megabyte%20Caches

Description:

Optimizes Bandwidth and Performance. Large L2/L3 caches especially. Fine ... 4-core CMP modeled after Piranha. Private 32KB, 4-way set-associative L1 caches ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 50

Provided by: eecgTo

Category:

more less

Transcript and Presenter's Notes

Title: Two%20Ways%20to%20Exploit%20Multi-Megabyte%20Caches

1
Two Ways to Exploit Multi-Megabyte Caches

AENAO Research Group _at_ Toronto
Kaveh Aasaraai
Ioana Burcea
Myrto Papadopoulou
Elham Safi
Jason Zebchuk
Andreas Moshovos

aasaraai, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
2
Future Caches Just Larger?
CPU
D
I
interconnect
10s 100s of MB
Main Memory

Big Picture Management
Store Metadata

3
Conventional Block Centric Cache
Fine-Grain View of Memory
L2 Cache

Small Blocks
Optimizes Bandwidth and Performance
Large L2/L3 caches especially

Big Picture Lost
4
Big Picture View
Coarse-Grain View of Memory
L2 Cache

Region 2n sized, aligned area of memory
Patterns and behavior exposed
Spatial locality
Exploit for performance/area/power

5
Exploiting Coarse-Grain Patterns
Coarse-Grain Framework
Circuit-Switched Coherence
CPU
Stealth Prefetching
RegionScout

Embed coarse-grain information in tag array
Support many different optimizations with less
area overhead

Run-time Adaptive Cache Hierarchy Management via
Reference Analysis
L2 Cache
Destination-Set Prediction
Coarse-Grain Coherence Tracking
Spatial Memory Streaming

Many existing coarse-grain optimizations
Add new structures to track coarse-grain
information

Adaptable optimization FRAMEWORK
Hard to justify for a commercial design
6
RegionTracker Solution
L2 Cache
L1
Data Array
Data Blocks
Tag Array
Region Tracker
L1
Block Requests
L1
Region Probes
L1
Region Responses
Block Requests

Manage blocks, but also track and manage regions

7
RegionTracker Summary

Replace conventional tag array
4-core CMP with 8MB shared L2 cache
Within 1 of original performance
Up to 20 less tag area
Average 33 less energy consumption
Optimization Framework
Stealth Prefetching same performance, 36 less
area
RegionScout 2x more snoops avoided, no area
overhead

8
Road Map

Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker A Tag Array Replacement
RegionTracker An Optimization Framework
Conclusion

9
Goals

Conventional Tag Array Functionality
Identify data block location and state
Leave data array un-changed
Optimization Framework Functionality
Is Region X cached?
Which blocks of Region X are cached? Where?
Evict or migrate Region X
Easy to assign properties to each Region

10
Coarse-Grain Cache Designs
Large Block Size
Tag Array
Data Array
Region X

Increased BW, Decreased hit-rates

11
Sector Cache
Tag Array
Data Array
Region X

Decreased hit-rates

12
Sector Pool Cache
Tag Array
Data Array
Region X

High Associativity (2 - 4 times)

13
Decoupled Sector Cache
Tag Array
Data Array
Status Table
Region X

Region information not exposed
Region replacement requires scanning multiple
entries

14
Design Requirements

Small block size (64B)
Miss-rate does not increase
Lookup associativity does not increase
No additional access latency
(i.e., No scanning, no multiple block evictions)
Does not increase latency, area, or energy
Allows banking and interleaving
Fit in conventional tag array envelope

15
RegionTracker A Tag Array Replacement
L1
Data Array
Region Vector Array
L1
L1
Evicted Region Buffer
L1
Block Status Table

3 SRAM arrays, combined smaller than tag array

16
Basic Structures
Ex 8MB, 16-way set-associative cache, 64-byte
blocks, 1KB region
Region Vector Array (RVA)
Block Status Table (BST)
Region Tag

status
block15
block0
3
2
way
V
1
4

Address specific RVA set and BST set
RVA entry multiple, consecutive BST sets
BST entry one of four RVA sets

17
Common Case Hit
Ex 8MB, 16-way set-associative cache, 64-byte
blocks, 1KB region
49
0
6
10
21
Address
Region Tag
RVA Index
Region Offset
Block Offset
Block Offset
Data Array BST Index
19
6
0
Region Vector Array (RVA)
Block Status Table (BST)
Region Tag

status
block15
block0
3
2
way
V
To Data Array
1
4
18
Worst Case (Rare) Region Miss
49
0
6
10
21
Region Tag
RVA Index
Region Offset
Block Offset
Address
Block Offset
Data Array BST Index
Ptr
19
6
0
Region Vector Array (RVA)
Block Status Table (BST)
Evicted Region Buffer (ERB)
No Match!
Region Tag

status
Ptr
block15
block0
3
2
way
V
19
Methodology
P
P
P
P

Flexus simulator from CMU SimFlex group
Based on Simics full-system simulator
4-core CMP modeled after Piranha
Private 32KB, 4-way set-associative L1 caches
Shared 8MB, 16-way set-associative L2 cache
64-byte blocks
Miss-rates Functional simulation of 2 billion
instructions per core
Performance and Energy Timing simulation using
SMARTS sampling methodology
Area and Power Full custom implementation on
130nm commercial technology
9 commercial workloads
WEB SpecWEB on Apache and Zeus
OLTP TPC-C on DB2 and Oracle
DSS 5 TPC-H queries on DB2

D
I
D
I
D
I
D
I
Interconnect
L2
20
Miss-Rates vs. Area
Sector Cache (0.25, 1.26)
48-way
Relative Miss-Rate
52-way
14-way
15-way
better
Relative Tag Array Area

Sector Cache 512KB sectors, SPC and RT 1KB
regions
Trade-offs comparable to conventional cache

21
Performance Energy
Performance
Energy
better
better
Reduction in Tag Energy
Normalized Execution Time

12-way set-associative RegionTracker 20 less
area
Error bars 95 confidence interval
Performance within 1, with 33 tag energy
reduction

22
Road Map

Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker A Tag Array Replacement
RegionTracker An Optimization Framework
Conclusion

23
RegionTracker An Optimization Framework
Stealth Prefetching Average 20 performance
improvement Drop-in RegionTracker for 36 less
area overhead
RegionScout In-depth analysis
24
Snoop Coherence Common Case
CPU
CPU
CPU
Read x
Read x1
Read x2
Read xn
miss
miss
Main Memory
Many snoops are to non-shared regions
25
RegionScout
CPU
CPU
CPU
Read x
Miss
Miss
Region Miss
Region Miss
Global Region Miss
Main Memory
Non-Shared Regions
Locally Cached Regions
Eliminate broadcasts for non-shared regions
26
RegionTracker Implementation

Minimal overhead to support RegionScout
optimization
Still uses less area than conventional tag array

27
RegionTracker RegionScout

4 processors, 512KB L2 Caches
1KB regions

BlockScout (4KB)
better
Reduction in Snoop Broadcasts
Avoid 41 of Snoop Broadcasts, no area overhead
compared to conventional tag array
28
Result Summary

Replace Conventional Tag Array
20 Less tag area
33 Less tag energy
Within 1 of original performance
Coarse-Grain Optimization Framework
36 reduction in area overhead for Stealth
Prefetching
Filter 41 of snoop broadcasts with no area
overhead compared to conventional cache

29
Predictor Virtualization

Ioana Burcea
Joint work with
Stephen Somogyi
Babak Falsafi

30
Optimization Engines Predictors
Predictor Virtualization
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
Interconnect
L2
Main Memory
31
Motivating Trends

Dedicating resources to predictors hard to
justify
Chip multiprocessors
Space dedicated to predictors X processors
Larger predictor tables
Increased performance
Memory hierarchies offer the opportunity
Increased capacity
How many apps really use the space?

Use conventional memory hierarchies to store
predictor information
32
PV Architecture contd.
Optimization Engine
request
request
prediction
Predictor Table
33
PV Architecture contd.
Optimization Engine
request
prediction
Predictor Virtualization
34
PV Architecture contd.
Optimization Engine
request
prediction
PVCache
MSHR
PVStart
index
PVProxy
On the backside of the L1
L2
PVTable
Main Memory
35
To Virtualize Or Not to Virtualize?
Common Case

Re-Use2. Predictor Info Prefetching

36
To Virtualize or Not?

Challenge
Hit in the PVCache most of the time
Will not work for all predictors out of the box
Reuse is necessary
Intrinsic
Easy to virtualize
Non-intrinsic
Must be engineered
More so if the predictor needs to be fast to
start with

37
Will There Be Reuse?

Intrinsic
Multiple predictions per entry
Well see an example
Can be engineered
Group temporally correlated entries together

Cache block
38
Spatial Memory Streaming

Footprint
Blocks accessed per memory region
Predict next time the footprint will be the same
Handle PC offset within region

39
Spatial Generations
40
Virtualizing SMS
Virtualize
patterns
Predictor
Detector
patterns
trigger access
prefetches
41
Virtualizing SMS
Virtual Table
PVCache
8
1K
11
11
tag
pattern
tag
tag
pattern
pattern
unused
85
0
11
43
54
42
Packing Entries in One Cache Block

Index PC offset within spatial group
PC ?16 bits
32 blocks in a spatial group ? 5 bit offset
?
32 bit spatial pattern
Pattern table 1K sets
10 bits to index the table ? 11 bit tag
Cache block 64 bytes
11 entries per cache block ? Pattern table
1K sets
11-way set associative

21 bit index
tag
pattern
tag
tag
pattern
pattern
unused
85
0
11
43
54
43
Memory Address Calculation
PC
Block offset
16 bits
5 bits
PV Start Address
000000
10 bits
Memory Address
44
Simulation Infrastructure

SimFlex CMU Impetus
Full-system simulator based on Simics
Base processor configuration
8-wide OoO
256-entry ROB / 64-entry LSQ
L1D/L1I 64KB 4-way set-associative
UL2 8MB 16-way set-associative
Commercial workloads
TPC-C DB2 and Oracle
TPC-H Query 1, Query 2, Query 16, Query 17
Web Apache and Zeus

45
SMS Performance Potential
better
46
Virtualized Spatial Memory Streaming
better
Original Prefetcher Cost 60KB Virtualized
Prefetcher Cost lt1Kbyte Nearly Identical
Performance
47
Impact of Virtualization on L2 Misses
48
Impact of Virtualization on L2 Requests
49
Coarse-Grain Tracking