Title: Region-Centric Memory Design
1Region-Centric Memory Design
- AENAO Research Group
- Patrick Akl, M.A.Sc.
- Ioana Burcea, Ph.D. C.
- Myrto Papadopoulou, M.A.Sc. C.
- Elham Safi, Ph.D. C.
- Jason Zebchuk, M.A.Sc. C.
- Andreas Moshovos
pakl, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
2Future On-Chip Caches Just Larger?
CPU
D
I
interconnect
10s 100s of MB
Main Memory
- Observe and Exploit Memory Access Behavior at a
Coarse Grain
3Conventional Block-Centric Memory Hierarchy
Conventional Fine-Grain Tracking
- Small Blocks
- Performance and Bandwidth
- Several optimizations exist
- Big picture is lost
4Big Picture View
Supplemental Coarse-Grain Tracking
- Region 2n sized, aligned memory area
- Concept already in use TLBs
- Patterns Emerge in Space / Time
- Exploit for performance power
- Expose to software
5This Presentation
- Examples of Coarse-Grain Optimizations
- Snoop Coherence
- Thread-level speculation disambiguation
- Region-Centric Memory Design
- RegionTracker Cache
- Snoop Coherence Revisited
- Current Activities
- Coherence Delegation
- Predictor Virtualization
6An Example Snoop Coherence
CPU
CPU
CPU
I
D
I
D
I
D
interconnect
Main Memory
- Conventional Considerations
- Complexity and Correctness NOT Power/Bandwidth
- Can we (1) Reduce Power/bandwidth
- (2) Leverage snoop
coherence? - Remains Attractive Simple / Design Re-use
- Yes Exploit Program Behavior to
- Dynamically Identify Requests that do not Need
Snooping
7Coherence Basics
CPU
CPU
CPU
X
snoop
snoop
hit
Main Memory
- Given request for memory block X (address)
- Detect where current value resides
8Conventional Coherence not Power-Aware/Bandwidth-
Effective
CPU
CPU
CPU
L2
miss
miss
Main Memory
All L2 tags see all accesses Perf. Complexity
Have L2 tags why not use them Power All L2 tags
consume power on all accesses Bandwidth
broadcast all coherent requests
9RegionScout MotivationSharing is Coarse
Typical Memory Space Snapshot colored by owner(s)
addresses
- Region large continuous memory area, power of 2
size - CPU X asks for data block in region R
- No one else has X
- No one else has any block in R
- RegionScout Exploits this Behavior
- Layered Extension over Snoop Coherence
10Optimization Opportunities
SWITCH
Memory
- Power and Bandwidth
- Originating node avoid asking others
- Remote node avoid tag lookup
11Potential Region Miss Frequency
better
of all requests
Global Region Misses
Region Size
Even with a 16K Region 45 of requests miss in
all remote nodes
12RegionScout at Work Non-Shared Region Discovery
CPU
CPU
CPU
Region Miss
Region Miss
Global Region Miss
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions
- First request detects a non-shared region
13RegionScout at WorkAvoiding Snoops
CPU
CPU
CPU
Global Region Miss
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions
- Subsequent request avoids snoops
14RegionScout is Self-Correcting
CPU
CPU
CPU
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions
- Request from another node invalidates non-shared
record
15Implementation Requirements
- Requesting Node provides address
- At Originating Node from CPU
- Have I discovered that this region is not shared?
- At Remote Nodes from Interconnect
- Do I have a block in the region?
address
lg(Region Size)
16Remembering Non-Shared Regions
address
Region Tag
offset
Non-Shared Region Table
valid
Few entries 16x4 in most experiments
- Records non-shared regions
- Lookup by Region portion prior to issuing a
request - Snoop requests and invalidate
17What Regions are Locally Cached?
Region Tag
offset
counter
- If we had as many counters as regions
- Block Allocation counterregion
- Block Eviction counterregion--
- Region cached only if counterRegion non-zero
- Not Practical
- E.g., 16K Regions and 4G Memory ? 256K counters
18What Regions are Locally Cached?
Region Tag
offset
counter
hash()
- Imprecise
- Records a superset of locally cached Regions
- False positives lost opportunity, correctness
preserved - Small e.g., 256 entries for 1M cache
- Power-Optimized structures described in the paper
19LFSR-Based Implementation
Region Tag
offset
LFSR
hash()
Zero Detector
- Linear-Feedback Shift Register Array
- Increment/Decrement/Is Zero?
- 130nm commercial technology
- ISLPED 06
- Faster 1.6x to 3.7x
- More Energy Efficient 1.4x to 2.3x
- But Area 3.2x
20Filter Rates SPLASH-II
better
Identified Global Region Misses
CRH Size
Jason Cantin_at_Wisconsin studied commercial
workloads 40 filter rate
21Region-Centric Disambiguation
- Join work w/
- Greg Steffan and Mihai Burcea
- Patrick Akl
- Andreas Moshovos
22Speculative Parallelization Models
- Thread level speculation
- Transactional Memory
Speculative Parallelization
Original
Good Scenario
Bad Scenario
read a
read b
time
write a
write a
Need to Compare Addresses Across Code Pieces
23Ex 2 Region-Centric Disambiguation
Region-Centric
Conventional
Task 1
Task 2
Task 1
Task 2
Memory Space
- Send digest at region level
- Region-conflict
- Send block-level info
- Reduced bandwidth, potential for performance and
power
24How Much Traffic Can We Save?
Better
- TLS benchmarks from STAMPEDE group (G. Steffan)
- Approximate timing model
Potential for traffic reduction by 38
25Exploiting Region-Level Information
- Region Coherence Arrays
- Cantin, Lipasti and Smith
- RegionScout
- Both of these reduce snoop lookups (and
broadcasts) in snoop coherence protocolsOur work - Spatial Memory Prefetching
- Leverages spatial memory patterns for prefetching
with commercial workloads - Impetus Group at CMU
- Stealth Prefetching
- Cantin, Lipasti and Smith
26Coarse-Grain Techniques Today
Conventional Cache
Auxiliary Tracking
DATA
TAGS
- Overhead
- Storage e.g., 60 of tags
- Functionality Restrict placement, Region
Evictions - Loss of Information
- Hard to justify for a commercial design
27Rethinking Cache Design
Embedded Tracking
DATA
Dual-Grain TAGS
- Can we provide a common substrate for all these
optimizations? - Redesign caches
- Regions a first class citizen
- RegionTracker Cache
28RegionTracker Cache
- Goals
- Expose region behavior
- Is region X cached?
- Which blocks are?
- Facilitate management at the region level
- Evict/migrate region X
- Do something with all blocks in X
- Constraints
- Data movement only at the block level
- No increase in area
- No decrease in performance
- Complexity
- Associativity
29Region-Based Caches
- Start with conventional 16-way cache and replace
tag array - Sector Caches
- Hit rate suffers 20 loss
- Sector Pool Caches
- High Associavity 48-way for matching a 16-way
cache - Decoupled-Sector Caches
- No coarse-grain info
- Replacements require searching
- No previous design is adequate
- RegionTracker
- Meets all requirements
- But does not save as much tag resources
30Sector Cache
D-way Data
D-way Region Tags
RVA
Data Array
- Reduced Area and Power
- Increased miss-rates (2.5 - 96 for 1kB sectors)
- Replacement?
31Sector Pool Cache
D-way Data
1 DSR
Data Array
- M gt D
- Requires highly associative cache to achieve same
performance as RegionTracker (48-way)
32Decoupled-Sectored Cache
- Has multiple block evictions
- Requires scanning status array
- No simple mechanism to avoid this
- Does NOT expose region-level information
33RegionTracker
- In practice L lt D
- Decouple Data and Lookup organizations
- Lower Associativity lookups with no hit-rate
penalty - RegionTracker provides complete solution
34RegionTracker Cache
Block and Region Lookups Region Tag Way Per
Block
Evict Region Blocks Lazily
Simplify replacement and reduce area Status per
block RVA set backpointer
Can be banked and partitioned
35Region-Aware Cache Performance vs. Area
better
- Commercial workloads DB2, Oracle, TPC-C and
TPC-H, Apache, Zeus - SimICS SimFlex, Sampling, 2K Regions
36RegionTracker-RegionScout
BlockScout
better
Reduction in Broadcasts
- One bit per Region tag Known to be not shared
- 1KB Regions, Commercial workloads
- 512KB L2 private caches
- Filter 41 of snoops at Zero Cost compared to
conventional cache
37Directory Optimizations Base Architecture
Core
L3 Data DRAM
Directory
L2 Tags
L3 Tags
L2 Data
38Coherence Delegation
Ideal Path
Requesting Node
Directory Lookup
Remote L2 containing data
- Eliminate 3-hop overhead
- Attract directory tracking to nodes
39Optimization Engines Predictors
Predictor Virtualization
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
L1-D
L1-I
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
Interconnect
L2
Main Memory
40Motivating Trends
- Chip multiprocessors
- Space dedicated to predictors X processors
- Larger predictor table
- Increased performance
- Memory hierarchies
- Increased capacities
Use conventional memory hierarchies to store
predictor information
41PV Architecture
Optimization Engine
entry
index
prediction
Predictor Table
42PV Architecture
Optimization Engine
entry
index
prediction
Predictor Virtualization
43PV Architecture
Optimization Engine
entry
index
prediction
PVCache
MSHR
PVStart
index
PVProxy
L2
PVTable
Main Memory
44Virtualized Spatial Memory Streaming
Original Prefetcher Cost 80KB Virtualized
Prefetcer Cost lt1Kbyte Nearly Identical
Performance
45Region-Centric Memory Design
- AENAO Research Group
- Patrick Akl, M.A.Sc. C.
- Ioana Burcea, Ph.D. C.
- Myrto Papadopoulou, M.A.Sc. C.
- Elham Safi, Ph.D. C.
- Jason Zebchuk, M.A.Sc. C.
- Andreas Moshovos
pakl, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
46Summary
- Caches are getting larger
- Time to look at the big picture
- Region-Centric Memory Design
- Expose region-level info
- Allow management at the region-level
- RegionScout
- eliminate broadcasts for snoop coherence
- Region-Centric Disambiguation
- Reduce bandwidth for TLS or TM
- Region-Aware Memory
- Same area and performance as conventional
region info - Predictor Virtualization