Region-Centric Memory Design - PowerPoint PPT Presentation

About This Presentation

Title:

Region-Centric Memory Design

Description:

Multiple block evictions might also be a problem. ... 16 71.16 84.98 12.00 4.00 512.00 fft. ... 63 37.42 12.00 4.00 512.00 radix. 69.93 4.02 15.54 23.53 28.65 12.00 4 ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 47

Provided by: toro164

Category:

more less

Transcript and Presenter's Notes

Title: Region-Centric Memory Design

1
Region-Centric Memory Design

AENAO Research Group
Patrick Akl, M.A.Sc.
Ioana Burcea, Ph.D. C.
Myrto Papadopoulou, M.A.Sc. C.
Elham Safi, Ph.D. C.
Jason Zebchuk, M.A.Sc. C.
Andreas Moshovos

pakl, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
2
Future On-Chip Caches Just Larger?
CPU
D
I
interconnect
10s 100s of MB
Main Memory

Observe and Exploit Memory Access Behavior at a
Coarse Grain

3
Conventional Block-Centric Memory Hierarchy
Conventional Fine-Grain Tracking

Small Blocks
Performance and Bandwidth
Several optimizations exist
Big picture is lost

4
Big Picture View
Supplemental Coarse-Grain Tracking

Region 2n sized, aligned memory area
Concept already in use TLBs
Patterns Emerge in Space / Time
Exploit for performance power
Expose to software

5
This Presentation

Examples of Coarse-Grain Optimizations
Snoop Coherence
Thread-level speculation disambiguation
Region-Centric Memory Design
RegionTracker Cache
Snoop Coherence Revisited
Current Activities
Coherence Delegation
Predictor Virtualization

6
An Example Snoop Coherence
CPU
CPU
CPU
I
D
I
D
I
D
interconnect
Main Memory

Conventional Considerations
Complexity and Correctness NOT Power/Bandwidth
Can we (1) Reduce Power/bandwidth
(2) Leverage snoop
coherence?
Remains Attractive Simple / Design Re-use
Yes Exploit Program Behavior to
Dynamically Identify Requests that do not Need
Snooping

7
Coherence Basics
CPU
CPU
CPU
X
snoop
snoop
hit
Main Memory

Given request for memory block X (address)
Detect where current value resides

8
Conventional Coherence not Power-Aware/Bandwidth-
Effective
CPU
CPU
CPU
L2
miss
miss
Main Memory
All L2 tags see all accesses Perf. Complexity
Have L2 tags why not use them Power All L2 tags
consume power on all accesses Bandwidth
broadcast all coherent requests
9
RegionScout MotivationSharing is Coarse
Typical Memory Space Snapshot colored by owner(s)
addresses

Region large continuous memory area, power of 2
size
CPU X asks for data block in region R
No one else has X
No one else has any block in R
RegionScout Exploits this Behavior
Layered Extension over Snoop Coherence

10
Optimization Opportunities
SWITCH
Memory

Power and Bandwidth
Originating node avoid asking others
Remote node avoid tag lookup

11
Potential Region Miss Frequency
better
of all requests
Global Region Misses
Region Size
Even with a 16K Region 45 of requests miss in
all remote nodes
12
RegionScout at Work Non-Shared Region Discovery
CPU
CPU
CPU
Region Miss
Region Miss
Global Region Miss
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions

First request detects a non-shared region

13
RegionScout at WorkAvoiding Snoops
CPU
CPU
CPU
Global Region Miss
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions

Subsequent request avoids snoops

14
RegionScout is Self-Correcting
CPU
CPU
CPU
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions

Request from another node invalidates non-shared
record

15
Implementation Requirements

Requesting Node provides address
At Originating Node from CPU
Have I discovered that this region is not shared?
At Remote Nodes from Interconnect
Do I have a block in the region?

address
lg(Region Size)
16
Remembering Non-Shared Regions
address
Region Tag
offset
Non-Shared Region Table
valid
Few entries 16x4 in most experiments

Records non-shared regions
Lookup by Region portion prior to issuing a
request
Snoop requests and invalidate

17
What Regions are Locally Cached?
Region Tag
offset
counter

If we had as many counters as regions
Block Allocation counterregion
Block Eviction counterregion--
Region cached only if counterRegion non-zero
Not Practical
E.g., 16K Regions and 4G Memory ? 256K counters

18
What Regions are Locally Cached?
Region Tag
offset
counter
hash()

Imprecise
Records a superset of locally cached Regions
False positives lost opportunity, correctness
preserved
Small e.g., 256 entries for 1M cache
Power-Optimized structures described in the paper

19
LFSR-Based Implementation
Region Tag
offset
LFSR
hash()
Zero Detector

Linear-Feedback Shift Register Array
Increment/Decrement/Is Zero?
130nm commercial technology
ISLPED 06
Faster 1.6x to 3.7x
More Energy Efficient 1.4x to 2.3x
But Area 3.2x

20
Filter Rates SPLASH-II
better
Identified Global Region Misses
CRH Size
Jason Cantin_at_Wisconsin studied commercial
workloads 40 filter rate
21
Region-Centric Disambiguation

Join work w/
Greg Steffan and Mihai Burcea
Patrick Akl
Andreas Moshovos

22
Speculative Parallelization Models

Thread level speculation
Transactional Memory

Speculative Parallelization
Original
Good Scenario
Bad Scenario
read a
read b
time
write a
write a
Need to Compare Addresses Across Code Pieces
23
Ex 2 Region-Centric Disambiguation
Region-Centric
Conventional
Task 1
Task 2
Task 1
Task 2
Memory Space

Send digest at region level
Region-conflict
Send block-level info
Reduced bandwidth, potential for performance and
power

24
How Much Traffic Can We Save?
Better

TLS benchmarks from STAMPEDE group (G. Steffan)
Approximate timing model

Potential for traffic reduction by 38
25
Exploiting Region-Level Information

Region Coherence Arrays
Cantin, Lipasti and Smith
RegionScout
Both of these reduce snoop lookups (and
broadcasts) in snoop coherence protocolsOur work
Spatial Memory Prefetching
Leverages spatial memory patterns for prefetching
with commercial workloads
Impetus Group at CMU
Stealth Prefetching
Cantin, Lipasti and Smith

26
Coarse-Grain Techniques Today
Conventional Cache
Auxiliary Tracking
DATA
TAGS

Overhead
Storage e.g., 60 of tags
Functionality Restrict placement, Region
Evictions
Loss of Information
Hard to justify for a commercial design

27
Rethinking Cache Design
Embedded Tracking
DATA
Dual-Grain TAGS

Can we provide a common substrate for all these
optimizations?
Redesign caches
Regions a first class citizen
RegionTracker Cache

28
RegionTracker Cache

Goals
Expose region behavior
Is region X cached?
Which blocks are?
Facilitate management at the region level
Evict/migrate region X
Do something with all blocks in X
Constraints
Data movement only at the block level
No increase in area
No decrease in performance
Complexity
Associativity

29
Region-Based Caches

Start with conventional 16-way cache and replace
tag array
Sector Caches
Hit rate suffers 20 loss
Sector Pool Caches
High Associavity 48-way for matching a 16-way
cache
Decoupled-Sector Caches
No coarse-grain info
Replacements require searching
No previous design is adequate
RegionTracker
Meets all requirements
But does not save as much tag resources

30
Sector Cache
D-way Data
D-way Region Tags

RVA
Data Array

Reduced Area and Power
Increased miss-rates (2.5 - 96 for 1kB sectors)
Replacement?

31
Sector Pool Cache
D-way Data

1 DSR
Data Array

M gt D
Requires highly associative cache to achieve same
performance as RegionTracker (48-way)

32
Decoupled-Sectored Cache

Has multiple block evictions
Requires scanning status array
No simple mechanism to avoid this
Does NOT expose region-level information

33
RegionTracker

In practice L lt D
Decouple Data and Lookup organizations
Lower Associativity lookups with no hit-rate
penalty
RegionTracker provides complete solution

34
RegionTracker Cache
Block and Region Lookups Region Tag Way Per
Block
Evict Region Blocks Lazily
Simplify replacement and reduce area Status per
block RVA set backpointer
Can be banked and partitioned
35
Region-Aware Cache Performance vs. Area
better

Commercial workloads DB2, Oracle, TPC-C and
TPC-H, Apache, Zeus
SimICS SimFlex, Sampling, 2K Regions

36
RegionTracker-RegionScout
BlockScout
better
Reduction in Broadcasts

One bit per Region tag Known to be not shared
1KB Regions, Commercial workloads
512KB L2 private caches
Filter 41 of snoops at Zero Cost compared to
conventional cache

37
Directory Optimizations Base Architecture
Core
L3 Data DRAM
Directory
L2 Tags
L3 Tags
L2 Data
38
Coherence Delegation
Ideal Path
Requesting Node
Directory Lookup
Remote L2 containing data

Eliminate 3-hop overhead
Attract directory tracking to nodes

39
Optimization Engines Predictors
Predictor Virtualization
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
L1-D
L1-I
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
Interconnect
L2
Main Memory
40
Motivating Trends

Chip multiprocessors
Space dedicated to predictors X processors
Larger predictor table
Increased performance
Memory hierarchies
Increased capacities

Use conventional memory hierarchies to store
predictor information
41
PV Architecture
Optimization Engine
entry
index
prediction
Predictor Table
42
PV Architecture
Optimization Engine
entry
index
prediction
Predictor Virtualization
43
PV Architecture
Optimization Engine
entry
index
prediction
PVCache
MSHR
PVStart
index
PVProxy
L2
PVTable
Main Memory
44
Virtualized Spatial Memory Streaming
Original Prefetcher Cost 80KB Virtualized
Prefetcer Cost lt1Kbyte Nearly Identical
Performance
45
Region-Centric Memory Design