Coarse-Grained Coherence

About This Presentation

Title:

Coarse-Grained Coherence

Description:

Title: On the Value Locality of Store Instructions Author: Kevin Lepak Last modified by: Mikko Lipasti Created Date: 4/20/2000 3:20:45 PM Document presentation format – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 92

Provided by: KevinL164

Learn more at: https://pharm.ece.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Coarse-Grained Coherence

1
Coarse-Grained Coherence

Mikko H. Lipasti
Associate Professor
Electrical and Computer Engineering
University of Wisconsin Madison
Joint work with Jason Cantin, IBM (Ph.D.
06)
Natalie Enright Jerger
Prof. Jim Smith
Prof. Li-Shiuan Peh (Princeton)

http//www.ece.wisc.edu/pharm
2
Motivation

Multiprocessors are commonplace
Historically, glass house servers
Now laptops, soon cell phones
Most common multiprocessor
Symmetric processors w/coherent caches
Logical extension of time-shared uniprocessors
Easy to program, reason about
Not so easy to build

3
Coherence Granularity

Track each individual word
Too much overhead
Track larger blocks
32B 128B common
Less overhead, exploit spatial locality
Large blocks cause false sharing

Solution use multiple granularities
Small blocks manage local read/write permissions
Large blocks track global behavior

4
Coarse-Grained Coherence

Initially
Identify non-shared regions
Decouple obtaining coherence permission from data
transfer
Filter snoops to reduce broadcast bandwidth
Later
Enable aggressive prefetching
Optimize DRAM accesses
Customize protocol, interconnect to match

5
Coarse-Grained Coherence

Optimizations lead to
Reduced memory miss latency
Reduced cache-to-cache miss latency
Reduced snoop bandwidth
Fewer exposed cache misses
Elimination of unnecessary DRAM reads
Power savings on bus, interconnect, caches, and
in DRAM
World peace and end to global warming

6
Coarse-Grained Coherence Tracking

Memory is divided into coarse-grained regions
Aligned, power-of-two multiple of cache line size
Can range from two lines to a physical page
A cache-like structure is added to each processor
for monitoring coherence at the granularity of
regions
Region Coherence Array (RCA)

7
Region Coherence Arrays

Each entry has an address tag, state, and count
of lines cached by the processor
The region state indicates if the processor and /
or other processors are sharing / modifying lines
in the region
Customize policy/protocol/interconnect to exploit
region state

8
Talk Outline

Motivation
Overview of Coarse-Grained Coherence
Techniques
Broadcast Snoop Reduction ISCA 2005
Stealth Prefetching ASPLOS 2006
Power-Efficient DRAM Speculation
Hybrid Circuit Switching
Virtual Proximity
Circuit-switched snooping
Research Group Overview

9
Unnecessary Broadcasts
10
Broadcast Snoop Reduction

Identify requests that dont need a broadcast
Send data requests directly to memory w/o
broadcasting
Reducing broadcast traffic
Reducing memory latency
Avoid sending non-data requests externally

Example
11
Simulator Evaluation

PHARMsim near-RTL but written in C
Execution-driven simulator built on top of
SimOS-PPC
Four 4-way superscalar out-of-order processors
Two-level hierarchy with split L1, unified 1MB L2
caches, and 64B lines
Separate address / data networks similar to Sun
Fireplane

12
Workloads

Scientific
Ocean, Raytrace, Barnes
Multiprogrammed
SPECint2000_rate, SPECint95_rate
Commercial (database, web)
TPC-W, TPC-B, TPC-H
SPECweb99, SPECjbb2000

13
Broadcasts Avoided
14
Execution Time
15
Summary

Eliminates nearly all unnecessary broadcasts
Reduces snoop activity by 65
Fewer broadcasts
Fewer lookups
Provides modest speedup

16
Talk Outline

Motivation
Overview of Coarse-grained Coherence
Techniques
Broadcast Snoop Reduction ISCA-2005
Stealth Prefetching ASPLOS 2006
Power-Efficient DRAM Speculation
Hybrid Circuit Switching
Virtual Proximity
Circuit-switched snooping
Research Group Overview

17
Prefetching in Multiprocessors

Prefetching
Anticipate future reference, fetch into cache
Many prefetching heuristics possible
Current systems next-block, stride
Proposed skip pointer, content-based
Some/many prefetched blocks are not used
Multiprocessors complications
Premature or unnecessary prefetches
Permission thrashing if blocks are shared
Separate study ISPASS 2006

18
Stealth Prefetching

Lines from non-shared regions can be prefetched
stealthily and efficiently
Without disturbing other processors
Without downgrades, invalidations
Without preventing them from obtaining exclusive
copies
Without broadcasting prefetch requests
Fetched from DRAM with low overhead
Example

19
Stealth Prefetching

After a threshold number of L2 misses (2), the
rest of the lines from a region are prefetched
These lines are buffered close to the processor
for later use (Stealth Data Prefetch Buffer)
After accessing the RCA, requests may obtain data
from the buffer as they would from memory
To access data, region must be in valid state and
a broadcast unnecessary for coherent access

20
L2 Misses Prefetched
21
Speedup
22
Summary

Stealth Prefetching can prefetch data
Stealthily
Only non-shared data prefetched
Prefetch requests not broadcast
Aggressively
Large regions prefetched at once, 80-90 timely
Efficiently
Piggybacked onto a demand request
Fetched from DRAM in open-page mode

23
Talk Outline

Motivation
Overview of Coarse-grained Coherence
Techniques
Broadcast Snoop Reduction ISCA-2005
Stealth Prefetching ASPLOS 2006
Power-Efficient DRAM Speculation
Hybrid Circuit Switching
Virtual Proximity
Circuit-switched snooping
Research Group Overview

24
Power-Efficient DRAM Speculation
Broadcast Req
Snoop Tags
Send Resp

Modern systems overlap the DRAM access with the
snoop, speculatively accessing DRAM before snoop
response
Trading DRAM bandwidth for latency
Wasting power
Approximately 25 of DRAM requests are reads that
speculatively access DRAM unnecessarily

25
DRAM Operations
26
Power-Efficient DRAM Speculation

Direct memory requests are non-speculative
Lines from externally-dirty regions likely to be
sourced from another processors cache
Region state can serve as a prediction
Need not access DRAM speculatively
Initial requests to a region (state unknown) have
a lower but significant probability of obtaining
data from other processors caches

27
Useless DRAM Reads
28
Useful DRAM Reads
29
DRAM Reads Performed/Delayed
30
Summary

Power-Efficient DRAM Speculation
Can reduce DRAM reads 20, with less than 1
degradation in performance
7 slowdown with nonspeculative DRAM
Nearly doubles interval between DRAM requests,
allowing modules to stay in low-power modes
longer

31
Talk Outline

Motivation
Overview of Coarse-grained Coherence
Techniques
Broadcast Snoop Reduction ISCA-2005
Stealth Prefetching ASPLOS 2006
Power-Efficient DRAM Speculation
Hybrid Circuit Switching
Virtual Proximity
Circuit-switched snooping
Research Group Overview

32
Chip Multiprocessor Interconnect

Options
Buses dont scale
Crossbars too expensive
Rings too slow
Packet-switched mesh
Attractive for all the same 1990s DSM reasons
Scalable
Low latency
High link utilization

33
CMP Interconnection Networks

But
Cables/traces are now on-chip wires
Fast, cheap, plentiful
Short 1 cycle per hop
Router latency adds up
3-4 cycles per hop
Store-and-forward
Lots of activity/power
Is this the right answer?

34
Circuit-Switched Interconnects

Communication patterns
Spatial locality to memory
Pairwise communication
Circuit-switched links
Avoid switching/routing
Reduce latency
Save power?
Poor utilization! Maybe OK

35
Router Design

Switches consist of
Configurable crossbar
Configuration memory
4-stage router pipeline exposes only 1 cycle if
CS
Can also act as packet-switched network
Design details in CA Letters 07

36
Protocol Optimization

Initial 3-hop miss establishes CS path
Subsequent miss requests
Sent directly on CS path to predicted owner
Also in parallel to home node
Predicted owner sources data early
Directory acks update to sharing list
Benefits
Reduced 3-hop latency
Less activity, less power

37
Hybrid Circuit Switching (1)

Hybrid Circuit Switching improves performance by
up to 7

38
Hybrid Circuit Switching (2)

Positive interaction in co-designed interconnect
protocol
More circuit reuse gt greater latency benefit

39
Summary

Hybrid Circuit Switching
Routing overhead eliminated
Still enable high bandwidth when needed
Co-designed protocol
Optimize cache-to-cache transfers
Substantial performance benefits
To do power analysis

40
Talk Outline

Motivation
Overview of Coarse-grained Coherence
Techniques
Broadcast Snoop Reduction ISCA-2005
Stealth Prefetching ASPLOS 2006
Power-Efficient DRAM Speculation
Hybrid Circuit Switching
Virtual Proximity
Circuit-switched snooping
Research Group Overview

41
Server Consolidation on CMPs

CMP as consolidation platform
Simplify system administration
Save power, cost and physical infrastructure
Study combinations of individual workloads in
full system environment
Micro-coded hypervisor schedules VMs
See An Evaluation of Server Consolidation
Workloads for Multi-Core Designs in IISWC 2007
for additional details
Nugget shared LLC a big win

42
Virtual Proximity

Interactions between VM scheduling, placement,
and interconnect
Goal placement agnostic scheduling
Best workload balance
Evaluate 3 scheduling policies
Gang, Affinity and Load Balanced
HCS provides virtual proximity

43
Scheduling Algorithms

Gang Scheduling
Co-schedules all threads of a VM
No idle-cycle stealing
Affinity Scheduling
VMs assigned to neighboring cores
Can steal idle cycles across VMs sharing core
Load Balanced Scheduling
Ready threads assigned to any core
Any/all VMs can steal idle cycles
Over time, VM fragments across chip

Load balancing wins with fast interconnect
Affinity scheduling wins with slow interconnect
HCS creates virtual proximity

45
Virtual Proximity Performance

HCS able to provide virtual proximity

As physical distance (hop count) increases, HCS
provides significantly lower latency

47
Summary

Virtual Proximity in submission
Enables placement agnostic hypervisor scheduler
Results
Up to 17 better than affinity scheduling
Idle cycle reduction 84 over gang and 41 over
affinity
Low-latency interconnect mitigates increase in L2
cache conflicts from load balancing
L2 misses up by 10 but execution time reduced by
11
A flexible, distributed address mapping combined
with HCS out-performs a localized affinity-based
memory mapping by an average of 7

48
Talk Outline

Motivation
Overview of Coarse-grained Coherence
Techniques
Broadcast Snoop Reduction ISCA-2005
Stealth Prefetching ASPLOS 2006
Power-Efficient DRAM Speculation
Hybrid Circuit Switching
Virtual Proximity
Circuit-switched snooping
Research Group Overview

49
Circuit Switched Snooping (1)

Scalable, efficient broadcasting on unordered
network
Remove latency overhead of directory indirection
Extend point-to-point circuit-switched links to
trees
Low latency multicast via circuit-switched tree
Help provide performance isolation as requests do
not share same communication medium

50
Circuit-Switched Snooping (2)

Extend Coarse Grain Coherence Tracking (CGCT)
Remove unnecessary broadcasts
Convert broadcasts to multicasts
Effective in Server Consolidation Workloads
Very few coherence requests to globally shared
data

51
Snooping Interconnect

Switches consist of
Configurable crossbar
Configuration memory
Circuits span two or more nodes, based on RCA
Snooping occurs across circuits
All sharers in region join circuit
Each link can physically accommodate multiple
circuits

52
Circuit-Switched Snooping

Use RCA to identify subsets of nodes that share
data
Create shared circuits among these nodes
Design challenges
Multi-drop, bidirectional circuits
Memory ordering
Results very much in progress

53
Talk Outline

Motivation
Overview of Coarse-grained Coherence
Techniques
Broadcast Snoop Reduction ISCA-2005
Stealth Prefetching ASPLOS 2006
Power-Efficient DRAM Speculation
Hybrid Circuit Switching
Virtual Proximity
Circuit-switched snooping
Research Group Overview

54
Research Group Overview

Faculty Mikko Lipasti, since 1999
Current MS/PhD students
Gordie Bell (also IBM), Natalie Enright Jerger,
Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su,
Dana Vantrease
Graduates, current employment
Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
Madhu Seshadri
IBM Trey Cain, Jason Cantin, Brian Mestan
AMD Kevin Lepak
Sun Microsystems Matt Ramsay, Razvan Cheveresan,
Pranay Koka

55
Current Focus Areas

Multiprocessors
Coherence protocol optimization
Interconnection network design
Fairness issues in hierarchical systems
Microprocessor design
Complexity-effective microarchitecture
Scalable dynamic scheduling hardware
Speculation reduction for power savings
Transparent clock gating
Domain-specific ISA extensions
Software
Java Virtual Machine run-time optimization
Workload development and characterization

56
Funding

National Science Foundation
Intel Research Council
IBM Faculty Partnership Awards
IBM Shared University Research equipment
Schneider ECE Faculty Fellowship
UW Graduate School

57
Questions?

http//www.ece.wisc.edu/pharm

58
Backup Slides
59
Region Coherence Arrays

The regions are kept coherent with a protocol,
which summarizes the local and global state of
lines in the region

60
Region Coherence Arrays

On cache misses, the region state is read to
determine if a broadcast is necessary
On external snoops, the region state is read to
provide a region snoop response
Piggybacked onto the conventional response
Used to update other processors region state
The regions are kept coherent with a protocol,
which summarizes the local and global state of
lines in the region

61
Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
Region not exclusive anymore
Owned, Region Owned
RFO P1, 100002
Owned, Region Owned
RFO P1, 100002

P1 stores 100002

0
RCA
1
RCA
001
DI
Exclusive
Invalid
0000
Invalid
000
0010
0010
Pending
001
Pending
DD
Pending
DD
Invalid
Modified

MISS

Invalid
000
Invalid
0000
Invalid
000
Exclusive
0011
Data

Snoop performed

Store 100002
Data

Hits in P0 cache

P0
P1

Response sent

Data transfer

M0
M1
62
Overhead

Storage for RCA
Two bits in snoop response for region snoop
response
Region Externally Clean/Dirty

63
Overhead

RCA maintains inclusion over caches
RCA must respond correctly to external requests
if lines cached
When regions evicted from RCA, their lines are
evicted from the cache
Replacement algorithm uses line count to favor
regions with no lines cached

64
Snoop Traffic Peak
65
Snoop Traffic Average
66
Snoop Traffic

Peak snoop traffic is halved
Average snoop traffic reduced by nearly two
thirds
The system is more scalable, and may effectively
support more processors

67
Tag Lookups Filtered

Coarse-Grain Coherence Tracking can be used to
filter external snoops
Send external requests to RCA first
If region valid and line-count nonzero, send
external request to cache
Reduces power consumption in the cache tag arrays
Increases broadcast snoop latency

68
Tag Lookups Filtered
69
Line Evictions for Inclusion
70
L2 Miss Ratio Increase
71
Stealth Prefetching

Lines from a region may be prefetched again after
a threshold number of L2 misses (currently 2).
A bit mask of the lines cached since the last
prefetch is used to avoid prefetching useless
data

72
Stealth Prefetching
Prefetched lines are managed by a simple protocol
73
Prefetch Timeliness
74
Data Traffic
75
Period Between DRAM Requests
76
Switch design
77
Value-Aware Techniques

Coherence misses in multiprocessors
Store Value Locality Lepak 03
Ensuring consistency
Value-based checks Cain 04
Reducing speculation
Operand significance
Create (nearly) nonspeculative execution schedule
Java Virtual Machine runtime optimization Su
Speculative optimizations VEE 07

78
Complexity-Effective Techniques

Scalable dynamic scheduling hardware
Half-price architecture Kim 03
Macro-op scheduling Kim 03
Operand significance Gunadi
Scalable snoop-based coherence
Coarse-grained coherence Cantin 06
Circuit-switched coherence Enright

79
Power-Efficient Techniques

Power-efficient techniques
Reduced speculation Gunadi
Clock gating E. Hill
Transparent pipelines need fine-grained stalls
Redistribute coarse-grained stall cycles
Circuit-switched coherence Enright
Reduce overhead of CMP cache coherence
Improve latency, power

80
Cache Coherence Problem
Load A
Load A
Store Alt 1
Load A
A
0
A
0
1
Memory
81
Cache Coherence Problem
Load A
Load A
Store Alt 1
Load A
A
0
A
0
1
A
1
Memory
82
Snoopy Cache Coherence

All cache misses broadcast on shared bus
Processors and memory snoop and respond
Cache block permissions enforced
Multiple readers allowed (shared state)
Only a single writer (exclusive state)
Must upgrade block before writing to it
Other copies invalidated
Read/write-shared blocks bounce from cache to
cache
Migratory sharing

83
Example Conventional Snooping
Network
Read P0, 100002
Read P0, 100002
Invalid
Invalid
Tag
State

P0 loads 100002

0
Invalid
0000
Pending
0010
1
Invalid
0000
Exclusive

MISS

Invalid
0000
Invalid
0000
Data

Snoop performed

Load 100002
Data
P0
P1

Response sent

Data transfer

M0
M1
84
Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
P0 has exclusive access to region
Read P0, 100002
Invalid, Region Not Shared
Read P0, 100002
Invalid, Region Not Shared
Tag
State

P0 loads 100002

0
RCA
1
RCA
Invalid
0000
Invalid
0000
Invalid
000
Pending
0010
Pending
Invalid
000
DI
Exclusive
001

MISS

Invalid
0000
Invalid
0000
Invalid
000
Invalid
000
Data

Snoop performed

Load 100002
P0
P1

Response sent

Data

Data transfer

M0
M1
85
Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
Exclusive region state, broadcast unnecessary
Tag
State

P0 loads 110002

0
RCA
1
RCA
001
0010
DI
Exclusive
Invalid
0000
Invalid
000

MISS, Region Hit

Invalid
0000
Invalid
000
Invalid
0000
Invalid
000
Pending
0011
Exclusive
Data

Direct request sent

Load 110002
P0
P1

Data transfer

Read P0, 110002
Data
M0
M1
86
Impact on Execution Time
87
Stealth Prefetching
Assume 8-byte lines, 32-byte regions, 2-line
threshold
Network
Tag
State

P0 loads 0x28

0
RCA
1
RCA
001
DI
Invalid
000
Invalid
0000
Invalid
0000

MISS, RCA Hit

Invalid
000
0100
Exclusive
Invalid
0000
Invalid
000

Direct request sent

Invalid
0000
Invalid
0000
Pending
0101
Exclusive
Load 0x28
Data
Prefetch 11002
Data

Data transfer

P0
P1
SDPB
Pending
Valid
0110
SDPB
Invalid
0000
Pending
Valid
0111
Invalid
0000
Read P0, 0x28 Prefetch 11002

Prefetch data

M0
M1
88
Stealth Prefetching
Assume 8-byte lines, 32-byte regions, 2-line
threshold
Network
Tag
State

P0 loads 0x30

0
RCA
1
RCA
001
DI
Invalid
000
Invalid
0000
Invalid
0000
Pending
0110
Exclusive

MISS, SDPB Hit

Invalid
000
0100
Exclusive
Invalid
0000
Invalid
000

Data Transfer
Return

Invalid
0000
0101
Exclusive
Load 0x30
Data
Data
P0
P1
SDPB
0000
Valid
0110
Invalid
SDPB
0000
Valid
0111
M0
M1
89
Communication Latencies
CC-NUMA CMP
Local Cache Access 12 12
Remote Cache-to-Cache Transfer 12 21 H 3 (H hop count) 12 4 H 3
Local Memory Access 150 150
Remote Memory Access 150 21 H 2 150 4 H 2

Remote cache access is 2-5x faster in CMPs than
NUMA machines
Lower communication latencies allow for more
flexible thread placement

90
Configuration
Simulation Parameters Simulation Parameters
Cores 16 single-threaded light-weight, in-order
Interconnect 2-D Packet-Switched Mesh 3-cycle router pipeline (baseline)
Interconnect Hybrid Circuit-Switched Mesh 4 Circuits
L1 Cache Split I/D, 16KB each (2 cycles)
L2 Cache Private, 128 KB (6 cycles)
L3 Cache Shared, 16 MB (16 1MB banks) 12 cycles
Memory Latency 150 cycles
Workload Mixes Workload Mixes
Mix 1 TPC-W (4) TPC-H (4)
Mix 2 TPC-W (4) SPECjbb (4)
Mix 3 TPC-H (4) SPECjbb(4)
91
Effect of Memory Placement