Title: Coarse-Grained Coherence
1Coarse-Grained Coherence
- Mikko H. Lipasti
- Associate Professor
- Electrical and Computer Engineering
- University of Wisconsin Madison
- Joint work with Jason Cantin, IBM (Ph.D.
06) - Natalie Enright Jerger
- Prof. Jim Smith
- Prof. Li-Shiuan Peh (Princeton)
http//www.ece.wisc.edu/pharm
2Motivation
- Multiprocessors are commonplace
- Historically, glass house servers
- Now laptops, soon cell phones
- Most common multiprocessor
- Symmetric processors w/coherent caches
- Logical extension of time-shared uniprocessors
- Easy to program, reason about
- Not so easy to build
3Coherence Granularity
- Track each individual word
- Too much overhead
- Track larger blocks
- 32B 128B common
- Less overhead, exploit spatial locality
- Large blocks cause false sharing
- Solution use multiple granularities
- Small blocks manage local read/write permissions
- Large blocks track global behavior
4Coarse-Grained Coherence
- Initially
- Identify non-shared regions
- Decouple obtaining coherence permission from data
transfer - Filter snoops to reduce broadcast bandwidth
- Later
- Enable aggressive prefetching
- Optimize DRAM accesses
- Customize protocol, interconnect to match
5Coarse-Grained Coherence
- Optimizations lead to
- Reduced memory miss latency
- Reduced cache-to-cache miss latency
- Reduced snoop bandwidth
- Fewer exposed cache misses
- Elimination of unnecessary DRAM reads
- Power savings on bus, interconnect, caches, and
in DRAM - World peace and end to global warming
6Coarse-Grained Coherence Tracking
- Memory is divided into coarse-grained regions
- Aligned, power-of-two multiple of cache line size
- Can range from two lines to a physical page
- A cache-like structure is added to each processor
for monitoring coherence at the granularity of
regions - Region Coherence Array (RCA)
7Region Coherence Arrays
- Each entry has an address tag, state, and count
of lines cached by the processor - The region state indicates if the processor and /
or other processors are sharing / modifying lines
in the region - Customize policy/protocol/interconnect to exploit
region state
8Talk Outline
- Motivation
- Overview of Coarse-Grained Coherence
- Techniques
- Broadcast Snoop Reduction ISCA 2005
- Stealth Prefetching ASPLOS 2006
- Power-Efficient DRAM Speculation
- Hybrid Circuit Switching
- Virtual Proximity
- Circuit-switched snooping
- Research Group Overview
9Unnecessary Broadcasts
10Broadcast Snoop Reduction
- Identify requests that dont need a broadcast
- Send data requests directly to memory w/o
broadcasting - Reducing broadcast traffic
- Reducing memory latency
- Avoid sending non-data requests externally
Example
11Simulator Evaluation
- PHARMsim near-RTL but written in C
- Execution-driven simulator built on top of
SimOS-PPC - Four 4-way superscalar out-of-order processors
- Two-level hierarchy with split L1, unified 1MB L2
caches, and 64B lines - Separate address / data networks similar to Sun
Fireplane
12Workloads
- Scientific
- Ocean, Raytrace, Barnes
- Multiprogrammed
- SPECint2000_rate, SPECint95_rate
- Commercial (database, web)
- TPC-W, TPC-B, TPC-H
- SPECweb99, SPECjbb2000
13Broadcasts Avoided
14Execution Time
15Summary
- Eliminates nearly all unnecessary broadcasts
- Reduces snoop activity by 65
- Fewer broadcasts
- Fewer lookups
- Provides modest speedup
16Talk Outline
- Motivation
- Overview of Coarse-grained Coherence
- Techniques
- Broadcast Snoop Reduction ISCA-2005
- Stealth Prefetching ASPLOS 2006
- Power-Efficient DRAM Speculation
- Hybrid Circuit Switching
- Virtual Proximity
- Circuit-switched snooping
- Research Group Overview
17Prefetching in Multiprocessors
- Prefetching
- Anticipate future reference, fetch into cache
- Many prefetching heuristics possible
- Current systems next-block, stride
- Proposed skip pointer, content-based
- Some/many prefetched blocks are not used
- Multiprocessors complications
- Premature or unnecessary prefetches
- Permission thrashing if blocks are shared
- Separate study ISPASS 2006
18Stealth Prefetching
- Lines from non-shared regions can be prefetched
stealthily and efficiently - Without disturbing other processors
- Without downgrades, invalidations
- Without preventing them from obtaining exclusive
copies - Without broadcasting prefetch requests
- Fetched from DRAM with low overhead
- Example
19Stealth Prefetching
- After a threshold number of L2 misses (2), the
rest of the lines from a region are prefetched - These lines are buffered close to the processor
for later use (Stealth Data Prefetch Buffer) - After accessing the RCA, requests may obtain data
from the buffer as they would from memory - To access data, region must be in valid state and
a broadcast unnecessary for coherent access
20L2 Misses Prefetched
21Speedup
22Summary
- Stealth Prefetching can prefetch data
- Stealthily
- Only non-shared data prefetched
- Prefetch requests not broadcast
- Aggressively
- Large regions prefetched at once, 80-90 timely
- Efficiently
- Piggybacked onto a demand request
- Fetched from DRAM in open-page mode
23Talk Outline
- Motivation
- Overview of Coarse-grained Coherence
- Techniques
- Broadcast Snoop Reduction ISCA-2005
- Stealth Prefetching ASPLOS 2006
- Power-Efficient DRAM Speculation
- Hybrid Circuit Switching
- Virtual Proximity
- Circuit-switched snooping
- Research Group Overview
24Power-Efficient DRAM Speculation
Broadcast Req
Snoop Tags
Send Resp
- Modern systems overlap the DRAM access with the
snoop, speculatively accessing DRAM before snoop
response - Trading DRAM bandwidth for latency
- Wasting power
- Approximately 25 of DRAM requests are reads that
speculatively access DRAM unnecessarily
25DRAM Operations
26Power-Efficient DRAM Speculation
- Direct memory requests are non-speculative
- Lines from externally-dirty regions likely to be
sourced from another processors cache - Region state can serve as a prediction
- Need not access DRAM speculatively
- Initial requests to a region (state unknown) have
a lower but significant probability of obtaining
data from other processors caches
27Useless DRAM Reads
28Useful DRAM Reads
29DRAM Reads Performed/Delayed
30Summary
- Power-Efficient DRAM Speculation
- Can reduce DRAM reads 20, with less than 1
degradation in performance - 7 slowdown with nonspeculative DRAM
- Nearly doubles interval between DRAM requests,
allowing modules to stay in low-power modes
longer
31Talk Outline
- Motivation
- Overview of Coarse-grained Coherence
- Techniques
- Broadcast Snoop Reduction ISCA-2005
- Stealth Prefetching ASPLOS 2006
- Power-Efficient DRAM Speculation
- Hybrid Circuit Switching
- Virtual Proximity
- Circuit-switched snooping
- Research Group Overview
32Chip Multiprocessor Interconnect
- Options
- Buses dont scale
- Crossbars too expensive
- Rings too slow
- Packet-switched mesh
- Attractive for all the same 1990s DSM reasons
- Scalable
- Low latency
- High link utilization
33CMP Interconnection Networks
- But
- Cables/traces are now on-chip wires
- Fast, cheap, plentiful
- Short 1 cycle per hop
- Router latency adds up
- 3-4 cycles per hop
- Store-and-forward
- Lots of activity/power
- Is this the right answer?
34Circuit-Switched Interconnects
- Communication patterns
- Spatial locality to memory
- Pairwise communication
- Circuit-switched links
- Avoid switching/routing
- Reduce latency
- Save power?
- Poor utilization! Maybe OK
35Router Design
- Switches consist of
- Configurable crossbar
- Configuration memory
- 4-stage router pipeline exposes only 1 cycle if
CS - Can also act as packet-switched network
- Design details in CA Letters 07
36Protocol Optimization
- Initial 3-hop miss establishes CS path
- Subsequent miss requests
- Sent directly on CS path to predicted owner
- Also in parallel to home node
- Predicted owner sources data early
- Directory acks update to sharing list
- Benefits
- Reduced 3-hop latency
- Less activity, less power
37Hybrid Circuit Switching (1)
- Hybrid Circuit Switching improves performance by
up to 7
38Hybrid Circuit Switching (2)
- Positive interaction in co-designed interconnect
protocol - More circuit reuse gt greater latency benefit
39Summary
- Hybrid Circuit Switching
- Routing overhead eliminated
- Still enable high bandwidth when needed
- Co-designed protocol
- Optimize cache-to-cache transfers
- Substantial performance benefits
- To do power analysis
40Talk Outline
- Motivation
- Overview of Coarse-grained Coherence
- Techniques
- Broadcast Snoop Reduction ISCA-2005
- Stealth Prefetching ASPLOS 2006
- Power-Efficient DRAM Speculation
- Hybrid Circuit Switching
- Virtual Proximity
- Circuit-switched snooping
- Research Group Overview
41Server Consolidation on CMPs
- CMP as consolidation platform
- Simplify system administration
- Save power, cost and physical infrastructure
- Study combinations of individual workloads in
full system environment - Micro-coded hypervisor schedules VMs
- See An Evaluation of Server Consolidation
Workloads for Multi-Core Designs in IISWC 2007
for additional details - Nugget shared LLC a big win
42Virtual Proximity
- Interactions between VM scheduling, placement,
and interconnect - Goal placement agnostic scheduling
- Best workload balance
- Evaluate 3 scheduling policies
- Gang, Affinity and Load Balanced
- HCS provides virtual proximity
43Scheduling Algorithms
- Gang Scheduling
- Co-schedules all threads of a VM
- No idle-cycle stealing
- Affinity Scheduling
- VMs assigned to neighboring cores
- Can steal idle cycles across VMs sharing core
- Load Balanced Scheduling
- Ready threads assigned to any core
- Any/all VMs can steal idle cycles
- Over time, VM fragments across chip
44- Load balancing wins with fast interconnect
- Affinity scheduling wins with slow interconnect
- HCS creates virtual proximity
45Virtual Proximity Performance
- HCS able to provide virtual proximity
46- As physical distance (hop count) increases, HCS
provides significantly lower latency
47Summary
- Virtual Proximity in submission
- Enables placement agnostic hypervisor scheduler
- Results
- Up to 17 better than affinity scheduling
- Idle cycle reduction 84 over gang and 41 over
affinity - Low-latency interconnect mitigates increase in L2
cache conflicts from load balancing - L2 misses up by 10 but execution time reduced by
11 - A flexible, distributed address mapping combined
with HCS out-performs a localized affinity-based
memory mapping by an average of 7
48Talk Outline
- Motivation
- Overview of Coarse-grained Coherence
- Techniques
- Broadcast Snoop Reduction ISCA-2005
- Stealth Prefetching ASPLOS 2006
- Power-Efficient DRAM Speculation
- Hybrid Circuit Switching
- Virtual Proximity
- Circuit-switched snooping
- Research Group Overview
49Circuit Switched Snooping (1)
- Scalable, efficient broadcasting on unordered
network - Remove latency overhead of directory indirection
- Extend point-to-point circuit-switched links to
trees - Low latency multicast via circuit-switched tree
- Help provide performance isolation as requests do
not share same communication medium
50Circuit-Switched Snooping (2)
- Extend Coarse Grain Coherence Tracking (CGCT)
- Remove unnecessary broadcasts
- Convert broadcasts to multicasts
- Effective in Server Consolidation Workloads
- Very few coherence requests to globally shared
data
51Snooping Interconnect
- Switches consist of
- Configurable crossbar
- Configuration memory
- Circuits span two or more nodes, based on RCA
- Snooping occurs across circuits
- All sharers in region join circuit
- Each link can physically accommodate multiple
circuits
52Circuit-Switched Snooping
- Use RCA to identify subsets of nodes that share
data - Create shared circuits among these nodes
- Design challenges
- Multi-drop, bidirectional circuits
- Memory ordering
- Results very much in progress
53Talk Outline
- Motivation
- Overview of Coarse-grained Coherence
- Techniques
- Broadcast Snoop Reduction ISCA-2005
- Stealth Prefetching ASPLOS 2006
- Power-Efficient DRAM Speculation
- Hybrid Circuit Switching
- Virtual Proximity
- Circuit-switched snooping
- Research Group Overview
54Research Group Overview
- Faculty Mikko Lipasti, since 1999
- Current MS/PhD students
- Gordie Bell (also IBM), Natalie Enright Jerger,
Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su,
Dana Vantrease - Graduates, current employment
- Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
Madhu Seshadri - IBM Trey Cain, Jason Cantin, Brian Mestan
- AMD Kevin Lepak
- Sun Microsystems Matt Ramsay, Razvan Cheveresan,
Pranay Koka
55Current Focus Areas
- Multiprocessors
- Coherence protocol optimization
- Interconnection network design
- Fairness issues in hierarchical systems
- Microprocessor design
- Complexity-effective microarchitecture
- Scalable dynamic scheduling hardware
- Speculation reduction for power savings
- Transparent clock gating
- Domain-specific ISA extensions
- Software
- Java Virtual Machine run-time optimization
- Workload development and characterization
56Funding
- National Science Foundation
- Intel Research Council
- IBM Faculty Partnership Awards
- IBM Shared University Research equipment
- Schneider ECE Faculty Fellowship
- UW Graduate School
57Questions?
- http//www.ece.wisc.edu/pharm
58Backup Slides
59Region Coherence Arrays
- The regions are kept coherent with a protocol,
which summarizes the local and global state of
lines in the region
60Region Coherence Arrays
- On cache misses, the region state is read to
determine if a broadcast is necessary - On external snoops, the region state is read to
provide a region snoop response - Piggybacked onto the conventional response
- Used to update other processors region state
- The regions are kept coherent with a protocol,
which summarizes the local and global state of
lines in the region
61Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
Region not exclusive anymore
Owned, Region Owned
RFO P1, 100002
Owned, Region Owned
RFO P1, 100002
0
RCA
1
RCA
001
DI
Exclusive
Invalid
0000
Invalid
000
0010
0010
Pending
001
Pending
DD
Pending
DD
Invalid
Modified
Invalid
000
Invalid
0000
Invalid
000
Exclusive
0011
Data
Store 100002
Data
P0
P1
M0
M1
62Overhead
- Storage for RCA
- Two bits in snoop response for region snoop
response - Region Externally Clean/Dirty
63Overhead
- RCA maintains inclusion over caches
- RCA must respond correctly to external requests
if lines cached - When regions evicted from RCA, their lines are
evicted from the cache - Replacement algorithm uses line count to favor
regions with no lines cached
64Snoop Traffic Peak
65Snoop Traffic Average
66Snoop Traffic
- Peak snoop traffic is halved
- Average snoop traffic reduced by nearly two
thirds - The system is more scalable, and may effectively
support more processors
67Tag Lookups Filtered
- Coarse-Grain Coherence Tracking can be used to
filter external snoops - Send external requests to RCA first
- If region valid and line-count nonzero, send
external request to cache - Reduces power consumption in the cache tag arrays
- Increases broadcast snoop latency
68Tag Lookups Filtered
69Line Evictions for Inclusion
70L2 Miss Ratio Increase
71Stealth Prefetching
- Lines from a region may be prefetched again after
a threshold number of L2 misses (currently 2). - A bit mask of the lines cached since the last
prefetch is used to avoid prefetching useless
data
72Stealth Prefetching
Prefetched lines are managed by a simple protocol
73Prefetch Timeliness
74Data Traffic
75Period Between DRAM Requests
76Switch design
77Value-Aware Techniques
- Coherence misses in multiprocessors
- Store Value Locality Lepak 03
- Ensuring consistency
- Value-based checks Cain 04
- Reducing speculation
- Operand significance
- Create (nearly) nonspeculative execution schedule
- Java Virtual Machine runtime optimization Su
- Speculative optimizations VEE 07
78Complexity-Effective Techniques
- Scalable dynamic scheduling hardware
- Half-price architecture Kim 03
- Macro-op scheduling Kim 03
- Operand significance Gunadi
- Scalable snoop-based coherence
- Coarse-grained coherence Cantin 06
- Circuit-switched coherence Enright
79Power-Efficient Techniques
- Power-efficient techniques
- Reduced speculation Gunadi
- Clock gating E. Hill
- Transparent pipelines need fine-grained stalls
- Redistribute coarse-grained stall cycles
- Circuit-switched coherence Enright
- Reduce overhead of CMP cache coherence
- Improve latency, power
80Cache Coherence Problem
Load A
Load A
Store Alt 1
Load A
A
0
A
0
1
Memory
81Cache Coherence Problem
Load A
Load A
Store Alt 1
Load A
A
0
A
0
1
A
1
Memory
82Snoopy Cache Coherence
- All cache misses broadcast on shared bus
- Processors and memory snoop and respond
- Cache block permissions enforced
- Multiple readers allowed (shared state)
- Only a single writer (exclusive state)
- Must upgrade block before writing to it
- Other copies invalidated
- Read/write-shared blocks bounce from cache to
cache - Migratory sharing
83Example Conventional Snooping
Network
Read P0, 100002
Read P0, 100002
Invalid
Invalid
Tag
State
0
Invalid
0000
Pending
0010
1
Invalid
0000
Exclusive
Invalid
0000
Invalid
0000
Data
Load 100002
Data
P0
P1
M0
M1
84Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
P0 has exclusive access to region
Read P0, 100002
Invalid, Region Not Shared
Read P0, 100002
Invalid, Region Not Shared
Tag
State
0
RCA
1
RCA
Invalid
0000
Invalid
0000
Invalid
000
Pending
0010
Pending
Invalid
000
DI
Exclusive
001
Invalid
0000
Invalid
0000
Invalid
000
Invalid
000
Data
Load 100002
P0
P1
Data
M0
M1
85Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
Exclusive region state, broadcast unnecessary
Tag
State
0
RCA
1
RCA
001
0010
DI
Exclusive
Invalid
0000
Invalid
000
Invalid
0000
Invalid
000
Invalid
0000
Invalid
000
Pending
0011
Exclusive
Data
Load 110002
P0
P1
Read P0, 110002
Data
M0
M1
86Impact on Execution Time
87Stealth Prefetching
Assume 8-byte lines, 32-byte regions, 2-line
threshold
Network
Tag
State
0
RCA
1
RCA
001
DI
Invalid
000
Invalid
0000
Invalid
0000
Invalid
000
0100
Exclusive
Invalid
0000
Invalid
000
Invalid
0000
Invalid
0000
Pending
0101
Exclusive
Load 0x28
Data
Prefetch 11002
Data
P0
P1
SDPB
Pending
Valid
0110
SDPB
Invalid
0000
Pending
Valid
0111
Invalid
0000
Read P0, 0x28 Prefetch 11002
M0
M1
88Stealth Prefetching
Assume 8-byte lines, 32-byte regions, 2-line
threshold
Network
Tag
State
0
RCA
1
RCA
001
DI
Invalid
000
Invalid
0000
Invalid
0000
Pending
0110
Exclusive
Invalid
000
0100
Exclusive
Invalid
0000
Invalid
000
Invalid
0000
0101
Exclusive
Load 0x30
Data
Data
P0
P1
SDPB
0000
Valid
0110
Invalid
SDPB
0000
Valid
0111
M0
M1
89Communication Latencies
CC-NUMA CMP
Local Cache Access 12 12
Remote Cache-to-Cache Transfer 12 21 H 3 (H hop count) 12 4 H 3
Local Memory Access 150 150
Remote Memory Access 150 21 H 2 150 4 H 2
- Remote cache access is 2-5x faster in CMPs than
NUMA machines - Lower communication latencies allow for more
flexible thread placement
90Configuration
Simulation Parameters Simulation Parameters
Cores 16 single-threaded light-weight, in-order
Interconnect 2-D Packet-Switched Mesh 3-cycle router pipeline (baseline)
Interconnect Hybrid Circuit-Switched Mesh 4 Circuits
L1 Cache Split I/D, 16KB each (2 cycles)
L2 Cache Private, 128 KB (6 cycles)
L3 Cache Shared, 16 MB (16 1MB banks) 12 cycles
Memory Latency 150 cycles
Workload Mixes Workload Mixes
Mix 1 TPC-W (4) TPC-H (4)
Mix 2 TPC-W (4) SPECjbb (4)
Mix 3 TPC-H (4) SPECjbb(4)
91Effect of Memory Placement
- Load Balancing with HCS outperforms local
placement - Virtual proximity to memory home node