Coarse-Grained Coherence - PowerPoint PPT Presentation

About This Presentation
Title:

Coarse-Grained Coherence

Description:

Title: On the Value Locality of Store Instructions Author: Kevin Lepak Last modified by: Mikko Lipasti Created Date: 4/20/2000 3:20:45 PM Document presentation format – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 92
Provided by: KevinL164
Category:

less

Transcript and Presenter's Notes

Title: Coarse-Grained Coherence


1
Coarse-Grained Coherence
  • Mikko H. Lipasti
  • Associate Professor
  • Electrical and Computer Engineering
  • University of Wisconsin Madison
  • Joint work with Jason Cantin, IBM (Ph.D.
    06)
  • Natalie Enright Jerger
  • Prof. Jim Smith
  • Prof. Li-Shiuan Peh (Princeton)

http//www.ece.wisc.edu/pharm
2
Motivation
  • Multiprocessors are commonplace
  • Historically, glass house servers
  • Now laptops, soon cell phones
  • Most common multiprocessor
  • Symmetric processors w/coherent caches
  • Logical extension of time-shared uniprocessors
  • Easy to program, reason about
  • Not so easy to build

3
Coherence Granularity
  • Track each individual word
  • Too much overhead
  • Track larger blocks
  • 32B 128B common
  • Less overhead, exploit spatial locality
  • Large blocks cause false sharing
  • Solution use multiple granularities
  • Small blocks manage local read/write permissions
  • Large blocks track global behavior

4
Coarse-Grained Coherence
  • Initially
  • Identify non-shared regions
  • Decouple obtaining coherence permission from data
    transfer
  • Filter snoops to reduce broadcast bandwidth
  • Later
  • Enable aggressive prefetching
  • Optimize DRAM accesses
  • Customize protocol, interconnect to match

5
Coarse-Grained Coherence
  • Optimizations lead to
  • Reduced memory miss latency
  • Reduced cache-to-cache miss latency
  • Reduced snoop bandwidth
  • Fewer exposed cache misses
  • Elimination of unnecessary DRAM reads
  • Power savings on bus, interconnect, caches, and
    in DRAM
  • World peace and end to global warming

6
Coarse-Grained Coherence Tracking
  • Memory is divided into coarse-grained regions
  • Aligned, power-of-two multiple of cache line size
  • Can range from two lines to a physical page
  • A cache-like structure is added to each processor
    for monitoring coherence at the granularity of
    regions
  • Region Coherence Array (RCA)

7
Region Coherence Arrays
  • Each entry has an address tag, state, and count
    of lines cached by the processor
  • The region state indicates if the processor and /
    or other processors are sharing / modifying lines
    in the region
  • Customize policy/protocol/interconnect to exploit
    region state

8
Talk Outline
  • Motivation
  • Overview of Coarse-Grained Coherence
  • Techniques
  • Broadcast Snoop Reduction ISCA 2005
  • Stealth Prefetching ASPLOS 2006
  • Power-Efficient DRAM Speculation
  • Hybrid Circuit Switching
  • Virtual Proximity
  • Circuit-switched snooping
  • Research Group Overview

9
Unnecessary Broadcasts
10
Broadcast Snoop Reduction
  • Identify requests that dont need a broadcast
  • Send data requests directly to memory w/o
    broadcasting
  • Reducing broadcast traffic
  • Reducing memory latency
  • Avoid sending non-data requests externally

Example
11
Simulator Evaluation
  • PHARMsim near-RTL but written in C
  • Execution-driven simulator built on top of
    SimOS-PPC
  • Four 4-way superscalar out-of-order processors
  • Two-level hierarchy with split L1, unified 1MB L2
    caches, and 64B lines
  • Separate address / data networks similar to Sun
    Fireplane

12
Workloads
  • Scientific
  • Ocean, Raytrace, Barnes
  • Multiprogrammed
  • SPECint2000_rate, SPECint95_rate
  • Commercial (database, web)
  • TPC-W, TPC-B, TPC-H
  • SPECweb99, SPECjbb2000

13
Broadcasts Avoided
14
Execution Time
15
Summary
  • Eliminates nearly all unnecessary broadcasts
  • Reduces snoop activity by 65
  • Fewer broadcasts
  • Fewer lookups
  • Provides modest speedup

16
Talk Outline
  • Motivation
  • Overview of Coarse-grained Coherence
  • Techniques
  • Broadcast Snoop Reduction ISCA-2005
  • Stealth Prefetching ASPLOS 2006
  • Power-Efficient DRAM Speculation
  • Hybrid Circuit Switching
  • Virtual Proximity
  • Circuit-switched snooping
  • Research Group Overview

17
Prefetching in Multiprocessors
  • Prefetching
  • Anticipate future reference, fetch into cache
  • Many prefetching heuristics possible
  • Current systems next-block, stride
  • Proposed skip pointer, content-based
  • Some/many prefetched blocks are not used
  • Multiprocessors complications
  • Premature or unnecessary prefetches
  • Permission thrashing if blocks are shared
  • Separate study ISPASS 2006

18
Stealth Prefetching
  • Lines from non-shared regions can be prefetched
    stealthily and efficiently
  • Without disturbing other processors
  • Without downgrades, invalidations
  • Without preventing them from obtaining exclusive
    copies
  • Without broadcasting prefetch requests
  • Fetched from DRAM with low overhead
  • Example

19
Stealth Prefetching
  • After a threshold number of L2 misses (2), the
    rest of the lines from a region are prefetched
  • These lines are buffered close to the processor
    for later use (Stealth Data Prefetch Buffer)
  • After accessing the RCA, requests may obtain data
    from the buffer as they would from memory
  • To access data, region must be in valid state and
    a broadcast unnecessary for coherent access

20
L2 Misses Prefetched
21
Speedup
22
Summary
  • Stealth Prefetching can prefetch data
  • Stealthily
  • Only non-shared data prefetched
  • Prefetch requests not broadcast
  • Aggressively
  • Large regions prefetched at once, 80-90 timely
  • Efficiently
  • Piggybacked onto a demand request
  • Fetched from DRAM in open-page mode

23
Talk Outline
  • Motivation
  • Overview of Coarse-grained Coherence
  • Techniques
  • Broadcast Snoop Reduction ISCA-2005
  • Stealth Prefetching ASPLOS 2006
  • Power-Efficient DRAM Speculation
  • Hybrid Circuit Switching
  • Virtual Proximity
  • Circuit-switched snooping
  • Research Group Overview

24
Power-Efficient DRAM Speculation
Broadcast Req
Snoop Tags
Send Resp
  • Modern systems overlap the DRAM access with the
    snoop, speculatively accessing DRAM before snoop
    response
  • Trading DRAM bandwidth for latency
  • Wasting power
  • Approximately 25 of DRAM requests are reads that
    speculatively access DRAM unnecessarily

25
DRAM Operations
26
Power-Efficient DRAM Speculation
  • Direct memory requests are non-speculative
  • Lines from externally-dirty regions likely to be
    sourced from another processors cache
  • Region state can serve as a prediction
  • Need not access DRAM speculatively
  • Initial requests to a region (state unknown) have
    a lower but significant probability of obtaining
    data from other processors caches

27
Useless DRAM Reads
28
Useful DRAM Reads
29
DRAM Reads Performed/Delayed
30
Summary
  • Power-Efficient DRAM Speculation
  • Can reduce DRAM reads 20, with less than 1
    degradation in performance
  • 7 slowdown with nonspeculative DRAM
  • Nearly doubles interval between DRAM requests,
    allowing modules to stay in low-power modes
    longer

31
Talk Outline
  • Motivation
  • Overview of Coarse-grained Coherence
  • Techniques
  • Broadcast Snoop Reduction ISCA-2005
  • Stealth Prefetching ASPLOS 2006
  • Power-Efficient DRAM Speculation
  • Hybrid Circuit Switching
  • Virtual Proximity
  • Circuit-switched snooping
  • Research Group Overview

32
Chip Multiprocessor Interconnect
  • Options
  • Buses dont scale
  • Crossbars too expensive
  • Rings too slow
  • Packet-switched mesh
  • Attractive for all the same 1990s DSM reasons
  • Scalable
  • Low latency
  • High link utilization

33
CMP Interconnection Networks
  • But
  • Cables/traces are now on-chip wires
  • Fast, cheap, plentiful
  • Short 1 cycle per hop
  • Router latency adds up
  • 3-4 cycles per hop
  • Store-and-forward
  • Lots of activity/power
  • Is this the right answer?

34
Circuit-Switched Interconnects
  • Communication patterns
  • Spatial locality to memory
  • Pairwise communication
  • Circuit-switched links
  • Avoid switching/routing
  • Reduce latency
  • Save power?
  • Poor utilization! Maybe OK

35
Router Design
  • Switches consist of
  • Configurable crossbar
  • Configuration memory
  • 4-stage router pipeline exposes only 1 cycle if
    CS
  • Can also act as packet-switched network
  • Design details in CA Letters 07

36
Protocol Optimization
  • Initial 3-hop miss establishes CS path
  • Subsequent miss requests
  • Sent directly on CS path to predicted owner
  • Also in parallel to home node
  • Predicted owner sources data early
  • Directory acks update to sharing list
  • Benefits
  • Reduced 3-hop latency
  • Less activity, less power

37
Hybrid Circuit Switching (1)
  • Hybrid Circuit Switching improves performance by
    up to 7

38
Hybrid Circuit Switching (2)
  • Positive interaction in co-designed interconnect
    protocol
  • More circuit reuse gt greater latency benefit

39
Summary
  • Hybrid Circuit Switching
  • Routing overhead eliminated
  • Still enable high bandwidth when needed
  • Co-designed protocol
  • Optimize cache-to-cache transfers
  • Substantial performance benefits
  • To do power analysis

40
Talk Outline
  • Motivation
  • Overview of Coarse-grained Coherence
  • Techniques
  • Broadcast Snoop Reduction ISCA-2005
  • Stealth Prefetching ASPLOS 2006
  • Power-Efficient DRAM Speculation
  • Hybrid Circuit Switching
  • Virtual Proximity
  • Circuit-switched snooping
  • Research Group Overview

41
Server Consolidation on CMPs
  • CMP as consolidation platform
  • Simplify system administration
  • Save power, cost and physical infrastructure
  • Study combinations of individual workloads in
    full system environment
  • Micro-coded hypervisor schedules VMs
  • See An Evaluation of Server Consolidation
    Workloads for Multi-Core Designs in IISWC 2007
    for additional details
  • Nugget shared LLC a big win

42
Virtual Proximity
  • Interactions between VM scheduling, placement,
    and interconnect
  • Goal placement agnostic scheduling
  • Best workload balance
  • Evaluate 3 scheduling policies
  • Gang, Affinity and Load Balanced
  • HCS provides virtual proximity

43
Scheduling Algorithms
  • Gang Scheduling
  • Co-schedules all threads of a VM
  • No idle-cycle stealing
  • Affinity Scheduling
  • VMs assigned to neighboring cores
  • Can steal idle cycles across VMs sharing core
  • Load Balanced Scheduling
  • Ready threads assigned to any core
  • Any/all VMs can steal idle cycles
  • Over time, VM fragments across chip

44
  • Load balancing wins with fast interconnect
  • Affinity scheduling wins with slow interconnect
  • HCS creates virtual proximity

45
Virtual Proximity Performance
  • HCS able to provide virtual proximity

46
  • As physical distance (hop count) increases, HCS
    provides significantly lower latency

47
Summary
  • Virtual Proximity in submission
  • Enables placement agnostic hypervisor scheduler
  • Results
  • Up to 17 better than affinity scheduling
  • Idle cycle reduction 84 over gang and 41 over
    affinity
  • Low-latency interconnect mitigates increase in L2
    cache conflicts from load balancing
  • L2 misses up by 10 but execution time reduced by
    11
  • A flexible, distributed address mapping combined
    with HCS out-performs a localized affinity-based
    memory mapping by an average of 7

48
Talk Outline
  • Motivation
  • Overview of Coarse-grained Coherence
  • Techniques
  • Broadcast Snoop Reduction ISCA-2005
  • Stealth Prefetching ASPLOS 2006
  • Power-Efficient DRAM Speculation
  • Hybrid Circuit Switching
  • Virtual Proximity
  • Circuit-switched snooping
  • Research Group Overview

49
Circuit Switched Snooping (1)
  • Scalable, efficient broadcasting on unordered
    network
  • Remove latency overhead of directory indirection
  • Extend point-to-point circuit-switched links to
    trees
  • Low latency multicast via circuit-switched tree
  • Help provide performance isolation as requests do
    not share same communication medium

50
Circuit-Switched Snooping (2)
  • Extend Coarse Grain Coherence Tracking (CGCT)
  • Remove unnecessary broadcasts
  • Convert broadcasts to multicasts
  • Effective in Server Consolidation Workloads
  • Very few coherence requests to globally shared
    data

51
Snooping Interconnect
  • Switches consist of
  • Configurable crossbar
  • Configuration memory
  • Circuits span two or more nodes, based on RCA
  • Snooping occurs across circuits
  • All sharers in region join circuit
  • Each link can physically accommodate multiple
    circuits

52
Circuit-Switched Snooping
  • Use RCA to identify subsets of nodes that share
    data
  • Create shared circuits among these nodes
  • Design challenges
  • Multi-drop, bidirectional circuits
  • Memory ordering
  • Results very much in progress

53
Talk Outline
  • Motivation
  • Overview of Coarse-grained Coherence
  • Techniques
  • Broadcast Snoop Reduction ISCA-2005
  • Stealth Prefetching ASPLOS 2006
  • Power-Efficient DRAM Speculation
  • Hybrid Circuit Switching
  • Virtual Proximity
  • Circuit-switched snooping
  • Research Group Overview

54
Research Group Overview
  • Faculty Mikko Lipasti, since 1999
  • Current MS/PhD students
  • Gordie Bell (also IBM), Natalie Enright Jerger,
    Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su,
    Dana Vantrease
  • Graduates, current employment
  • Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
    Madhu Seshadri
  • IBM Trey Cain, Jason Cantin, Brian Mestan
  • AMD Kevin Lepak
  • Sun Microsystems Matt Ramsay, Razvan Cheveresan,
    Pranay Koka

55
Current Focus Areas
  • Multiprocessors
  • Coherence protocol optimization
  • Interconnection network design
  • Fairness issues in hierarchical systems
  • Microprocessor design
  • Complexity-effective microarchitecture
  • Scalable dynamic scheduling hardware
  • Speculation reduction for power savings
  • Transparent clock gating
  • Domain-specific ISA extensions
  • Software
  • Java Virtual Machine run-time optimization
  • Workload development and characterization

56
Funding
  • National Science Foundation
  • Intel Research Council
  • IBM Faculty Partnership Awards
  • IBM Shared University Research equipment
  • Schneider ECE Faculty Fellowship
  • UW Graduate School

57
Questions?
  • http//www.ece.wisc.edu/pharm

58
Backup Slides
59
Region Coherence Arrays
  • The regions are kept coherent with a protocol,
    which summarizes the local and global state of
    lines in the region

60
Region Coherence Arrays
  • On cache misses, the region state is read to
    determine if a broadcast is necessary
  • On external snoops, the region state is read to
    provide a region snoop response
  • Piggybacked onto the conventional response
  • Used to update other processors region state
  • The regions are kept coherent with a protocol,
    which summarizes the local and global state of
    lines in the region

61
Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
Region not exclusive anymore
Owned, Region Owned
RFO P1, 100002
Owned, Region Owned
RFO P1, 100002
  • P1 stores 100002

0
RCA
1
RCA
001
DI
Exclusive
Invalid
0000
Invalid
000
0010
0010
Pending
001
Pending
DD
Pending
DD
Invalid
Modified
  • MISS

Invalid
000
Invalid
0000
Invalid
000
Exclusive
0011
Data
  • Snoop performed

Store 100002
Data
  • Hits in P0 cache

P0
P1
  • Response sent
  • Data transfer

M0
M1
62
Overhead
  • Storage for RCA
  • Two bits in snoop response for region snoop
    response
  • Region Externally Clean/Dirty

63
Overhead
  • RCA maintains inclusion over caches
  • RCA must respond correctly to external requests
    if lines cached
  • When regions evicted from RCA, their lines are
    evicted from the cache
  • Replacement algorithm uses line count to favor
    regions with no lines cached

64
Snoop Traffic Peak
65
Snoop Traffic Average
66
Snoop Traffic
  • Peak snoop traffic is halved
  • Average snoop traffic reduced by nearly two
    thirds
  • The system is more scalable, and may effectively
    support more processors

67
Tag Lookups Filtered
  • Coarse-Grain Coherence Tracking can be used to
    filter external snoops
  • Send external requests to RCA first
  • If region valid and line-count nonzero, send
    external request to cache
  • Reduces power consumption in the cache tag arrays
  • Increases broadcast snoop latency

68
Tag Lookups Filtered
69
Line Evictions for Inclusion
70
L2 Miss Ratio Increase
71
Stealth Prefetching
  • Lines from a region may be prefetched again after
    a threshold number of L2 misses (currently 2).
  • A bit mask of the lines cached since the last
    prefetch is used to avoid prefetching useless
    data

72
Stealth Prefetching
Prefetched lines are managed by a simple protocol
73
Prefetch Timeliness
74
Data Traffic
75
Period Between DRAM Requests
76
Switch design
77
Value-Aware Techniques
  • Coherence misses in multiprocessors
  • Store Value Locality Lepak 03
  • Ensuring consistency
  • Value-based checks Cain 04
  • Reducing speculation
  • Operand significance
  • Create (nearly) nonspeculative execution schedule
  • Java Virtual Machine runtime optimization Su
  • Speculative optimizations VEE 07

78
Complexity-Effective Techniques
  • Scalable dynamic scheduling hardware
  • Half-price architecture Kim 03
  • Macro-op scheduling Kim 03
  • Operand significance Gunadi
  • Scalable snoop-based coherence
  • Coarse-grained coherence Cantin 06
  • Circuit-switched coherence Enright

79
Power-Efficient Techniques
  • Power-efficient techniques
  • Reduced speculation Gunadi
  • Clock gating E. Hill
  • Transparent pipelines need fine-grained stalls
  • Redistribute coarse-grained stall cycles
  • Circuit-switched coherence Enright
  • Reduce overhead of CMP cache coherence
  • Improve latency, power

80
Cache Coherence Problem
Load A
Load A
Store Alt 1
Load A
A
0
A
0
1
Memory
81
Cache Coherence Problem
Load A
Load A
Store Alt 1
Load A
A
0
A
0
1
A
1
Memory
82
Snoopy Cache Coherence
  • All cache misses broadcast on shared bus
  • Processors and memory snoop and respond
  • Cache block permissions enforced
  • Multiple readers allowed (shared state)
  • Only a single writer (exclusive state)
  • Must upgrade block before writing to it
  • Other copies invalidated
  • Read/write-shared blocks bounce from cache to
    cache
  • Migratory sharing

83
Example Conventional Snooping
Network
Read P0, 100002
Read P0, 100002
Invalid
Invalid
Tag
State
  • P0 loads 100002

0
Invalid
0000
Pending
0010
1
Invalid
0000
Exclusive
  • MISS

Invalid
0000
Invalid
0000
Data
  • Snoop performed

Load 100002
Data
P0
P1
  • Response sent
  • Data transfer

M0
M1
84
Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
P0 has exclusive access to region
Read P0, 100002
Invalid, Region Not Shared
Read P0, 100002
Invalid, Region Not Shared
Tag
State
  • P0 loads 100002

0
RCA
1
RCA
Invalid
0000
Invalid
0000
Invalid
000
Pending
0010
Pending
Invalid
000
DI
Exclusive
001
  • MISS

Invalid
0000
Invalid
0000
Invalid
000
Invalid
000
Data
  • Snoop performed

Load 100002
P0
P1
  • Response sent

Data
  • Data transfer

M0
M1
85
Coarse-Grain Coherence Tracking
Region Coherence Array added two lines per region
Network
Exclusive region state, broadcast unnecessary
Tag
State
  • P0 loads 110002

0
RCA
1
RCA
001
0010
DI
Exclusive
Invalid
0000
Invalid
000
  • MISS, Region Hit

Invalid
0000
Invalid
000
Invalid
0000
Invalid
000
Pending
0011
Exclusive
Data
  • Direct request sent

Load 110002
P0
P1
  • Data transfer

Read P0, 110002
Data
M0
M1
86
Impact on Execution Time
87
Stealth Prefetching
Assume 8-byte lines, 32-byte regions, 2-line
threshold
Network
Tag
State
  • P0 loads 0x28

0
RCA
1
RCA
001
DI
Invalid
000
Invalid
0000
Invalid
0000
  • MISS, RCA Hit

Invalid
000
0100
Exclusive
Invalid
0000
Invalid
000
  • Direct request sent

Invalid
0000
Invalid
0000
Pending
0101
Exclusive
Load 0x28
Data
Prefetch 11002
Data
  • Data transfer

P0
P1
SDPB
Pending
Valid
0110
SDPB
Invalid
0000
Pending
Valid
0111
Invalid
0000
Read P0, 0x28 Prefetch 11002
  • Prefetch data

M0
M1
88
Stealth Prefetching
Assume 8-byte lines, 32-byte regions, 2-line
threshold
Network
Tag
State
  • P0 loads 0x30

0
RCA
1
RCA
001
DI
Invalid
000
Invalid
0000
Invalid
0000
Pending
0110
Exclusive
  • MISS, SDPB Hit

Invalid
000
0100
Exclusive
Invalid
0000
Invalid
000
  • Data Transfer
  • Return

Invalid
0000
0101
Exclusive
Load 0x30
Data
Data
P0
P1
SDPB
0000
Valid
0110
Invalid
SDPB
0000
Valid
0111
M0
M1
89
Communication Latencies
CC-NUMA CMP
Local Cache Access 12 12
Remote Cache-to-Cache Transfer 12 21 H 3 (H hop count) 12 4 H 3
Local Memory Access 150 150
Remote Memory Access 150 21 H 2 150 4 H 2
  • Remote cache access is 2-5x faster in CMPs than
    NUMA machines
  • Lower communication latencies allow for more
    flexible thread placement

90
Configuration
Simulation Parameters Simulation Parameters
Cores 16 single-threaded light-weight, in-order
Interconnect 2-D Packet-Switched Mesh 3-cycle router pipeline (baseline)
Interconnect Hybrid Circuit-Switched Mesh 4 Circuits
L1 Cache Split I/D, 16KB each (2 cycles)
L2 Cache Private, 128 KB (6 cycles)
L3 Cache Shared, 16 MB (16 1MB banks) 12 cycles
Memory Latency 150 cycles
Workload Mixes Workload Mixes
Mix 1 TPC-W (4) TPC-H (4)
Mix 2 TPC-W (4) SPECjbb (4)
Mix 3 TPC-H (4) SPECjbb(4)
91
Effect of Memory Placement
  • Load Balancing with HCS outperforms local
    placement
  • Virtual proximity to memory home node
Write a Comment
User Comments (0)
About PowerShow.com