Wire%20Aware%20Architecture - PowerPoint PPT Presentation

About This Presentation

Title:

Wire%20Aware%20Architecture

Description:

www.cs.utah.edu – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 63

Provided by: Naveen74

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Wire%20Aware%20Architecture

1
Wire Aware Architecture
Naveen Muralimanohar Advisor Rajeev
Balasubramonian
University of Utah
2
Effect of Technology Scaling

Power wall
Temperature wall
Reliability issues
Process variation
Soft errors
Wire scaling
Communication is expensive but computations are
cheap

3
Wire Delay Compelling Opportunity

Existing proposals are indirect
Hide wire delay
Pre-fetching, Speculative coherence, Run-ahead
execution
Reduce communication to save power
Wire level optimizations are still limited to
circuit designers

4
Thesis Statement

The growing cost of on-chip wire delay requires
a thorough understanding of wires.
The dissertation advocates exposing wire
properties to architects and proposes
microarchitectural wire management

5
Wire Delay/Power

Pentium 4 (_at_ 90nm) spent two cycles to send a
signal across the chip
Wire delays are costly for performance and power
Latencies of 60 cycles to reach ends of a chip
at 32nm (_at_ 5 GHz)
50 of dynamic power is in interconnect switching
(Magen et al. SLIP 04)

6
Large Caches
Intel Montecito

Cache hierarchies will dominate chip area
Montecito has two private 12 MB L3 caches (27MB
including L2)
Long global wires are required to transmit
data/address

Cache
Cache
7
On-Chip Cache Challenges
Cache access time calculated using CACTI
4 MB
16 MB
64 MB
1.5X 65nm process
1X 130nm process
2X 32nm process
8
Effect of L2 Hit Time
An aggressive out-of-order processor (L2-hit time
30 -gt15 cycles)
Avg 17
9
Coherence Traffic

CMP has already become ubiquitous
Requires Coherence among multiple cores
Coherence operations entail frequent
communications
Different coherence messages have different
latency and bandwidth needs

Core 1
Core2
Core 3
Latest copy
Inv Ack
L1
L1
L1
Inval Req
Read Req
Ex Req
Fwd Read Req to owner
L2
Messages related to read miss
Messages related to write miss
10
L1 Accesses

Highly latency critical in aggressive
out-of-order processors (such as a clustered
processor)
The choice of inter-cluster communication fabric
has a high impact on performance

11
On-chip Traffic
Cluster
P0
P1
P2
P3
P4
P5
P6
P7
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
Controller
L2 bank
Cache Reads and Writes
Coherence Transactions
L1-accesses
Controller
P8
I
D
P9
I
D
P10
I
D
P11
I
D
P12
I
D
P13
I
D
P14
I
D
P15
I
D
12
Outline

Overview
Wire Design Space
Methodology to Design Scalable Caches
Heterogeneous Wires for Large Caches
Heterogeneous Wires for Coherence Traffic
Conclusions

13
Wire Characteristics

Wire Resistance and capacitance per unit length

Resistance Capacitance Bandwidth
Width
Spacing
Naveen Muralimanohar
University of Utah
13
14
Design Space Exploration

Tuning wire width and spacing

Base case B wires
Fast but Low bandwidth L wires
(Width Spacing) ? ? Delay ? Bandwidth ?
Naveen Muralimanohar
University of Utah
14
15
Design Space Exploration

Tuning Repeater size and spacing

Power Optimal Wires Smaller repeaters Increased
spacing
Delay
Power
Traditional Wires Large repeaters Optimum spacing
Naveen Muralimanohar
University of Utah
15
16
ED Trade-off in a Repeated Wire
Naveen Muralimanohar
University of Utah
16
17
Design Space Exploration
Base case B wires
Bandwidth optimized W wires
Power and B/W optimized PW wires
Fast, low bandwidth L wires
Latency 1x Power 1x Area 1x
Latency 1.6x Power 0.9x Area 0.5x
Latency 3.2x Power 0.3x Area 0.5x
Latency 0.5x Power 0.5x Area 4x
Naveen Muralimanohar
University of Utah
17
18
Wire Model
ores
Cside-wall
V
Wire RC
M
M
M
ocap
Icap
Cadj
Ref Banerjee et al.
Wire Type Relative Latency Relative Area Dynamic Power Static Power
B-Wire 8x 1x 1x 2.65a 1x
B-Wire 4x 1.6x 0.5x 2.9a 1.13x
L-Wire 8x 0.5x 4x 1.46a 0.55X
PW-Wire 4x 3.2x 0.5x 0.87a 0.3x
65nm process, 10 Metal Layers 4 in 1X and 2 in
each 2X, 4X and 8X plane
Naveen Muralimanohar
University of Utah
18
19
Outline

Overview
Wire Design Space
Methodology to Design Scalable Caches
Heterogeneous Wires for Large Caches
Heterogeneous Wires for Coherence Traffic
Conclusions

20
Cache Design Basics
Bitlines
Input address
Wordline
Decoder
Tag array
Data array
Column muxes
Sense Amps
Comparators
Output driver
Mux drivers
Output driver
Data output
Valid output?
Naveen Muralimanohar
University of Utah
20
21
Existing Model - CACTI
Wordline bitline delay
Wordline bitline delay
Decoder delay
Decoder delay
Cache model with 4 sub-arrays
Cache model with 16 sub-arrays
Decoder delay H-tree delay logic delay
Naveen Muralimanohar
University of Utah
21
22
CACTI Shortcomings

Access delay is equal to the delay of slowest
sub-array
Very high hit time for large caches
Employs a separate bus for each cache bank for
multi-banked caches
Not scalable

Potential solution NUCA Extend CACTI to model
NUCA
Exploit different wire types and network design
choices to reduce access latency
Naveen Muralimanohar
University of Utah
22
23
Non-Uniform Cache Access (NUCA)

Large cache is broken into a number of small
banks
Employs on-chip network for communication
Access delay a (distance between bank and cache
controller)

CPU L1
Cache banks
(Kim et al. ASPLOS 02)
Naveen Muralimanohar
University of Utah
23
24
Extension to CACTI

On-chip network
Wire model based on ITRS 2005 parameters
Grid network
3-stage speculative router pipeline
Network latency vs Bank access latency tradeoff
Iterate over different bank sizes
Calculate the average network delay based on the
number of banks and bank sizes
Consider contention values for different cache
configurations
Similarly we also consider power consumed for
each organization

Naveen Muralimanohar
University of Utah
24
25
Trade-off Analysis (32 MB Cache)
Naveen Muralimanohar
University of Utah
25
26
Effect of Core Count
Naveen Muralimanohar
University of Utah
26
27
Power Centric Design (32MB Cache)
Naveen Muralimanohar
University of Utah
27
28
Search Space of Old CACTI

Design space with global wires optimized for
delay

28
University of Utah
28
29
Search Space of CACTI-6
Low-swing
30 Delay Penalty
Least Delay
Design space with various wire types
29
University of Utah
29
30
Earlier NUCA Models

Made simplified assumptions for network
parameters
Minimum bank access time
Minimum network hop latency
Single cycle router pipeline
Employed 512 banks for a 32 MB cache
More bandwidth
- 2.5X less efficient in terms of delay

Naveen Muralimanohar
University of Utah
30
31
Outline

Overview
Wire Design Space
Methodology to Design Scalable Caches
Heterogeneous Wires for Large Caches
Heterogeneous Wires for Coherence Traffic
Conclusions

32
Cache Look-Up
Decoder 10-15 bits
Tag
Data
Network Routing Logic 4-6 bits
Core/L1
Comparator
L2 Bank

The entire access happens in a sequential manner

33
Early Look-Up
Tag
Data
Critical lower order bits
Core/L1
Comparator
L2 Bank

Break the sequential access
Hides 70 of the bank access time

34
Aggressive Look-Up
Tag
Data
Critical lower order bits 8 bits
Core/L1
Comparator
L2 Bank
11011101111100010
11100010
35
Aggressive Look-Up

Reduction in link delay (for address transfer)
Increase in traffic due to false match lt 1
Marginal increase in link overhead
Additional 8-bits
More logic at the cache controller for tag match
Address transfer for writes happens on L-wires

36
Heterogeneous Network

Routers introduce significant overhead
(especially in L-network)
L-wires can transfer signal across four banks in
four cycles
Router adds three cycles for each hop
Modify network topology to take advantage of wire
property
Different topology for address and data transfers

37
Hybrid Network

Combination of point-to-point and bus
Reduction in latency
Reduction in power
Efficient use of L-wires
- Low bandwidth

38
Experimental Setup

Simplescalar with contention modeled in detail
Single core, 8-issue out-of-order processor
32 MB, 8-way set-associative, on-chip L2 cache
(SNUCA organization)
32KB L1 I-cache and 32KB L1 D-cache with a hit
latency of 3 cycles
Main memory latency 300 cycles

39
CMP Setup
L2 Bank

Eight Core CMP
(Simplescalar tool)
32 MB, 8-way set-associative
(SNUCA organization)
Two cache controllers
Main memory latency 300 cycles

C1
C5
C2
C6
C3
C7
C4
C8
40
Network Model

Virtual channel flow control
Four virtual channels/physical channel
Credit based flow control (for backpressure)
Adaptive routing
Each hop should reduce Manhattan distance between
the source and the destination

41
Cache Models
Model Bank Access (cycles) Bank Count Network Link Description
1 3 512 B-wires Based on prior work
2 17 16 B-wires CACTI-6
3 17 16 B Lwires Early Lookup
4 17 16 B Lwires Agg. Lookup
5 17 16 B Lwires Hybrid network
6 17 16 B-wires Upper bound
42
Performance Results (Uniprocessor)
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
43
Performance Results (Uniprocessor)
Early lookup technique, average improvement over
Model 2 6 L2 Sensitive 8
44
Performance Results (Uniprocessor)
Aggressive lookup technique, average improvement
over Model 2 8 L2 Sensitive 9
45
Performance Results (Uniprocessor)
Hybrid model, average improvement over Model 2
15 L2 Sensitive 20
46
Performance Results (CMP)
47
Performance Results (4X Wires)

Wire delay constrained
model
Performance improvements are better
Early lookup - 7
Aggressive model - 20
Hybrid model - 29

48
NUCA Design

Network parameters play a significant role in the
performance of large caches
Modified CACTI model, that includes network
overhead performs 51 better compared to previous
models
Methodology to compute an optimal baseline NUCA

49
NUCA Design II

Wires can be tuned for different metrics
Routers impose non-trivial overhead
Address and data have different bandwidth needs
We introduce heterogeneity at three levels
Different types of wires for address and data
transfers
Different topologies for address and data
networks
Different architectures within address network
(point-to-point and bus)
(Yields an additional performance improvement of
15 over the optimal, baseline NUCA)

50
Outline

Overview
Methodology to Design Scalable Caches
Wire Design Space
Heterogeneous Wires for Large Caches
Heterogeneous Wires for Coherence Traffic
Conclusions

51
Directory Based Protocol (Write-Invalidate)

Map critical/small messages on L wires and
non-critical messages on PW wires
Read exclusive request for block in shared state
Read request for block in exclusive state
Negative Ack (NACK) messages

Hop Imbalance in messages
52
Exclusive request for a shared copy
Non-Critical

Rd-Ex request from processor 1
Directory sends clean copy to processor 1
Directory sends invalidate message to processor 2
Cache 2 sends acknowledgement back to processor 1

Processor 1
Processor 2
4
Cache 1
Cache 2
2
3
1
Critical
L2 Directory
53
Read to an Exclusive Block
Fwd Dirty Copy
(critical)
Proc 1 L1
Proc 2 L1
ACK
Spec Reply
(non-critical)
Req
Read Req
WB Data
L2 Directory
(non-critical)
54
Evaluation Platform Simulation Methodology
Processor

Virtutech Simics Functional Simulator
Ruby Timing Model (GEMS)
SPLASH Suite

L2
55
Heterogeneous Model
Processor
L-wire B-wire PW-wire
L2

11 Performance improvement
22.5 Power savings in wire

56
Summary

Coherence messages have diverse needs
Intelligent mapping of these messages to wires in
heterogeneous network can improve both
performance and power
Low bandwidth, high speed links improve
performance by 11 for SPLASH benchmark suite
Non-critical traffic on power optimized network
decreases wire power by 22.5
Ref Interconnect Aware Coherence Protocol (ISCA
06) collaborated with Liqun Cheng

57
On-Core Communications

L-wires
Narrow bit width operands
Branch mis-predict signal
PW wires
Non-critical register values
Ready registers
Store data
11 improvement in ED2

58
Results Summary
Cluster
P0
P1
P2
P3
P4
P5
P6
P7
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
Controller
Cache Reads and Writes 114 Processor performance
improvement 50 Power Savings
L2 bank
Coherence Transactions 11 Performance
Improvement 22.5 power savings in wires
L1-accesses 7 performance improvement 11 ED2
improvement
Controller
P8
I
D
P9
I
D
P10
I
D
P11
I
D
P12
I
D
P13
I
D
P14
I
D
P15
I
D
Naveen Muralimanohar
University of Utah
58
59
Conclusion

Impact of interconnect choices in modern
processors is significant
Architectural level wire management can improve
both power and performance of future
communication bound processors
Architects have a lot to offer in the area of
wire aware design

60
Future Research