Title: Wire%20Aware%20Architecture
1Wire Aware Architecture
Naveen Muralimanohar Advisor Rajeev
Balasubramonian
University of Utah
2Effect of Technology Scaling
- Power wall
- Temperature wall
- Reliability issues
- Process variation
- Soft errors
- Wire scaling
- Communication is expensive but computations are
cheap
3Wire Delay Compelling Opportunity
- Existing proposals are indirect
- Hide wire delay
- Pre-fetching, Speculative coherence, Run-ahead
execution - Reduce communication to save power
- Wire level optimizations are still limited to
circuit designers -
4Thesis Statement
- The growing cost of on-chip wire delay requires
a thorough understanding of wires. - The dissertation advocates exposing wire
properties to architects and proposes
microarchitectural wire management -
5Wire Delay/Power
- Pentium 4 (_at_ 90nm) spent two cycles to send a
signal across the chip - Wire delays are costly for performance and power
- Latencies of 60 cycles to reach ends of a chip
at 32nm (_at_ 5 GHz) - 50 of dynamic power is in interconnect switching
(Magen et al. SLIP 04)
6Large Caches
Intel Montecito
- Cache hierarchies will dominate chip area
- Montecito has two private 12 MB L3 caches (27MB
including L2) - Long global wires are required to transmit
data/address
Cache
Cache
7On-Chip Cache Challenges
Cache access time calculated using CACTI
4 MB
16 MB
64 MB
1.5X 65nm process
1X 130nm process
2X 32nm process
8Effect of L2 Hit Time
An aggressive out-of-order processor (L2-hit time
30 -gt15 cycles)
Avg 17
9Coherence Traffic
- CMP has already become ubiquitous
- Requires Coherence among multiple cores
- Coherence operations entail frequent
communications - Different coherence messages have different
latency and bandwidth needs
Core 1
Core2
Core 3
Latest copy
Inv Ack
L1
L1
L1
Inval Req
Read Req
Ex Req
Fwd Read Req to owner
L2
Messages related to read miss
Messages related to write miss
10L1 Accesses
- Highly latency critical in aggressive
out-of-order processors (such as a clustered
processor) - The choice of inter-cluster communication fabric
has a high impact on performance
11On-chip Traffic
Cluster
P0
P1
P2
P3
P4
P5
P6
P7
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
Controller
L2 bank
Cache Reads and Writes
Coherence Transactions
L1-accesses
Controller
P8
I
D
P9
I
D
P10
I
D
P11
I
D
P12
I
D
P13
I
D
P14
I
D
P15
I
D
12Outline
- Overview
- Wire Design Space
- Methodology to Design Scalable Caches
- Heterogeneous Wires for Large Caches
- Heterogeneous Wires for Coherence Traffic
- Conclusions
13Wire Characteristics
- Wire Resistance and capacitance per unit length
Resistance Capacitance Bandwidth
Width
Spacing
Naveen Muralimanohar
University of Utah
13
14Design Space Exploration
- Tuning wire width and spacing
Base case B wires
Fast but Low bandwidth L wires
(Width Spacing) ? ? Delay ? Bandwidth ?
Naveen Muralimanohar
University of Utah
14
15Design Space Exploration
- Tuning Repeater size and spacing
Power Optimal Wires Smaller repeaters Increased
spacing
Delay
Power
Traditional Wires Large repeaters Optimum spacing
Naveen Muralimanohar
University of Utah
15
16ED Trade-off in a Repeated Wire
Naveen Muralimanohar
University of Utah
16
17Design Space Exploration
Base case B wires
Bandwidth optimized W wires
Power and B/W optimized PW wires
Fast, low bandwidth L wires
Latency 1x Power 1x Area 1x
Latency 1.6x Power 0.9x Area 0.5x
Latency 3.2x Power 0.3x Area 0.5x
Latency 0.5x Power 0.5x Area 4x
Naveen Muralimanohar
University of Utah
17
18Wire Model
ores
Cside-wall
V
Wire RC
M
M
M
ocap
Icap
Cadj
Ref Banerjee et al.
Wire Type Relative Latency Relative Area Dynamic Power Static Power
B-Wire 8x 1x 1x 2.65a 1x
B-Wire 4x 1.6x 0.5x 2.9a 1.13x
L-Wire 8x 0.5x 4x 1.46a 0.55X
PW-Wire 4x 3.2x 0.5x 0.87a 0.3x
65nm process, 10 Metal Layers 4 in 1X and 2 in
each 2X, 4X and 8X plane
Naveen Muralimanohar
University of Utah
18
19Outline
- Overview
- Wire Design Space
- Methodology to Design Scalable Caches
- Heterogeneous Wires for Large Caches
- Heterogeneous Wires for Coherence Traffic
- Conclusions
20Cache Design Basics
Bitlines
Input address
Wordline
Decoder
Tag array
Data array
Column muxes
Sense Amps
Comparators
Output driver
Mux drivers
Output driver
Data output
Valid output?
Naveen Muralimanohar
University of Utah
20
21Existing Model - CACTI
Wordline bitline delay
Wordline bitline delay
Decoder delay
Decoder delay
Cache model with 4 sub-arrays
Cache model with 16 sub-arrays
Decoder delay H-tree delay logic delay
Naveen Muralimanohar
University of Utah
21
22CACTI Shortcomings
- Access delay is equal to the delay of slowest
sub-array - Very high hit time for large caches
- Employs a separate bus for each cache bank for
multi-banked caches - Not scalable
Potential solution NUCA Extend CACTI to model
NUCA
Exploit different wire types and network design
choices to reduce access latency
Naveen Muralimanohar
University of Utah
22
23Non-Uniform Cache Access (NUCA)
- Large cache is broken into a number of small
banks - Employs on-chip network for communication
- Access delay a (distance between bank and cache
controller) -
CPU L1
Cache banks
(Kim et al. ASPLOS 02)
Naveen Muralimanohar
University of Utah
23
24Extension to CACTI
- On-chip network
- Wire model based on ITRS 2005 parameters
- Grid network
- 3-stage speculative router pipeline
- Network latency vs Bank access latency tradeoff
- Iterate over different bank sizes
- Calculate the average network delay based on the
number of banks and bank sizes - Consider contention values for different cache
configurations - Similarly we also consider power consumed for
each organization
Naveen Muralimanohar
University of Utah
24
25Trade-off Analysis (32 MB Cache)
Naveen Muralimanohar
University of Utah
25
26Effect of Core Count
Naveen Muralimanohar
University of Utah
26
27Power Centric Design (32MB Cache)
Naveen Muralimanohar
University of Utah
27
28Search Space of Old CACTI
- Design space with global wires optimized for
delay
28
University of Utah
28
29Search Space of CACTI-6
Low-swing
30 Delay Penalty
Least Delay
Design space with various wire types
29
University of Utah
29
30Earlier NUCA Models
- Made simplified assumptions for network
parameters - Minimum bank access time
- Minimum network hop latency
- Single cycle router pipeline
- Employed 512 banks for a 32 MB cache
- More bandwidth
- - 2.5X less efficient in terms of delay
Naveen Muralimanohar
University of Utah
30
31Outline
- Overview
- Wire Design Space
- Methodology to Design Scalable Caches
- Heterogeneous Wires for Large Caches
- Heterogeneous Wires for Coherence Traffic
- Conclusions
32Cache Look-Up
Decoder 10-15 bits
Tag
Data
Network Routing Logic 4-6 bits
Core/L1
Comparator
L2 Bank
- The entire access happens in a sequential manner
33Early Look-Up
Tag
Data
Critical lower order bits
Core/L1
Comparator
L2 Bank
- Break the sequential access
- Hides 70 of the bank access time
34Aggressive Look-Up
Tag
Data
Critical lower order bits 8 bits
Core/L1
Comparator
L2 Bank
11011101111100010
11100010
35Aggressive Look-Up
- Reduction in link delay (for address transfer)
- Increase in traffic due to false match lt 1
- Marginal increase in link overhead
- Additional 8-bits
- More logic at the cache controller for tag match
- Address transfer for writes happens on L-wires
36Heterogeneous Network
- Routers introduce significant overhead
(especially in L-network) - L-wires can transfer signal across four banks in
four cycles - Router adds three cycles for each hop
- Modify network topology to take advantage of wire
property - Different topology for address and data transfers
37Hybrid Network
- Combination of point-to-point and bus
- Reduction in latency
- Reduction in power
- Efficient use of L-wires
- - Low bandwidth
38Experimental Setup
- Simplescalar with contention modeled in detail
- Single core, 8-issue out-of-order processor
- 32 MB, 8-way set-associative, on-chip L2 cache
(SNUCA organization) - 32KB L1 I-cache and 32KB L1 D-cache with a hit
latency of 3 cycles - Main memory latency 300 cycles
39CMP Setup
L2 Bank
- Eight Core CMP
- (Simplescalar tool)
- 32 MB, 8-way set-associative
- (SNUCA organization)
- Two cache controllers
- Main memory latency 300 cycles
C1
C5
C2
C6
C3
C7
C4
C8
40Network Model
- Virtual channel flow control
- Four virtual channels/physical channel
- Credit based flow control (for backpressure)
- Adaptive routing
- Each hop should reduce Manhattan distance between
the source and the destination
41Cache Models
Model Bank Access (cycles) Bank Count Network Link Description
1 3 512 B-wires Based on prior work
2 17 16 B-wires CACTI-6
3 17 16 B Lwires Early Lookup
4 17 16 B Lwires Agg. Lookup
5 17 16 B Lwires Hybrid network
6 17 16 B-wires Upper bound
42Performance Results (Uniprocessor)
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
43Performance Results (Uniprocessor)
Early lookup technique, average improvement over
Model 2 6 L2 Sensitive 8
44Performance Results (Uniprocessor)
Aggressive lookup technique, average improvement
over Model 2 8 L2 Sensitive 9
45Performance Results (Uniprocessor)
Hybrid model, average improvement over Model 2
15 L2 Sensitive 20
46Performance Results (CMP)
47Performance Results (4X Wires)
- Wire delay constrained
- model
- Performance improvements are better
- Early lookup - 7
- Aggressive model - 20
- Hybrid model - 29
48NUCA Design
- Network parameters play a significant role in the
performance of large caches - Modified CACTI model, that includes network
overhead performs 51 better compared to previous
models - Methodology to compute an optimal baseline NUCA
49NUCA Design II
- Wires can be tuned for different metrics
- Routers impose non-trivial overhead
- Address and data have different bandwidth needs
- We introduce heterogeneity at three levels
- Different types of wires for address and data
transfers - Different topologies for address and data
networks - Different architectures within address network
(point-to-point and bus) - (Yields an additional performance improvement of
15 over the optimal, baseline NUCA)
50Outline
- Overview
- Methodology to Design Scalable Caches
- Wire Design Space
- Heterogeneous Wires for Large Caches
- Heterogeneous Wires for Coherence Traffic
- Conclusions
51Directory Based Protocol (Write-Invalidate)
- Map critical/small messages on L wires and
non-critical messages on PW wires - Read exclusive request for block in shared state
- Read request for block in exclusive state
- Negative Ack (NACK) messages
-
Hop Imbalance in messages
52Exclusive request for a shared copy
Non-Critical
- Rd-Ex request from processor 1
- Directory sends clean copy to processor 1
- Directory sends invalidate message to processor 2
- Cache 2 sends acknowledgement back to processor 1
Processor 1
Processor 2
4
Cache 1
Cache 2
2
3
1
Critical
L2 Directory
53Read to an Exclusive Block
Fwd Dirty Copy
(critical)
Proc 1 L1
Proc 2 L1
ACK
Spec Reply
(non-critical)
Req
Read Req
WB Data
L2 Directory
(non-critical)
54Evaluation Platform Simulation Methodology
Processor
- Virtutech Simics Functional Simulator
- Ruby Timing Model (GEMS)
- SPLASH Suite
L2
55Heterogeneous Model
Processor
L-wire B-wire PW-wire
L2
- 11 Performance improvement
- 22.5 Power savings in wire
56Summary
- Coherence messages have diverse needs
- Intelligent mapping of these messages to wires in
heterogeneous network can improve both
performance and power - Low bandwidth, high speed links improve
performance by 11 for SPLASH benchmark suite - Non-critical traffic on power optimized network
decreases wire power by 22.5 - Ref Interconnect Aware Coherence Protocol (ISCA
06) collaborated with Liqun Cheng
57On-Core Communications
- L-wires
- Narrow bit width operands
- Branch mis-predict signal
- PW wires
- Non-critical register values
- Ready registers
- Store data
- 11 improvement in ED2
58Results Summary
Cluster
P0
P1
P2
P3
P4
P5
P6
P7
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
Controller
Cache Reads and Writes 114 Processor performance
improvement 50 Power Savings
L2 bank
Coherence Transactions 11 Performance
Improvement 22.5 power savings in wires
L1-accesses 7 performance improvement 11 ED2
improvement
Controller
P8
I
D
P9
I
D
P10
I
D
P11
I
D
P12
I
D
P13
I
D
P14
I
D
P15
I
D
Naveen Muralimanohar
University of Utah
58
59Conclusion
- Impact of interconnect choices in modern
processors is significant - Architectural level wire management can improve
both power and performance of future
communication bound processors - Architects have a lot to offer in the area of
wire aware design
60Future Research
- Exploit upcoming technologies
- Low-swing wires, optical interconnect, RF,
transmission lines etc. - Transactional Memory
- Network to support register-register
communication - Dynamic adaptation
61Acknowledgements
- Committee members
- Rajeev, Al, John, Erik, and Shubu (Intel)
- External
- Dr. Norm Jouppi (HP Labs), Dr. Ravi Iyer
(Intel) - CS front office staff
- Lab-mates
- Karthik, Niti, Liqun, and other fellow grads
-
62Avenues Explored
- Inter-core communication (ISCA 2006)
- Memory hierarchy (ISCA 2007)
- CACTI 6.0 publicly released (MICRO 2007), (IEEE
Micro Top Picks 2008) - Out-of-order core (HPCA 2005, IEEE Micro 06)
- Power and Temperature Aware Architectures
- (ISPASS 2006)
- Current Project or under submission
- Scalable and Reliable Transactional Memory (PACT
08) - Rethinking Fundamentals Route Wires or Packets?
- 3D Reconfigurable Caches