Wire%20Aware%20Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

Wire%20Aware%20Architecture

Description:

www.cs.utah.edu – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 63
Provided by: Naveen74
Category:

less

Transcript and Presenter's Notes

Title: Wire%20Aware%20Architecture


1
Wire Aware Architecture
Naveen Muralimanohar Advisor Rajeev
Balasubramonian
University of Utah
2
Effect of Technology Scaling
  • Power wall
  • Temperature wall
  • Reliability issues
  • Process variation
  • Soft errors
  • Wire scaling
  • Communication is expensive but computations are
    cheap

3
Wire Delay Compelling Opportunity
  • Existing proposals are indirect
  • Hide wire delay
  • Pre-fetching, Speculative coherence, Run-ahead
    execution
  • Reduce communication to save power
  • Wire level optimizations are still limited to
    circuit designers

4
Thesis Statement
  • The growing cost of on-chip wire delay requires
    a thorough understanding of wires.
  • The dissertation advocates exposing wire
    properties to architects and proposes
    microarchitectural wire management

5
Wire Delay/Power
  • Pentium 4 (_at_ 90nm) spent two cycles to send a
    signal across the chip
  • Wire delays are costly for performance and power
  • Latencies of 60 cycles to reach ends of a chip
    at 32nm (_at_ 5 GHz)
  • 50 of dynamic power is in interconnect switching
    (Magen et al. SLIP 04)

6
Large Caches
Intel Montecito
  • Cache hierarchies will dominate chip area
  • Montecito has two private 12 MB L3 caches (27MB
    including L2)
  • Long global wires are required to transmit
    data/address

Cache
Cache
7
On-Chip Cache Challenges
Cache access time calculated using CACTI
4 MB
16 MB
64 MB
1.5X 65nm process
1X 130nm process
2X 32nm process
8
Effect of L2 Hit Time
An aggressive out-of-order processor (L2-hit time
30 -gt15 cycles)
Avg 17
9
Coherence Traffic
  • CMP has already become ubiquitous
  • Requires Coherence among multiple cores
  • Coherence operations entail frequent
    communications
  • Different coherence messages have different
    latency and bandwidth needs

Core 1
Core2
Core 3
Latest copy
Inv Ack
L1
L1
L1
Inval Req
Read Req
Ex Req
Fwd Read Req to owner
L2
Messages related to read miss
Messages related to write miss
10
L1 Accesses
  • Highly latency critical in aggressive
    out-of-order processors (such as a clustered
    processor)
  • The choice of inter-cluster communication fabric
    has a high impact on performance

11
On-chip Traffic
Cluster
P0
P1
P2
P3
P4
P5
P6
P7
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
Controller
L2 bank
Cache Reads and Writes
Coherence Transactions
L1-accesses
Controller
P8
I
D
P9
I
D
P10
I
D
P11
I
D
P12
I
D
P13
I
D
P14
I
D
P15
I
D
12
Outline
  • Overview
  • Wire Design Space
  • Methodology to Design Scalable Caches
  • Heterogeneous Wires for Large Caches
  • Heterogeneous Wires for Coherence Traffic
  • Conclusions

13
Wire Characteristics
  • Wire Resistance and capacitance per unit length

Resistance Capacitance Bandwidth
Width
Spacing
Naveen Muralimanohar
University of Utah
13
14
Design Space Exploration
  • Tuning wire width and spacing

Base case B wires
Fast but Low bandwidth L wires
(Width Spacing) ? ? Delay ? Bandwidth ?
Naveen Muralimanohar
University of Utah
14
15
Design Space Exploration
  • Tuning Repeater size and spacing

Power Optimal Wires Smaller repeaters Increased
spacing
Delay
Power
Traditional Wires Large repeaters Optimum spacing
Naveen Muralimanohar
University of Utah
15
16
ED Trade-off in a Repeated Wire
Naveen Muralimanohar
University of Utah
16
17
Design Space Exploration
Base case B wires
Bandwidth optimized W wires
Power and B/W optimized PW wires
Fast, low bandwidth L wires
Latency 1x Power 1x Area 1x
Latency 1.6x Power 0.9x Area 0.5x
Latency 3.2x Power 0.3x Area 0.5x
Latency 0.5x Power 0.5x Area 4x
Naveen Muralimanohar
University of Utah
17
18
Wire Model
ores
Cside-wall
V
Wire RC
M
M
M
ocap
Icap
Cadj
Ref Banerjee et al.
Wire Type Relative Latency Relative Area Dynamic Power Static Power
B-Wire 8x 1x 1x 2.65a 1x
B-Wire 4x 1.6x 0.5x 2.9a 1.13x
L-Wire 8x 0.5x 4x 1.46a 0.55X
PW-Wire 4x 3.2x 0.5x 0.87a 0.3x
65nm process, 10 Metal Layers 4 in 1X and 2 in
each 2X, 4X and 8X plane
Naveen Muralimanohar
University of Utah
18
19
Outline
  • Overview
  • Wire Design Space
  • Methodology to Design Scalable Caches
  • Heterogeneous Wires for Large Caches
  • Heterogeneous Wires for Coherence Traffic
  • Conclusions

20
Cache Design Basics
Bitlines
Input address
Wordline
Decoder
Tag array
Data array
Column muxes
Sense Amps
Comparators
Output driver
Mux drivers
Output driver
Data output
Valid output?
Naveen Muralimanohar
University of Utah
20
21
Existing Model - CACTI
Wordline bitline delay
Wordline bitline delay
Decoder delay
Decoder delay
Cache model with 4 sub-arrays
Cache model with 16 sub-arrays
Decoder delay H-tree delay logic delay
Naveen Muralimanohar
University of Utah
21
22
CACTI Shortcomings
  • Access delay is equal to the delay of slowest
    sub-array
  • Very high hit time for large caches
  • Employs a separate bus for each cache bank for
    multi-banked caches
  • Not scalable

Potential solution NUCA Extend CACTI to model
NUCA
Exploit different wire types and network design
choices to reduce access latency
Naveen Muralimanohar
University of Utah
22
23
Non-Uniform Cache Access (NUCA)
  • Large cache is broken into a number of small
    banks
  • Employs on-chip network for communication
  • Access delay a (distance between bank and cache
    controller)

CPU L1
Cache banks
(Kim et al. ASPLOS 02)
Naveen Muralimanohar
University of Utah
23
24
Extension to CACTI
  • On-chip network
  • Wire model based on ITRS 2005 parameters
  • Grid network
  • 3-stage speculative router pipeline
  • Network latency vs Bank access latency tradeoff
  • Iterate over different bank sizes
  • Calculate the average network delay based on the
    number of banks and bank sizes
  • Consider contention values for different cache
    configurations
  • Similarly we also consider power consumed for
    each organization

Naveen Muralimanohar
University of Utah
24
25
Trade-off Analysis (32 MB Cache)
Naveen Muralimanohar
University of Utah
25
26
Effect of Core Count
Naveen Muralimanohar
University of Utah
26
27
Power Centric Design (32MB Cache)
Naveen Muralimanohar
University of Utah
27
28
Search Space of Old CACTI
  • Design space with global wires optimized for
    delay

28
University of Utah
28
29
Search Space of CACTI-6
Low-swing
30 Delay Penalty
Least Delay
Design space with various wire types
29
University of Utah
29
30
Earlier NUCA Models
  • Made simplified assumptions for network
    parameters
  • Minimum bank access time
  • Minimum network hop latency
  • Single cycle router pipeline
  • Employed 512 banks for a 32 MB cache
  • More bandwidth
  • - 2.5X less efficient in terms of delay

Naveen Muralimanohar
University of Utah
30
31
Outline
  • Overview
  • Wire Design Space
  • Methodology to Design Scalable Caches
  • Heterogeneous Wires for Large Caches
  • Heterogeneous Wires for Coherence Traffic
  • Conclusions

32
Cache Look-Up
Decoder 10-15 bits
Tag
Data
Network Routing Logic 4-6 bits
Core/L1
Comparator
L2 Bank
  • The entire access happens in a sequential manner

33
Early Look-Up
Tag
Data
Critical lower order bits
Core/L1
Comparator
L2 Bank
  • Break the sequential access
  • Hides 70 of the bank access time

34
Aggressive Look-Up
Tag
Data
Critical lower order bits 8 bits
Core/L1
Comparator
L2 Bank
11011101111100010
11100010
35
Aggressive Look-Up
  • Reduction in link delay (for address transfer)
  • Increase in traffic due to false match lt 1
  • Marginal increase in link overhead
  • Additional 8-bits
  • More logic at the cache controller for tag match
  • Address transfer for writes happens on L-wires

36
Heterogeneous Network
  • Routers introduce significant overhead
    (especially in L-network)
  • L-wires can transfer signal across four banks in
    four cycles
  • Router adds three cycles for each hop
  • Modify network topology to take advantage of wire
    property
  • Different topology for address and data transfers

37
Hybrid Network
  • Combination of point-to-point and bus
  • Reduction in latency
  • Reduction in power
  • Efficient use of L-wires
  • - Low bandwidth

38
Experimental Setup
  • Simplescalar with contention modeled in detail
  • Single core, 8-issue out-of-order processor
  • 32 MB, 8-way set-associative, on-chip L2 cache
    (SNUCA organization)
  • 32KB L1 I-cache and 32KB L1 D-cache with a hit
    latency of 3 cycles
  • Main memory latency 300 cycles

39
CMP Setup
L2 Bank
  • Eight Core CMP
  • (Simplescalar tool)
  • 32 MB, 8-way set-associative
  • (SNUCA organization)
  • Two cache controllers
  • Main memory latency 300 cycles

C1
C5
C2
C6
C3
C7
C4
C8
40
Network Model
  • Virtual channel flow control
  • Four virtual channels/physical channel
  • Credit based flow control (for backpressure)
  • Adaptive routing
  • Each hop should reduce Manhattan distance between
    the source and the destination

41
Cache Models
Model Bank Access (cycles) Bank Count Network Link Description
1 3 512 B-wires Based on prior work
2 17 16 B-wires CACTI-6
3 17 16 B Lwires Early Lookup
4 17 16 B Lwires Agg. Lookup
5 17 16 B Lwires Hybrid network
6 17 16 B-wires Upper bound
42
Performance Results (Uniprocessor)
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
43
Performance Results (Uniprocessor)
Early lookup technique, average improvement over
Model 2 6 L2 Sensitive 8
44
Performance Results (Uniprocessor)
Aggressive lookup technique, average improvement
over Model 2 8 L2 Sensitive 9
45
Performance Results (Uniprocessor)
Hybrid model, average improvement over Model 2
15 L2 Sensitive 20
46
Performance Results (CMP)
47
Performance Results (4X Wires)
  • Wire delay constrained
  • model
  • Performance improvements are better
  • Early lookup - 7
  • Aggressive model - 20
  • Hybrid model - 29

48
NUCA Design
  • Network parameters play a significant role in the
    performance of large caches
  • Modified CACTI model, that includes network
    overhead performs 51 better compared to previous
    models
  • Methodology to compute an optimal baseline NUCA

49
NUCA Design II
  • Wires can be tuned for different metrics
  • Routers impose non-trivial overhead
  • Address and data have different bandwidth needs
  • We introduce heterogeneity at three levels
  • Different types of wires for address and data
    transfers
  • Different topologies for address and data
    networks
  • Different architectures within address network
    (point-to-point and bus)
  • (Yields an additional performance improvement of
    15 over the optimal, baseline NUCA)

50
Outline
  • Overview
  • Methodology to Design Scalable Caches
  • Wire Design Space
  • Heterogeneous Wires for Large Caches
  • Heterogeneous Wires for Coherence Traffic
  • Conclusions

51
Directory Based Protocol (Write-Invalidate)
  • Map critical/small messages on L wires and
    non-critical messages on PW wires
  • Read exclusive request for block in shared state
  • Read request for block in exclusive state
  • Negative Ack (NACK) messages

Hop Imbalance in messages
52
Exclusive request for a shared copy
Non-Critical
  1. Rd-Ex request from processor 1
  2. Directory sends clean copy to processor 1
  3. Directory sends invalidate message to processor 2
  4. Cache 2 sends acknowledgement back to processor 1

Processor 1
Processor 2
4
Cache 1
Cache 2
2
3
1
Critical
L2 Directory
53
Read to an Exclusive Block
Fwd Dirty Copy
(critical)
Proc 1 L1
Proc 2 L1
ACK
Spec Reply
(non-critical)
Req
Read Req
WB Data
L2 Directory
(non-critical)
54
Evaluation Platform Simulation Methodology
Processor
  • Virtutech Simics Functional Simulator
  • Ruby Timing Model (GEMS)
  • SPLASH Suite

L2
55
Heterogeneous Model
Processor
L-wire B-wire PW-wire
L2
  • 11 Performance improvement
  • 22.5 Power savings in wire

56
Summary
  • Coherence messages have diverse needs
  • Intelligent mapping of these messages to wires in
    heterogeneous network can improve both
    performance and power
  • Low bandwidth, high speed links improve
    performance by 11 for SPLASH benchmark suite
  • Non-critical traffic on power optimized network
    decreases wire power by 22.5
  • Ref Interconnect Aware Coherence Protocol (ISCA
    06) collaborated with Liqun Cheng

57
On-Core Communications
  • L-wires
  • Narrow bit width operands
  • Branch mis-predict signal
  • PW wires
  • Non-critical register values
  • Ready registers
  • Store data
  • 11 improvement in ED2

58
Results Summary
Cluster
P0
P1
P2
P3
P4
P5
P6
P7
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
Controller
Cache Reads and Writes 114 Processor performance
improvement 50 Power Savings
L2 bank
Coherence Transactions 11 Performance
Improvement 22.5 power savings in wires
L1-accesses 7 performance improvement 11 ED2
improvement
Controller
P8
I
D
P9
I
D
P10
I
D
P11
I
D
P12
I
D
P13
I
D
P14
I
D
P15
I
D
Naveen Muralimanohar
University of Utah
58
59
Conclusion
  • Impact of interconnect choices in modern
    processors is significant
  • Architectural level wire management can improve
    both power and performance of future
    communication bound processors
  • Architects have a lot to offer in the area of
    wire aware design

60
Future Research
  • Exploit upcoming technologies
  • Low-swing wires, optical interconnect, RF,
    transmission lines etc.
  • Transactional Memory
  • Network to support register-register
    communication
  • Dynamic adaptation

61
Acknowledgements
  • Committee members
  • Rajeev, Al, John, Erik, and Shubu (Intel)
  • External
  • Dr. Norm Jouppi (HP Labs), Dr. Ravi Iyer
    (Intel)
  • CS front office staff
  • Lab-mates
  • Karthik, Niti, Liqun, and other fellow grads

62
Avenues Explored
  • Inter-core communication (ISCA 2006)
  • Memory hierarchy (ISCA 2007)
  • CACTI 6.0 publicly released (MICRO 2007), (IEEE
    Micro Top Picks 2008)
  • Out-of-order core (HPCA 2005, IEEE Micro 06)
  • Power and Temperature Aware Architectures
  • (ISPASS 2006)
  • Current Project or under submission
  • Scalable and Reliable Transactional Memory (PACT
    08)
  • Rethinking Fundamentals Route Wires or Packets?
  • 3D Reconfigurable Caches
Write a Comment
User Comments (0)
About PowerShow.com