University of Utah - PowerPoint PPT Presentation

About This Presentation
Title:

University of Utah

Description:

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 31
Provided by: NaveenMura5
Category:
Tags: university | utah

less

Transcript and Presenter's Notes

Title: University of Utah


1
Optimizing NUCA Organizations and Wiring
Alternatives for Large Caches with CACTI 6.0
Naveen Muralimanohar Rajeev Balasubramonian Norman
P Jouppi
1
2
Large Caches
Intel Montecito
  • Cache hierarchies will dominate chip area
  • 3D stacked processors with an entire die for
    on-chip cache could be common
  • Montecito has two private 12 MB L3 caches (27MB
    including L2)
  • Long global wires are required to transmit
    data/address

Cache
Cache
2
3
Wire Delay/Power
  • Wire delays are costly for performance and power
  • Latencies of 60 cycles to reach ends of a chip
    at 32nm (_at_ 5 GHz)
  • 50 of dynamic power is in interconnect switching
    (Magen et al. SLIP 04)
  • CACTI access time for 24 MB cache is 90 cycles _at_
    5GHz, 65nm Tech

3
version 4
4
Contribution
  • Support for various interconnect models
  • Improved design space exploration
  • Support for modeling Non-Uniform Cache Access
    (NUCA)

5
Cache Design Basics
Bitlines
Input address
Wordline
Decoder
Tag array
Data array
Column muxes
Sense Amps
Comparators
Output driver
Mux drivers
Output driver
Data output
Valid output?
5
6
Existing Model - CACTI
Wordline bitline delay
Wordline bitline delay
Decoder delay
Decoder delay
Cache model with 4 sub-arrays
Cache model with 16 sub-arrays
Decoder delay H-tree delay logic delay
6
7
Power/Delay Overhead of Wires
  • H-tree delay increases with cache size
  • H-tree power continues to dominate
  • Bitlines are other major contributors to total
    power

8
Motivation
  • The dominant role of interconnect is clear
  • Lack of tool to model interconnect in detail can
    impede progress
  • Current solutions have limited wire options
  • Orion, CACTI
  • Weak wire model
  • No support for modeling Multi-megabyte caches

University of Utah
8
9
CACTI 6.0 Enhancements
  • Incorporation of
  • Different wire models
  • Different router models
  • Grid topology for NUCA
  • Shared bus for UCA
  • Contention values for various cache
    configurations
  • Methodology to compute optimal NUCA organization
  • Improved interface that enables trade-off
    analysis
  • Validation analysis

University of Utah
9
10
Full-swing Wires
Z
Y
X
University of Utah
10
11
Full-swing Wires II
Three different design points
10 Delay penalty
20 Delay penalty
30 Delay penalty
Repeater size
  • Caveat Repeater sizing and spacing cannot be
    controlled precisely all the time

University of Utah
11
12
Full-Swing Wires
  • Fast and simple
  • Delay proportional to sqrt(RC) as against RC
  • High bandwidth
  • Can be pipelined
  • Requires silicon area
  • High energy
  • Quadratic dependence on voltage

13
Low-swing wires
50mV raise
400mV
400mV
400mV
50mV drop
Differential wires
University of Utah
13
14
Differential Low-swing
  • Very low-power, can be routed over other
    modules
  • Relatively slow, low-bandwidth, high area
    requirement, requires special transmitter and
    receiver
  • Bitlines are a form of low-swing wire
  • Optimized for speed and area as against power
  • Driver and pre-charger employ full Vdd voltage

University of Utah
14
15
Delay Characteristics
Quadratic increase in delay
University of Utah
15
16
Energy Characteristics
University of Utah
16
17
Search Space of CACTI-5
  • Design space with global wires optimized for
    delay

University of Utah
17
18
Search Space of CACTI-6
Low-swing
30 Delay Penalty
Least Delay
Design space with global and low-swing wires
University of Utah
18
19
CACTI Another Limitation
  • Access delay is equal to the delay of slowest
    sub-array
  • Very high hit time for large caches
  • Employs a separate bus for each cache bank for
    multi-banked caches
  • Not scalable

Potential solution NUCA Extend CACTI to model
NUCA
Exploit different wire types and network design
choices to improve the search space
19
20
Non-Uniform Cache Access (NUCA)
  • Large cache is broken into a number of small
    banks
  • Employs on-chip network for communication
  • Access delay a (distance between bank and cache
    controller)

CPU L1
Cache banks
(Kim et al. ASPLOS 02)
20
21
Extension to CACTI
  • On-chip network
  • Wire model based on ITRS 2005 parameters
  • Grid network
  • 3-stage speculative router pipeline
  • Network latency vs Bank access latency tradeoff
  • Iterate over different bank sizes
  • Calculate the average network delay based on the
    number of banks and bank sizes
  • Consider contention values for different cache
    configurations
  • Similarly we also consider power consumed for
    each organization

21
22
Trade-off Analysis (32 MB Cache)
16 Core CMP
23
Effect of Core Count
24
Power Centric Design (32MB Cache)
24
25
Validation
  • HSPICE tool
  • Predictive Technology Model (65nm tech.)
  • Analytical model that employs PTM parameters
    compared against HSPICE
  • Distributed wordlines, bitlines, low-swing
    transmitters, wires, receivers
  • Verified to be within 12

26
Case Study Heterogeneous D-NUCA
  • Dynamic-NUCA
  • Reduces access time by dynamic data movement
  • Near-by banks are accessed more frequently
  • Heterogeneous Banks
  • Near-by banks are made smaller and hence faster
  • Access to nearby banks consume less power
  • Other banks can be made larger and more power
    efficient

27
Access Frequency
  • request satisfied by x KB of cache

28
Few Heterogeneous Organizations Considered by
CACTI
Model 1
Model 2
29
Other Applications
  • Exposing wire properties
  • Novel cache pipelining
  • Early lookup, Aggressive lookup (ISCA 07)
  • Flit-reservation flow control (Peh et al., HPCA
    00)
  • Novel topologies
  • Hybrid network (ISCA 07)

30
Conclusion
  • Network parameters and contention play a critical
    role in deciding NUCA organization
  • Wire choices have significant impact on cache
    properties
  • CACTI 6.0 can identify models that reduce power
    by a factor of three for a delay penalty of 25
  • http//www.hpl.hp.com/personal/Norman_Jouppi/cacti
    6.html
  • http//www.cs.utah.edu/rajeev/cacti6/
Write a Comment
User Comments (0)
About PowerShow.com