On-Chip Interconnect Trend and Design Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

On-Chip Interconnect Trend and Design Optimization

Description:

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 85
Provided by: kua79
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: On-Chip Interconnect Trend and Design Optimization


1
On-Chip Interconnect Trend and Design Optimization
  • Chung-Kuan Cheng
  • UC San Diego, La Jolla, CA

2
Outlines
  • Global Interconnect Technologies
  • RC Trees and Transmission Lines
  • Prefix Adder Synthesis
  • Modeling
  • FPGA Interconnect Architecture
  • Modeling
  • Interconnect Architecture
  • Non-Manhattan Wire Arrangement

3
Interconnect Technologies
  • Introduction
  • On-Chip Global Interconnection
  • Global Wire Modeling
  • Performance Comparison

4
Introduction Performance Impact
  • Interconnect delay determines the system
    performance ITRS08
  • 542ps for 1mm minimum pitch Cu global wire w/o
    repeater _at_ 45nm
  • 150ps for 10 level FO4 delay _at_ 45nm

Ho2001 Future of Wire
5
Introduction Power Dissipation
  • Interconnects consume a significant portion of
    power
  • 1-2 order larger in magnitude compared with gates
  • Half of the dynamic power dissipated on repeaters
    to minimize latency Zhang07
  • Wires consume 50 of total dynamic power for a
    0.13um microprocessor Magen04
  • About 1/3 burned on the global wires.

6
Introduction Technology Trend
  • On-Chip Interconnect Scaling
  • Dimension shrinks
  • Wire resistance increases -gt RC delay
  • Increasing capacitive coupling -gt delay, power,
    noise, etc.
  • Performance of global wires decreases w/
    technology scaling.

Wire Category Wire Category Technology Node Technology Node Technology Node
Wire Category Wire Category 90nm 45nm 22nm
M1 Wire Rw(kohm/mm) 1.914 8.860 34.827
M1 Wire Cw(pF/mm) 0.183 0.157 0.129
Global Wire Rw(kohm/mm) 0.532 2.970 11.000
Global Wire Cw(pF/mm) 0.205 0.179 0.151
Scaling trend of PUL wire resistance and
capacitance
Copper resistivity versus wire width
7
Organization of On-Chip Global Interconnections
8
Multi-Dimensional Design Consideration
  • Preliminary analysis results assuming 65nm CMOS
    process.
  • Application-oriented choice
  • Low Latency
  • T-TL or UT-TL -gt Single-Ended T-lines
  • High Throughput
  • R-RC
  • Low Power
  • PE-TL or UE-TL
  • Low Noise
  • PE-TL or UE-TL
  • Low Area/Cost
  • R-RC

Differential T-lines
For each architecture, the more area the pentagon
covers, the better overall performance is
achieved.
9
On-Chip Global Interconnect Schemes (1)
  • R-RC structure
  • Repeater size/Length of segments
  • Adopt previous design methodology Zhang07
  • UT-TL structure
  • Full swing at wire-end
  • Tapered inverter chain as TX
  • T-TL structure
  • Optimize eye-height at wire-end
  • Non-Tapered inverter chain as TX

Repeated RC wires (R-RC)
Un-Terminated and Terminated T-Line (UT-TL and
T-TL)
10
On-Chip Global Interconnect Schemes (2)
Un-Equalized and Passive-Equalized T-Line (UE-TL
and PE-TL)
  • Driver side Tapered differential driver
  • Receiver side Termination resistance,
    Sense-Amplifier (SA) inverter chain
  • Passive equalizer parallel RC network
  • Design Constraint enough eye-opening (50mV)
    needed at the wire-end

11
Effects of driver impedance and termination
resistance on step response
  • Optimal Rload
  • Larger driver impedance leads to slower rise edge
    and lower saturation voltage
  • Larger termination resistance causes sharper rise
    edge but with larger reflection

12
Bit-rate 50Gbps Rs11.06ohm, Rd350ohm,
Cd0.38pF, RL107.69ohm
13
Global Wire Modeling Single-Ended
Differential On-Chip T-lines
  • Orthogonal layers replaced by ground planes -gt 2D
    cap extraction, accurate when loading density is
    high.
  • Top-layer thick wires used -gt dimension maintains
    as technology scales.
  • LC-mode behavior dominant

Determine the bit rate
  • Smallest wire dimensions that satisfy eye
    constraint
  • Notice PE-TL needs narrower wire -gt Equalization
    helps to increase density.

14
Global Wire Modeling RC wires and T-lines
  • RC wire modeling
  • T-line 2D-R(f)L(f)C parameter extraction
  • T-line Modeling
  • R(f)L(f)C Tabular model -gt Transient simulation
    to estimate eye-height.
  • Synthesized compact circuit model Kopcsay02 -gt
    Study signal integrity issue.
  • Distributed ? model composed of wire resistance
    and capacitance
  • Closed-form equations Sim03 to calculate 2D
    wire capacitance

2D-C Extraction Template
2D-R(f)L(f) Extraction Template
15
Performance Analysis Definitions
  • Normalized delay (unit ps/mm)
  • Propagation delay includes wire delay and gate
    delay.
  • Normalized energy per bit (unit pJ/m)
  • Bit rate is assumed to be the inverse of
    propagation delay for RC wires
  • Normalized throughput (unit Gbps/um)

16
Performance Analysis Latency
  • Variables technology-defined parameters
  • Supply voltage Vdd (unit V)
  • Dielectric constant
  • Min-sized inverter FO4 delay (unit ps)
  • R-RC structure (min-d)
  • is roughly constant
  • FO4 delay scales w/ scaling factor S
  • T-line structures
  • Sum of wire delay and TX delay
  • Wire delay
  • TX delay improved w/ FO4 delay

Decreasing w/ technology scaling!
Increasing w/ technology scaling!
17
Performance Analysis Energy per Bit
  • Same variables defined before

Constant !
  • R-RC structure (min-d)
  • Vdd reduces as technology scales
  • reduces as technology scales
  • T-line structures
  • Sum of power consumed on wire and TX.
  • Power of T-line
  • Power of TX circuit
  • FO4 delay reduces exponentially

Energy decreases w/ technology scaling!
Energy decreases w/ larger slope!!
18
Performance Analysis Throughput
  • Same variables defined before
  • R-RC structure (min-d)
  • Assuming wire pitch
  • FO4 delay reduces exponentially
  • T-line structures
  • TX bandwidth
  • Neglect the minor change of wire pitch
  • K1 0, for UT-TL
  • FO4 delay reduces exponentially

Throughput increases by 20 per generation!
Throughput increases by 43 per generation !!
19
Design Framework for On-Chip T-line Schemes
  • Proposed framework can be applied to design
    UT-TL/T-TL/UE-TL/PE-TL by changing wire
    configuration and circuit structure.
  • Different optimization routines (LP/ILP/SQP, etc)
    can be adopted according to the problem
    formulation.

20
Experimental Settings
  • Design objective min-d
  • Technology nodes 90nm-22nm
  • Five different global interconnection structures
  • Wire length 5mm
  • Parameter extraction
  • 2D field solver CZ2D from EIP tool suite of IBM
  • Tabular model or synthesized model
  • Transistor models
  • Predictive transistor model from Uemura06
  • Synopsys level 3 MOSFET model tuned according to
    ITRS roadmap
  • Simulation
  • HSPICE 2005
  • Modeling and Optimization
  • Linear or non-linear regression/SQP routine
  • MATLAB 2007

21
Performance Metric Normalized Delay Results
and Comparison
  • Technology trends
  • R-RC ?
  • T-line schemes ?
  • T-line structures
  • Outperform R-RC beyond 90nm
  • Single-ended lowest delay
  • At 22nm node
  • R-RC 55ps/mm
  • T-lines 8ps/mm (85 reduction)
  • Speed of light 5ps/mm
  • Linear model
  • lt 6 average percent error

22
Performance Metric Normalized Energy per Bit
Results and Comparison
  • Technology trends
  • R-RC and T-lines ?
  • T-lines reduce more quickly
  • T-line structures
  • Outperform R-RC beyond 45nm
  • Differential lowest energy.
  • Single-ended similar to R-RC.
  • T-TL gt UT-TL
  • At 22nm node
  • R-RC 100pJ/m
  • Single-ended 60 reduction
  • Differential 96 reduction
  • Linear model
  • lt 12 average percent error
  • Error for T-TL and PE-TL
  • RL and passive equalizers.

23
Performance Metric Normalized Throughput
Results and Comparison
  • Technology trends
  • R-RC and T-lines ?
  • T-lines increase more quickly
  • T-line structures
  • Outperform R-RC beyond 32nm
  • Differential better than single-ended
  • At 22nm node
  • R-RC 12Gbps/um
  • T-TL 30 improvement
  • UE-TL 75 improvement
  • PE-TL 2X of R-RC
  • Linear model
  • lt 7 average percent error

24
Signal Integrity single-ended T-lines
Worst-case switching pattern for peak noise
simulation
Using w.c. pattern
Using single or multiple PRBS patterns
  • UT-TL structure
  • 380mV peak noise at 1V supply voltage w/ 7ps rise
    time
  • SI could be a big issue as supply voltage drops
  • T-TL less sensitive to noise
  • At the same rise time, 50 reduction of peak
    noise
  • Peak noise ? as technology scales

25
Signal Integrity differential T-lines
Worst-case switching pattern for peak noise
simulation
  • More reliable
  • Termination resistance
  • Common-mode noise reduction
  • Peak noise
  • Within 10mV range
  • Eye-Heights
  • UE-TL
  • Eye reduces as bit rate ?
  • Harder to meet constraint.
  • PE-TL
  • gt 70mV eye even at 22nm node
  • Equalization does help!

26
Summary (cont)
Low-Latency Application (ps/mm)
Low-Energy Application (pJ/m)
Tech Node
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 3/35 1/42 1/46 1/55 1/55
UT-TL 5/15 5/13 5/10 5/9 5/8
T-TL 5/15 5/13 5/10 5/9 5/8
UE-TL 1/37 3/25 3/16 3/12 5/8
PE-TL 1/37 3/25 3/16 3/12 5/8
90nm 65nm 45nm 32nm 22nm
R-RC 2/150 2/140 1/130 1/100 1/100
UT-TL 3/140 3/110 3/70 3/50 2/40
T-TL 1/260 1/200 2/100 2/60 3/40
UE-TL 4/60 4/36 4/20 4/10 5/4
PE-TL 5/26 5/16 5/8 5/5 5/2
Schemes
Schemes
High-Throughput Application (Gbps/um)
Low-Noise Application
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 1 1 1 1 1
UT-TL 1 1 1 1 1
T-TL 3 3 3 3 3
UE-TL 5 5 4 4 4
PE-TL 4 4 5 5 5
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 5/5 5/6 3/8 3/10 2/12
UT-TL 2/3.3 1/3.3 1/3.3 1/3.3 1/3.3
T-TL 1/3 2/3.4 2/6 2/9 3/16
UE-TL 3/3 3/5 4/9 4/13 4/21
PE-TL 4/4 4/5.3 5/9 5/15 5/24
Schemes
Schemes
Item in the table score/value. Score the
higher, the better in terms of given metric, max.
score is 5. The best structure in each column
marked using red color.
27
Summary of Global Interconnect
  • Compare five different global interconnections in
    terms of latency, energy per bit, throughput and
    signal integrity from 90nm to 22nm.
  • A simple linear model provided to link
  • Architecture-level performance metrics
  • Technology-defined parameters
  • Some observations from experimental results
  • T-line structures have potential to replace R-RC
    at future node
  • Differential T-lines are better than single-ended
  • Low-power/High-throughput/Low-noise
  • Equalization could be utilized for on-chip global
    interconnection
  • Higher throughput density, improve signal
    integrity
  • Even w/ lower energy dissipation (passive
    equalizations)

28
Prefix Adder Synthesis
  • Motivation
  • Prefix Adder Formulation
  • Area/Timing/Power Models
  • Mixed-Radix (2,3,4) Adders
  • ILP Formulation
  • Experimental Results

29
Motivation Prefix Adder
  • Increasing impact of physical design
  • and concern of power.

Logical Levels
Fanouts
Wire Tracks
30
Prefix Adder Formulation
  • Input two n-bit binary numbers and
    , one bit carry-in
  • Output n-bit sum and one bit carry
    out
  • Prefix Addition Carry generation propagation

31
Prefix Addition Formulation
Pre-processing
Prefix Computation
Post-processing
32
Prefix Adder Prefix Structure Graph
bi
ai
Pre-processing
gpi
gp generator
Prefix Computation
GPi, j
GPj-1, k
GPi, k
GP cell
Gi0
Post-processing
pi
si
sum generator
33
Area Model
  • Distinguish physical placement from logical
    structure, but keep the bit-slice structure.

Bit position
Bit position
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Logical level
Physical level
Physical view
Logical view
Compact placement
34
Timing Model
  • Cell delay calculation

Effort Delay
Intrinsic Delay
Logical Effort
Electrical Effort Cout/Cin (fanoutswirelength)
/ size
Intrinsic properties of the cell
35
Power Model
  • Total power consumption Dynamic power
    Static Power
  • Static power leakage current of device
  • Psta ?cells
  • Dynamic power current switching capacitance
  • Pdyn ? ? Cload
  • ? is the switching probability
  • ? j (j is the logical level)

Vanichayobon S, etc, Power-speed Trade-off in
Parallel Prefix Circuits
36
Interval Adjacency Constraint
(column id, logic level)
37
Linearization for Interval Adjacency Constraint
Left interval bound equal to column index
Linearize
Pseudo Linear
38
ILP Formulation Overview
  • Structure variables
  • GP cells
  • Connections (wires)
  • Physical positions
  • Capacitance variables
  • Gate cap
  • Vertical wire cap
  • Horizontal wire cap

ILP
Power Objective
ILOG CPLEX
  • Timing variables
  • Input arrival time
  • Output arrival time

Optimal Solution
39
Experiments 16-bit Uniform Timing
40
Experiments 16-bit Uniform Timing
41
Min-Power Radix-2 Adder (delay 22, power
45.5FO4 )
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
42
Min-Power Radix-24 Adder (delay18, power
29.75FO4 )
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
Radix-2 Cell
Radix-4 Cell
43
Min-Power Mixed-Radix Adder (delay20, power
28.0FO4)
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
Radix-2 Cell
Radix-4 Cell
Radix-3 Cell
44
Experiments 64-bit Hierarchical Structure
(Mixed-Radix)
  • Handle high bit-width applications
  • 16x4 and 8x8

45
FPGA Global Routing Architecture
  • Synthesis Flow
  • Formulation
  • Experimental Results

46
Synthesis Flow
47
Formulation
48
FPGA Global Routing Architecture
49
Energy Model Wires
  • 0.18um tech node, grid length 0.5mm
  • 4 types of wires RC wires with spacing and
    transmission

50
Energy and Area Model Switch Box
  • Switch Area Model
  • Fs Number of switches connected to each wire
    entering a switch box
  • f Total flow incoming a switch box
  • Ns Per-bit number of switches inside a switch
    box
  • Energy Model
  • Pu energy of a single switch
  • Ps Per-bit switch energy

51
Topology Generation
  • Candidate topologies are required for MCF
    interconnection synthesis
  • MCF optimizes flow distribution, but not topology
  • Huge number of different topologies exists
  • A row of 10 cells has 2C(10, 2) 245 different
    connections
  • A 10?10 FPGA has (245)20 2900 different
    topologies!
  • Our assumptions
  • Each row and column has the same connection
  • Wire lengths are given (e.g. wire length 1, 2,
    4, 8)
  • A certain wire length repeats itself till the end
    of the chip

52
Representative Netlist Generation
  • Properties of Representative Netlist
  • Matches the size of the benchmark netlists
  • Geometry Distribution Function
  • The probability of the distance between two pins
    decreases exponentially when distance increases
  • k distance between pins
  • p probability of distance-1 links
  • P(k) probability of distance-k links

53
MCF Interconnection Synthesis
  • Integrate multiple wire styles to MCF formulation
  • Notations
  • Wire style parameter (Pe, Ae), PePwPs
  • Area Ar Routing area on vertical and horizontal
    dimension
  • djCommunication demand for net j, dj1
  • Flow f(t) flow amount on a steiner tree t

54
MCF Formulation Energy Optimization
Obj Min Energy
Routability constr.
Routing Area constr.
55
Experiment Settings
  • Seven of MCNC benchmark circuits
  • Technology mapped to 4-LUTs, each logic block
    contains 16 4-LUTs
  • Size of 10x10 to 11x11 switch boxes, 500 1000
    nets
  • Candidate topologies
  • Available segment length 1, 2, 4, 8
  • Total number of candidate topologies 93

alu4 apex4 diffeq dsip ex5p misex3 tseng
size 11x11 10x10 11x11 11x11 10x10 11x11 10x10
of nets 621 798 945 593 745 771 788
56
Energy Optimization Optimized FPGA Routing
Architectures
Routing Area 1500 ?m
Routing Area 2500 ?m
Routing Area 3500 ?m
Routing Area 4500 ?m
RC 1x
RC 2x
Energy 6.46 x103 pJ
Energy 5.24 x103 pJ
Energy 4.74 x103 pJ
Energy 4.63 x103 pJ
RC 4x
Energy Impv19
Energy Impv27
Energy Impv28
T-Line 10x
57
Energy Optimization Impact of Routing Area
  • Total energy of the 7 benchmarks with optimized
    FPGA routing architectures

58
Interconnect Architecture
  1. Wire Directions (M, Y, X, E)
  2. Layout Region (M, D, Y, X)
  3. Power Ground and Clock Distributions
  4. Layer Assignment
  5. Via Arrangement

Comparison
  1. Wire Length
  2. Throughput
  3. Grid vs No-grid

59
1. Wire Directions and Models
60
2. Layout Regions and Models
61
Length of 2 pin-nets to extend an area
Length Shape Man. Y-Arch X-Arch Euclidean
M Diamond 1.250 1.118 1.066 1.016
Y Hexagon 1.101
X Octagon 1.055
E Circle 1.273 1.103 1.055 1.000
E (worst) 1.414 1.155 1.082 1.000
62
Throughput concurrent flow demand
Throughput Shape Manhattan Y-Arch X-Arch
M Square 1.000 1.225 1.346
M (Bound) 1.241 1.356
M Diamond 1.195
Y Hexagon 1.315
X Octafon 1.420
ratio of 0-90 planes and 45-135 planes is not
fixed
63
Flow congestion map for uniform 90 Degree meshes
64
Congestion map of square chip using X-architecture
12 by 12
13 by 13
65
Congestion map of square chip using Y-architecture
12 by 12
13 by 13
66
Explanation For Throughput Increasing
Number of lines across the vertical center
cut-line d/D for 90 degree routing
for 45 degree routing
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
Global Grids (Power/Ground Mesh)
Y-Architecture
X-Architecture
(http//www.xinitiative.org/img/062102forum.pdf)
71
3. Clock Tree on Square Mesh
  • N-level clock tree
  • path distance
  • 21 less than H-tree
  • total wire length
  • 9 less than H tree, 3 less than X tree
  • No self-overlapping between parallel wire
    segments

72
4. Layer Assignment
Layer 4
Layer 3
Layer 2
Layer 1
IV
III
I
II
Assignment
Different routing direction assignment
73
Normalized throughput of mixed 45-degree and
90-degree mesh with different routing layer
assignments
 
74
Why interleaving Manhattan Layer and Diagonal
Layer Improves Throughput?
(0,3)
Wirelength 3.82
Wirelength 5.0
(2,0)
Shortest path between two points on the plane are
always a concatenation of a Manhattan line and a
Diagonal line.
75
Observations
  • Routing Direction Assignment Strategies Can
    Affect the Communication Throughput.
  • Interleaving the Manhattan Routing Layers and
    Diagonal Routing Layers can produce better
    Throughput

76
5. Via Arrangement Banks and Tunnels
  • Use tunnels to detour around vias
  • Use banks of tunnels to maximize the throughput
  • Use bottom k layers to perform intra-cell routing
  • Use top n-k layers to distribute signals to the
    banks

77
Via-Oriented Interconnect Planning
78
Via-Oriented Interconnect Planning
tunnel
79
Via-Oriented Interconnect Planning
Bank of tunnels
k2 overhead
Full bandwidth
vias kL Overheadk2 vertical Tracks L
dimension of the bank
80
Tunnel of Y Arch.
Blocking 5 tracks on the layer of 60-degree
direction
81
Tunnels of Y Arch.
82
3.2 Via-Oriented Interconnect Planning
vias c1kL
Bank of tunnels
Overhead kc2 tracks
83
Conclusion
  • Global Interconnect Technologies
  • EM waves Devices
  • Prefix Adder Synthesis
  • Formulation ILP
  • FPGA Interconnect Architecture
  • Formulation LP
  • Interconnect Architecture
  • Lambda Geometry Vias

84
  • Thank you!
  • Q A
Write a Comment
User Comments (0)
About PowerShow.com