On-Chip Interconnect Trend and Design Optimization

About This Presentation

Title:

On-Chip Interconnect Trend and Design Optimization

Description:

On-Chip Interconnect Trend and Design Optimization Chung-Kuan Cheng UC San Diego, La Jolla, CA – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 85

Provided by: kua79

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: On-Chip Interconnect Trend and Design Optimization

1
On-Chip Interconnect Trend and Design Optimization

Chung-Kuan Cheng
UC San Diego, La Jolla, CA

2
Outlines

Global Interconnect Technologies
RC Trees and Transmission Lines
Prefix Adder Synthesis
Modeling
FPGA Interconnect Architecture
Modeling
Interconnect Architecture
Non-Manhattan Wire Arrangement

3
Interconnect Technologies

Introduction
On-Chip Global Interconnection
Global Wire Modeling
Performance Comparison

4
Introduction Performance Impact

Interconnect delay determines the system
performance ITRS08
542ps for 1mm minimum pitch Cu global wire w/o
repeater _at_ 45nm
150ps for 10 level FO4 delay _at_ 45nm

Ho2001 Future of Wire
5
Introduction Power Dissipation

Interconnects consume a significant portion of
power
1-2 order larger in magnitude compared with gates
Half of the dynamic power dissipated on repeaters
to minimize latency Zhang07
Wires consume 50 of total dynamic power for a
0.13um microprocessor Magen04
About 1/3 burned on the global wires.

6
Introduction Technology Trend

On-Chip Interconnect Scaling
Dimension shrinks
Wire resistance increases -gt RC delay
Increasing capacitive coupling -gt delay, power,
noise, etc.
Performance of global wires decreases w/
technology scaling.

Wire Category Wire Category Technology Node Technology Node Technology Node
Wire Category Wire Category 90nm 45nm 22nm
M1 Wire Rw(kohm/mm) 1.914 8.860 34.827
M1 Wire Cw(pF/mm) 0.183 0.157 0.129
Global Wire Rw(kohm/mm) 0.532 2.970 11.000
Global Wire Cw(pF/mm) 0.205 0.179 0.151
Scaling trend of PUL wire resistance and
capacitance
Copper resistivity versus wire width
7
Organization of On-Chip Global Interconnections
8
Multi-Dimensional Design Consideration

Preliminary analysis results assuming 65nm CMOS
process.
Application-oriented choice
Low Latency
T-TL or UT-TL -gt Single-Ended T-lines
High Throughput
R-RC
Low Power
PE-TL or UE-TL
Low Noise
PE-TL or UE-TL
Low Area/Cost
R-RC

Differential T-lines
For each architecture, the more area the pentagon
covers, the better overall performance is
achieved.
9
On-Chip Global Interconnect Schemes (1)

R-RC structure
Repeater size/Length of segments
Adopt previous design methodology Zhang07
UT-TL structure
Full swing at wire-end
Tapered inverter chain as TX
T-TL structure
Optimize eye-height at wire-end
Non-Tapered inverter chain as TX

Repeated RC wires (R-RC)
Un-Terminated and Terminated T-Line (UT-TL and
T-TL)
10
On-Chip Global Interconnect Schemes (2)
Un-Equalized and Passive-Equalized T-Line (UE-TL
and PE-TL)

Driver side Tapered differential driver
Receiver side Termination resistance,
Sense-Amplifier (SA) inverter chain
Passive equalizer parallel RC network
Design Constraint enough eye-opening (50mV)
needed at the wire-end

11
Effects of driver impedance and termination
resistance on step response

Optimal Rload

Larger driver impedance leads to slower rise edge
and lower saturation voltage
Larger termination resistance causes sharper rise
edge but with larger reflection

12
Bit-rate 50Gbps Rs11.06ohm, Rd350ohm,
Cd0.38pF, RL107.69ohm
13
Global Wire Modeling Single-Ended
Differential On-Chip T-lines

Orthogonal layers replaced by ground planes -gt 2D
cap extraction, accurate when loading density is
high.
Top-layer thick wires used -gt dimension maintains
as technology scales.
LC-mode behavior dominant

Determine the bit rate

Smallest wire dimensions that satisfy eye
constraint
Notice PE-TL needs narrower wire -gt Equalization
helps to increase density.

14
Global Wire Modeling RC wires and T-lines

RC wire modeling
T-line 2D-R(f)L(f)C parameter extraction
T-line Modeling
R(f)L(f)C Tabular model -gt Transient simulation
to estimate eye-height.
Synthesized compact circuit model Kopcsay02 -gt
Study signal integrity issue.

Distributed ? model composed of wire resistance
and capacitance
Closed-form equations Sim03 to calculate 2D
wire capacitance

2D-C Extraction Template
2D-R(f)L(f) Extraction Template
15
Performance Analysis Definitions

Normalized delay (unit ps/mm)
Propagation delay includes wire delay and gate
delay.
Normalized energy per bit (unit pJ/m)
Bit rate is assumed to be the inverse of
propagation delay for RC wires
Normalized throughput (unit Gbps/um)

16
Performance Analysis Latency

Variables technology-defined parameters
Supply voltage Vdd (unit V)
Dielectric constant
Min-sized inverter FO4 delay (unit ps)

R-RC structure (min-d)
is roughly constant
FO4 delay scales w/ scaling factor S

T-line structures
Sum of wire delay and TX delay
Wire delay
TX delay improved w/ FO4 delay

Decreasing w/ technology scaling!
Increasing w/ technology scaling!
17
Performance Analysis Energy per Bit

Same variables defined before

Constant !

R-RC structure (min-d)
Vdd reduces as technology scales
reduces as technology scales

T-line structures
Sum of power consumed on wire and TX.
Power of T-line
Power of TX circuit
FO4 delay reduces exponentially

Energy decreases w/ technology scaling!
Energy decreases w/ larger slope!!
18
Performance Analysis Throughput

Same variables defined before

R-RC structure (min-d)
Assuming wire pitch
FO4 delay reduces exponentially

T-line structures
TX bandwidth
Neglect the minor change of wire pitch
K1 0, for UT-TL
FO4 delay reduces exponentially

Throughput increases by 20 per generation!
Throughput increases by 43 per generation !!
19
Design Framework for On-Chip T-line Schemes

Proposed framework can be applied to design
UT-TL/T-TL/UE-TL/PE-TL by changing wire
configuration and circuit structure.
Different optimization routines (LP/ILP/SQP, etc)
can be adopted according to the problem
formulation.

20
Experimental Settings

Design objective min-d
Technology nodes 90nm-22nm
Five different global interconnection structures
Wire length 5mm
Parameter extraction
2D field solver CZ2D from EIP tool suite of IBM
Tabular model or synthesized model
Transistor models
Predictive transistor model from Uemura06
Synopsys level 3 MOSFET model tuned according to
ITRS roadmap
Simulation
HSPICE 2005
Modeling and Optimization
Linear or non-linear regression/SQP routine
MATLAB 2007

21
Performance Metric Normalized Delay Results
and Comparison

Technology trends
R-RC ?
T-line schemes ?
T-line structures
Outperform R-RC beyond 90nm
Single-ended lowest delay
At 22nm node
R-RC 55ps/mm
T-lines 8ps/mm (85 reduction)
Speed of light 5ps/mm
Linear model
lt 6 average percent error

22
Performance Metric Normalized Energy per Bit
Results and Comparison

Technology trends
R-RC and T-lines ?
T-lines reduce more quickly
T-line structures
Outperform R-RC beyond 45nm
Differential lowest energy.
Single-ended similar to R-RC.
T-TL gt UT-TL
At 22nm node
R-RC 100pJ/m
Single-ended 60 reduction
Differential 96 reduction
Linear model
lt 12 average percent error
Error for T-TL and PE-TL
RL and passive equalizers.

23
Performance Metric Normalized Throughput
Results and Comparison

Technology trends
R-RC and T-lines ?
T-lines increase more quickly
T-line structures
Outperform R-RC beyond 32nm
Differential better than single-ended
At 22nm node
R-RC 12Gbps/um
T-TL 30 improvement
UE-TL 75 improvement
PE-TL 2X of R-RC
Linear model
lt 7 average percent error

24
Signal Integrity single-ended T-lines
Worst-case switching pattern for peak noise
simulation
Using w.c. pattern
Using single or multiple PRBS patterns

UT-TL structure
380mV peak noise at 1V supply voltage w/ 7ps rise
time
SI could be a big issue as supply voltage drops
T-TL less sensitive to noise
At the same rise time, 50 reduction of peak
noise
Peak noise ? as technology scales

25
Signal Integrity differential T-lines
Worst-case switching pattern for peak noise
simulation

More reliable
Termination resistance
Common-mode noise reduction
Peak noise
Within 10mV range
Eye-Heights
UE-TL
Eye reduces as bit rate ?
Harder to meet constraint.
PE-TL
gt 70mV eye even at 22nm node
Equalization does help!

26
Summary (cont)
Low-Latency Application (ps/mm)
Low-Energy Application (pJ/m)
Tech Node
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 3/35 1/42 1/46 1/55 1/55
UT-TL 5/15 5/13 5/10 5/9 5/8
T-TL 5/15 5/13 5/10 5/9 5/8
UE-TL 1/37 3/25 3/16 3/12 5/8
PE-TL 1/37 3/25 3/16 3/12 5/8
90nm 65nm 45nm 32nm 22nm
R-RC 2/150 2/140 1/130 1/100 1/100
UT-TL 3/140 3/110 3/70 3/50 2/40
T-TL 1/260 1/200 2/100 2/60 3/40
UE-TL 4/60 4/36 4/20 4/10 5/4
PE-TL 5/26 5/16 5/8 5/5 5/2
Schemes
Schemes
High-Throughput Application (Gbps/um)
Low-Noise Application
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 1 1 1 1 1
UT-TL 1 1 1 1 1
T-TL 3 3 3 3 3
UE-TL 5 5 4 4 4
PE-TL 4 4 5 5 5
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 5/5 5/6 3/8 3/10 2/12
UT-TL 2/3.3 1/3.3 1/3.3 1/3.3 1/3.3
T-TL 1/3 2/3.4 2/6 2/9 3/16
UE-TL 3/3 3/5 4/9 4/13 4/21
PE-TL 4/4 4/5.3 5/9 5/15 5/24
Schemes
Schemes
Item in the table score/value. Score the
higher, the better in terms of given metric, max.
score is 5. The best structure in each column
marked using red color.
27
Summary of Global Interconnect

Compare five different global interconnections in
terms of latency, energy per bit, throughput and
signal integrity from 90nm to 22nm.
A simple linear model provided to link
Architecture-level performance metrics
Technology-defined parameters
Some observations from experimental results
T-line structures have potential to replace R-RC
at future node
Differential T-lines are better than single-ended
Low-power/High-throughput/Low-noise
Equalization could be utilized for on-chip global
interconnection
Higher throughput density, improve signal
integrity
Even w/ lower energy dissipation (passive
equalizations)

28
Prefix Adder Synthesis

Motivation
Prefix Adder Formulation
Area/Timing/Power Models
Mixed-Radix (2,3,4) Adders
ILP Formulation
Experimental Results

29
Motivation Prefix Adder

Increasing impact of physical design
and concern of power.

Logical Levels
Fanouts
Wire Tracks
30
Prefix Adder Formulation

Input two n-bit binary numbers and
, one bit carry-in
Output n-bit sum and one bit carry
out
Prefix Addition Carry generation propagation

31
Prefix Addition Formulation
Pre-processing
Prefix Computation
Post-processing
32
Prefix Adder Prefix Structure Graph
bi
ai
Pre-processing
gpi
gp generator
Prefix Computation
GPi, j
GPj-1, k
GPi, k
GP cell
Gi0
Post-processing
pi
si
sum generator
33
Area Model

Distinguish physical placement from logical
structure, but keep the bit-slice structure.

Bit position
Bit position
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Logical level
Physical level
Physical view
Logical view
Compact placement
34
Timing Model

Cell delay calculation

Effort Delay
Intrinsic Delay
Logical Effort
Electrical Effort Cout/Cin (fanoutswirelength)
/ size
Intrinsic properties of the cell
35
Power Model

Total power consumption Dynamic power
Static Power
Static power leakage current of device
Psta ?cells
Dynamic power current switching capacitance
Pdyn ? ? Cload
? is the switching probability
? j (j is the logical level)

Vanichayobon S, etc, Power-speed Trade-off in
Parallel Prefix Circuits
36
Interval Adjacency Constraint
(column id, logic level)
37
Linearization for Interval Adjacency Constraint
Left interval bound equal to column index
Linearize
Pseudo Linear
38
ILP Formulation Overview

Structure variables
GP cells
Connections (wires)
Physical positions

Capacitance variables
Gate cap
Vertical wire cap
Horizontal wire cap

ILP
Power Objective
ILOG CPLEX

Timing variables
Input arrival time
Output arrival time

Optimal Solution
39
Experiments 16-bit Uniform Timing
40
Experiments 16-bit Uniform Timing
41
Min-Power Radix-2 Adder (delay 22, power
45.5FO4 )
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
42
Min-Power Radix-24 Adder (delay18, power
29.75FO4 )
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
Radix-2 Cell
Radix-4 Cell
43
Min-Power Mixed-Radix Adder (delay20, power
28.0FO4)
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
Radix-2 Cell
Radix-4 Cell
Radix-3 Cell
44
Experiments 64-bit Hierarchical Structure
(Mixed-Radix)

Handle high bit-width applications
16x4 and 8x8

45
FPGA Global Routing Architecture

Synthesis Flow
Formulation
Experimental Results

46
Synthesis Flow
47
Formulation
48
FPGA Global Routing Architecture
49
Energy Model Wires

0.18um tech node, grid length 0.5mm
4 types of wires RC wires with spacing and
transmission

50
Energy and Area Model Switch Box

Switch Area Model
Fs Number of switches connected to each wire
entering a switch box
f Total flow incoming a switch box
Ns Per-bit number of switches inside a switch
box
Energy Model
Pu energy of a single switch
Ps Per-bit switch energy

51
Topology Generation

Candidate topologies are required for MCF
interconnection synthesis
MCF optimizes flow distribution, but not topology
Huge number of different topologies exists
A row of 10 cells has 2C(10, 2) 245 different
connections
A 10?10 FPGA has (245)20 2900 different
topologies!
Our assumptions
Each row and column has the same connection
Wire lengths are given (e.g. wire length 1, 2,
4, 8)
A certain wire length repeats itself till the end
of the chip

52
Representative Netlist Generation

Properties of Representative Netlist
Matches the size of the benchmark netlists
Geometry Distribution Function
The probability of the distance between two pins
decreases exponentially when distance increases
k distance between pins
p probability of distance-1 links
P(k) probability of distance-k links

53
MCF Interconnection Synthesis

Integrate multiple wire styles to MCF formulation
Notations
Wire style parameter (Pe, Ae), PePwPs
Area Ar Routing area on vertical and horizontal
dimension
djCommunication demand for net j, dj1
Flow f(t) flow amount on a steiner tree t

54
MCF Formulation Energy Optimization
Obj Min Energy
Routability constr.
Routing Area constr.
55
Experiment Settings

Seven of MCNC benchmark circuits
Technology mapped to 4-LUTs, each logic block
contains 16 4-LUTs
Size of 10x10 to 11x11 switch boxes, 500 1000
nets
Candidate topologies
Available segment length 1, 2, 4, 8
Total number of candidate topologies 93

alu4 apex4 diffeq dsip ex5p misex3 tseng
size 11x11 10x10 11x11 11x11 10x10 11x11 10x10
of nets 621 798 945 593 745 771 788
56
Energy Optimization Optimized FPGA Routing
Architectures
Routing Area 1500 ?m
Routing Area 2500 ?m
Routing Area 3500 ?m
Routing Area 4500 ?m
RC 1x
RC 2x
Energy 6.46 x103 pJ
Energy 5.24 x103 pJ
Energy 4.74 x103 pJ
Energy 4.63 x103 pJ
RC 4x
Energy Impv19
Energy Impv27
Energy Impv28
T-Line 10x
57
Energy Optimization Impact of Routing Area

Total energy of the 7 benchmarks with optimized
FPGA routing architectures

58
Interconnect Architecture

Wire Directions (M, Y, X, E)
Layout Region (M, D, Y, X)
Power Ground and Clock Distributions
Layer Assignment
Via Arrangement

Comparison

Wire Length
Throughput
Grid vs No-grid

59
1. Wire Directions and Models
60
2. Layout Regions and Models
61
Length of 2 pin-nets to extend an area
Length Shape Man. Y-Arch X-Arch Euclidean
M Diamond 1.250 1.118 1.066 1.016
Y Hexagon 1.101
X Octagon 1.055
E Circle 1.273 1.103 1.055 1.000
E (worst) 1.414 1.155 1.082 1.000
62
Throughput concurrent flow demand
Throughput Shape Manhattan Y-Arch X-Arch
M Square 1.000 1.225 1.346
M (Bound) 1.241 1.356
M Diamond 1.195
Y Hexagon 1.315
X Octafon 1.420
ratio of 0-90 planes and 45-135 planes is not
fixed
63
Flow congestion map for uniform 90 Degree meshes
64
Congestion map of square chip using X-architecture
12 by 12
13 by 13
65
Congestion map of square chip using Y-architecture
12 by 12
13 by 13
66
Explanation For Throughput Increasing
Number of lines across the vertical center
cut-line d/D for 90 degree routing
for 45 degree routing
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
Global Grids (Power/Ground Mesh)
Y-Architecture
X-Architecture
(http//www.xinitiative.org/img/062102forum.pdf)
71
3. Clock Tree on Square Mesh

N-level clock tree
path distance
21 less than H-tree
total wire length
9 less than H tree, 3 less than X tree
No self-overlapping between parallel wire
segments

72
4. Layer Assignment
Layer 4
Layer 3
Layer 2
Layer 1
IV
III
I
II
Assignment
Different routing direction assignment
73
Normalized throughput of mixed 45-degree and
90-degree mesh with different routing layer
assignments

74
Why interleaving Manhattan Layer and Diagonal
Layer Improves Throughput?
(0,3)
Wirelength 3.82
Wirelength 5.0
(2,0)
Shortest path between two points on the plane are
always a concatenation of a Manhattan line and a
Diagonal line.
75
Observations

Routing Direction Assignment Strategies Can
Affect the Communication Throughput.
Interleaving the Manhattan Routing Layers and
Diagonal Routing Layers can produce better
Throughput

76
5. Via Arrangement Banks and Tunnels

Use tunnels to detour around vias
Use banks of tunnels to maximize the throughput
Use bottom k layers to perform intra-cell routing
Use top n-k layers to distribute signals to the
banks

77
Via-Oriented Interconnect Planning
78
Via-Oriented Interconnect Planning
tunnel
79
Via-Oriented Interconnect Planning
Bank of tunnels
k2 overhead
Full bandwidth
vias kL Overheadk2 vertical Tracks L
dimension of the bank
80
Tunnel of Y Arch.
Blocking 5 tracks on the layer of 60-degree
direction
81
Tunnels of Y Arch.
82
3.2 Via-Oriented Interconnect Planning
vias c1kL
Bank of tunnels
Overhead kc2 tracks
83
Conclusion