Title: On-Chip Interconnect Trend and Design Optimization
1On-Chip Interconnect Trend and Design Optimization
- Chung-Kuan Cheng
- UC San Diego, La Jolla, CA
2Outlines
- Global Interconnect Technologies
- RC Trees and Transmission Lines
- Prefix Adder Synthesis
- Modeling
- FPGA Interconnect Architecture
- Modeling
- Interconnect Architecture
- Non-Manhattan Wire Arrangement
3Interconnect Technologies
- Introduction
- On-Chip Global Interconnection
- Global Wire Modeling
- Performance Comparison
4Introduction Performance Impact
- Interconnect delay determines the system
performance ITRS08 - 542ps for 1mm minimum pitch Cu global wire w/o
repeater _at_ 45nm - 150ps for 10 level FO4 delay _at_ 45nm
Ho2001 Future of Wire
5Introduction Power Dissipation
- Interconnects consume a significant portion of
power - 1-2 order larger in magnitude compared with gates
- Half of the dynamic power dissipated on repeaters
to minimize latency Zhang07 - Wires consume 50 of total dynamic power for a
0.13um microprocessor Magen04 - About 1/3 burned on the global wires.
6Introduction Technology Trend
- On-Chip Interconnect Scaling
- Dimension shrinks
- Wire resistance increases -gt RC delay
- Increasing capacitive coupling -gt delay, power,
noise, etc. - Performance of global wires decreases w/
technology scaling.
Wire Category Wire Category Technology Node Technology Node Technology Node
Wire Category Wire Category 90nm 45nm 22nm
M1 Wire Rw(kohm/mm) 1.914 8.860 34.827
M1 Wire Cw(pF/mm) 0.183 0.157 0.129
Global Wire Rw(kohm/mm) 0.532 2.970 11.000
Global Wire Cw(pF/mm) 0.205 0.179 0.151
Scaling trend of PUL wire resistance and
capacitance
Copper resistivity versus wire width
7Organization of On-Chip Global Interconnections
8Multi-Dimensional Design Consideration
- Preliminary analysis results assuming 65nm CMOS
process. - Application-oriented choice
- Low Latency
- T-TL or UT-TL -gt Single-Ended T-lines
- High Throughput
- R-RC
- Low Power
- PE-TL or UE-TL
- Low Noise
- PE-TL or UE-TL
- Low Area/Cost
- R-RC
Differential T-lines
For each architecture, the more area the pentagon
covers, the better overall performance is
achieved.
9On-Chip Global Interconnect Schemes (1)
- R-RC structure
- Repeater size/Length of segments
- Adopt previous design methodology Zhang07
- UT-TL structure
- Full swing at wire-end
- Tapered inverter chain as TX
- T-TL structure
- Optimize eye-height at wire-end
- Non-Tapered inverter chain as TX
Repeated RC wires (R-RC)
Un-Terminated and Terminated T-Line (UT-TL and
T-TL)
10On-Chip Global Interconnect Schemes (2)
Un-Equalized and Passive-Equalized T-Line (UE-TL
and PE-TL)
- Driver side Tapered differential driver
- Receiver side Termination resistance,
Sense-Amplifier (SA) inverter chain - Passive equalizer parallel RC network
- Design Constraint enough eye-opening (50mV)
needed at the wire-end
11Effects of driver impedance and termination
resistance on step response
- Larger driver impedance leads to slower rise edge
and lower saturation voltage - Larger termination resistance causes sharper rise
edge but with larger reflection
12Bit-rate 50Gbps Rs11.06ohm, Rd350ohm,
Cd0.38pF, RL107.69ohm
13Global Wire Modeling Single-Ended
Differential On-Chip T-lines
- Orthogonal layers replaced by ground planes -gt 2D
cap extraction, accurate when loading density is
high. - Top-layer thick wires used -gt dimension maintains
as technology scales. - LC-mode behavior dominant
Determine the bit rate
- Smallest wire dimensions that satisfy eye
constraint - Notice PE-TL needs narrower wire -gt Equalization
helps to increase density.
14Global Wire Modeling RC wires and T-lines
- RC wire modeling
- T-line 2D-R(f)L(f)C parameter extraction
- T-line Modeling
- R(f)L(f)C Tabular model -gt Transient simulation
to estimate eye-height. - Synthesized compact circuit model Kopcsay02 -gt
Study signal integrity issue.
- Distributed ? model composed of wire resistance
and capacitance - Closed-form equations Sim03 to calculate 2D
wire capacitance
2D-C Extraction Template
2D-R(f)L(f) Extraction Template
15Performance Analysis Definitions
- Normalized delay (unit ps/mm)
- Propagation delay includes wire delay and gate
delay. - Normalized energy per bit (unit pJ/m)
- Bit rate is assumed to be the inverse of
propagation delay for RC wires - Normalized throughput (unit Gbps/um)
16Performance Analysis Latency
- Variables technology-defined parameters
- Supply voltage Vdd (unit V)
- Dielectric constant
- Min-sized inverter FO4 delay (unit ps)
- R-RC structure (min-d)
- is roughly constant
- FO4 delay scales w/ scaling factor S
- T-line structures
- Sum of wire delay and TX delay
- Wire delay
- TX delay improved w/ FO4 delay
Decreasing w/ technology scaling!
Increasing w/ technology scaling!
17Performance Analysis Energy per Bit
- Same variables defined before
Constant !
- R-RC structure (min-d)
- Vdd reduces as technology scales
- reduces as technology scales
- T-line structures
- Sum of power consumed on wire and TX.
- Power of T-line
- Power of TX circuit
- FO4 delay reduces exponentially
Energy decreases w/ technology scaling!
Energy decreases w/ larger slope!!
18Performance Analysis Throughput
- Same variables defined before
- R-RC structure (min-d)
- Assuming wire pitch
- FO4 delay reduces exponentially
- T-line structures
- TX bandwidth
- Neglect the minor change of wire pitch
- K1 0, for UT-TL
- FO4 delay reduces exponentially
Throughput increases by 20 per generation!
Throughput increases by 43 per generation !!
19Design Framework for On-Chip T-line Schemes
- Proposed framework can be applied to design
UT-TL/T-TL/UE-TL/PE-TL by changing wire
configuration and circuit structure. - Different optimization routines (LP/ILP/SQP, etc)
can be adopted according to the problem
formulation.
20Experimental Settings
- Design objective min-d
- Technology nodes 90nm-22nm
- Five different global interconnection structures
- Wire length 5mm
- Parameter extraction
- 2D field solver CZ2D from EIP tool suite of IBM
- Tabular model or synthesized model
- Transistor models
- Predictive transistor model from Uemura06
- Synopsys level 3 MOSFET model tuned according to
ITRS roadmap - Simulation
- HSPICE 2005
- Modeling and Optimization
- Linear or non-linear regression/SQP routine
- MATLAB 2007
21Performance Metric Normalized Delay Results
and Comparison
- Technology trends
- R-RC ?
- T-line schemes ?
- T-line structures
- Outperform R-RC beyond 90nm
- Single-ended lowest delay
- At 22nm node
- R-RC 55ps/mm
- T-lines 8ps/mm (85 reduction)
- Speed of light 5ps/mm
- Linear model
- lt 6 average percent error
22Performance Metric Normalized Energy per Bit
Results and Comparison
- Technology trends
- R-RC and T-lines ?
- T-lines reduce more quickly
- T-line structures
- Outperform R-RC beyond 45nm
- Differential lowest energy.
- Single-ended similar to R-RC.
- T-TL gt UT-TL
- At 22nm node
- R-RC 100pJ/m
- Single-ended 60 reduction
- Differential 96 reduction
- Linear model
- lt 12 average percent error
- Error for T-TL and PE-TL
- RL and passive equalizers.
23Performance Metric Normalized Throughput
Results and Comparison
- Technology trends
- R-RC and T-lines ?
- T-lines increase more quickly
- T-line structures
- Outperform R-RC beyond 32nm
- Differential better than single-ended
- At 22nm node
- R-RC 12Gbps/um
- T-TL 30 improvement
- UE-TL 75 improvement
- PE-TL 2X of R-RC
- Linear model
- lt 7 average percent error
24Signal Integrity single-ended T-lines
Worst-case switching pattern for peak noise
simulation
Using w.c. pattern
Using single or multiple PRBS patterns
- UT-TL structure
- 380mV peak noise at 1V supply voltage w/ 7ps rise
time - SI could be a big issue as supply voltage drops
- T-TL less sensitive to noise
- At the same rise time, 50 reduction of peak
noise - Peak noise ? as technology scales
25Signal Integrity differential T-lines
Worst-case switching pattern for peak noise
simulation
- More reliable
- Termination resistance
- Common-mode noise reduction
- Peak noise
- Within 10mV range
- Eye-Heights
- UE-TL
- Eye reduces as bit rate ?
- Harder to meet constraint.
- PE-TL
- gt 70mV eye even at 22nm node
- Equalization does help!
26Summary (cont)
Low-Latency Application (ps/mm)
Low-Energy Application (pJ/m)
Tech Node
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 3/35 1/42 1/46 1/55 1/55
UT-TL 5/15 5/13 5/10 5/9 5/8
T-TL 5/15 5/13 5/10 5/9 5/8
UE-TL 1/37 3/25 3/16 3/12 5/8
PE-TL 1/37 3/25 3/16 3/12 5/8
90nm 65nm 45nm 32nm 22nm
R-RC 2/150 2/140 1/130 1/100 1/100
UT-TL 3/140 3/110 3/70 3/50 2/40
T-TL 1/260 1/200 2/100 2/60 3/40
UE-TL 4/60 4/36 4/20 4/10 5/4
PE-TL 5/26 5/16 5/8 5/5 5/2
Schemes
Schemes
High-Throughput Application (Gbps/um)
Low-Noise Application
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 1 1 1 1 1
UT-TL 1 1 1 1 1
T-TL 3 3 3 3 3
UE-TL 5 5 4 4 4
PE-TL 4 4 5 5 5
Tech Node
90nm 65nm 45nm 32nm 22nm
R-RC 5/5 5/6 3/8 3/10 2/12
UT-TL 2/3.3 1/3.3 1/3.3 1/3.3 1/3.3
T-TL 1/3 2/3.4 2/6 2/9 3/16
UE-TL 3/3 3/5 4/9 4/13 4/21
PE-TL 4/4 4/5.3 5/9 5/15 5/24
Schemes
Schemes
Item in the table score/value. Score the
higher, the better in terms of given metric, max.
score is 5. The best structure in each column
marked using red color.
27Summary of Global Interconnect
- Compare five different global interconnections in
terms of latency, energy per bit, throughput and
signal integrity from 90nm to 22nm. - A simple linear model provided to link
- Architecture-level performance metrics
- Technology-defined parameters
- Some observations from experimental results
- T-line structures have potential to replace R-RC
at future node - Differential T-lines are better than single-ended
- Low-power/High-throughput/Low-noise
- Equalization could be utilized for on-chip global
interconnection - Higher throughput density, improve signal
integrity - Even w/ lower energy dissipation (passive
equalizations)
28Prefix Adder Synthesis
- Motivation
- Prefix Adder Formulation
- Area/Timing/Power Models
- Mixed-Radix (2,3,4) Adders
- ILP Formulation
- Experimental Results
29Motivation Prefix Adder
- Increasing impact of physical design
- and concern of power.
Logical Levels
Fanouts
Wire Tracks
30Prefix Adder Formulation
- Input two n-bit binary numbers and
, one bit carry-in - Output n-bit sum and one bit carry
out - Prefix Addition Carry generation propagation
31Prefix Addition Formulation
Pre-processing
Prefix Computation
Post-processing
32Prefix Adder Prefix Structure Graph
bi
ai
Pre-processing
gpi
gp generator
Prefix Computation
GPi, j
GPj-1, k
GPi, k
GP cell
Gi0
Post-processing
pi
si
sum generator
33Area Model
- Distinguish physical placement from logical
structure, but keep the bit-slice structure.
Bit position
Bit position
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Logical level
Physical level
Physical view
Logical view
Compact placement
34Timing Model
Effort Delay
Intrinsic Delay
Logical Effort
Electrical Effort Cout/Cin (fanoutswirelength)
/ size
Intrinsic properties of the cell
35Power Model
- Total power consumption Dynamic power
Static Power - Static power leakage current of device
- Psta ?cells
- Dynamic power current switching capacitance
- Pdyn ? ? Cload
- ? is the switching probability
- ? j (j is the logical level)
Vanichayobon S, etc, Power-speed Trade-off in
Parallel Prefix Circuits
36Interval Adjacency Constraint
(column id, logic level)
37Linearization for Interval Adjacency Constraint
Left interval bound equal to column index
Linearize
Pseudo Linear
38ILP Formulation Overview
- Structure variables
- GP cells
- Connections (wires)
- Physical positions
- Capacitance variables
- Gate cap
- Vertical wire cap
- Horizontal wire cap
ILP
Power Objective
ILOG CPLEX
- Timing variables
- Input arrival time
- Output arrival time
Optimal Solution
39Experiments 16-bit Uniform Timing
40Experiments 16-bit Uniform Timing
41Min-Power Radix-2 Adder (delay 22, power
45.5FO4 )
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
42Min-Power Radix-24 Adder (delay18, power
29.75FO4 )
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
Radix-2 Cell
Radix-4 Cell
43Min-Power Mixed-Radix Adder (delay20, power
28.0FO4)
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8
Radix-2 Cell
Radix-4 Cell
Radix-3 Cell
44Experiments 64-bit Hierarchical Structure
(Mixed-Radix)
- Handle high bit-width applications
- 16x4 and 8x8
45FPGA Global Routing Architecture
- Synthesis Flow
- Formulation
- Experimental Results
46Synthesis Flow
47Formulation
48FPGA Global Routing Architecture
49Energy Model Wires
- 0.18um tech node, grid length 0.5mm
- 4 types of wires RC wires with spacing and
transmission
50Energy and Area Model Switch Box
- Switch Area Model
- Fs Number of switches connected to each wire
entering a switch box - f Total flow incoming a switch box
- Ns Per-bit number of switches inside a switch
box - Energy Model
- Pu energy of a single switch
- Ps Per-bit switch energy
51Topology Generation
- Candidate topologies are required for MCF
interconnection synthesis - MCF optimizes flow distribution, but not topology
- Huge number of different topologies exists
- A row of 10 cells has 2C(10, 2) 245 different
connections - A 10?10 FPGA has (245)20 2900 different
topologies! - Our assumptions
- Each row and column has the same connection
- Wire lengths are given (e.g. wire length 1, 2,
4, 8) - A certain wire length repeats itself till the end
of the chip
52Representative Netlist Generation
- Properties of Representative Netlist
- Matches the size of the benchmark netlists
- Geometry Distribution Function
- The probability of the distance between two pins
decreases exponentially when distance increases - k distance between pins
- p probability of distance-1 links
- P(k) probability of distance-k links
53MCF Interconnection Synthesis
- Integrate multiple wire styles to MCF formulation
- Notations
- Wire style parameter (Pe, Ae), PePwPs
- Area Ar Routing area on vertical and horizontal
dimension - djCommunication demand for net j, dj1
- Flow f(t) flow amount on a steiner tree t
54MCF Formulation Energy Optimization
Obj Min Energy
Routability constr.
Routing Area constr.
55Experiment Settings
- Seven of MCNC benchmark circuits
- Technology mapped to 4-LUTs, each logic block
contains 16 4-LUTs - Size of 10x10 to 11x11 switch boxes, 500 1000
nets - Candidate topologies
- Available segment length 1, 2, 4, 8
- Total number of candidate topologies 93
alu4 apex4 diffeq dsip ex5p misex3 tseng
size 11x11 10x10 11x11 11x11 10x10 11x11 10x10
of nets 621 798 945 593 745 771 788
56Energy Optimization Optimized FPGA Routing
Architectures
Routing Area 1500 ?m
Routing Area 2500 ?m
Routing Area 3500 ?m
Routing Area 4500 ?m
RC 1x
RC 2x
Energy 6.46 x103 pJ
Energy 5.24 x103 pJ
Energy 4.74 x103 pJ
Energy 4.63 x103 pJ
RC 4x
Energy Impv19
Energy Impv27
Energy Impv28
T-Line 10x
57Energy Optimization Impact of Routing Area
- Total energy of the 7 benchmarks with optimized
FPGA routing architectures
58Interconnect Architecture
- Wire Directions (M, Y, X, E)
- Layout Region (M, D, Y, X)
- Power Ground and Clock Distributions
- Layer Assignment
- Via Arrangement
Comparison
- Wire Length
- Throughput
- Grid vs No-grid
591. Wire Directions and Models
602. Layout Regions and Models
61Length of 2 pin-nets to extend an area
Length Shape Man. Y-Arch X-Arch Euclidean
M Diamond 1.250 1.118 1.066 1.016
Y Hexagon 1.101
X Octagon 1.055
E Circle 1.273 1.103 1.055 1.000
E (worst) 1.414 1.155 1.082 1.000
62Throughput concurrent flow demand
Throughput Shape Manhattan Y-Arch X-Arch
M Square 1.000 1.225 1.346
M (Bound) 1.241 1.356
M Diamond 1.195
Y Hexagon 1.315
X Octafon 1.420
ratio of 0-90 planes and 45-135 planes is not
fixed
63Flow congestion map for uniform 90 Degree meshes
64Congestion map of square chip using X-architecture
12 by 12
13 by 13
65Congestion map of square chip using Y-architecture
12 by 12
13 by 13
66Explanation For Throughput Increasing
Number of lines across the vertical center
cut-line d/D for 90 degree routing
for 45 degree routing
67(No Transcript)
68(No Transcript)
69(No Transcript)
70Global Grids (Power/Ground Mesh)
Y-Architecture
X-Architecture
(http//www.xinitiative.org/img/062102forum.pdf)
713. Clock Tree on Square Mesh
- N-level clock tree
- path distance
-
- 21 less than H-tree
- total wire length
- 9 less than H tree, 3 less than X tree
- No self-overlapping between parallel wire
segments
724. Layer Assignment
Layer 4
Layer 3
Layer 2
Layer 1
IV
III
I
II
Assignment
Different routing direction assignment
73Normalized throughput of mixed 45-degree and
90-degree mesh with different routing layer
assignments
Â
74Why interleaving Manhattan Layer and Diagonal
Layer Improves Throughput?
(0,3)
Wirelength 3.82
Wirelength 5.0
(2,0)
Shortest path between two points on the plane are
always a concatenation of a Manhattan line and a
Diagonal line.
75Observations
- Routing Direction Assignment Strategies Can
Affect the Communication Throughput. - Interleaving the Manhattan Routing Layers and
Diagonal Routing Layers can produce better
Throughput
765. Via Arrangement Banks and Tunnels
- Use tunnels to detour around vias
- Use banks of tunnels to maximize the throughput
- Use bottom k layers to perform intra-cell routing
- Use top n-k layers to distribute signals to the
banks
77Via-Oriented Interconnect Planning
78Via-Oriented Interconnect Planning
tunnel
79Via-Oriented Interconnect Planning
Bank of tunnels
k2 overhead
Full bandwidth
vias kL Overheadk2 vertical Tracks L
dimension of the bank
80Tunnel of Y Arch.
Blocking 5 tracks on the layer of 60-degree
direction
81Tunnels of Y Arch.
823.2 Via-Oriented Interconnect Planning
vias c1kL
Bank of tunnels
Overhead kc2 tracks
83Conclusion
- Global Interconnect Technologies
- EM waves Devices
- Prefix Adder Synthesis
- Formulation ILP
- FPGA Interconnect Architecture
- Formulation LP
- Interconnect Architecture
- Lambda Geometry Vias
84