Title: An Efficient Surface-Based Low-Power Buffer Insertion Algorithm
1An Efficient Surface-Based Low-Power Buffer
Insertion Algorithm
- Rajeev R. Rao, David Blaauw, Dennis Sylvester,
Charles Alpert, Sani Nassif - Department of EECS, University of Michigan, Ann
Arbor, MI - IBM Austin Research Laboratory, Austin, TX
- rrrao, blaauw, dennis_at_eecs.umich.edu, alpert,
nassif_at_us.ibm.com
2Interconnect Trends
- Interconnect power a major issue
- Huge power consumption in both global and local
signal nets - Repeater counts increasing drastically
- IBM 50 of leakage in inverters/buffers
- Assuming continuation of current design styles,
dramatic projections for the 32nm technology node - 70 of cell count repeaters
- 65-80 of dynamic power due to interconnects
- Leakage increasing exponentially
- Require Optimal repeater usage with the
objective of total power minimization
3Outline
- Introduction
- Delay and Buffer models
- Previous Work
- Proposed Algorithm
- Library characterization
- Generation of different types of candidates
- Merging, Propagation, Snapping
- Results
- Conclusion
4Introduction
- Wire RC delay is quadratic function of wire
length - Segmenting wires decreases delay
- Same idea applicable for interconnect tree
structures - Buffers inserted for delay management
- Additional benefit Buffers/Inverters decouple
large output loads
1
1
Repeater
Receiver
Driver
Wire Length 2, Wire Delay ? (1)2(1)2 2
5Elmore Delay model
- Represent interconnect tree with a lumped RC
model - Assume binary tree topology is fixed with an
initial Steiner tree estimation - n vertices (branch points) and (n-1) edges (ie.,
wires) - For a wire e connecting vertices (u, v) the
Elmore delay is - where T(v) is the maximal subtree rooted at v
that does not contain buffers - The total delay from a vertex v to a sink node si
is
6Buffer model
- Linear gate delay model used for the buffers
- Assumption Delay is a linear function of output
capacitance -
- Isolation Property Buffer devices decouple
downstream output loads from the parent trees - Assumption Miller effect (bootstrapping) due
to Cgd is negligible
Dbuffer Dintrinsic-delay Rintrinsic-resistance
Coutput-load
v
Cload
7Buffer Insertion Problem
BufLib b1 b2 b3
- Timing Metrics
- Required Arrival Time (RAT)
- Each sink specified a given RAT(si) value and
source is fixed as RAT(so)0 - Delay minimization ? Maximize slack at source
q(so) - Subtree Delay (SD)
- SD(si) RATmax(si) RAT(si)
- Delay minimization ? Minimize SD(so)
- Advantage Unlike RAT, equations using SD are
additive - Our approach
- Tradeoff surfaces in 3D space of delay,
capacitance and power - Continuously-sized buffer libraries
8Outline
- Introduction
- Delay and Buffer models
- Previous Work
- Proposed Algorithm
- Library characterization
- Generation of different types of candidates
- Merging, Propagation, Snapping
- Results
- Conclusion
9Previous Work
- L. P. P. P. van Ginneken (VG) ISCAS90
- Two phase dynamic programming algorithm
- Backward traversal up the interconnect tree to
compute of load and delay values - Forward solution pass to reconstruct best
candidate
Function BOTTOM_UP (v) 1. If v e sink return
(Cv, SDv) Else 2. / compute options for
subtrees / 3. BOTTOM_UP( left(v) ) 4. BOTTOM_UP(
right(v) ) 5. Join pairs of subtrees by a merge
operation 6. Find best cnd among merged cnds to
add a buffer 7. Add parent wire to both types of
cnds 8. Prune inferior cnds from set of cnds 9.
Store cnd list for node v and return
Post-order DFS traversal
Merge operation Cparent Cleft
CrightSDparent max(SDleft, SDright)
Buffer candidate creation
Pruning provably inferior candidates
10VG Algorithm
- Candidate Format 2-tuple (Load, Subtree Delay)
(c,s) - Recursive forumulas for two possible cases
- Pruning Criteria (c1,s1) better than (c2,s2)
if both load and subtree delay values are lower
i.e., c1ltc2 and s1lts2 - Merge operation linear
- Complexity O(n2) where n number of buffer
locations - Additional objective Minimize buffer count ??
Complexity is non-polynomial
Only a wire is added at root of subtree A buffer and a wire added at root of subtree
c1 c0 cwire s1 s0 dwire c1 cbuf cwire (Isolation Property) s1 s0 dint rbufc0 dwire
(c1.s1)
(c1.s1)
(c0.s0)
(c0.s0)
11Previous Work
- Extensions to VG by Lillis et. al. ICCAD95,
JSSC96 - A buffer library B can be used during buffer
insertion ? Complexity O(n2B2) - Simultaneous wire sizing and buffer insertion
- Incorporate signal slew into buffer delay model
- Dynamic power minimization subject to timing
constraints - Candidate Format 3-tuple (Load, Subtree Delay,
Power) (c,s,p) - Equate power with effective total capacitance
- Assumption All capacitive values can be linearly
mapped onto a polynomially-bounded integer domain
(cmax max cap value) - Sophisticated pruning mechanism using orthogonal
range query - Complexity O(n3Bc2maxlog(ncmax)) based on the
assumption
12Previous Work
- Several approaches presented in literature to
target power minimization in conjunction with
buffer insertion. Examples - Quadratic programming Chu et. al. TCAD99
- Lagrangian relaxation C.-P.Chen et. al. TCAD99
- ClockTune J.-L.Tsai et. al. TCAD04
- Associate total power with effective capacitive
area of wires devices - Area minimization ? Power minimization
- Ignores the contribution of static leakage power
- Inclusion of this component results in
non-polynomial complexity - Addition of extra components in candidates
generally leads to exponential complexity for
dynamic programming
13Contributions of this paper
- Novel continuous buffer insertion algorithm
with total power minimization - Inclusive of both dynamic and leakage power
- Generate tradeoff surfaces in the 3D DCP (Delay,
Capacitance, Power) space - User is able to pick any desired point on this 3D
surface - Easy to explore trade-offs between the 3
variables - Ability to handle arbitrarily large buffer
libraries - Continuously sized cell libraries with numerous
buffer sizes - Capable of snapping to discrete buffer sizes if
necessary - Worst-case polynomial complexity O(n2)
- Similar to basic VG algorithm
14Outline
- Introduction
- Delay and Buffer models
- Previous Work
- Proposed Algorithm
- Library characterization
- Generation of different types of candidates
- Merging, Propagation, Snapping
- Results
- Conclusion
15Library Characterization
- Buffer library with a set of continuously sized
buffers - Let S sizing factor of the library. Express
delay (db), capacitance (cb) and leakage (lb) in
terms of S. -
- Determine c0, c1, l0, l1, d0, d1 through
empirical fitting constants - Equations combine discrete buffer sizes
approximate the ideal of continuous buffer sizing
cb ? Buffer Area ? cb c0 c1S lb ? Device
width ? lb l0 l1S db? Linear gate delay
model ? db d0 d1(Cout/S)
16Generation of candidates
b2
b1
b3
b4
lw1
lw2
lw3
o
u
v
t
- Point Candidate
- Candidate Format 3-tuple (Do, Co, Po)
- Node has point candidate ? there are no buffers
in subtree rooted at that node - All sinks have point candidates
- Write equations to determine candidate at u
17Generation of candidates
(D0, C0, P0)
b2
b1
b3
b4
lw1
lw2
lw3
o
u
v
t
?
Variable S
- Curve Candidate
- Candidate Format Dumin,Dumax, (gi, ki)
i0,2 - Node has curve candidate ? Exactly one buffer in
subtree rooted at node
18Generation of candidates
(Du, Cu, Pu)
b2
b1
b3
b4
lw1
lw2
lw3
o
u
v
t
For a given S, Cv fixed, Dv, Pv vary based on Du
C-plane with discrete Cv
- Surface Candidate
- C-plane Format Cv, Dmin,Dmax, (ki) i0,2
- Candidate Format vectorltCPlanegt
19Generation of candidates
(Du, Cu, Pu)
(Dv, Cv, Pv)
b2
b1
b3
b4
lw1
lw2
lw3
o
u
v
t
- Similar equations can be written to determine
candidate at t - Ct ? S but Dt, Pt ? Cv, Dv, S
- New set of C-planes.
- ? C-plane, Lower envelope ? Power optimal
solution - Surface candidate ? Surface candidate
20Design Choices
- Wire network is a binary tree
- Zero-length wires, dummy nodes
- Ignore signal polarity on buffers
- Pair of solution sets (similar to Lillis)
- Number of surface candidates per node 2
(Buffered/Non-buffered) - Trade-off between more fine grained solutions and
efficiency - No impact on optimality or complexity
21Merging and Implicit Pruning
- First, merge left and right candidate
- Compare equal delay points by checking 4
combinations of left and right candidates - Create P/C curves and extract the lower envelope
? Pruning - Translate P/C curves with fixed D value into P/D
curves with fixed C values ? Creation of C-planes
for 4 different surface candidates - Next, recombine these 4 surfaces into single
candidate - Map P/D curves from one C-plane to another using
linear interpolation - ? (D,C) value pick lowest power value ? Pruning
- Use composite surface to create the
buffered/non-buffered candidate
22Reconstruction and Snapping
- Pair of candidate solutions created for source
- Any trade-off point in the DCP surface can be
picked - Forward solution pass to reconstruct the tree
structure with buffer locations - Snapping If required size is unavailable then
buffer with nearest size value is chosen - Problem Discrepancies in D, C, P values ?
Solution Local refinements in the C-planes - Single pass through the RC tree
- Complexity O(n2) where n number of possible
buffer locations
23Outline
- Introduction
- Delay and Buffer models
- Previous Work
- Proposed Algorithm
- Library characterization
- Generation of different types of candidates
- Merging, Propagation, Snapping
- Results
- Conclusion
24Results
- Benchmarks C-tree nets
- TSMC 0.13um buffer library
- Number of discrete buffer choices 9
- Multilinear fitting models using GNU Scientific
Library - Example 3D surface
25Results Snapping
26Results Comparison
- Implementation of Lillis algorithm with leakage
included - Pruning less effective
27Conclusion
- Buffer insertion algorithm with total power (Pdyn
Pstat) minimization as objective - Generate 3D surfaces in Delay, Capacitance and
Power space - Ability to explore different types of trade-offs
- Able to handle large buffer libraries with
continuous sizes - Worst case polynomial complexity