Title: Large Scale Circuit Placement: Gap and Promise
1Large Scale Circuit Placement Gap and Promise
- Jason Cong
- UCLA VLSI CAD LAB1
- Joint work with Chin-Chih Chang, Tim Kong,
Michail Romesis, Joseph R. Shinnerl, Min Xie and
Xin Yuan
2Outline
- Introduction
- Gap Analysis of Existing Placement Algorithms
- Scalable Paradigm Multilevel Placement
3Why Still Placement Problem
- True, it has been studied over 30 years, but
- We need good solutions more then ever
- One of most important steps in IC implementation
flow - Directly defines interconnects
- Difficult
- Problem size grows 2X every 18-24 months
- Moores Law
- Cannot place hierarchically without quality
degradation
4Example of Logic Hierarchy in Final Layout
By courtesy of IBM (Tony Drumm)
5Why Still Placement
- True, it has been studied over 30 years, but
- We need good solutions more then ever
- One of most important steps in IC implementation
flow - Directly defines interconnects
- Difficult
- Problem size grows 2X every 18-24 months
- Moores Law
- Cannot place hierarchically without quality
degradation - We are not very good at it
6Outline
- Introduction
- Gap Analysis of Existing Placement Algorithms
- Scalable Paradigm Multilevel Placement
7Motivation
- Lack of significant progress in wirelength
reduction - Rate of reduction is about 5-10 every 2-3 years
- Latest developments in placement differ mainly in
runtime - Most work compare only with known heuristics
- Use real design based benchmarks
- Use synthetic benchmarks
- Little understanding about the divergence from
the optimal
8Placement Examples with Known Optimal Wirelength
Chang et al, 2003
- Given a (real) netlist N
- Construct netlist N with known opt. WL and match
the net distribution of N
9Placement Examples with Known Upperbounds Cong
et al, 2003
- Limitations of PEKO
- All the nets are local
- Wirelength contribution by global connections in
real designs can be significant
10IllustrationPEKU Example Construction
Input t 64, D d235,d321,d47,d54,d62,
d71 ?0.2
W w1w30, w43, w53, w6 0,w7 2,w8 2,w91,
w100, w111, w121
Generate 28 2-pin optimally
Generate 16 3-pin optimally
Generate 5 3-pin randomly
Generate 6 4-pin optimally
Generate 1 4-pin randomly
Generate 4 5-pin optimally
Generate 2 6-pin optimally
Generate 1 7-pin optimally
Total WL 184
11Studied Five State-of-the-Art Placers
- Capo Caldwell et al, 2000
- Based on multilevel partitioner
- Aims to enhance the routability
- Dragon Wang et al, 2000
- Uses hMetis for initial partition
- SA with bin-based swapping
- mPL Chan et al, 2000
- Nonlinear programming on the coarsest level
- Discrete relaxation at finer levels
- mPG Chang et al, 2002
- Uses FC clustering and hierarchical density
control - Incremental A-tree for routability
- Qplace Cadence Inc.
- Leading edge industrial placer
- Component of Silicon Ensemble
12Experimental Results on PEKO
- Existing Algorithms can be 59 to 140 away from
the optimal on PEKO - On Examples with pads
- mPG and Qplace show improvement of 12 and 10
repectively - Dragon, mPL, and Capo do not benefit much from
the additional information - There is significant room for improvement in
placement algorithms
13Experimental Results on PEKO
- Capo, QPlace and mPL scales well in runtime
- Average solution quality of each tool shows
deterioration by an additional 9 to 17 when the
problem size increases by a factor of 10
14Experimental Results on PEKU
- The effectiveness of existing placers can vary
significantly for circuits of similar size but
different characteristics - Comparing QRs helps to identify the technique
that works best under each scenario
QR (Placed Wirelength vs Upperbound) may not be
tight
15High Interest in the Community
16Timing-driven Placement Examples with Known
Optimal (TPEKO)
- Obtain a placement for the circuit from any
available tool - Perform timing analysis on the circuit
- Create an artificial combinational path with
equal or larger delay than the longest path - Guarantee the cells in the path are adjacent to
each other - Make necessary modifications
17Evaluating Timing-Driven Placement Algorithms
Using TPEKO
- Evaluating two state-of-the-art FPGA placement
algorithms - VPR Marquardt et al. 2000
- PATH Kong 2002
- Can be far away from the optimal for difficult
examples - 35 on average
- 54 in the worst case
18Observations from Gap Analysis
- Significant opportunity in placement
- Existing algorithms may produce solutions far
away from the optimal - The quality result of the same placer varies for
circuits of similar size but different
characteristic - Scalability problem in runtime and solution
quality - Significant ROI
- Benefit equal to one to two generations of
process scaling - But without requiring multi-billion dollar
investment (hopefully!)
19Outline
- Introduction
- Gap Analysis of Existing Placement Algorithms
- Scalable Paradigm
- Timing Optimization
- Routability Optimization
- Concluding Remarks
- Application
- Multi-Million Gate FPGA Placement
20Paradigm 2 Multilevel Placement
- Coarsening build the hierarchy by recursive
aggregation (generalized clustering) - Relaxation improve the placement at each level
by localized optimization - Interpolation transfer coarse-level solution to
adjacent, finer level (generalized declustering) - Multilevel Flow multiple traversals over
multiple hierarchies (V-cycle variations)
21Multi-Level Optimization Framework
- Multilevel coarsening generates smaller problem
sizes at coarser levels ? faster optimization at
coarser levels - May explore different aspects of the solution
space at different levels - Gradual refinement on good solutions from coarser
levels is very efficient - Successful in many applications
- Originally developed for PDEs
- Recent success in VLSI CAD partitioning,
placement, routing
22Multilevel Coarse Placement
23Multilevel Methods Coarsening by Recursive
Aggregation
- Recursive aggregation defines the hierarchy.
- Different aggregation algorithms can be used on
different levels and/or in different V-cycles. - Clustering methods
- First-Choice Clustering (hMetis Karypis 1999).
- AMG based aggregation
- An aggregate need not be a cluster. A cell can
be fractionally associated to more than one
aggregate
24Multilevel Methods Relaxation(Intralevel
Optimization)
- Iterative improvement at each level by fast,
localized computation - Discrete permutation enumerations swapping
- Unconstrained quadratic wirelength minimization
on subsets - Network-flow based improvement on subsets (RDFL)
- Local relaxation is sufficient. Global
improvement comes from the multilevel hierarchy. - Relaxations at finer levels may be quite
different, e.g., more discrete, than relaxations
at coarser levels.
25 Relaxation on Local Subsets
Move the red cells to their optimal positions,
holding all other cells fixed and (perhaps)
ignoring overlap
Original Subnetlist with Subproblem
26Example Goto-based Discrete Relaxation
- Each cells optimal location is readily
calculated when all other cells are held fixed. - Compute a chain A, B, C, D, E, whereB is a
randomly selected neighbor of As optimal
location, etc. - Examine all permutations of the chain and take
the best one. - Problem the chain is not closed (A is not
necessarily near any other cells optimal
location).
27Example Quadratic Relaxation on Noncontiguous
Subsets (QRS)
- Select a subset M of cells to move
- Identify other cells and pads, F, connected to M
by nets in - Decouple the horizontal and vertical problems.
- M is obtained as segments of length k along a DFS
vertex traversal of the netlist
28Solving the QRS subproblem
- Problem formulation (horizontal case)
- Iteratively solve the weighted quadratic
minimization problem, using the current solution
to determine the weight (as in Gordian-L) - May result in cell overlap!
29Ripple-move legalization Hur and Lillis, 2000
Because many forms of subset relaxation ignore
overlap, post-relaxation cell swaps may be needed
to remove overlap.
30Multilevel Methods Interpolation(Generalized
Declustering)
- Goal transfer a partial solution from a coarser
level to its adjacent finer level - Simplest approach place all components of a
cluster at its center - Better approach place each component of an
aggregate at the weighted average of the
aggregates to which it is strongly connected. - Optionally impose constraints e.g., the average
location of the components can be held fixed.
31Interpolation (Declustering)
- Use the same grid structure at each level
- Variable cluster size (may be bigger than a bin)
handled by hierarchical area density control - Multilevel SA engine SA engine starts with a low
temperature at each level except the coarsest
level
32AMG-style Linear Interpolation
33AMG-based Linear Interpolation A. Brandt 1986
constant
34Iterated Multilevel Flow
Make use of placement solution from 1st V-cycle
First Choice (FC) clustering
35Iterated Multilevel Flow
Iterated V-Cycles
F-Cycle
Backtracking V-Cycle
36Sample Impact of the Multilevel Components to
mPLs overall quality
- First-Choice Clustering 34 reduced WL
- QRS Relaxation 56 reduced WL
- AMG Interpolation 23 reduced WL
- Iterated V-cycles 28 reduced WL
37mPL 3.0 vs. mPL1.0 and Gordian-L
Uniform-Cell IBM/ISPD 98 Circuits Uniform-Cell IBM/ISPD 98 Circuits Uniform-Cell IBM/ISPD 98 Circuits Uniform-Cell IBM/ISPD 98 Circuits Uniform-Cell IBM/ISPD 98 Circuits
mPL1.0Dom. mPL1.0Dom. mPL1.0Dom. Gordian-LDom. Gordian-LDom.
Circuit Wirelength CPU time CPU time Wirelength CPU time
Ibm04 1.18 0.31 0.31 1.05 1.90
Ibm07 1.14 0.34 0.34 1.05 3.77
Ibm09 1.14 0.33 0.33 1.04 4.90
Ibm10 1.11 0.31 0.31 0.99 6.54
Ibm14 1.11 0.41 0.41 1.04 8.28
Ibm16 1.16 0.46 0.46 1.00 11.76
Ibm17 1.07 0.44 0.44 0.98 10.41
Ibm18 1.18 0.42 0.42 1.03 13.43
Average 1.14 0.38 0.38 1.02 7.62
(12 better than mPL1.0 with 2x longer runtime
2 better than Gordian-L and 7x faster)
38mPL 3.0 vs. Capo 8.5 and Dragon
Uniform-Cell IBM/ISPD 98 Circuits Uniform-Cell IBM/ISPD 98 Circuits Uniform-Cell IBM/ISPD 98 Circuits Uniform-Cell IBM/ISPD 98 Circuits
Capo 8.5 / mPL3.0 Capo 8.5 / mPL3.0 Dragon / mPL3.0 Dragon / mPL3.0
Circuit Wirelength CPU time Wirelength CPU time
Ibm04 1.12 0.53 0.97 3.03
Ibm07 1.12 0.60 0.95 3.33
Ibm09 1.12 0.67 1.01 5.40
Ibm10 1.10 0.55 0.99 4.70
Ibm14 1.08 0.57 0.95 3.02
Ibm16 1.06 0.54 0.90 6.83
Ibm17 1.10 0.43 0.98 6.82
Ibm18 1.10 0.43 0.96 6.10
Average 1.10 0.54 0.96 4.91
(10 better than Capo with 2x longer runtime4
worse than Dragon but 4x faster)
39mPL3.0 vs. mPL1.0, Capo8.5, Dragon and Gordian-L
40Extension Multilevel Mixed-size Placement
- Simultaneous place big and small objects
- Gradually fix the locations of big objects and
generate overlap-free placement for big objects
during multilevel placement
41Example Final Placement of ibm02 by mPG-ms
42Concluding Remarks
- There is significant opportunity to improve the
placement technologies - Multilevel placement is a promising scalable
solution