Title: Accelerators for FPGA Placement
1Accelerators for FPGA Placement
- Pritha Banerjee
- Advanced Computing Microelectronics Unit
- Indian Statistical Institute, Kolkata
-
2Outline of this talk
- Introduction to FPGAs
- Problem Formulation
- Cone based FPGA Placement
- Initial placement
- Low temperature SA
- Placement by Space Filling Curve
- Generation of linear placement
- Initial placement by SFC curves
- Refinement by Low temperature SA
- Concluding Remarks
3Island-style FPGA Architecture
L LUT based Logic Block/Slice C Connection
Block S Switch Block
L
L
L
C
C
S
S
C
C
C
L
L
L
C
C
Logic Block
S
S
C
C
C
Connection Block
Programmable Connection Switch
L
L
L
C
C
Programmable Routing Switch
Switch Block
Array based FPGA Model
Short wire Segment
Long wire Segment
4Simplified FPGA Logic Block/Slice
- An FPGA slice has
- 2 LUTs with 4, 5 or 6 inputs
- 2 registers
- Carry logic for fast adders
- 4 outputs, 2 registered 2 non-registered
Slice 0
PRE
D
Q
CE
CLR
PRE
D
Q
CE
CLR
Courtesy Richard Sevcik, Xilinx
5A Decade of Progress
1000x
- 200x More Logic
- Plus memory, µP etc.
- 40x Faster
- 50x Lower Power
- 500x Lower Cost
Virtex-4
XC4000
Spartan
100x
CLB Capacity
Virtex-II
Speed
Virtex-II Pro
Power per MHz
Virtex
Price
Virtex-E
10x
Spartan-2
XC4000
Spartan-3
1x
'91 '92 '93 '94 '95 '96 '97
'98 '99 '00 '01 '02 '03 '04
Year
Courtesy Richard Sevcik, Xilinx
6FPGA vs. ASIC Cost ASIC High volumes needed to
recover design cost
Total cost
ASIC cost/part is lower
ASIC Design Cost is much higher (and
increasing)!!
Volume
Courtesy Richard Sevcik, Xilinx
7FPGA Design Flow
Circuit description (VHDL,schematic)
Synthesize ( technology map) to logic blocks
Place logic blocks in FPGA
Route connections between logic blocks
FPGA programming file
8FPGA Placement Problem
- Input A technology mapped netlist of
Configurable Logic Blocks (CLB) realizing a
given circuit. - Output CLB netlist placed in a two dimensional
array of slots such that total
wirelength is minimized.
i1
i2
i3
i4
1
2
3
4
5
6
7
8
Placement
9
10
f1
f2
FPGA
CLB Netlist
9Problem Formulation
- Given
- Set of modules M m1, m2, .mn
- Set of signals S s1, s2, .sq
- Set of location L l1, l2, .lp, p ? M
- ? mi ? M, there is a set of signals
- ? si ? S, there is a set of modules
- is said to be a signal net
Goal To assign each module mi ? M to a location
lj ? L such that the chosen objective function
is optimized.
10Existing Approaches for FPGA Placement
- VPR (1997)
- Uses Simulated Annealing (SA)
- Adaptive Annealing Schedule
- Tabu Search Based Method(1999)
- TCO (2004)
- Temperature schedule and probability of
acceptance derived from laws of thermodynamics - Force Directed Placement
- Genetic Algorithm Based Placement
- Partitioning and Clustering Based Techniques
11Accelerators for FPGA Placement
Initial placement quality does affect the speed
of convergence in iterative refinement methods!
- Cone based initial placement with iterative
refinement by simulated annealing - Cost metrics
- Low temperature Simulated Annealing
- Placement by Space Filling Curves followed by low
temp. SA
12Part I ACone based Initial Placement for FPGAs
13Motivation of our work
- Salient features of previous approaches
- initial placement done at random
- improvement through iteration time-consuming
- Our motivation Fast Placement method to
accelerate the iterative phase without
sacrificing quality - Approach
- better initial placement using constructive
method - quicker convergence of iterative phase
14Our workflow
Technology mapped netlist
Cone based Initial Placement (Accelerator)
Initial Placement
Low temperature simulated annealing
Final Placement
15Preliminaries
- For a given CLB netlist, a graph G(V,E) is
defined where - V v v is CLB / primary input (I) / primary
output (O) - E ltvi, vjgt vi ? fanin(vj) and vj ?
fanout(vi).
- Cone(Oi) fi u ? a simple directed path
from u to Oi in G
- Bounding Box
- A rectangular region containing all bj ?
fanout(bi)
- Nnets - number of nets
- bbx(i), bby(i) - horizontal vertical span of
bounding box - q(i) - 1 , for nets with 3 or fewer terminals,
increases till 2.79 for nets with 50 terminals - Cav,x(i), Cav,y(i) avg. channel capacities in
x y direction
16Algorithm Overview
- Initial Placement
- Generate a n ? n array , n sqrt(number of CLB)
- Place each Oi at the boundary of the array at
random - Trace cone(Oi) till all blocks bi are placed
- Trace Cone
- Find one bi ? fanin(bj) and bj not placed , bj
is already placed - Place Block bi
- Place Block
- Find smallest rectangle enclosing bbfanin(bi) ?
bbfanout(bi) - Find an optimal position(empty slot) within the
bounding box - If there is no empty slot, extend bounding box
- Repeat the process until an empty slot found
17An Example
4
A Cone
7
9
8
2
11
16
0
15
3
2
14
1
9
1
3
2
8
0
2
5
2
0
4
9
16
8
4
7
1
8
11
3
9
14
15
16
Placement of a cone
6
Bounding box of net 9
18A Running Example (contd.)
5
10
10
0
12
11
3
12
0
2
8
11
19A Running Example (contd.)
6
13
0
14
15
13
fan-in fan-out of 13 0,14,15 6
20Benchmark Circuit Details
Q (Quality) 100 . T (Time) 100 .
Algo VPR TCO
21Experimental Results Initial cost
22Remarks on Initial Temperature
- Existing Approach (VPR)
- High initial temperature when a random placement
is given - Initially all moves are accepted
- Tinit 20 ? where ? Std. Dev. of cost
over Nblocks random moves, co-efficient derived
empirically - range limit is set to maximum span of 2D array
- Our Approach Low initial temperature
- Need to generate low enough initial temp to
match the better quality initial solution on the
annealing curve - Not all moves are accepted
- Tinit 0.025 ? , ? Std. Dev. of cost over
Nblocks random moves, co-efficient derived
empirically by us - range limit is set to 1
23Experimental Results Accl LTSA vs. VPR
With Initial Temperature Tinit , adaptive
schedule of SA in VPR, our initial placement
converges very fast, while maintaining quality.
24Experimental Results Accl LTSA vs. TCO
25Summary of Results for MCNC Benchmarks
26Accelerators for FPGA Placement
- Cone based initial placement with iterative
refinement by simulated annealing - Cost metrics
- Low temperature Simulated Annealing
- Initial placement by Space Filling Curves
followed by Low Temp SA
27Part IIFPGA Placement by Space Filling Curves
28Motivation
- Observations from previous work
- Stochastic methods like Simulated Annealing,
Genetic Algorithms yield good quality solutions - SA based techniques take enormous amount of time.
- Our motivation
- Development of much faster initial placement
method to accelerate FPGA placement - Quality of the placement should be comparable to
the SA based FPGA place and route tool VPR.
29Our Workflow
Technology mapped netlist
Find a linear order of netlist blocks
Linear order of blocks
Accelerator
Place the linearly ordered list using Space
Filling Curves (Snake, Hilbert or Z Curve)
Low temperature simulated annealing
Final Placement
30Step 1 Linear ordering of a netlist
- 2D placement problem mapped to 1D placement
problem
- Requirement The CLBs of the netlist to be
assigned to equally - spaced slots on a line such that the total
wirelength is minimized. -
- Our Approach
- Min-cut based partitions of at most two CLBs per
partition. - Netlist Hypergraph is bi-partitioned recursively
to obtain a - nearly linear order
- A popular tool, hMetis, is used for hypergraph
bi-partitioning.
Problem with current method The linear order
obtained here needs further refinement.
31Step 2 Space-filling Curves(SFC)
- Preliminaries
- Provides a linear traversal or indexing of a
multidimensional grid space - Commonly used to reduce a multi-dimensional
problem to one dimensional problem. - Our objective
- Reverse mapping the linear order (1D) onto a 2D
grid - Exploit the locality preserving property of SFC
32Step 2 Space-filling Curves (Contd..)
- A sequence of 2D SFC of successive orders follow
recursive framework. - Hilbert curve
Z Curve
33Our Method by An Example
Step 1 Generation of nearly linear ordered CLB
netlist
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16
Linear order
Step 2 Placement of linear list using Space
Filling Curve
Hilbert
Z
Snake
16
8
14
16
11
12
13
16
12
13
17
20
6
10
9
11
5
18
7
13
14
19
15
14
15
15
10
7
8
3
2
2
3
6
10
2
4
12
7
1
3
9
11
1
4
5
9
6
5
4
1
8
34Step 3 Low Temp. SA
- With Initial Temperature Tinit , adaptive
schedule of SA in VPR, - our initial placement converges very fast, while
maintaining quality. - Our Approach Low initial temperature
- Need to generate low enough initial temp to
match the better quality initial solution on the
annealing curve - Not all moves are accepted
- Tinit co-eff ? , ? Std. Dev. of cost
over Nblocks random moves, co-efficient derived
empirically by us - range limit is set to 1
35Experimental Result Initial Cost
- K height of snake curve
- Time to place ordered blocks using space-filling
curve is negligible - This placement can be refined by low temp. SA to
obtain near-optimal quality
36Experimental Result Final Cost after LTSA
37Experimental Result Speed up by our method
38Concluding Remarks
- Extending our placement methods for
- advanced FPGA architecture ( Xilinx Virtex)
- having preplaced blocks like RAM,
- microprocessors etc.
- Development of placement method for
- 3D FPGA architecture
-
39Thank you