Title: Optimal Simultaneous Mapping and Clustering for FPGA Delay Optimization
1Optimal Simultaneous Mapping and Clustering for
FPGA Delay Optimization
- Joey Lin Deming Chen Jason Cong
2Modern FPGA Architecture
- Hierarchical FPGA
- Interconnect delay domination
(From Altera website)
cluster
3Existing FPGA Optimization Flow
- RTL synthesis
- Tech independent decomposition
- Mapping
- Delay/depth optimal mapping
- Flow-map Cong TCAD94 (network flow based)
- Minimize area while having delay optimal
- Praetor Cong FPGA99, DAO-map Chen ICCAD04
(Cut enumration) - ABC Mishchenko IWLS05 (AIGcutarea heuristics)
- Clustering
- Delay
- Rajaraman DAC93 Cong DAC01 Dynamic
programming - Vpack,T-Vpack
- Betz Kluwer99 Objective for connectivity,
criticality - R-pack
- Bozogzadeh ASPDAC01 Objective for routability
- Placement Routing
4Unit vs. General Delay Model
- Unit delay model
- Every gate has a constant interconnect delay
- Used in the traditional mapping step
- Problems
- Cannot reflect the different delays related to
the cluster - General delay model
- Murgai, ICCAD91
- Large inter-cluster delay
- Experimental value
- A few times of the intra-cluster delay
- Optimization objective
- Minimum delay under General Delay Model
- Parameter Dnode , Dint-edge , Dext-edge
5Drawback of Separate Mapping and Clustering
- Setting
- 3-LUT FPGA, (K3)
- Each cluster contains maximum 3 LUTs, (M3)
- Dnode 1, Dint-edge 1, Dext-edge 3
- Traditional flow
- Best area/delay mapping followed be clustering
- 311131313
- Conclusion
- Optimal Mapping
- Optimal Clustering
- ? Optimal Mapping/Clustering
6Simultaneous Mapping And Clustering Flow - SMAC
- Optimal signal arrival time computation
- Processed in topological order
- For the nodes at the primary input, the arrival
time is 0 - For any nodes except PIs, the arrival time is
derived from its predecessors - Clustering formation
- Processed in backward topological order
- Create the solutions that can maintain the best
signal arrival time
7Optimal Arrival Time
- Intuition
- Regard a cluster as a big logic
- Ignore the multiple fanouts capabilityof the
cluster - Find all cluster solutions of v
- Find all K-feasible cut of v
- Choose the cluster solution of theinput of of
the K-feasible cut - Add the solutions together and check whether
physical capacity constraintM met - Associate the delay with solution
- Potential problem
- Too many clustering solutions
8Dynamic Programming Approach
- Basic idea
- The physical constraints of the cluster only
related to area - Area can be summed up from the inputs
- Resolve the runtime problem
- Not keeping all clustering solutions
- Keep the best delay solution with the solution
area - Only keep a list of delay values
areaa1a2a31 delay worst delay from inputs
9Illustration
- i represents area
- Arr(v,i) is the best delay for v when it is in a
cluster of area i
- K 3
- M 4
- Dnode 1
- Dint-edge 2
- Dext-edge 7
10Clustering Formation
- Processed in backward topological order
- Time and area constraints at the primary output
- Time indicates the signal arrival time expected
- Area indicates the cluster capacity left
- Find a solution and propagate the time andarea
constraint - Reverse of the optimal arrival time process
- Solutions guaranteed
- By the dynamic programming process
(Largest arrival time, M)
11Complexity Analysis
- Complexity
- Runtime consuming part - Find out all the
possible combinations - O(MK) for processing each cut key point
- M is the cluster capacity and K is the cut size
- Is constant for fixed M and K
- Typically M10 and K4
- O(CmN) for the complexity of the whole algorithm
- For fixed M and K
- m is the K-cut number of the node
- Typically less than 20 for K4
- N is the number of nodes
12Optimality Analysis
- Conservative during dynamic programming speed up
- Combinational vs. Sequential
- Assign inter-cluster delay on the edges out of
sequential element - The difference to optimal value is less than
Dext-edge- Dint-edge
FF
13Area Reduction Heuristics
- Area aware mapping selection
- Area estimation
- Consider area cost into the mapping solution
selection - Online packing
- Pack the clusters during the cluster realization
phase - Predecessor packing
- Fanin related packing
- Bin packing
- Post processing to pack the clusters as muchas
possible - Duplication Control
- Preserve large fanout nodes
14Area Reduction Experiments
Average 1 1.04 1.19 1.64 1.70
15Experiment Flow
- Reference Best FPGA flow in academic
- Technology independent synthesis SIS (Berkeley)
- Decomposition RASP (UCLA)
- Mapping DAO-map (UCLA)
- Clustering T-Vpack (U Toronto)
- Placement routing VPR (U Toronto)
- Standard benchmark set
- Toronto 20 large MCNC circuits
- FPGA architecture setting
- Cluster capacity 10 (Altera Stratix)
SIS
RASP
DAO-map
SMAC
T-Vpack
VPR
16Consistent Delay Reduction
Reference flow result as 1
Estimated model (General delay model)
Frequency improvement 25
Area overhead 23
Delay after PR
Frequency improvement 12
17Potential Improvements
- Runtime improvement
- Status
- 100x than traditional mapping and clustering
- Still comparable to placement
- Runtime efficiency is an important issue for
application - We have rooms for improvement
- Incremental update
18Summary
- The first work that find the optimal delay under
the generaldelay model - A dynamic programming algorithm
- Labeling phase
- Clustering realization
- Area reduction
- Experimental results
- 25 frequency improvements under general delay
model - 23 area overhead, and 12 PR delay reduction
- Potential improvements
19Questions