Optimal Simultaneous Mapping and Clustering for FPGA Delay Optimization

1 / 19
About This Presentation
Title:

Optimal Simultaneous Mapping and Clustering for FPGA Delay Optimization

Description:

Optimal Simultaneous Mapping and Clustering for FPGA Delay Optimization ... Praetor [Cong FPGA99], DAO-map [Chen ICCAD04] (Cut enumration) ... –

Number of Views:50
Avg rating:3.0/5.0
Slides: 20
Provided by: www94
Category:

less

Transcript and Presenter's Notes

Title: Optimal Simultaneous Mapping and Clustering for FPGA Delay Optimization


1
Optimal Simultaneous Mapping and Clustering for
FPGA Delay Optimization
  • Joey Lin Deming Chen Jason Cong

2
Modern FPGA Architecture
  • Hierarchical FPGA
  • Interconnect delay domination

(From Altera website)
cluster
3
Existing FPGA Optimization Flow
  • RTL synthesis
  • Tech independent decomposition
  • Mapping
  • Delay/depth optimal mapping
  • Flow-map Cong TCAD94 (network flow based)
  • Minimize area while having delay optimal
  • Praetor Cong FPGA99, DAO-map Chen ICCAD04
    (Cut enumration)
  • ABC Mishchenko IWLS05 (AIGcutarea heuristics)
  • Clustering
  • Delay
  • Rajaraman DAC93 Cong DAC01 Dynamic
    programming
  • Vpack,T-Vpack
  • Betz Kluwer99 Objective for connectivity,
    criticality
  • R-pack
  • Bozogzadeh ASPDAC01 Objective for routability
  • Placement Routing

4
Unit vs. General Delay Model
  • Unit delay model
  • Every gate has a constant interconnect delay
  • Used in the traditional mapping step
  • Problems
  • Cannot reflect the different delays related to
    the cluster
  • General delay model
  • Murgai, ICCAD91
  • Large inter-cluster delay
  • Experimental value
  • A few times of the intra-cluster delay
  • Optimization objective
  • Minimum delay under General Delay Model
  • Parameter Dnode , Dint-edge , Dext-edge

5
Drawback of Separate Mapping and Clustering
  • Setting
  • 3-LUT FPGA, (K3)
  • Each cluster contains maximum 3 LUTs, (M3)
  • Dnode 1, Dint-edge 1, Dext-edge 3
  • Traditional flow
  • Best area/delay mapping followed be clustering
  • 311131313
  • Best situation
  • 311111311
  • Conclusion
  • Optimal Mapping
  • Optimal Clustering
  • ? Optimal Mapping/Clustering

6
Simultaneous Mapping And Clustering Flow - SMAC
  • Optimal signal arrival time computation
  • Processed in topological order
  • For the nodes at the primary input, the arrival
    time is 0
  • For any nodes except PIs, the arrival time is
    derived from its predecessors
  • Clustering formation
  • Processed in backward topological order
  • Create the solutions that can maintain the best
    signal arrival time

7
Optimal Arrival Time
  • Intuition
  • Regard a cluster as a big logic
  • Ignore the multiple fanouts capabilityof the
    cluster
  • Find all cluster solutions of v
  • Find all K-feasible cut of v
  • Choose the cluster solution of theinput of of
    the K-feasible cut
  • Add the solutions together and check whether
    physical capacity constraintM met
  • Associate the delay with solution
  • Potential problem
  • Too many clustering solutions

8
Dynamic Programming Approach
  • Basic idea
  • The physical constraints of the cluster only
    related to area
  • Area can be summed up from the inputs
  • Resolve the runtime problem
  • Not keeping all clustering solutions
  • Keep the best delay solution with the solution
    area
  • Only keep a list of delay values

areaa1a2a31 delay worst delay from inputs
9
Illustration
  • i represents area
  • Arr(v,i) is the best delay for v when it is in a
    cluster of area i
  • K 3
  • M 4
  • Dnode 1
  • Dint-edge 2
  • Dext-edge 7

10
Clustering Formation
  • Processed in backward topological order
  • Time and area constraints at the primary output
  • Time indicates the signal arrival time expected
  • Area indicates the cluster capacity left
  • Find a solution and propagate the time andarea
    constraint
  • Reverse of the optimal arrival time process
  • Solutions guaranteed
  • By the dynamic programming process

(Largest arrival time, M)
11
Complexity Analysis
  • Complexity
  • Runtime consuming part - Find out all the
    possible combinations
  • O(MK) for processing each cut key point
  • M is the cluster capacity and K is the cut size
  • Is constant for fixed M and K
  • Typically M10 and K4
  • O(CmN) for the complexity of the whole algorithm
  • For fixed M and K
  • m is the K-cut number of the node
  • Typically less than 20 for K4
  • N is the number of nodes

12
Optimality Analysis
  • Conservative during dynamic programming speed up
  • Combinational vs. Sequential
  • Assign inter-cluster delay on the edges out of
    sequential element
  • The difference to optimal value is less than
    Dext-edge- Dint-edge

FF
13
Area Reduction Heuristics
  • Area aware mapping selection
  • Area estimation
  • Consider area cost into the mapping solution
    selection
  • Online packing
  • Pack the clusters during the cluster realization
    phase
  • Predecessor packing
  • Fanin related packing
  • Bin packing
  • Post processing to pack the clusters as muchas
    possible
  • Duplication Control
  • Preserve large fanout nodes

14
Area Reduction Experiments
Average 1 1.04 1.19 1.64 1.70
15
Experiment Flow
  • Reference Best FPGA flow in academic
  • Technology independent synthesis SIS (Berkeley)
  • Decomposition RASP (UCLA)
  • Mapping DAO-map (UCLA)
  • Clustering T-Vpack (U Toronto)
  • Placement routing VPR (U Toronto)
  • Standard benchmark set
  • Toronto 20 large MCNC circuits
  • FPGA architecture setting
  • Cluster capacity 10 (Altera Stratix)

SIS
RASP
DAO-map
SMAC
T-Vpack
VPR
16
Consistent Delay Reduction
Reference flow result as 1
Estimated model (General delay model)
Frequency improvement 25
Area overhead 23
Delay after PR
Frequency improvement 12
17
Potential Improvements
  • Runtime improvement
  • Status
  • 100x than traditional mapping and clustering
  • Still comparable to placement
  • Runtime efficiency is an important issue for
    application
  • We have rooms for improvement
  • Incremental update

18
Summary
  • The first work that find the optimal delay under
    the generaldelay model
  • A dynamic programming algorithm
  • Labeling phase
  • Clustering realization
  • Area reduction
  • Experimental results
  • 25 frequency improvements under general delay
    model
  • 23 area overhead, and 12 PR delay reduction
  • Potential improvements

19
Questions
Write a Comment
User Comments (0)
About PowerShow.com