Title: Clustering of Large Designs for ChannelWidth Constrained FPGAs
1Clustering of Large Designs forChannel-Width
Constrained FPGAs
Marvin Tom Guy Lemieux University of British
Columbia Department of Electrical and Computer
Engineering Vancouver, BC, Canada
2Overview
- Introduction, Goals and Motivation
- Reduce channel width, lower cost, make circuits
routable - Reducing Channel Width By Depopulation
- Large Benchmark Circuits
- New Clustering Technique
- Selective Depopulation
- Conclusions and Future Work
3Mesh-Based FPGA Architecture
- Channel width
- Number of routing tracks per channel
- Larger FPGA devices more tiles
- Channel width is fixed
4Motivation Area of FPGA Devices
MCNC Circuits Mapped onto an FPGA
Total Layout AREA SIZE Number
5Motivation Channel Width Demand
MCNC Circuits Mapped onto an FPGA
Devices built for worst-casechannel width (fixed
width)
Interconnect cost dominates (gt70)
6Goal Reduce Channel Width
But apex4, elliptic, frisc, ex1010, spla, pdc
are unroutable. Can we make them routable in
a Constrained FPGA?
7Possible Solution
- Trade-off logic utilization for channel width
- User can always buy more logic. (not more wires)
Trade-off CLB count for Channel width
FPGA 1
FPGA 2
But.. can we achieve lower Total Area? (
SIZE CLB Count)
8Logic Element BLE and CLB
BLE 1
- Basic Logic Element (BLE)
- k-input LUT FF
- Clustered Logic Block (CLB)
- N BLEs, N outputs
- I shared inputs
-
BLE 2
BLE 3
N Outputs
I Inputs
BLE 4
Note I lt kN
BLE 5
CLB
9CLB Depopulation
BLE 1
- Normally CLBs fully packed
- Reduces total of CLBs needed for circuit
- CLB Depopulation Tessier, DeHon
- Do not use all BLEs ?
- Increase CLBs used ?
- Decrease channel width ?
- Decrease overall area
- Problem
- Increase in CLBs high for large circuits
- Our work limits CLB increase
BLE 2
BLE 3
N Outputs
I Inputs
BLE 4
BLE 5
CLB
10 Uniform Depopulation
- Previous work
- Depopulate each CLB by equal amount
- But circuit observations
- regions of high routing demand
- regions of low routing demand
- Depopulate in low congestion areas ??
- Unnecessary increase in area
11 Non-Uniform Depopulation
- Our depopulation method
- Assume congestion is localized
- Depopulate only congested areas
- We show non-uniform de-population
- Effective method of channel width reduction
- Graceful tradeoff between channel width and area
- Makes unroutable circuits routable
12Depopulation MethodstoReduce Channel Width
13CLB Depopulation
BLE 1
- General Approach
- Use existing clustering tools
- Do not fill CLB while clustering
- Input-Limited
- Eg. Maximum 67 inpututilization per CLB
- Might use all BLEs
- BLE-Limited
- Eg. Maximum 60 BLE utilization per CLB
- Might use all Inputs
BLE 2
N Outputs
BLE 3
I Inputs
BLE 4
BLE 5
CLB
14Reducing Channel Width Results(max cluster size
16)
- Input-Limited
- No channel width control
- BLE-Limited
- (almost) monotonically increasing ? good channel
width control
15Benchmark Circuit Creation
- (We want BIG circuits!)
- (What do REALLY BIG circuits look like?)
16Benchmarking Circuits Some Observations
- Altera has bigger benchmarks than academics
- We noted similar characteristics
- Some LARGE circuits routable with NARROW routing
channels - Some SMALL circuits need WIDE routing channels
- What if each circuit is IP Block in larger
system ??
17Benchmark Creation IP Blocks
- Mimic process of creating large designs
- IP Blocks ltgt MCNC Circuits
- SoC ltgt Randomly integrate/stitch together IP
Blocks - IP Blocks have varied interconnect needs
- Real-life large designs System-on-Chip
Methodology - IP blocks (own, 3rd party)
- Re-use improves productivity
- Primarily integration and verification effort
18Benchmark Creation Large Designs
- Considered 3 stitching schemes
- Independent
- IP Blocks are not connected to each other
- Pipeline
- Outputs of one IP block connected to inputs of
next IP block - Clique
- Outputs of each IP block are uniformly
distributed to inputs of all other IP blocks
19MetaCircuitReducing Routed Channel Width?
- Observations
- IP blocks are tightly-connected internally
- IP blocks have varied channel width needs
- Hypotheses
- Placement keeps each IP block together
- IP blocks has large routed channel width ?
MetaCircuit has large routed channel width
20Hypothesis TestingMetaCircuit PR Results
- Use VPR FPGA tools from University of Toronto
- Hypothesis 1
- VPR placer successfully groups IP blocks from
random initial placement - Hypothesis 2
- VPR router confirms channel width of MetaCircuit
is dominated by a few IP blocks pdc, clma,
ex1010
21Consequences of Hypothesis 2
- Question
- Shrink channel width of few IP blocks ??? shrink
channel width of MetaCircuit? - How to shrink channel widths?
- Selective CLB Depopulation !!
- Depopulate hard-to-route IP blocks the most
- How much to depopulate?
- Channel width profiling of IP block
22Meeting Channel Width ConstraintsSelective
Depopulation
- Step 1 Channel Width Profiling of IP Blocks
(Congestion Estimation) - Step 2 Re-cluster Only Congested IP Blocks
(Selective Depopulation)
23IP Block Properties
- Cluster IP Blocks into N16, k6
- VPR determine minimum channel width for each IP
Block - Sort IP Blocks based on channel width
Hard-to-Route Circuits
Easy-to-Route Circuits
24Channel Width Profiling of IP Block
- Cluster sizes
- NA FPGA Architecture Cluster Size (fixed)
- NC BLE-Limit Size (variable)
- Sweep NC for each IP block
25Analysis with Constraint
- Given channel-width constraint of 60 tracks
- tseng routable (easy)
- clma routable for NC lt 10
- clma not routable for NC gt 10
26Our Technique Selective Depopulation
- Step 1 Channel Width Profiling of IP Blocks
(Congestion Estimation) - Step 2 Re-cluster Only Congested IP Blocks
(Selective Depopulation)
27Uniform Depopulation
- Minimum NC Cluster Size
- De-populate all clusters equally
- Eg, use NC10 for both IP Blocks
28Non-Uniform Depopulation
- Maximal NC Cluster Size
- Depopulate each IP block according to maximal
cluster size - Eg, clma NC10, tseng NC16
29Uniform vs. Non-Uniform
- Non-Uniform depopulation better than Uniform
- Lower CLB count
- Higher LUT utilization
LUT Utilization
Total CLBs Needed
Uniform
Non-Uniform
Uniform
Non-Uniform
x 1,000
Channel Width Constraint
Channel Width Constraint
30MetaCircuit Clustering Results
- Depopulate the most-congested IP blocks
- (BLE-Limit) of each IP block shown(max16)
- Some IP blocks are depopulated more than others
31MetaCircuit PR Results
- Clique MetaCircuit
- PR channel width results closely match
constraints
Constraint
Routed
Channel Width
Normalized Area
1
Channel Width Constraint
Channel Width Constraint
- Shrink Channel Width by 20 (from 95 to 75), NO
AREA INCREASE - by 50
(from 95 to 50), 1.7x area increase
32Other MetaCircuit Results
These latest results are better than those
given in paper
33Critical Path Delay and Average Wirelength
- Expect critical path delay to increase under
tighter constraints - Delay noise due to instability of floorplan
locations - Average wirelength / net increases under tighter
constraints
34Conclusion
- System-level technique to map large
System-on-Chip (SoC) designs to channel-width
constrained FPGAs using fewer routing resources - Depopulating CLBs effective at reducing channel
width - Non-uniform depopulation important to limit area
inflation - Channel width reduced
- by 0-20 with lt 5 area increase
- by up to 50 with 3.3 X area increase
- Effective solution to trade-off CLBs for
Interconnect !!! - UNROUTABLE circuits (channel width TOO LARGE)can
be made ROUTABLE (reduced channel width)by
buying an FPGA with MORE LOGIC!!!
35End of Talk
36Future Work
- Real-Life SoC Benchmark
- Licensed IP Bluetooth baseband processor
- 325,000 ASIC gates
- Numerous IP blocks of varying complexity
- Needed to authenticate Synthetic results
- Automated technique to find hard IP blocks
- Granularity is based on design hierarchy (?)
- Replaces time-consuming Step 1 of process
37Motivation Reduce Cost
- Observations
- Interconnect dominates, layout area gt 70
- Fixed interconnect architecture
- Designed for near-worst-case demand
- Same interconnect architecture across entire
family - Eg, Altera Cyclone 80 tracks-per-channel for all
devices - Choice for logic capacity (device selection)
- No choice for interconnect capacity
- Result
- Overcapacity in interconnect
- Interconnect dominates cost
- User has no way to reduce dominant cost
38Fixed Channel-Width Constraints
- Real FPGA Device fixed Channel Width
- Some hard-to-route circuits (routing intensive)
wont (reword?) fit - Problem
- Find way to make circuit fit
- Our Approach
- Divide circuit into large-sized chunks, eg IP
Blocks - Make hard-to-route IP Blocks easy-to-route by
CLB depopulation - This increases CLB usage
- Leave easy ones alone limit CLB increase
39Overview of Clustering Approach
- Two methods for choosing NC
- Uniform Depopulation use fixed NC lt NA
- Non-Uniform Depopulation use best NC lt NA
- As expected, Non-Uniform gives better results
- Cluster each IP block separately
- Compare with 2 clustering tools
- T-VPACK vs. iRAC replica
- Channel Width Prediction
- Largest Channel Width of IP blocks lt Channel
Width of MetaCircuit