Title: NoC Physical Implementation
1NoC Physical Implementation
- Federico Angiolini
- federico.angiolini_at_unibo.it
- DEIS Università di Bologna
2Physical Implementation and NoCs
- NoCs and physical implementation flows are
strictly related topics - On the one hand, NoCs are designed to alleviate
back-end issues (structured wiring) - On the other hand, back-end properties critically
affect NoC behaviour and effectiveness
3ASIC Synthesis Flow
4A Typical ASIC Design Flow
Design Space Exploration
RTL Coding
Logic Synthesis
Placement
Routing
- Ideally, one-shot linear flow
- In practice, iterations needed to fix issues
- Validation failures
- Bad quality of results
- No timing closure
5Basics of a Back-End Flow
RTL code Circuit description
Tech Libs
GTech Connected network of logic blocks
Analysis
Netlist Connected network of gates
LogicSynthesis
Placed Netlist Placed network of gates
Placement
Major vendors Synopsys, Mentor, Magma, Cadence
Layout Placed and routed network of gates
Routing
6Notes on Tech Libraries
- Encapsulate foundry capabilities
- Typical content boolean gates, flip-flops,
simple gates - But, in lots of variations fan-in, driving
strength, speed, power... - Describe function, delay, area, power, physical
shape... - Often many libraries per process
high-perf/low-power best/worst varying VDD
varying VT
Tech Libs
7Analysis of the Hardware Description
Analysis
- Normally, a very swift step
- Input Verilog/VHDL description
- Output circuit description in terms of adders,
muxes, registers, boolean gates, etc.
(GTech Generic Technology) - Output is not optimized by any metric
- Just translates specifications into an abstract
circuit
8Logic Synthesis
LogicSynthesis
- Takes minutes to hours
- Input GTech description
- Output circuit description in terms of HSFFX4,
LPNOR2X2, LLINVX32, etc. (i.e. specific
gates of a specific tech library) - Output is...
- Complying with timing specs (e.g. at 500 MHz)
- Optimized for area and power
9...How Does This Work?
- Based on GTech, paths are identified
- register-to-register
- input-to-register
- register-to-output
- input-to-output
- Along each path, GTech blocks are replaced with
actually available gates from a technological
library - The outcome is called netlist
- Delay is analyzed first and some paths are
detected as critical
10Example Critical Paths
Adventures in ASIC Digital Design
- Based on chosen library gates and netlist, path 1
? 6 is longest and violates constraints
11Netlist Optimization
- Synthesis process optimizes critical paths until
timing constraints are met, e.g. - Use faster gates instead of lower-power
- Play with driving strength (as in buffering)
- Refactor combinational logic to minimize gates to
be traversed - Once timing is met, analyze non-critical paths
- Optimize for area and power, even if slower
12Placement
Placement
- Step 1 Floorplanning
- Place macro-blocks onto rectangle (? chip)
- e.g. processors, memories...
- Step 2 Detailed placement
- Align single gates of macro-blocks into rows
- Typically aiming at 85 row utilization
13Example xpipes Placement Approach
- Floorplan mix of
- hard macros for IP cores
- soft macros for NoC blocks
14Routing
Routing
- Step 1 Clock tree insertion
- Bring clock to all flip-flops
- Step 2 Power network insertion
- Bring VDD, GND nets across chip
- Typically over top metal layers
- Either as ring (small designs) or grid (bigger
designs) - Step 3 Logic routing
- Actually connect gates to each other
- Typically over bottom metal layers
15Example Binary Clock Tree
Courtesy Shobha Vasudevan
- Issue minimizing skew
- Critical at high frequencies
- Consumes large amount of power
16Issue with Traditional Flow
- Major problem with traditional flow...
- ...wiring is not considered during synthesis!!!
- Outdated assumption wiring delay is negligible
- Partial fix wireload models
- Consider fan-out of gates
- If small, assume short wiring at outputs, and a
bit of extra delay - If large, assume long wiring at outputs, and a
noticeable extra delay - Still grossly inaccurate
17Physical Synthesis
- Currently envisioned solution physical synthesis
- Merge placement with logic synthesis
- Initial, quick logic synthesis
- Coarse-grained placement
- Incremental synthesis placement until
convergence - Drastically better results (more predictable)
- Still may not suffice... also integrate routing
step??
RTL
Quick logic synthesis
Initial Netlist
Quick placement
Initial Placed Netlist
Incremental synthesis placement
Final Placed Netlist
18Advanced Back-End Flow
RTL code Circuit description
Tech Libs
GTech Connected network of logic blocks
Analysis
Placed Netlist Placed network of gates
PhysicalSynthesis
Major vendors Synopsys, Mentor, Magma, Cadence
Layout Placed and routed network of gates
Routing
19Some Observations on the Physical Implementation
of NoCs
20Study 1 Cross-Benchmarking NoCs vs. Traditional
Interconnects
- Study performance, area, power of a NoC
implementation as opposed to traditional bus
interconnects - Plain shared bus
- Hierarchical bus
- 130nm technology
- Note on old, unoptimized version of NoC
architecture
21AMBA AHB Shared Bus
AMBA AHB
- Baseline architecture
- Ten ARM cores, five traffic generators, fifteen
slaves (fully populated bus) - ARM cores running a pipelined multimedia
benchmark - Traffic generators
- Streaming traffic towards a memory (DSP-like)
- Periodically querying some slaves (IOCtrl-like)
22AMBA AHB Multilayer
M1
T0
M0
AHB Layer 0
- Dramatically improves performance
- Intra-cluster traffic to private slaves (P0-P9)
is bound within each layer, reducing congestion - Shared slaves (S10-S14) can be accessed in
parallel - Representative 5x5 Multilayer configuration (up
to 8x8 allowed)
P0
P1
S10
M3
T1
M2
AHB Layer 1
S11
P2
P3
M5
T2
M4
AMBA AHB crossbar
AHB Layer 2
S12
P4
P5
M7
T3
M6
S13
AHB Layer 3
P6
P7
S14
M9
T4
M8
AHB Layer 4
P8
P9
23xpipes (Quasi-)Mesh
M1
M2
T0
T1
M0
P1
P2
S10
S11
P0
M3
M5
M4
T2
T3
130nm
P3
P5
P4
S12
S13
M6
M7
M8
M9
T4
P6
P7
P8
P9
S14
1 mm²
- Excellent bandwidth
- Balanced architecture, no max frequency
bottlenecks - Very regular topology easy to floorplan
- Overhead of areapower due to many links and
buffers
24NoCs vs. Traditional Interconnects - Performance
- Time to complete functional benchmark
- Shared buses are totally collapsing
- NoCs are 10-15 faster than hierarchical buses
Observation 1 NoCs are much more scalable and
can provide better performance under severe load.
25NoCs vs. Traditional Interconnects - Summary
Observation 2 NoCs are dramatically more
predictable than traditional interconnects.
Observation 3 NoCs are better in performance
and physical design, but be careful about area
and power!
26Bandwidth or Latency?
- NoC bandwidth is much higher (44 links, 1 GHz)
- But this is indirect clue of performance
- NoC latency penalty/gain depends on transaction
- Penalty on short reads
- Gain on posted writes
Observation 4 Latency matters more than raw
bandwidth. NoCs have to be careful about some
transaction types.
27Area, Power Budget Analysis
b. Power
a. Area
Observation 5 Clock trees are negligible in
area, but eat up almost half of the power budget.
38-bit qmesh
28Study 2 Implementation of NoCs in 90 and 65nm
- Study behaviour of NoCs as they are implemented
in cutting-edge technologies - Observe behaviour of tech libraries, tools,
architecture and links as they are scaled from
one technology node to another
29Link Design Constraints
65nm lowest power
65nm power/ performance
- Power to drive a 38-bit (plus flow control)
unidirectional link
Observation 6 Long links (unless custom
designed) become either infeasible, or too
power-hungry. Keep them segmented.
30Link Repeaters/Relay Stations
- Wire segmentation by topology design
- Put more switches, closer
- Adds a lot of overhead
- Wire segmentation by repeater insertion
- Flops/relay stations to break links
- Details are strictly related to flow control
Observation 7 Architectural provisions may be
needed to tackle physical-level issues. These may
impact performance, so they should be accounted
for in advance.
31Wireload Models and 65nm
- Wireload models to guesstimate propagation delay
during logic synthesis are inaccurate - As seen, for 130nm, 6 to 23 off from actual
achievable post-placement timing - In 65nm, problem is dramatically worse
- No timing closure after placement (-50
frequency, huge runtimes...) - Traditional logic synthesis tools (e.g. Synopsys
Design Compiler) insufficient - Physical synthesis however works great
Observation 8 Physical synthesis is compulsory
for next-generation nodes.
32Placement in Soft Macros
- In our experiments, placementrouting is
extremely sensitive to soft macro area - Fences too tight flow fails
- Fences too wide tool produces bad results
- Solution accurate component area models
- Involves work since area depends on architectural
parameters (cardinality, buffering...)
Observation 9 Thorough characterization of the
components may be key to the convergence of the
flow for a whole topology.
3365nm Degrees of Freedom
Observation 10 There is no such thing as a
65nm library. Power/performance degrees of
freedom span across one order of magnitude. It is
the designers (or the tools) responsibility to
pick the right library choice.
- LP and HP libraries differ in gate design, VT,
VDD...
34Technology Scaling within Modules
6x6 switch, 38 bits,6 buffers
- Within modules, scaling looks great
- 25 frequency
- -46 area
- -52 power
35Technology Scaling on Topologies
- Three designs for max frequency
65 nm, 1 mm2 cores
65 nm, 0.4 mm2 cores
36Mesh Scaling
- Links
- Always short (lt1.2 mm) ? non-pipelined
- However
- 90 nm 1 mm2 3.1 mW
- 65 nm 1 mm2 3.6 mW (tightest fit ? more
buffering) - 65 nm 0.4 mm2 2.2 mW
- Power shifting from switches/NIs to links
(buffering)
37High-Radix Switch Feasibility
- High-radix switches become too slow
- 10x10 is maximum realistic size
- For sizes 26x26 and 30x30, PR is unfeasible!
38Clock Skew in High-Radix Switches
- A single switch is still a small entity
- Skew can be confined to lt10, typically lt5
39A Complete NoC Synthesis Flow
40Design of a NoC-Based System
Software Services Mapping, QoS, middleware...
CAD Tools
Architecture Packeting, buffering, flow control...
Physical Implementation Synchronization, wires,
power...
- All these items are key opportunities and
challenges - Strict interaction/feedback mandatory!...
- CAD tools must guide designers to best results
41The Design Tool Dilemma
- Automatically find topology and architectural
parameters so that - Design constraints are satisfied
- Area, power, latency are minimized
A hypercube? A torus? Or, do I want a custom
topology?
42Custom Topology Mapping
- Objectives
- Design fully application-specific custom
topologies - Generate deadlock-free networks
- Optimize architectural parameters of the NoC
(frequency, flit size), tuning based upon
application requirements
Physical design awareness
- Leverage accurate analytical models for area and
power, back-annotated from layouts - Integrated floorplanner to achieve design closure
while also considering wiring complexity
43The xpipes NoC Design Flow
User objectives power, hop delay
Constraints area, power, hop delay, wire length
NoC component library
FPGA Emulation
IP Core models
Application Traffic Task Graph
Topology Synthesis includes Floorplanner NoC
Router
Platform Generation
Platform Generation xpipes- Compiler
Synthesis
Placement Routing
System specs
SystemC code
To fab
NoC Area, Power Models
RTL Architectural Simulation
SunFloor
Floorplanning specifications
Area, power characterization
44Example Task Graph
- Captures communication among system cores
- Source/destination pairs
- Required bandwidth
45Measuring xpipes Performance
topology specs
46Example Layout
- Floorplan is automatically generated
- Black areas IP cores
- Colored areas NoC
- Over-the-cell routing allowed in this example
65nm design