NoC Physical Implementation - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

NoC Physical Implementation

Description:

Based on GTech, paths are identified. register-to-register. input-to ... Along each path, GTech blocks are replaced with actually available gates from a ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 47

Provided by: con99

Category:

more less

Transcript and Presenter's Notes

Title: NoC Physical Implementation

1
NoC Physical Implementation

Federico Angiolini
federico.angiolini_at_unibo.it
DEIS Università di Bologna

2
Physical Implementation and NoCs

NoCs and physical implementation flows are
strictly related topics
On the one hand, NoCs are designed to alleviate
back-end issues (structured wiring)
On the other hand, back-end properties critically
affect NoC behaviour and effectiveness

3
ASIC Synthesis Flow
4
A Typical ASIC Design Flow
Design Space Exploration
RTL Coding
Logic Synthesis
Placement
Routing

Ideally, one-shot linear flow
In practice, iterations needed to fix issues
Validation failures
Bad quality of results
No timing closure

5
Basics of a Back-End Flow
RTL code Circuit description
Tech Libs
GTech Connected network of logic blocks
Analysis
Netlist Connected network of gates
LogicSynthesis
Placed Netlist Placed network of gates
Placement
Major vendors Synopsys, Mentor, Magma, Cadence
Layout Placed and routed network of gates
Routing
6
Notes on Tech Libraries

Encapsulate foundry capabilities
Typical content boolean gates, flip-flops,
simple gates
But, in lots of variations fan-in, driving
strength, speed, power...
Describe function, delay, area, power, physical
shape...
Often many libraries per process
high-perf/low-power best/worst varying VDD
varying VT

Tech Libs
7
Analysis of the Hardware Description
Analysis

Normally, a very swift step
Input Verilog/VHDL description
Output circuit description in terms of adders,
muxes, registers, boolean gates, etc.
(GTech Generic Technology)
Output is not optimized by any metric
Just translates specifications into an abstract
circuit

8
Logic Synthesis
LogicSynthesis

Takes minutes to hours
Input GTech description
Output circuit description in terms of HSFFX4,
LPNOR2X2, LLINVX32, etc. (i.e. specific
gates of a specific tech library)
Output is...
Complying with timing specs (e.g. at 500 MHz)
Optimized for area and power

9
...How Does This Work?

Based on GTech, paths are identified
register-to-register
input-to-register
register-to-output
input-to-output
Along each path, GTech blocks are replaced with
actually available gates from a technological
library
The outcome is called netlist
Delay is analyzed first and some paths are
detected as critical

10
Example Critical Paths
Adventures in ASIC Digital Design

Based on chosen library gates and netlist, path 1
? 6 is longest and violates constraints

11
Netlist Optimization

Synthesis process optimizes critical paths until
timing constraints are met, e.g.
Use faster gates instead of lower-power
Play with driving strength (as in buffering)
Refactor combinational logic to minimize gates to
be traversed
Once timing is met, analyze non-critical paths
Optimize for area and power, even if slower

12
Placement
Placement

Step 1 Floorplanning
Place macro-blocks onto rectangle (? chip)
e.g. processors, memories...
Step 2 Detailed placement
Align single gates of macro-blocks into rows
Typically aiming at 85 row utilization

13
Example xpipes Placement Approach

Floorplan mix of
hard macros for IP cores
soft macros for NoC blocks

14
Routing
Routing

Step 1 Clock tree insertion
Bring clock to all flip-flops
Step 2 Power network insertion
Bring VDD, GND nets across chip
Typically over top metal layers
Either as ring (small designs) or grid (bigger
designs)
Step 3 Logic routing
Actually connect gates to each other
Typically over bottom metal layers

15
Example Binary Clock Tree
Courtesy Shobha Vasudevan

Issue minimizing skew
Critical at high frequencies
Consumes large amount of power

16
Issue with Traditional Flow

Major problem with traditional flow...
...wiring is not considered during synthesis!!!
Outdated assumption wiring delay is negligible
Partial fix wireload models
Consider fan-out of gates
If small, assume short wiring at outputs, and a
bit of extra delay
If large, assume long wiring at outputs, and a
noticeable extra delay
Still grossly inaccurate

17
Physical Synthesis

Currently envisioned solution physical synthesis
Merge placement with logic synthesis
Initial, quick logic synthesis
Coarse-grained placement
Incremental synthesis placement until
convergence
Drastically better results (more predictable)
Still may not suffice... also integrate routing
step??

RTL
Quick logic synthesis
Initial Netlist
Quick placement
Initial Placed Netlist
Incremental synthesis placement
Final Placed Netlist
18
Advanced Back-End Flow
RTL code Circuit description
Tech Libs
GTech Connected network of logic blocks
Analysis
Placed Netlist Placed network of gates
PhysicalSynthesis
Major vendors Synopsys, Mentor, Magma, Cadence
Layout Placed and routed network of gates
Routing
19
Some Observations on the Physical Implementation
of NoCs
20
Study 1 Cross-Benchmarking NoCs vs. Traditional
Interconnects

Study performance, area, power of a NoC
implementation as opposed to traditional bus
interconnects
Plain shared bus
Hierarchical bus
130nm technology
Note on old, unoptimized version of NoC
architecture

21
AMBA AHB Shared Bus
AMBA AHB

Baseline architecture
Ten ARM cores, five traffic generators, fifteen
slaves (fully populated bus)
ARM cores running a pipelined multimedia
benchmark
Traffic generators
Streaming traffic towards a memory (DSP-like)
Periodically querying some slaves (IOCtrl-like)

22
AMBA AHB Multilayer
M1
T0
M0
AHB Layer 0

Dramatically improves performance
Intra-cluster traffic to private slaves (P0-P9)
is bound within each layer, reducing congestion
Shared slaves (S10-S14) can be accessed in
parallel
Representative 5x5 Multilayer configuration (up
to 8x8 allowed)

P0
P1
S10
M3
T1
M2
AHB Layer 1
S11
P2
P3
M5
T2
M4
AMBA AHB crossbar
AHB Layer 2
S12
P4
P5
M7
T3
M6
S13
AHB Layer 3
P6
P7
S14
M9
T4
M8
AHB Layer 4
P8
P9
23
xpipes (Quasi-)Mesh
M1
M2
T0
T1
M0
P1
P2
S10
S11
P0
M3
M5
M4
T2
T3
130nm
P3
P5
P4
S12
S13
M6
M7
M8
M9
T4
P6
P7
P8
P9
S14
1 mm²

Excellent bandwidth
Balanced architecture, no max frequency
bottlenecks
Very regular topology easy to floorplan
Overhead of areapower due to many links and
buffers

24
NoCs vs. Traditional Interconnects - Performance

Time to complete functional benchmark
Shared buses are totally collapsing
NoCs are 10-15 faster than hierarchical buses

Observation 1 NoCs are much more scalable and
can provide better performance under severe load.
25
NoCs vs. Traditional Interconnects - Summary
Observation 2 NoCs are dramatically more
predictable than traditional interconnects.
Observation 3 NoCs are better in performance
and physical design, but be careful about area
and power!
26
Bandwidth or Latency?

NoC bandwidth is much higher (44 links, 1 GHz)
But this is indirect clue of performance
NoC latency penalty/gain depends on transaction
Penalty on short reads
Gain on posted writes

Observation 4 Latency matters more than raw
bandwidth. NoCs have to be careful about some
transaction types.
27
Area, Power Budget Analysis
b. Power
a. Area
Observation 5 Clock trees are negligible in
area, but eat up almost half of the power budget.
38-bit qmesh
28
Study 2 Implementation of NoCs in 90 and 65nm

Study behaviour of NoCs as they are implemented
in cutting-edge technologies
Observe behaviour of tech libraries, tools,
architecture and links as they are scaled from
one technology node to another

29
Link Design Constraints
65nm lowest power
65nm power/ performance

Power to drive a 38-bit (plus flow control)
unidirectional link

Observation 6 Long links (unless custom
designed) become either infeasible, or too
power-hungry. Keep them segmented.
30
Link Repeaters/Relay Stations

Wire segmentation by topology design
Put more switches, closer
Adds a lot of overhead
Wire segmentation by repeater insertion
Flops/relay stations to break links
Details are strictly related to flow control

Observation 7 Architectural provisions may be
needed to tackle physical-level issues. These may
impact performance, so they should be accounted
for in advance.
31
Wireload Models and 65nm

Wireload models to guesstimate propagation delay
during logic synthesis are inaccurate
As seen, for 130nm, 6 to 23 off from actual
achievable post-placement timing
In 65nm, problem is dramatically worse
No timing closure after placement (-50
frequency, huge runtimes...)
Traditional logic synthesis tools (e.g. Synopsys
Design Compiler) insufficient
Physical synthesis however works great

Observation 8 Physical synthesis is compulsory
for next-generation nodes.
32
Placement in Soft Macros

In our experiments, placementrouting is
extremely sensitive to soft macro area
Fences too tight flow fails
Fences too wide tool produces bad results
Solution accurate component area models
Involves work since area depends on architectural
parameters (cardinality, buffering...)

Observation 9 Thorough characterization of the
components may be key to the convergence of the
flow for a whole topology.
33
65nm Degrees of Freedom
Observation 10 There is no such thing as a
65nm library. Power/performance degrees of
freedom span across one order of magnitude. It is
the designers (or the tools) responsibility to
pick the right library choice.

LP and HP libraries differ in gate design, VT,
VDD...

34
Technology Scaling within Modules
6x6 switch, 38 bits,6 buffers

Within modules, scaling looks great
25 frequency
-46 area
-52 power

35
Technology Scaling on Topologies

Three designs for max frequency

65 nm, 1 mm2 cores

90 nm, 1 mm2 cores

65 nm, 0.4 mm2 cores
36
Mesh Scaling

Links
Always short (lt1.2 mm) ? non-pipelined
However
90 nm 1 mm2 3.1 mW
65 nm 1 mm2 3.6 mW (tightest fit ? more
buffering)
65 nm 0.4 mm2 2.2 mW
Power shifting from switches/NIs to links
(buffering)

37
High-Radix Switch Feasibility

High-radix switches become too slow
10x10 is maximum realistic size
For sizes 26x26 and 30x30, PR is unfeasible!

38
Clock Skew in High-Radix Switches

A single switch is still a small entity
Skew can be confined to lt10, typically lt5

39
A Complete NoC Synthesis Flow
40
Design of a NoC-Based System
Software Services Mapping, QoS, middleware...
CAD Tools
Architecture Packeting, buffering, flow control...
Physical Implementation Synchronization, wires,
power...

All these items are key opportunities and
challenges
Strict interaction/feedback mandatory!...
CAD tools must guide designers to best results

41
The Design Tool Dilemma

Automatically find topology and architectural
parameters so that
Design constraints are satisfied
Area, power, latency are minimized

A hypercube? A torus? Or, do I want a custom
topology?
42
Custom Topology Mapping

Objectives
Design fully application-specific custom
topologies
Generate deadlock-free networks
Optimize architectural parameters of the NoC
(frequency, flit size), tuning based upon
application requirements

Physical design awareness

Leverage accurate analytical models for area and
power, back-annotated from layouts
Integrated floorplanner to achieve design closure
while also considering wiring complexity

43
The xpipes NoC Design Flow
User objectives power, hop delay
Constraints area, power, hop delay, wire length
NoC component library
FPGA Emulation
IP Core models
Application Traffic Task Graph
Topology Synthesis includes Floorplanner NoC
Router
Platform Generation
Platform Generation xpipes- Compiler
Synthesis
Placement Routing
System specs
SystemC code
To fab
NoC Area, Power Models
RTL Architectural Simulation
SunFloor
Floorplanning specifications
Area, power characterization
44
Example Task Graph