NoC Physical Implementation - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

NoC Physical Implementation

Description:

Based on GTech, paths are identified. register-to-register. input-to ... Along each path, GTech blocks are replaced with actually available gates from a ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 47
Provided by: con99
Category:

less

Transcript and Presenter's Notes

Title: NoC Physical Implementation


1
NoC Physical Implementation
  • Federico Angiolini
  • federico.angiolini_at_unibo.it
  • DEIS Università di Bologna

2
Physical Implementation and NoCs
  • NoCs and physical implementation flows are
    strictly related topics
  • On the one hand, NoCs are designed to alleviate
    back-end issues (structured wiring)
  • On the other hand, back-end properties critically
    affect NoC behaviour and effectiveness

3
ASIC Synthesis Flow
4
A Typical ASIC Design Flow
Design Space Exploration
RTL Coding
Logic Synthesis
Placement
Routing
  • Ideally, one-shot linear flow
  • In practice, iterations needed to fix issues
  • Validation failures
  • Bad quality of results
  • No timing closure

5
Basics of a Back-End Flow
RTL code Circuit description
Tech Libs
GTech Connected network of logic blocks
Analysis
Netlist Connected network of gates
LogicSynthesis
Placed Netlist Placed network of gates
Placement
Major vendors Synopsys, Mentor, Magma, Cadence
Layout Placed and routed network of gates
Routing
6
Notes on Tech Libraries
  • Encapsulate foundry capabilities
  • Typical content boolean gates, flip-flops,
    simple gates
  • But, in lots of variations fan-in, driving
    strength, speed, power...
  • Describe function, delay, area, power, physical
    shape...
  • Often many libraries per process
    high-perf/low-power best/worst varying VDD
    varying VT

Tech Libs
7
Analysis of the Hardware Description
Analysis
  • Normally, a very swift step
  • Input Verilog/VHDL description
  • Output circuit description in terms of adders,
    muxes, registers, boolean gates, etc.
    (GTech Generic Technology)
  • Output is not optimized by any metric
  • Just translates specifications into an abstract
    circuit

8
Logic Synthesis
LogicSynthesis
  • Takes minutes to hours
  • Input GTech description
  • Output circuit description in terms of HSFFX4,
    LPNOR2X2, LLINVX32, etc. (i.e. specific
    gates of a specific tech library)
  • Output is...
  • Complying with timing specs (e.g. at 500 MHz)
  • Optimized for area and power

9
...How Does This Work?
  • Based on GTech, paths are identified
  • register-to-register
  • input-to-register
  • register-to-output
  • input-to-output
  • Along each path, GTech blocks are replaced with
    actually available gates from a technological
    library
  • The outcome is called netlist
  • Delay is analyzed first and some paths are
    detected as critical

10
Example Critical Paths
Adventures in ASIC Digital Design
  • Based on chosen library gates and netlist, path 1
    ? 6 is longest and violates constraints

11
Netlist Optimization
  • Synthesis process optimizes critical paths until
    timing constraints are met, e.g.
  • Use faster gates instead of lower-power
  • Play with driving strength (as in buffering)
  • Refactor combinational logic to minimize gates to
    be traversed
  • Once timing is met, analyze non-critical paths
  • Optimize for area and power, even if slower

12
Placement
Placement
  • Step 1 Floorplanning
  • Place macro-blocks onto rectangle (? chip)
  • e.g. processors, memories...
  • Step 2 Detailed placement
  • Align single gates of macro-blocks into rows
  • Typically aiming at 85 row utilization

13
Example xpipes Placement Approach
  • Floorplan mix of
  • hard macros for IP cores
  • soft macros for NoC blocks

14
Routing
Routing
  • Step 1 Clock tree insertion
  • Bring clock to all flip-flops
  • Step 2 Power network insertion
  • Bring VDD, GND nets across chip
  • Typically over top metal layers
  • Either as ring (small designs) or grid (bigger
    designs)
  • Step 3 Logic routing
  • Actually connect gates to each other
  • Typically over bottom metal layers

15
Example Binary Clock Tree
Courtesy Shobha Vasudevan
  • Issue minimizing skew
  • Critical at high frequencies
  • Consumes large amount of power

16
Issue with Traditional Flow
  • Major problem with traditional flow...
  • ...wiring is not considered during synthesis!!!
  • Outdated assumption wiring delay is negligible
  • Partial fix wireload models
  • Consider fan-out of gates
  • If small, assume short wiring at outputs, and a
    bit of extra delay
  • If large, assume long wiring at outputs, and a
    noticeable extra delay
  • Still grossly inaccurate

17
Physical Synthesis
  • Currently envisioned solution physical synthesis
  • Merge placement with logic synthesis
  • Initial, quick logic synthesis
  • Coarse-grained placement
  • Incremental synthesis placement until
    convergence
  • Drastically better results (more predictable)
  • Still may not suffice... also integrate routing
    step??

RTL
Quick logic synthesis
Initial Netlist
Quick placement
Initial Placed Netlist
Incremental synthesis placement
Final Placed Netlist
18
Advanced Back-End Flow
RTL code Circuit description
Tech Libs
GTech Connected network of logic blocks
Analysis
Placed Netlist Placed network of gates
PhysicalSynthesis
Major vendors Synopsys, Mentor, Magma, Cadence
Layout Placed and routed network of gates
Routing
19
Some Observations on the Physical Implementation
of NoCs
20
Study 1 Cross-Benchmarking NoCs vs. Traditional
Interconnects
  • Study performance, area, power of a NoC
    implementation as opposed to traditional bus
    interconnects
  • Plain shared bus
  • Hierarchical bus
  • 130nm technology
  • Note on old, unoptimized version of NoC
    architecture

21
AMBA AHB Shared Bus
AMBA AHB
  • Baseline architecture
  • Ten ARM cores, five traffic generators, fifteen
    slaves (fully populated bus)
  • ARM cores running a pipelined multimedia
    benchmark
  • Traffic generators
  • Streaming traffic towards a memory (DSP-like)
  • Periodically querying some slaves (IOCtrl-like)

22
AMBA AHB Multilayer
M1
T0
M0
AHB Layer 0
  • Dramatically improves performance
  • Intra-cluster traffic to private slaves (P0-P9)
    is bound within each layer, reducing congestion
  • Shared slaves (S10-S14) can be accessed in
    parallel
  • Representative 5x5 Multilayer configuration (up
    to 8x8 allowed)

P0
P1
S10
M3
T1
M2
AHB Layer 1
S11
P2
P3
M5
T2
M4
AMBA AHB crossbar
AHB Layer 2
S12
P4
P5
M7
T3
M6
S13
AHB Layer 3
P6
P7
S14
M9
T4
M8
AHB Layer 4
P8
P9
23
xpipes (Quasi-)Mesh
M1
M2
T0
T1
M0
P1
P2
S10
S11
P0
M3
M5
M4
T2
T3
130nm
P3
P5
P4
S12
S13
M6
M7
M8
M9
T4
P6
P7
P8
P9
S14
1 mm²
  • Excellent bandwidth
  • Balanced architecture, no max frequency
    bottlenecks
  • Very regular topology easy to floorplan
  • Overhead of areapower due to many links and
    buffers

24
NoCs vs. Traditional Interconnects - Performance
  • Time to complete functional benchmark
  • Shared buses are totally collapsing
  • NoCs are 10-15 faster than hierarchical buses

Observation 1 NoCs are much more scalable and
can provide better performance under severe load.
25
NoCs vs. Traditional Interconnects - Summary
Observation 2 NoCs are dramatically more
predictable than traditional interconnects.
Observation 3 NoCs are better in performance
and physical design, but be careful about area
and power!
26
Bandwidth or Latency?
  • NoC bandwidth is much higher (44 links, 1 GHz)
  • But this is indirect clue of performance
  • NoC latency penalty/gain depends on transaction
  • Penalty on short reads
  • Gain on posted writes

Observation 4 Latency matters more than raw
bandwidth. NoCs have to be careful about some
transaction types.
27
Area, Power Budget Analysis
b. Power
a. Area
Observation 5 Clock trees are negligible in
area, but eat up almost half of the power budget.
38-bit qmesh
28
Study 2 Implementation of NoCs in 90 and 65nm
  • Study behaviour of NoCs as they are implemented
    in cutting-edge technologies
  • Observe behaviour of tech libraries, tools,
    architecture and links as they are scaled from
    one technology node to another

29
Link Design Constraints
65nm lowest power
65nm power/ performance
  • Power to drive a 38-bit (plus flow control)
    unidirectional link

Observation 6 Long links (unless custom
designed) become either infeasible, or too
power-hungry. Keep them segmented.
30
Link Repeaters/Relay Stations
  • Wire segmentation by topology design
  • Put more switches, closer
  • Adds a lot of overhead
  • Wire segmentation by repeater insertion
  • Flops/relay stations to break links
  • Details are strictly related to flow control

Observation 7 Architectural provisions may be
needed to tackle physical-level issues. These may
impact performance, so they should be accounted
for in advance.
31
Wireload Models and 65nm
  • Wireload models to guesstimate propagation delay
    during logic synthesis are inaccurate
  • As seen, for 130nm, 6 to 23 off from actual
    achievable post-placement timing
  • In 65nm, problem is dramatically worse
  • No timing closure after placement (-50
    frequency, huge runtimes...)
  • Traditional logic synthesis tools (e.g. Synopsys
    Design Compiler) insufficient
  • Physical synthesis however works great

Observation 8 Physical synthesis is compulsory
for next-generation nodes.
32
Placement in Soft Macros
  • In our experiments, placementrouting is
    extremely sensitive to soft macro area
  • Fences too tight flow fails
  • Fences too wide tool produces bad results
  • Solution accurate component area models
  • Involves work since area depends on architectural
    parameters (cardinality, buffering...)

Observation 9 Thorough characterization of the
components may be key to the convergence of the
flow for a whole topology.
33
65nm Degrees of Freedom
Observation 10 There is no such thing as a
65nm library. Power/performance degrees of
freedom span across one order of magnitude. It is
the designers (or the tools) responsibility to
pick the right library choice.
  • LP and HP libraries differ in gate design, VT,
    VDD...

34
Technology Scaling within Modules
6x6 switch, 38 bits,6 buffers
  • Within modules, scaling looks great
  • 25 frequency
  • -46 area
  • -52 power

35
Technology Scaling on Topologies
  • Three designs for max frequency

65 nm, 1 mm2 cores
  • 90 nm, 1 mm2 cores

65 nm, 0.4 mm2 cores
36
Mesh Scaling
  • Links
  • Always short (lt1.2 mm) ? non-pipelined
  • However
  • 90 nm 1 mm2 3.1 mW
  • 65 nm 1 mm2 3.6 mW (tightest fit ? more
    buffering)
  • 65 nm 0.4 mm2 2.2 mW
  • Power shifting from switches/NIs to links
    (buffering)

37
High-Radix Switch Feasibility
  • High-radix switches become too slow
  • 10x10 is maximum realistic size
  • For sizes 26x26 and 30x30, PR is unfeasible!

38
Clock Skew in High-Radix Switches
  • A single switch is still a small entity
  • Skew can be confined to lt10, typically lt5

39
A Complete NoC Synthesis Flow
40
Design of a NoC-Based System
Software Services Mapping, QoS, middleware...
CAD Tools
Architecture Packeting, buffering, flow control...
Physical Implementation Synchronization, wires,
power...
  • All these items are key opportunities and
    challenges
  • Strict interaction/feedback mandatory!...
  • CAD tools must guide designers to best results

41
The Design Tool Dilemma
  • Automatically find topology and architectural
    parameters so that
  • Design constraints are satisfied
  • Area, power, latency are minimized


A hypercube? A torus? Or, do I want a custom
topology?
42
Custom Topology Mapping
  • Objectives
  • Design fully application-specific custom
    topologies
  • Generate deadlock-free networks
  • Optimize architectural parameters of the NoC
    (frequency, flit size), tuning based upon
    application requirements

Physical design awareness
  • Leverage accurate analytical models for area and
    power, back-annotated from layouts
  • Integrated floorplanner to achieve design closure
    while also considering wiring complexity

43
The xpipes NoC Design Flow
User objectives power, hop delay
Constraints area, power, hop delay, wire length
NoC component library
FPGA Emulation
IP Core models
Application Traffic Task Graph
Topology Synthesis includes Floorplanner NoC
Router
Platform Generation
Platform Generation xpipes- Compiler
Synthesis
Placement Routing
System specs
SystemC code
To fab
NoC Area, Power Models
RTL Architectural Simulation
SunFloor
Floorplanning specifications
Area, power characterization
44
Example Task Graph
  • Captures communication among system cores
  • Source/destination pairs
  • Required bandwidth

45
Measuring xpipes Performance
topology specs
46
Example Layout
  • Floorplan is automatically generated
  • Black areas IP cores
  • Colored areas NoC
  • Over-the-cell routing allowed in this example

65nm design
Write a Comment
User Comments (0)
About PowerShow.com