Title: Ben Abdallah Abderazek
1Networks-on-Chip
- Ben Abdallah Abderazek
- The University of Aizu,
- Graduate School of Computer Science and Eng.
- Adaptive Systems Laboratory,
- E-mail benab_at_u-aizu.ac.jp
03/01/2010
2Part I Application RequirementsNetwork on
Chip A paradigm Shift in VLSICritical problems
addressed by NoCTraffic abstractions Data
AbstractionNetwork delay modeling
3Application Requirements
- Signal processing
- Hard real time
- Very regular load
- High quality
- Typically on DSPs
- Media processing
- Hard real time
- Irregular load
- High quality
- SoC/media processors
- Multimedia
- Soft real time
- Irregular load
- Limited quality
- PC/desktop
Very challenging!
4What the Internet Needs?
ASIC (large, expensive to develop, not flexible)
SoC, MCSoC?
Increasing Huge Amount of Packets Routing,
Packet Classification, Encryption, QoS, New
Applications and Protocols, etc..
- High processing power
- Support wire speed
- Programmable
- Scalable
- Specially for network applications
General Purpose RISC (not capable enough)
5Example - Network Processor (NP)
- 16 pico-procesors and 1 powerPC
- Each pico-processor
- Support 2 hardware threads
- 3 stage pipeline fetch/decode/execute
- Dyadic Processing Unit
- Two pico-processors
- 2KB Shared memory
- Tree search engine
- Focus is layers 2-4
- PowerPC 405 for control plane operations
- 16K I and D caches
- Target is OC-48
IBM PowerNP
6Example - Network Processor (NP)
- NP can be applied in various network layers and
applications - Traditional apps forwarding, classification
- Advanced apps transcoding, URL-based switching,
security etc. - New apps
7Telecommunication Systems and NoC Paradigm
- The trend nowadays is to integrate
telecommunication system on complex multicore SoC
(MCSoC) - Network processors,
- Multimedia hubs ,and
- base-band telecom circuits
- These applications have tight time-to-market and
performance constraints
8Telecommunication Systems and NoC Paradigm
- Telecommunication multicore SoC is composed of 4
kinds of components - Software tasks,
- Processors executing software,
- Specific hardware cores , and
- Global on-chip communication network
9Telecommunication Systems and NoC Paradigm
- Telecommunication multicore SoC is composed of 4
kinds of components - Software tasks,
- Processors executing software,
- Specific hardware cores , and
- Global on-chip communication network
This is the most challenging part.
10Technology Architecture Trends
- Technology trends
- Vast transistor budgets
- Relatively poor interconnect scaling
- Need to manage complexity and power
- Build flexible designs (multi-/general-purpose)
- Architectural trends
- Go parallel !
- Keep core complexity constant or simplify
- Result is lots of modules (cores, memories,
offchip interfaces, specialized IP cores, etc.)
11Wire Delay vs. Logic Delay
Operation Delay (.13mico) Delay (.05micro)
32-bit ALU Operation 650ps 250ps
32-bit Register read 325ps 125ps
Read 32-bit from 8KB RAM 780ps 300ps
Transfer 32-bit across chip (10mm) 1400ps 2300ps
Transfer 32-bit across chip (200mm) 2800ps 4600ps
21 global on-chip communication to operation
delay 91 in 2010
Ref W.J. Dally HPCA Panel presentation 2002
12Communication Reliability
- Information transfer is inherently unreliable at
the electrical level, due to - Timing errors
- Cross-talk
- Electro-magnetic interference (EMI)
- Soft errors
- The problem will get increasingly worse as
technology scales down
13Evolution of on-chip communication
14Traditional SoC nightmare
- Variety of dedicated interfaces
- Design and verification complexity
- Unpredictable performance
- Many underutilized wires
DMA
CPU
DSP
Control signals
CPU Bus
A
Bridge
B
Peripheral Bus
IO
IO
IO
C
15Network on Chip A paradigm Shift in VLSI
From Dedicated signal wires
To Shared network
Point- To-point Link
Network switch
Computing Module
16NoC essential
- Communication by packets of bits
- Routing of packets through several hops, via
switches - Efficient sharing of wires
- Parallelism
17Characteristics of a paradigm shift
- Solves a critical problem
- Step-up in abstraction
- Design is affected
- Design becomes more restricted
- New tools
- The changes enable higher complexity and capacity
- Jump in design productivity
18Characteristics of a paradigm shift
- Solves a critical problem
- Step-up in abstraction
- Design is affected
- Design becomes more restricted
- New tools
- The changes enable higher complexity and capacity
- Jump in design productivity
We will look at the problem addressed by NoC.
19Origins of the NoC concept
- The idea was talked about in the 90s, but actual
research came in the new illenium. - Some well-known early publications
- Guerrier and Greiner (2000) A generic
architecture for on-chip packet-switched
interconnections - Hemani et al. (2000) Network on chip An
architecture for billion transistor era - Dally and Towles (2001) Route packets, not
wires on-chip interconnection networks - Wingard (2001) MicroNetwork-based integration
of SoCs - Rijpkema, Goossens and Wielage (2001) A router
architecture for networks on silicon - Kumar et al. (2002) A Network on chip
architecture and design methodology - De Micheli and Benini (2002) Networks on chip
A new paradigm for systems on chip design
20Don't we already know how to design
interconnection networks?
- Many existing network topologies, router designs
and theory has already been developed for high
end supercomputers and telecom switches - Yes, and we'll cover some of this material, but
the trade-offs on-chip lead to very different
designs!!
20
21Critical problems addressed by NoC
1) Global interconnect design problem delay,
power, noise, scalability, reliability
2) System integration productivity problem
3) Chip Multi Processors (key to power-efficient
computing
221(a) NoC and Global wire delay
Long wire delay is dominated by Resistance
Add repeaters
Repeaters become latches (with clock frequency
scaling)
Latches evolve to NoC routers
231(b) Wire design for NoC
- NoC links
- Regular
- Point-to-point (no fanout tree)
- Can use transmission-line layout
- Well-defined current return path
- Can be optimized for noise / speed / power
- Low swing, current mode, .
241(c) NoC scalability
- For Same Performance, compare the wire area and
power
Simple Bus O(n3 vn) O(nvn)
NoC O(n) O(n)
Segmented Bus O(n2 vn) O(nvn)
Point to-Point O(n2 vn) O(n vn)
251(d) NoC and communication reliability
- Fault tolerance error correction
Router
n
Input buffer
Error correction
Synchronization
ISI reduction
m
Parallel to Serial Convertor
UMODEM
Router
U MO D E M
U MO D E M
Modulation
Link Interface
UMODEM
Interconnect
A. Morgenshtein, E. Bolotin, I. Cidon, A.
Kolodny, R. Ginosar, Micro-modem reliability
solution for NOC communications, ICECS 2004
261(e) NoC and GALS
- Modules in NoC System use different clocks
- May use different voltages
- NoC can take care of synchronization
- NoC design may be asynchronous
- No waste of power when the links and routers are
idle
272 NoC and engineering productivity
- NoC eliminates ad-hoc global wire engineering
- NoC separates computation from communication
- NoC supports modularity and reuse of cores
- NoC is a platform for system integration,
debugging and testing
283 NoC and CMP
- Uniprocessors cannot provide Power-efficient
performance growth - Interconnect dominates dynamic power
- Global wire delay doesnt scale
- Instruction-level parallelism is limited
- Power-efficiency requires many parallel local
- computations
- Chip Multi Processors (CMP)
- Thread-Level Parallelism (TLP)
Gate
Interconnect
Diff.
Uniprocessor dynamic power (Magen et al., SLIP 200
Uniprocessir Performance
Die Area (or Power)
293 NoC and CMP
- Uniprocessors cannot provide Power-efficient
performance growth - Interconnect dominates dynamic power
- Global wire delay doesnt scale
- Instruction-level parallelism is limited
- Power-efficiency requires many parallel local
computations - Chip Multi Processors (CMP)
- Thread-Level Parallelism (TLP)
- Network is a natural choice for CMP!
303 NoC and CMP
Network is a natural choice for CMP
- Uniprocessors cannot provide Power-efficient
performance growth - Interconnect dominates dynamic power
- Global wire delay doesnt scale
- Instruction-level parallelism is limited
- Power-efficiency requires many parallel local
computations - Chip Multi Processors (CMP)
- Thread-Level Parallelism (TLP)
- Network is a natural choice for CMP!
31Why Now is the time for NoC?
Difficulty of DSM wire design
Productivity pressure
CMPs
32Traffic abstractions
- Traffic model are generally captured from actual
traces of functional simulation - A statically distribution is often assumed for
message
33Data abstractions
34Layers of abstraction in network modeling
- Software layers
- Application, OS
- Network transport layers
- Network topology e.g. crossbar, ring, mesh,
torus, fat tree, - Switching Circuit / packet switching(SAF, VCT),
wormhole - Addressing Logical/physical, source/destination,
flow, transaction - Routing Static/dynamic, distributed/source,
deadlock avoidance - Quality of Service e.g. guaranteed-throughput,
best-effort - Congestion control, end-to-end flow control
- Data link layer
- Flow control (handshake)
- Handling of contention
- Correction of transmission errors
- Physical layer
- Wires, drivers, receivers, repeaters, signaling,
circuits,..
35How to select architecture ?
- Architecture choices depends on system needs.
Reconfiguration Rate During run time At
boot time At design time
CMP/ Multicore
ASSP
FPGA
ASIC
Flexibility
Single application
General purpose or Embedded systems
36How to select architecture ?
- Architecture choices depends on system needs.
Reconfiguration Rate During run time At
boot time At design time
A large range of solutions!
CMP/ Multicore
ASSP
FPGA
ASIC
Flexibility
Single application
General purpose or Embedded systems
37Example OASIS
- ASIC assumed
- Traffic requirement are known a-priori
- Features
- Packet switching wormhole
- Quality of service e
- Mesh topology
K. Mori, A. Ben Abdallah, and K. Kuruda, Design
and Evaluation of a Complexity Effective
Network-on-Chip Architecture on FPGA", The 19th
Intelligent System Symposium (FAN 2009),
pp.318-321, Sep. 2009. S. Miura, A. Ben Abdallah,
and K. Kuroda, "PNoC - Design and Preliminary
Evaluation of a Parameterizable NoC for
MCSoCGeneration and Design Space Exploration",
The 19th Intelligent System Symposium (FAN 2009),
pp.314-317, Sep. 2009.
38Perspective 1 NoC vs. Bus
NoC
Bus
- Aggregate bandwidth grows
- Link speed unaffected by N
- Concurrent spatial reuse
- Pipelining is built-in
- Distributed arbitration
- Separate abstraction layers
- However
- No performance guarantee
- Extra delay in routers
- Area and power overhead?
- Modules need NI
- Unfamiliar methodology
- Bandwidth is limited, shared
- Speed goes down as N grows
- No concurrency
- Pipelining is tough
- Central arbitration
- No layers of abstraction
- (communication and computation are coupled)
- However
- Fairly simple and familiar
39Perspective 2 NoC vs. Off-chip Networks
Off-Chip Networks
NoC
- Cost is in the links
- Latency is tolerable
- Traffic/applications unknown
- Changes at runtime
- Adherence to networking
- standards
- Sensitive to cost
- area
- power
- Wires are relatively cheap
- Latency is critical
- Traffic may be known a-priori
- Design time specialization
- Custom NoCs are possible
40VLSI CAD problems
- Application mapping
- Floorplanning / placement
- Routing
- Buffer sizing
- Timing closure
- Simulation
- Testing
41VLSI CAD problems in NoC
- Application mapping (map tasks to cores)
- Floorplanning / placement (within the network)
- Routing (of messages)
- Buffer sizing (size of FIFO queues in the
routers) - Timing closure (Link bandwidth capacity
allocation) - Simulation (Network simulation,
traffic/delay/power modeling) - Other NoC design problems (topology synthesis,
switching, virtual channels, arbitration, flow
control,)
42Typical NoC design flow
Place Modules
Determine routing and adjust link capacities
43Timing closure in NoC
- Too long capacity results in poor QoS
- Too high capacity wastes area
- Uniform link capacities are a waste in ASIP
system
44Network delay modeling
- Analysis of mean packet delay us wormhole network
- Multiple Virtual-Channels
- Different link capacities
- Different communication demands
45NoC design requirements
- High-performance interconnect
- High-throughput, latency, power, area
- Complex functionality (performance again)
- Support for virtual-channels
- QoS
- Synchronization
- Reliability, high-throughput, low-laten
46ISO/OSI network protocol stack model
47Part IINoC topologies Switching
strategiesRouting algorithmsFlow control
schemesClocking schemesQoSBasic Building
Blocks Status and Open Problems
48NoC Topology
The connection map between PEs
- Adopted from large-scale networks and parallel
computing - Topology classifications
- Direct topologies
- Indirect topologies
49Direct topologies
- Each switch (SW) connected to a single PE
- As the of nodes in the system increases, the
total bandwidth also increases
50Direct topologiesMesh
- 2D mesh is most popular
- All links have the same length
- Eases physical design
- Area grows linearly with the the of nodes
4x4 Mesh
51Direct topologiesTorus and Folded Torus
- Overcomes the long link limitation of a 2-D
torus - Links have the same size
- Similar to a regular Mesh
- Excessive delay problem due to long-end-around
connection
52Direct topologiesOctagon topology
- Messages being sent between any 2 nodes require
at most two hops - More octagons can be tiled together to
accommodate larger designs
53Indirect topologies
A set of PEs are connected to a switch (router).
- Fat tree topology
- Nodes are connected only to the leaves of the
tree - More links near root, where bandwidth
requirements are higher
54Indirect topologiesk-ary n-fly butterfly network
- Blocking multi-stage network packets may be
temporarily blocked or dropped in the network if
contention occurs
Example 2-ary 3-fly butterfly network
55Indirect topologies(m, n, r) symmetric Clos
network
- 3-stage network in which each stage is made up of
a number of crossbar switches - m number of middle-stage switches
- n number of input/output nodes on each
input/output switch - r number of I and O switches
- Example (3, 3, 4) Clos network
- Non-blocking network
- Expensive (several full crossbars)
56Indirect topologiesBenes network
- Rearrangeable network in which paths may have to
be rearranged to provide a connection, requiring
an appropriate controller - Clos topology composed of 2 x 2 switches
Example (2, 2, 4) re-arrangeable Clos network
constructed using two (2, 2, 2) Clos networks
with 4 x 4 middle switches.
57Irregular TopologiesCustomized
- Customized for an application
- Usually a mix of shared bus, direct, and indirect
network topologies
PE
PE
PE
sw
sw
sw
sw
sw
sw
PE
PE
PE
sw
sw
sw
sw
sw
sw
sw
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
sw
sw
sw
PE
sw
sw
sw
sw
PE
PE
sw
sw
sw
sw
sw
sw
PE
PE
PE
PE
PE
PE
Example 2 Cluster-based hybrid topology
Example1 Reduced mesh
58Example 1 Partially irregular 2D-Mesh topology
- Contains oversized rectangularly shaped PEs.
59Example 2 Irregular Mesh
- This kind of chip does not limit the shape of the
PEs or the placement of the routers. It may be
considered a "custom" NoC
60How to Select a Topology ?
- Application decides the topology type
- If PEs few tens
- Star, Mesh topologies are recommended
- If PEs 100 or more
- Hierarchical Star, Mesh are recommended
- Some topologies are better for certain designs
than others - Most of the times, when one topology is better in
performance, it is worse in power consumption!!
61Part IINoC topologies NoC Switching
strategiesRouting algorithmsFlow control
schemesClocking schemesQoSBasic Building
Blocks Status and Open Problems
62NoC Switching Strategies
Switching determines how flits and packets flows
through routers in the network
- There are two basic modes
- Circuit switching
- Packet switching
63Circuit Switching
- Network resources (channels) are reserved before
a packet is sent - Entire path must be reserved first
- The packets do not contain routing information,
but rather data and information about the data. - Circuit-switched networks require no overhead for
packetisation, packet header processing or packet
buffering
64Circuit Switching
Header
ACK
Data
R1
R2
R3
Router Delay
Routing switching delay
Setup time
Transfer time
65Circuit Switching
- Once circuit is setup, router latency and control
overheads are very low - Very poor use of channel bandwidth if lots of
short packets must be sent to many different
destinations - More commonly seen in embedded SoC applications
where traffic patterns may be static and involve
streaming large amounts of data between different
IP blocks
66Packet Switching
- We can aim to make better use of channel
resources by buffering packets. We then
arbitrate for access to network resources
dynamically. - We distinguish between different approaches by
the granularity at which we reserve resources
(e.g. channels and buffers) and conditions that
must be met for a packet to advance to the next
node
67Packet Switching
Advance when entire packet is buffered L free
flit buffers at next node
Store-and-forward (SaF)
Packet-Buffer Flow Control
Advance when L free flit buffers at the next
node
Cut-through
Can advance when at least one flit buffer is
available
Flit-Buffer Flow Control
Wormhole
L Packet Length
68Packet Switching Store and Forward (SAF)
- Packet is sent from one router to the next only
if the receiving router has buffer space for
entire packet - Buffer size in the router is at least equal to
the size of a packet
Forward packet by packet
Buffer
Buffer
Buffer
Switch
Switch
Switch
packet
Store and Forward switching
data flit header flit
69Packet switching Wormhole (WH)
- Flit is forwarded to a router if space exists for
that flit - Parts of the packet can be distributed among two
or more routers - Buffer requirements are reduced to one flit,
instead of an entire packet
Forward flit by flit
Buffer
Buffer
Buffer
packet
Switch
Switch
Switch
WH switching technique
data flit header flit
70Packet switching Virtual Channel (VC)
- Improve performance of WH routing, prevent a
single packet blocking a free channel - e.g. if the green packet is blocked, the red
packet may still make progress through the
network - We can interleave flits from different packets
over the same channel
71Part IINoC topologies NoC Switching
strategiesRouting algorithmsFlow control
schemesClocking schemesQoSBasic Building
Blocks Status and Open Problems