Title: Implementation Analysis of NoC: A MPSoC TraceDriven Approach
1Implementation Analysis of NoC A MPSoC
TraceDriven Approach
- Sergio Tota¹, Mario R. Casu¹, Luca Macchiarulo²
¹ Politecnico di Torino
² University of Hawaii
2Outline
- Motivations
- Network-on-Chip Paradigm
- Definitions and Terminology
- NoC Topologies and Routing Strategies
- Switch Design
- Trace-Based Emulation Experiments
- Results
- Conclusions
3Motivations
- The number of processors in the same die
increases at each technology node (MPSoC) - This trend leads to the need of a scalable
communication infrastructure - Shared bus are not a long-term solution
- On-Chip Micronetworks better suit the demand of
scalability and performance
4Network-on-Chip (NoC)
- On-chip networks inherit some of the features of
computer networks - New constraints emerge for the on-chip
implementation (i.e. area, power) - The NoC characteristics depend on the choice of
the Topologies and Routing Strategy - Regular On-chip networks facilitate modular
design and improve performance
5Definitions and Terminology
- Flit The elementary unit of information
excanged in the communication network in a clock
cycle. - Packet An element of information that a
processing element (PE) sends to another PE. A
packet may consist of a variable number of
flits. - Switch The component of the network that is
in charge of flit routing.
6Definitions and Terminology (cont'd)
- Flit Latency The time needed for a FLIT to
reach its target PE from its source PE. - Packet Latency The time needed for a PACKET to
reach its target PE from its source PE. - Packet Spread The time from the reception of
the first flit of a packet to the reception of
the last one.
7Network Topology
1
2
3
4
1
2
3
4
5
6
7
8
5
6
7
8
9
10
11
12
9
10
11
12
13
14
15
16
13
14
15
16
Mesh
Physical implementation
8Network Topology (cont'd)
1
2
4
3
1
2
3
4
5
6
7
8
13
14
16
15
9
10
11
12
5
6
8
7
13
14
15
16
1
2
3
4
Torus
Physical implementation
9Routing
- What is important to us
- Minimum latency is of paramount importance in
MP-SoC (interprocess communication). - Ideally 1 clock latency per switch (flit enters
at time t and exits at t1) - Maximum switch clock frequency (technology
routing logic limits) - Deadlock free
- No flits are ever lost once a flit is injected
in the NoC, it will eventually reach its
destination
10Routing Strategy Wormhole
- In wormhole routing a header flit digs the
hole - Successive flits are routed to the same
direction - In case of blocks and lossless NoC we need
- Buffers
- A backpressure mechanism (unless you dont have
infinite FIFOs) - We assume X-Y static routing (first X then Y,
proven deadlock free)
11Worm-Hole
Src
Dest
12Worm-Hole
Src
HF
F2
F3
F4
TF
Dest
13Worm-Hole
Src
F2
HF
F3
F4
TF
Dest
14Worm-Hole
Src
F3
F2
F4
TF
HF
Dest
15Worm-Hole
Src
F4
F3
TF
F2
HF
Dest
16Worm-Hole
Src
F4
F3
TF
F2
HF
Dest
17Worm-Hole
Src
F3
F4
TF
F2
HF
Dest
18Worm-Hole
Src
F4
TF
F3
F2
Dest
HF
19Worm-Hole
Src
TF
F4
F3
Dest
F2
HF
20Worm-Hole
Src
TF
F4
Dest
F3
F2
HF
21Worm-Hole
Src
TF
Dest
F3
F2
HF
22Worm-Hole
Src
Dest
TF
F3
F2
HF
23Routing Strategy Deflection Routing
- Every flit can be routed to different directions
(no packet notion at the switch level) - if the optimal direction is blocked, the flit is
deflected to another direction - switch latency of 1 clock cycle no matter the
congestion - minimum buffer requirements
- A.K.A. Hot Potato, deadlock free by
construction
24Hot-Potato
Src
Dest
25Hot-Potato
Src
HF
F2
F3
TF
Dest
26Hot-Potato
Src
F2
HF
F3
TF
Dest
27Hot-Potato
Src
F3
F2
HF
TF
Dest
28Hot-Potato
Src
TF
HF
F2
F3
Dest
29Hot-Potato
Src
TF
HF
F2
F3
Dest
30Hot-Potato
Src
TF
Dest
HF
F2
F3
31Hot-Potato
Src
TF
Dest
HF
F2
F3
32Hot-Potato
Src
Dest
TF
HF
F2
F3
33Hot-Potato
Src
Dest
F3
TF
HF
F2
34Routing Techniques
Wormhole
Hot-Potato
No packets reordering - Static routing -
Buffering ( ?2 flits/port) - Back pressure - XY
routing needs mesh
- Packets reordering Adaptive routing No
buffering No back pressure Works with
torus/mesh
35Switch logic scheme
36Physical Implementation(IBM CMOS 0.13 ?m _at_ 500
MHz) 160 Gbit/s
Deflection-Routing
Wormhole10²
Wormhole 2¹
0.038 mm²
0.273 mm²
0.086 mm²
Area _at_ 64 bits
0.07 mm²
0.502 mm²
0.140 mm²
Area _at_ 128 bits
0.14 mm²
0.910 mm²
0.234 mm²
Area _at_ 256 bits
29 uW/MHz
190 uW/MHz
54 uW/MHz
Power _at_ 64 bits
52 uW/MHz
380 uW/MHz
92 uW/MHz
Power _at_ 128 bits
102 uW/MHz
655 uW/MHz
171 uW/MHz
Power _at_ 256 bits
¹Two buffers per port
²Ten buffers per port
37Emulation environment
38Emulation Environment
- Statistically generated traffic is not realistic
- Standford SPLASH-2 MultiProcessor Benchmark
(radix,lu,fft,ocean,raytrace) - RSIM cycle-accurate simulator (UIUC) to extract
traffic traces between processors - Simulation in VHDL using behavioural traffic
generators and RTL NoC implementations - 1 dual Opteron and 5 dual Xeon servers with Linux
64 bit, Modelsim, Synopsys and Encounter
39Experiments parameters
- NoC size N x N, N2 and N4
- Packet size 10 flits
- Wormhole buffer size 2 flits/port (minimum
allowed value) - Standard simulation PE computation time 1 X
- Accelerated simulation PE computation time 20
X, same NoC clock frequency (worse traffic
conditions) - WH wormhole, HP hot potato
40Results
Mesh uniform traffic ideal latency
Torus uniform traffic ideal latency
Mean Lat. 2/3N Peak Lat. 2N-2
Mean Lat. 1/2N Peak Lat. N
41Results
Ideal packet latency packet size 10
42Comments
- Peak flit latency occurs sporadically (peak gtgt
average) - WH and HP show comparable latency (both peak and
ave) - HP packet mean latency ? 10 flits arrive
in-order (with few exceptions) - No substantial differences between standard and
accelerated simulations - Benchmark execution time overhead due to NoC
congestion negligible - Overall, WH is not better nor worse than HP
43Conclusions
- RTL NoC based on Worm-Hole and Hot Potato
- Switch design and synthesis on 0.13 ?m CMOS
- 500 MHz clock frequency
- Real MP-SoC trace simulations
- WH and HP show similar performance but HP needs
less area and power - The strength of HP is supposed to emerge in a
condition of higher load Need for benchmarks
that generate higher traffic - Ongoing work
- New RTL design working at 700 MHz
- Reconstruction interface for HP implemented