Implementation Analysis of NoC: A MPSoC TraceDriven Approach - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Implementation Analysis of NoC: A MPSoC TraceDriven Approach

Description:

Implementation Analysis of NoC: A MPSoC TraceDriven Approach ... Standford SPLASH-2 MultiProcessor Benchmark (radix,lu,fft,ocean,raytrace) ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 44

Provided by: tlcPo

Category:

more less

Transcript and Presenter's Notes

Title: Implementation Analysis of NoC: A MPSoC TraceDriven Approach

1
Implementation Analysis of NoC A MPSoC
TraceDriven Approach

Sergio Tota¹, Mario R. Casu¹, Luca Macchiarulo²

¹ Politecnico di Torino
² University of Hawaii
2
Outline

Motivations
Network-on-Chip Paradigm
Definitions and Terminology
NoC Topologies and Routing Strategies
Switch Design
Trace-Based Emulation Experiments
Results
Conclusions

3
Motivations

The number of processors in the same die
increases at each technology node (MPSoC)
This trend leads to the need of a scalable
communication infrastructure
Shared bus are not a long-term solution
On-Chip Micronetworks better suit the demand of
scalability and performance

4
Network-on-Chip (NoC)

On-chip networks inherit some of the features of
computer networks
New constraints emerge for the on-chip
implementation (i.e. area, power)
The NoC characteristics depend on the choice of
the Topologies and Routing Strategy
Regular On-chip networks facilitate modular
design and improve performance

5
Definitions and Terminology

Flit The elementary unit of information
excanged in the communication network in a clock
cycle.
Packet An element of information that a
processing element (PE) sends to another PE. A
packet may consist of a variable number of
flits.
Switch The component of the network that is
in charge of flit routing.

6
Definitions and Terminology (cont'd)

Flit Latency The time needed for a FLIT to
reach its target PE from its source PE.
Packet Latency The time needed for a PACKET to
reach its target PE from its source PE.
Packet Spread The time from the reception of
the first flit of a packet to the reception of
the last one.

7
Network Topology
1
2
3
4
1
2
3
4
5
6
7
8
5
6
7
8
9
10
11
12
9
10
11
12
13
14
15
16
13
14
15
16
Mesh
Physical implementation
8
Network Topology (cont'd)
1
2
4
3
1
2
3
4
5
6
7
8
13
14
16
15
9
10
11
12
5
6
8
7
13
14
15
16
1
2
3
4
Torus
Physical implementation
9
Routing

What is important to us
Minimum latency is of paramount importance in
MP-SoC (interprocess communication).
Ideally 1 clock latency per switch (flit enters
at time t and exits at t1)
Maximum switch clock frequency (technology
routing logic limits)
Deadlock free
No flits are ever lost once a flit is injected
in the NoC, it will eventually reach its
destination

10
Routing Strategy Wormhole

In wormhole routing a header flit digs the
hole
Successive flits are routed to the same
direction
In case of blocks and lossless NoC we need
Buffers
A backpressure mechanism (unless you dont have
infinite FIFOs)
We assume X-Y static routing (first X then Y,
proven deadlock free)

11
Worm-Hole
Src
Dest
12
Worm-Hole
Src
HF
F2
F3
F4
TF
Dest
13
Worm-Hole
Src
F2
HF
F3
F4
TF
Dest
14
Worm-Hole
Src
F3
F2
F4
TF
HF
Dest
15
Worm-Hole
Src
F4
F3
TF
F2
HF
Dest
16
Worm-Hole
Src
F4
F3
TF
F2
HF
Dest
17
Worm-Hole
Src
F3
F4
TF
F2
HF
Dest
18
Worm-Hole
Src
F4
TF
F3
F2
Dest
HF
19
Worm-Hole
Src
TF
F4
F3
Dest
F2
HF
20
Worm-Hole
Src
TF
F4
Dest
F3
F2
HF
21
Worm-Hole
Src
TF
Dest
F3
F2
HF
22
Worm-Hole
Src
Dest
TF
F3
F2
HF
23
Routing Strategy Deflection Routing

Every flit can be routed to different directions
(no packet notion at the switch level)
if the optimal direction is blocked, the flit is
deflected to another direction
switch latency of 1 clock cycle no matter the
congestion
minimum buffer requirements
A.K.A. Hot Potato, deadlock free by
construction

24
Hot-Potato
Src
Dest
25
Hot-Potato
Src
HF
F2
F3
TF
Dest
26
Hot-Potato
Src
F2
HF
F3
TF
Dest
27
Hot-Potato
Src
F3
F2
HF
TF
Dest
28
Hot-Potato
Src
TF
HF
F2
F3
Dest
29
Hot-Potato
Src
TF
HF
F2
F3
Dest
30
Hot-Potato
Src
TF
Dest
HF
F2
F3
31
Hot-Potato
Src
TF
Dest
HF
F2
F3
32
Hot-Potato
Src
Dest
TF
HF
F2
F3
33
Hot-Potato
Src
Dest
F3
TF
HF
F2
34
Routing Techniques
Wormhole
Hot-Potato
No packets reordering - Static routing -
Buffering ( ?2 flits/port) - Back pressure - XY
routing needs mesh
- Packets reordering Adaptive routing No
buffering No back pressure Works with
torus/mesh
35
Switch logic scheme
36
Physical Implementation(IBM CMOS 0.13 ?m _at_ 500
MHz) 160 Gbit/s
Deflection-Routing
Wormhole10²
Wormhole 2¹
0.038 mm²
0.273 mm²
0.086 mm²
Area _at_ 64 bits
0.07 mm²
0.502 mm²
0.140 mm²
Area _at_ 128 bits
0.14 mm²
0.910 mm²
0.234 mm²
Area _at_ 256 bits
29 uW/MHz
190 uW/MHz
54 uW/MHz
Power _at_ 64 bits
52 uW/MHz
380 uW/MHz
92 uW/MHz
Power _at_ 128 bits
102 uW/MHz
655 uW/MHz
171 uW/MHz
Power _at_ 256 bits
¹Two buffers per port
²Ten buffers per port
37
Emulation environment
38
Emulation Environment

Statistically generated traffic is not realistic
Standford SPLASH-2 MultiProcessor Benchmark
(radix,lu,fft,ocean,raytrace)
RSIM cycle-accurate simulator (UIUC) to extract
traffic traces between processors
Simulation in VHDL using behavioural traffic
generators and RTL NoC implementations
1 dual Opteron and 5 dual Xeon servers with Linux
64 bit, Modelsim, Synopsys and Encounter

39
Experiments parameters

NoC size N x N, N2 and N4
Packet size 10 flits
Wormhole buffer size 2 flits/port (minimum
allowed value)
Standard simulation PE computation time 1 X
Accelerated simulation PE computation time 20
X, same NoC clock frequency (worse traffic
conditions)
WH wormhole, HP hot potato

40
Results
Mesh uniform traffic ideal latency
Torus uniform traffic ideal latency
Mean Lat. 2/3N Peak Lat. 2N-2
Mean Lat. 1/2N Peak Lat. N
41
Results
Ideal packet latency packet size 10
42
Comments

Peak flit latency occurs sporadically (peak gtgt
average)
WH and HP show comparable latency (both peak and
ave)
HP packet mean latency ? 10 flits arrive
in-order (with few exceptions)
No substantial differences between standard and
accelerated simulations
Benchmark execution time overhead due to NoC
congestion negligible
Overall, WH is not better nor worse than HP

43
Conclusions

RTL NoC based on Worm-Hole and Hot Potato
Switch design and synthesis on 0.13 ?m CMOS
500 MHz clock frequency
Real MP-SoC trace simulations
WH and HP show similar performance but HP needs
less area and power
The strength of HP is supposed to emerge in a
condition of higher load Need for benchmarks
that generate higher traffic
Ongoing work
New RTL design working at 700 MHz
Reconstruction interface for HP implemented