MPhil/Master dissertation - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

MPhil/Master dissertation

Description:

HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based ... Each NIOS-II Avalon based tile is generated effortlessly through QuartusII SOPC ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 43

Provided by: cephi

Category:

more less

Transcript and Presenter's Notes

Title: MPhil/Master dissertation

1
HW-SW Co-Design Framework for Parallel
Distributed Computing on NoC-based MPSoC
architectures

MPhil/Master dissertation
presented by
Jaume Joven Murillo
and supervised by
Dr. Jordi Carrabina Bordoll

2
Presentation outline

Introduction
Basic concepts state of the art in NoCs and
MPSoCs
Design framework and working methodology
HW-SW NoC-based MPSoC implementation
Experimental results
Conclusions future work

3
1. Introduction

1.1 - Introduction research project analysis
1.2 - Objectives of the research project

4
Introduction

The continuous evolution of the technology
(Moores law) causes that every IC is able to
contain a large number (until 2020 according SIA
roadmap)
Productivity gap
Adopted solutions
Component reuse (IP cores)
Soft-cores processors
HW-SW co-design
Novel design methodologies
Communication centric
Novel on-chip paradigms
Networks-on-Chips (NoCs)
System-level languages
SystemC, UML,
Develop complex ICs with billion of transistors
in the near future
Multiprocessor-System-on-Chip (MPSoC) /
Multi-cores / Chip-multiprocessors (CMP)
Sea of tiles (IP cores) interconnected by a
Network-on-Chip

5
Objectives of the research project

Develop a HW-SW co-design framework for parallel
distributed computing on-chip applying
platform-based design concepts
Performs co-evolution strategy of two concurrent
phases (HW-SW)
Hardware architecture
Scalable Distributed-Memory NoC-based MPSoC
(NUMA)
Software framework
Software drivers
embedded Message Passing Interface (eMPI)
Run benchmarks test applications
Explore concurrency and parallelism in on-chip
environments

6
2. Basic Concepts and state of the art in NoCs
and MPSoCs

2.1 - On-chip communication schemes
2.2 - Basic concepts on NoCs
2.3 - NoC topologies
2.4 - Switching modes routing schemes
2.5 - Flow control micro-network stack
2.6 - State of the art in NoCs/MPSoCs

7
On-chip communication architectures

Point-to-point
Fixed dedicated wires
Not flexible, Not shared
Null reusability
Bus-based interconnection (OCB)
Shared communication infrastructure
Multi-level, hierarchical or segmented buses
Bus becomes a bottleneck
On-chip network (NoC)
Distributed nature
Maximum flexibility scalability
Exploits reusability, parallel operations/transact
ions
Regular geometry
Predictable layout and performance
Best testability verification time
Must guarantee a certain QoS

8
Basic concepts on NoCs

Tile
Computational nodes
Router/Switch
Communication nodes
Switching and routing strategy
Network adapter (NA, NI, NIC)
Decouple computation from communication
Adapts network tile clock domains (GALS)
Links
Dedicated P2P communication channels
Flow control protocol (Handshake or credit-based)
NoC-based systems
High degree of composition and traffic diversity
It is desired to have good floorplanning
minimal buffer
Conventional/Traditional networks
Homogeneous and coarse grained

9
NoC topologies

Typical of multiprocessor systemsbut now on a
chip

Regular
Predictable in terms of
Power consumption,
Performance (bandwidth, latency)
Area usage
Good floorplanning
Non-regular
Mixing regular topologies
Mesh-Torus, Ring-Mesh, Ring-hypercube
Direct
At least one tile attached to each node
Indirect
A subset of nodes are not connected to any core
Its selection is a trade-off between
Network complexity or on-chip area costs
Communication requirements or network performance

10
Switching modes routing schemes

Circuit switching
Involves the establishment releasing of a
circuit between source and destination
Buffer-less switching scheme
Packet switching
Forwards the data to next hop
Buffering is mandatory
Different packet switching modes
Store-and-forward
Stall at two nodes and the link between them
Wormhole
Combines packet switching circuit switching
Reduce buffer size
Stall at all nodes and links spanned by the
packet
Virtual cut-through
Next hop must store the whole packet
Stall at local node

Buffering
Buffer size ? width, depth
Location in the router
Shared or distributed
Affects the power consumption area usage

11
Switching modes routing schemes

Routing schemes
Deterministic
Path determined by its source destinations
address
Easy to implement
Not optimal under congestion
Adaptive
Path decided on a per-hop basis
Complex in its implementation
Must be a deadlock/livelock free routing
Offers great benefits under congestion

12
Flow control Micro-Network stack

Flow control protocol (ensures the correct
transport of packets)
Handshake
Request acknowledge signals (req, ack/nAck)
Simpler and cheaper than credit-based
Credit-based
All network components keep counters for the
available buffer space
Data received ? counter-- Data sent ? counter
if counter0 ? buffer full
Network ?stack layers
Transport
Network Adapter has to pack/unpack messages into
network layer packets
Network
Where how a packet is transmitted
Data-link
Protocol to transmit a flit/phit
Physical
Number length of wires

13
State of the art in NoCs/MPSoCs

NoC is an emerging hot topic during last years
Research at all ?stack levels
System/Application Level
Design methodologies, co-exploration
Programming models OS support
Network Adapters
Network architecture
Link level
Research on MPSoC
HW-SW interfaces
Implantation of parallel programming models
Shared memory or message passing
ccNUMA MPSoC architecture using NIOS-II
MPSoC using segmented buses (HIBI)

14
3. Design framework and working methodology

3.1 - HW-SW Co-design flow
3.2 - Proposed NoC-based MPSoC architecture
3.3 - Prototyping platform

15
HW-SW Co-design flow

System specification
Architecture exploration
?P, VLIW, DSP
NoC routers, busses
NIC interfaces
Architecture designand HW-SW Co-design
RTL architecture
IP core integration (Soft-cores)
Software design
Benchmarks/Applications
embedded MPI (eMPI)
NIC driver
Integration and system-verification
SystemC
On-chip co-debugging
Functional prototype

Quartus II SOPC
Microsoft Visual Studio Eclipse IDE for NIOSII
ModelSim, GTKwave, Signal-Tap Synplify or
QuartusII
16
Proposed NoC-based MPSoC architecture

Distributed-memory NoC-based MPSoC components
NoC communication architecture
Soft/Hard IP core processors (Pi)
Distributed memory subsystem (Mi)
Network Interface Controller (NICi)
Driver for Network Interface Controller (NIC
driver)
embedded Message Passing Interface (eMPI)

17
Proposed NoC-based MPSoC architecture

NoC topology
2D-Mesh (regular, predictable)
XY Routing
Deterministic, minimal deadlock-free
Switching mode
Ephemeral Circuit switching
Store forward
Flow control
4-phase handshake
Tile composition
NIOS-II Soft-core processor
On-chip RAM or SSRAM controller
NIC interface to NoC
Timer (IRQs, multi-threaded)
UART, JTAG, Performance Counter

18
Prototyping platform

Stratix EP1S25 DSP prototyping/development board
Altera FPGA Stratix EP1S25F780C5
Contains 25.660 LEs
Includes 1.944.576 bits of on-chip memory
224 - M512 RAM blocks (32x18b)
138 - M4K RAM blocks (128x36b)
2 - M-RAM blocks
6 PLLs
597 maximum user I/O pins
Off-chip memory
2 Mbytes of SSRAM configuredas two independent
banks
32 Mbits of flash memory
Other I/O
LEDs, RS232, buttons, switches, 7segments

19
4. HW-SW NoC-based MPSoC implementation

4.1 - NoC-based MPSoC block diagram
4.2 - Communication channel
4.3 - Design of the Network Interface Controller
4.4 - Router design
4.5 - Software design
4.6 - Applications and benchmarks

20
NoC-based MPSoC block diagram

Distributed-memory NoC-based MPSoC based on
NIOS-II soft-core processor
Each NIOS-II Avalon based tile is generated
effortlessly through QuartusIISOPC
Our custom HW design
Implementation of flow control in
eachcommunication channel
Design of Network Interface Controller
Design of the router

21
Communication channel

Implements full-duplex 4-phase handshake protocol
Between NIC-Router or between routers
4-phase is not ambiguous
Two independent and synchronous FSM have been
designed
Packet definition
The definition of each subfield
XY address, message id, message length, sequence
number, flags, priority
Size of each subfield
Fixes the router and NIC implementation
Our packet format for a 2D-Mesh

22
Design of the Network Interface Controller

NIC interface between tiles and routers of our
NoC
Decoupling tiles computation from the NoCs
communication infrastructure
Important piece to get good packet injection rate
over the NoC
Build flits/packets
Bus peripheral (slave)
Polling or IRQs
Register Memory mappings
N1 bits of addressable bus space
Custom instruction (CI-based NIC)
Attached in the processor datapath
Is not master or slave

23
Router design

Circuit switching
Ephemeral circuit switching
Two latency cycles
One for XY routing
Another for PathSwitchMatrix

24
Router design

Packet switching
Store and forward
Full or shared/unified output queue
Now, the latency to traverse the router depends
on
FIFO capacity (depth)
Output queue policies

25
Software design

HW-SW platform stack view of our
distributed-memory NIOSII-based MPSOC with 2D
Mesh interconnection strategy
Software components
NIC driver low-level communication API
eMPI high-level communication API for message
passing
Optionally, between HdS (drivers) and
high-level communication APIs an operation system
(OS) might be included

26
Software design

The NIC software driver contains 3 basic
functions
Interact transparently with a given NIC component
exploiting all HW capabilities

volatile int NIC (int) (NIC_BASE)
volatile int NIC_TX (int) (NIC_BASE0x4)
Status register masks
0x1 ? dataPending
0x2 ? txBusy

/ NIC driver checking NIC status function / int nicStatus() return (NIC 0x0)
/ NIC driver blocking receive function / int nicRecv() while(!(nicStatus() 0x1)) return (NIC 0x1)
/ NIC driver blocking receive function / void nicSend(int data, int address) while(nicStatus() 0x2) (NIC_TX address) data
27
Software design

The eMPI software API will be our high-level
design language
Implements message passing over our on-chip
network
Steps to create our eMPI
Select a minimal working subset of standard MPI
functions
MPI_Init(), MPI_Finalize(), MPI_Comm_size(),
MPI_Comm_rank()
MPI_Send(), MPI_Recv()
Porting process from standard defacto MPI to our
on-chip network
Lightweight memory overhead message passing
interface (15-20KB)

28
Applications and benchmarks

The software framework let us to run parallel
applications over the hardware architecture
All applications and benchmarks have been done by
using NIC driver instead eMPI software API
COMMS1 COMMS2
Ping-pong benchmarks

29
Applications and benchmarks

Parallelization of Mandelbrot set
Iterative loop using complex numbers
Complex numbers are Cabi (a, b are C/C float
or double)
Ideal to perform a message passing
parallelization
Mandelbrot set eMPI function calls

30
5. Experimental results

5.1 - Hardware costs area usage
5.2 - Hardware costs area and power usage
5.3 - Software framework requirements
5.4 - On-chip network throughput and bandwidth
5.5 - Application results
5.6 - Comparative results

31
HW costs area usage

Router comparison between our Ephemeral Circuit
Switching vs.our Packet Switching unified/shared
queue
On a 2D-Mesh the number of ports are between 3-5
ports
Ephemeral Circuit Switching is between 2.5-3.8
times smaller than our Packet Switching
unified/share output queue

32
HW costs area usage

Evolution of NxN 2D-Mesh NoC-based MPSoC
Ratio of HW resources
CS 20 comm. / 80 comp.
PS 45 comm. / 55 comp.
Ephemeral circuit switching is a low cost
architecture
Area resources
On-chip memory requirements

Ephemeral Circuit Switching
Packet switching (Store and forward)
Logic elements (LEs)
Logic elements (LEs)
NxN 2D Mesh
NxN 2D Mesh
33
HW costs area and power usage

2x2 Mesh NoC-based MPSoC with Ephemeral Circuit
Switching
Not use any on-chip memory
Communication infrastructure (15) is extremely
small compared to the computational components
(85)

HW resources distribution
Running at 20MHz we can achieve around 60 DMIPS
Overall system metrics
49,65mW/MHz
3 DMIPS/MHz

Dynamic power usage
993,31mW
Static 548,39 mW
Dynamic 442,92 mW
The NoC only affects 0.5

34
Software framework requirements

It is necessary a RAM memory for each processor
Distributed-memory architecture
At least 64KB of RAM per processor
To load the software framework
Application data and algorithm
On-chip FPGA memory resources
High throughput (few cycles to access)
Low capacity (KB)
External SSRAM available on the prototyping board
Low throughput (many cycles to access)
Large capacity (MB)
Trade-off between capacity and throughput

35
On-chip Network throughput bandwidth

2x2 Mesh NoC-based MPSoC with Ephemeral Circuit
Switching
Maximum channel bandwidth is about 168.84Mbps at
63.24MHz
Bandwidth decrease according the number of hops
(end-to-end flow control)

Metric in COMMS1 2x2 Mesh with Ephemeral Circuit Switching running at 20MHz (flit size 32 bits) 2x2 Mesh with Ephemeral Circuit Switching running at 20MHz (flit size 32 bits)
Metric in COMMS1 Bus slave NIC CI-based NIC
Injection Rate
Channel bandwidth 4Mbps 16,2Mbps
Aggregate Bandwidth 12Mbps 48,6Mbps
Bisection bandwidth 8Mbps 32,4Mbps
Maximum network capacity 48Mbps 192,2Mbps
36
Application results

Test of the parallelization of Mandelbrot set in
several architectures
Sequential execution on Simple NIOS-II
monoprocessor
Parallel execution on a Dual-core NIOS-II
architecture
Parallel execution on a 2x2 Mesh NoC-based with
Ephemeral circuit switching

Speedup 4x
Speedup 2x
37
Comparative results
ON-CHIP MULTIPROCESSOR ARCHITECTURE ON-CHIP MULTIPROCESSOR ARCHITECTURE ON-CHIP MULTIPROCESSOR ARCHITECTURE
Symmetric Multiprocessor Hung05 HIBI-based MPSoC Salminen05 Our NoC-based MPSoC
Architecture ccNUMA (Cache Coherency) Shared memory Hierarchical bus/Network scheme Both, shared memory and message passing Network-on-Chip Both, shared memory and message passing
Interconnection capabilities Avalon bus Cache Coherency Module Hierarchical bus OCP compliant, DMA, and FIFO capabilities HIBI wrappers Circuit switching and wormhole routing Mesh-based NoC Avalon NIC wrappers following VSIA Low cost Ephemeral circuit switching
Scalability Scalability limited Good scalability Highly scalable
Processor core NIOSII soft-core processor NIOSII soft-core processor NIOSII soft-core processor
Hardware results (LEs/On-chip memory usage) 4 CPUs 11.708 LEs/185.920 b 8 CPUs 24.302 LEs/371.840 b 4 CPUs 24.207 LEs/314.3 KB 8 CPUs 36.402 LEs/2.911 Kb 4 CPUs (2x2 Mesh) 7.528 LEs/None 9 CPUs (3x3 Mesh) 17.780 LEs/None 64 KB used for each processor
Software results Functional verification tests Parallel MPEG-4 Simple Profile encoder was tested Includes a software API called eMPI Parallel generation of Mandelbrot set by using eMPI/NIC driver
Other relevant results SMP 1-, 2-, 4- and 8-way Standard running frequency between 60-80 MHz Theoretical maximum bandwidth of HIBI prototype is 328 MB/s _at_ 82MHz Standard running frequency 78 MHz Around 1 Gbps of aggregated bandwidth for custom HW _at_ 63,24MHz or 100Mbps _at_ 20MHz from NIOSII with message passing API using CI-based NIC
38
6. Conclusions future work

6.1 Conclusions
6.2 Future work

39
Conclusions

I have proposed a complete HW-SW framework for
distributed-memory NoC-based MPSoC architecture
eMPI is a viable solution to on-chip parallelism
using message passing
The methodology have been formalized as a HW-SW
co-design flow
Complete system level design tool chain
Validity tested on a physical platform (FPGA)
Methodology is also valid for ASIC development
This research work let us to perform effortlessly
distributed parallel computing on a chip
Useful parallel on-chip platform for many
high-performance computing and low power
emerging applications
Multimedia applications
Smart cams
Software-defined radio
Lack of verification and support tools to create
complex MPSoC

40
Future work

Long term
Extend this architecture to implement
heterogeneous systems
Extend this architecture to an hybrid memory
model(shared distributed memory system)
Large memory bank as a tile
Cache coherence
Mechanism to access the shared medium
Should be useful to get a complete SystemC
simulation model
Evolution of Ephemeral Circuit Switching
architecture
Build a wormhole packet switching
Include a NIC queue in our Ephemeral circuit
switching architecture
Change the fixed PriorityEncoder within
PathSwitchMatrix
Test our architecture with bus-slave NIC with
IRQs

41
Future work

Evolution of software framework
Improve the NIC software driver functions
Extend the eMPI SW API with other useful message
passing collective communication functions
broadcast, scatter, gather, scan, reduce,
allreduce, alltoall, reducescatter, barrier
synchronization,
Application-level
Take real application
Coarse grain or fine grain parallelism
Run GALS scheme with multiple clock domains

42
The endThank you !
43
HW-SW design using UML and SystemC