Title: MPhil/Master dissertation
1HW-SW Co-Design Framework for Parallel
Distributed Computing on NoC-based MPSoC
architectures
- MPhil/Master dissertation
- presented by
- Jaume Joven Murillo
- and supervised by
- Dr. Jordi Carrabina Bordoll
2Presentation outline
- Introduction
- Basic concepts state of the art in NoCs and
MPSoCs - Design framework and working methodology
- HW-SW NoC-based MPSoC implementation
- Experimental results
- Conclusions future work
31. Introduction
- 1.1 - Introduction research project analysis
- 1.2 - Objectives of the research project
4Introduction
- The continuous evolution of the technology
(Moores law) causes that every IC is able to
contain a large number (until 2020 according SIA
roadmap) - Productivity gap
- Adopted solutions
- Component reuse (IP cores)
- Soft-cores processors
- HW-SW co-design
- Novel design methodologies
- Communication centric
- Novel on-chip paradigms
- Networks-on-Chips (NoCs)
- System-level languages
- SystemC, UML,
- Develop complex ICs with billion of transistors
in the near future - Multiprocessor-System-on-Chip (MPSoC) /
Multi-cores / Chip-multiprocessors (CMP) - Sea of tiles (IP cores) interconnected by a
Network-on-Chip
5Objectives of the research project
- Develop a HW-SW co-design framework for parallel
distributed computing on-chip applying
platform-based design concepts - Performs co-evolution strategy of two concurrent
phases (HW-SW) - Hardware architecture
- Scalable Distributed-Memory NoC-based MPSoC
(NUMA) - Software framework
- Software drivers
- embedded Message Passing Interface (eMPI)
- Run benchmarks test applications
- Explore concurrency and parallelism in on-chip
environments
62. Basic Concepts and state of the art in NoCs
and MPSoCs
- 2.1 - On-chip communication schemes
- 2.2 - Basic concepts on NoCs
- 2.3 - NoC topologies
- 2.4 - Switching modes routing schemes
- 2.5 - Flow control micro-network stack
- 2.6 - State of the art in NoCs/MPSoCs
7On-chip communication architectures
- Point-to-point
- Fixed dedicated wires
- Not flexible, Not shared
- Null reusability
- Bus-based interconnection (OCB)
- Shared communication infrastructure
- Multi-level, hierarchical or segmented buses
- Bus becomes a bottleneck
- On-chip network (NoC)
- Distributed nature
- Maximum flexibility scalability
- Exploits reusability, parallel operations/transact
ions - Regular geometry
- Predictable layout and performance
- Best testability verification time
- Must guarantee a certain QoS
8Basic concepts on NoCs
- Tile
- Computational nodes
- Router/Switch
- Communication nodes
- Switching and routing strategy
- Network adapter (NA, NI, NIC)
- Decouple computation from communication
- Adapts network tile clock domains (GALS)
- Links
- Dedicated P2P communication channels
- Flow control protocol (Handshake or credit-based)
- NoC-based systems
- High degree of composition and traffic diversity
- It is desired to have good floorplanning
minimal buffer - Conventional/Traditional networks
- Homogeneous and coarse grained
9NoC topologies
- Typical of multiprocessor systemsbut now on a
chip
- Regular
- Predictable in terms of
- Power consumption,
- Performance (bandwidth, latency)
- Area usage
- Good floorplanning
- Non-regular
- Mixing regular topologies
- Mesh-Torus, Ring-Mesh, Ring-hypercube
- Direct
- At least one tile attached to each node
- Indirect
- A subset of nodes are not connected to any core
- Its selection is a trade-off between
- Network complexity or on-chip area costs
- Communication requirements or network performance
10Switching modes routing schemes
- Circuit switching
- Involves the establishment releasing of a
circuit between source and destination - Buffer-less switching scheme
- Packet switching
- Forwards the data to next hop
- Buffering is mandatory
- Different packet switching modes
- Store-and-forward
- Stall at two nodes and the link between them
- Wormhole
- Combines packet switching circuit switching
- Reduce buffer size
- Stall at all nodes and links spanned by the
packet - Virtual cut-through
- Next hop must store the whole packet
- Stall at local node
- Buffering
- Buffer size ? width, depth
- Location in the router
- Shared or distributed
- Affects the power consumption area usage
11Switching modes routing schemes
- Routing schemes
- Deterministic
- Path determined by its source destinations
address - Easy to implement
- Not optimal under congestion
- Adaptive
- Path decided on a per-hop basis
- Complex in its implementation
- Must be a deadlock/livelock free routing
- Offers great benefits under congestion
12Flow control Micro-Network stack
- Flow control protocol (ensures the correct
transport of packets) - Handshake
- Request acknowledge signals (req, ack/nAck)
- Simpler and cheaper than credit-based
- Credit-based
- All network components keep counters for the
available buffer space - Data received ? counter-- Data sent ? counter
if counter0 ? buffer full - Network ?stack layers
- Transport
- Network Adapter has to pack/unpack messages into
network layer packets - Network
- Where how a packet is transmitted
- Data-link
- Protocol to transmit a flit/phit
- Physical
- Number length of wires
13State of the art in NoCs/MPSoCs
- NoC is an emerging hot topic during last years
- Research at all ?stack levels
- System/Application Level
- Design methodologies, co-exploration
- Programming models OS support
- Network Adapters
- Network architecture
- Link level
- Research on MPSoC
- HW-SW interfaces
- Implantation of parallel programming models
- Shared memory or message passing
- ccNUMA MPSoC architecture using NIOS-II
- MPSoC using segmented buses (HIBI)
143. Design framework and working methodology
- 3.1 - HW-SW Co-design flow
- 3.2 - Proposed NoC-based MPSoC architecture
- 3.3 - Prototyping platform
15HW-SW Co-design flow
- System specification
- Architecture exploration
- ?P, VLIW, DSP
- NoC routers, busses
- NIC interfaces
- Architecture designand HW-SW Co-design
- RTL architecture
- IP core integration (Soft-cores)
- Software design
- Benchmarks/Applications
- embedded MPI (eMPI)
- NIC driver
- Integration and system-verification
- SystemC
- On-chip co-debugging
- Functional prototype
Quartus II SOPC
Microsoft Visual Studio Eclipse IDE for NIOSII
ModelSim, GTKwave, Signal-Tap Synplify or
QuartusII
16Proposed NoC-based MPSoC architecture
- Distributed-memory NoC-based MPSoC components
- NoC communication architecture
- Soft/Hard IP core processors (Pi)
- Distributed memory subsystem (Mi)
- Network Interface Controller (NICi)
- Driver for Network Interface Controller (NIC
driver) - embedded Message Passing Interface (eMPI)
17Proposed NoC-based MPSoC architecture
- NoC topology
- 2D-Mesh (regular, predictable)
- XY Routing
- Deterministic, minimal deadlock-free
- Switching mode
- Ephemeral Circuit switching
- Store forward
- Flow control
- 4-phase handshake
- Tile composition
- NIOS-II Soft-core processor
- On-chip RAM or SSRAM controller
- NIC interface to NoC
- Timer (IRQs, multi-threaded)
- UART, JTAG, Performance Counter
18Prototyping platform
- Stratix EP1S25 DSP prototyping/development board
- Altera FPGA Stratix EP1S25F780C5
- Contains 25.660 LEs
- Includes 1.944.576 bits of on-chip memory
- 224 - M512 RAM blocks (32x18b)
- 138 - M4K RAM blocks (128x36b)
- 2 - M-RAM blocks
- 6 PLLs
- 597 maximum user I/O pins
- Off-chip memory
- 2 Mbytes of SSRAM configuredas two independent
banks - 32 Mbits of flash memory
- Other I/O
- LEDs, RS232, buttons, switches, 7segments
194. HW-SW NoC-based MPSoC implementation
- 4.1 - NoC-based MPSoC block diagram
- 4.2 - Communication channel
- 4.3 - Design of the Network Interface Controller
- 4.4 - Router design
- 4.5 - Software design
- 4.6 - Applications and benchmarks
20NoC-based MPSoC block diagram
- Distributed-memory NoC-based MPSoC based on
NIOS-II soft-core processor - Each NIOS-II Avalon based tile is generated
effortlessly through QuartusIISOPC - Our custom HW design
- Implementation of flow control in
eachcommunication channel - Design of Network Interface Controller
- Design of the router
21Communication channel
- Implements full-duplex 4-phase handshake protocol
- Between NIC-Router or between routers
- 4-phase is not ambiguous
- Two independent and synchronous FSM have been
designed - Packet definition
- The definition of each subfield
- XY address, message id, message length, sequence
number, flags, priority - Size of each subfield
- Fixes the router and NIC implementation
- Our packet format for a 2D-Mesh
22Design of the Network Interface Controller
- NIC interface between tiles and routers of our
NoC - Decoupling tiles computation from the NoCs
communication infrastructure - Important piece to get good packet injection rate
over the NoC - Build flits/packets
- Bus peripheral (slave)
- Polling or IRQs
- Register Memory mappings
- N1 bits of addressable bus space
- Custom instruction (CI-based NIC)
- Attached in the processor datapath
- Is not master or slave
23Router design
- Circuit switching
- Ephemeral circuit switching
- Two latency cycles
- One for XY routing
- Another for PathSwitchMatrix
24Router design
- Packet switching
- Store and forward
- Full or shared/unified output queue
- Now, the latency to traverse the router depends
on - FIFO capacity (depth)
- Output queue policies
25Software design
- HW-SW platform stack view of our
distributed-memory NIOSII-based MPSOC with 2D
Mesh interconnection strategy - Software components
- NIC driver low-level communication API
- eMPI high-level communication API for message
passing - Optionally, between HdS (drivers) and
high-level communication APIs an operation system
(OS) might be included
26Software design
- The NIC software driver contains 3 basic
functions - Interact transparently with a given NIC component
exploiting all HW capabilities
- volatile int NIC (int) (NIC_BASE)
- volatile int NIC_TX (int) (NIC_BASE0x4)
- Status register masks
- 0x1 ? dataPending
- 0x2 ? txBusy
/ NIC driver checking NIC status function / int nicStatus() return (NIC 0x0)
/ NIC driver blocking receive function / int nicRecv() while(!(nicStatus() 0x1)) return (NIC 0x1)
/ NIC driver blocking receive function / void nicSend(int data, int address) while(nicStatus() 0x2) (NIC_TX address) data
27Software design
- The eMPI software API will be our high-level
design language - Implements message passing over our on-chip
network - Steps to create our eMPI
- Select a minimal working subset of standard MPI
functions - MPI_Init(), MPI_Finalize(), MPI_Comm_size(),
MPI_Comm_rank() - MPI_Send(), MPI_Recv()
- Porting process from standard defacto MPI to our
on-chip network - Lightweight memory overhead message passing
interface (15-20KB)
28Applications and benchmarks
- The software framework let us to run parallel
applications over the hardware architecture - All applications and benchmarks have been done by
using NIC driver instead eMPI software API - COMMS1 COMMS2
- Ping-pong benchmarks
29Applications and benchmarks
- Parallelization of Mandelbrot set
- Iterative loop using complex numbers
- Complex numbers are Cabi (a, b are C/C float
or double) - Ideal to perform a message passing
parallelization - Mandelbrot set eMPI function calls
305. Experimental results
- 5.1 - Hardware costs area usage
- 5.2 - Hardware costs area and power usage
- 5.3 - Software framework requirements
- 5.4 - On-chip network throughput and bandwidth
- 5.5 - Application results
- 5.6 - Comparative results
31HW costs area usage
- Router comparison between our Ephemeral Circuit
Switching vs.our Packet Switching unified/shared
queue - On a 2D-Mesh the number of ports are between 3-5
ports - Ephemeral Circuit Switching is between 2.5-3.8
times smaller than our Packet Switching
unified/share output queue
32HW costs area usage
- Evolution of NxN 2D-Mesh NoC-based MPSoC
- Ratio of HW resources
- CS 20 comm. / 80 comp.
- PS 45 comm. / 55 comp.
- Ephemeral circuit switching is a low cost
architecture - Area resources
- On-chip memory requirements
Ephemeral Circuit Switching
Packet switching (Store and forward)
Logic elements (LEs)
Logic elements (LEs)
NxN 2D Mesh
NxN 2D Mesh
33HW costs area and power usage
- 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit
Switching - Not use any on-chip memory
- Communication infrastructure (15) is extremely
small compared to the computational components
(85)
- HW resources distribution
- Running at 20MHz we can achieve around 60 DMIPS
- Overall system metrics
- 49,65mW/MHz
- 3 DMIPS/MHz
- Dynamic power usage
- 993,31mW
- Static 548,39 mW
- Dynamic 442,92 mW
- The NoC only affects 0.5
34Software framework requirements
- It is necessary a RAM memory for each processor
- Distributed-memory architecture
- At least 64KB of RAM per processor
- To load the software framework
- Application data and algorithm
- On-chip FPGA memory resources
- High throughput (few cycles to access)
- Low capacity (KB)
- External SSRAM available on the prototyping board
- Low throughput (many cycles to access)
- Large capacity (MB)
- Trade-off between capacity and throughput
35On-chip Network throughput bandwidth
- 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit
Switching - Maximum channel bandwidth is about 168.84Mbps at
63.24MHz - Bandwidth decrease according the number of hops
(end-to-end flow control)
Metric in COMMS1 2x2 Mesh with Ephemeral Circuit Switching running at 20MHz (flit size 32 bits) 2x2 Mesh with Ephemeral Circuit Switching running at 20MHz (flit size 32 bits)
Metric in COMMS1 Bus slave NIC CI-based NIC
Injection Rate
Channel bandwidth 4Mbps 16,2Mbps
Aggregate Bandwidth 12Mbps 48,6Mbps
Bisection bandwidth 8Mbps 32,4Mbps
Maximum network capacity 48Mbps 192,2Mbps
36Application results
- Test of the parallelization of Mandelbrot set in
several architectures - Sequential execution on Simple NIOS-II
monoprocessor - Parallel execution on a Dual-core NIOS-II
architecture - Parallel execution on a 2x2 Mesh NoC-based with
Ephemeral circuit switching
Speedup 4x
Speedup 2x
37Comparative results
ON-CHIP MULTIPROCESSOR ARCHITECTURE ON-CHIP MULTIPROCESSOR ARCHITECTURE ON-CHIP MULTIPROCESSOR ARCHITECTURE
Symmetric Multiprocessor Hung05 HIBI-based MPSoC Salminen05 Our NoC-based MPSoC
Architecture ccNUMA (Cache Coherency) Shared memory Hierarchical bus/Network scheme Both, shared memory and message passing Network-on-Chip Both, shared memory and message passing
Interconnection capabilities Avalon bus Cache Coherency Module Hierarchical bus OCP compliant, DMA, and FIFO capabilities HIBI wrappers Circuit switching and wormhole routing Mesh-based NoC Avalon NIC wrappers following VSIA Low cost Ephemeral circuit switching
Scalability Scalability limited Good scalability Highly scalable
Processor core NIOSII soft-core processor NIOSII soft-core processor NIOSII soft-core processor
Hardware results (LEs/On-chip memory usage) 4 CPUs 11.708 LEs/185.920 b 8 CPUs 24.302 LEs/371.840 b 4 CPUs 24.207 LEs/314.3 KB 8 CPUs 36.402 LEs/2.911 Kb 4 CPUs (2x2 Mesh) 7.528 LEs/None 9 CPUs (3x3 Mesh) 17.780 LEs/None 64 KB used for each processor
Software results Functional verification tests Parallel MPEG-4 Simple Profile encoder was tested Includes a software API called eMPI Parallel generation of Mandelbrot set by using eMPI/NIC driver
Other relevant results SMP 1-, 2-, 4- and 8-way Standard running frequency between 60-80 MHz Theoretical maximum bandwidth of HIBI prototype is 328 MB/s _at_ 82MHz Standard running frequency 78 MHz Around 1 Gbps of aggregated bandwidth for custom HW _at_ 63,24MHz or 100Mbps _at_ 20MHz from NIOSII with message passing API using CI-based NIC
386. Conclusions future work
- 6.1 Conclusions
- 6.2 Future work
39Conclusions
- I have proposed a complete HW-SW framework for
distributed-memory NoC-based MPSoC architecture - eMPI is a viable solution to on-chip parallelism
using message passing - The methodology have been formalized as a HW-SW
co-design flow - Complete system level design tool chain
- Validity tested on a physical platform (FPGA)
- Methodology is also valid for ASIC development
- This research work let us to perform effortlessly
distributed parallel computing on a chip - Useful parallel on-chip platform for many
high-performance computing and low power
emerging applications - Multimedia applications
- Smart cams
- Software-defined radio
- Lack of verification and support tools to create
complex MPSoC
40Future work
- Long term
- Extend this architecture to implement
heterogeneous systems - Extend this architecture to an hybrid memory
model(shared distributed memory system) - Large memory bank as a tile
- Cache coherence
- Mechanism to access the shared medium
- Should be useful to get a complete SystemC
simulation model - Evolution of Ephemeral Circuit Switching
architecture - Build a wormhole packet switching
- Include a NIC queue in our Ephemeral circuit
switching architecture - Change the fixed PriorityEncoder within
PathSwitchMatrix - Test our architecture with bus-slave NIC with
IRQs
41Future work
- Evolution of software framework
- Improve the NIC software driver functions
- Extend the eMPI SW API with other useful message
passing collective communication functions - broadcast, scatter, gather, scan, reduce,
allreduce, alltoall, reducescatter, barrier
synchronization, - Application-level
- Take real application
- Coarse grain or fine grain parallelism
- Run GALS scheme with multiple clock domains
42The endThank you !
43HW-SW design using UML and SystemC
- Useful to verify complex MPSoC systems
- SystemC let us to model all HW-SW component
- UML let us to generate automatically the SystemC
code - Easy example MeshXYRouting
- Comparator