Title: P1247676901NaYnE
1An Efficient Architecture for the Implementation
of Message Passing Programming Model on Massive
Multiprocessor SoC
Ferid Gharsalli, Amer Baghdadi, Marius Bonaciu,
Giedrius Majauskas, Wander Cesario, Ahmed A.
Jerraya TIMA laboratory 46 av. Felix Viallet,
38031Grenoble Cedex (France) Tel (33) 476 574
759 ferid.gharsalli, amer.baghdadi,
marius.bonaciu, giedrius.majauskas,
wander.cesario, ahmed.jerraya_at_imag.fr
2Outline
- Introduction
- Flexible and Scalable Architecture for Parallel
Computations - Parallel Programming Model
- Application DivX Real Time Encoder
- Conclusions
2
3Efficient MP-SoC Design Method
Massive MP-SoC
Parallel Programming Model
Parallel Application
Slow and inefficient correspondence at early
stages of design
Many specifications
Flexible and Scalable Architecture for Parallel
Computations
Architecture
3
4Outline
- Introduction
- Flexible and Scalable Architecture for Parallel
Computations - Parallel Programming Model
- Application DivX Real Time Encoder
- Conclusions
4
5Evolution of Embedded Applications
What is the main problem of today's Embedded
Applications?
- GAP between design methods and actual
architectures, mainly because - Massive Computation
- Massive Data Transfer
- Cost and Power Constraints
- Huge Design Time
5
6Key Issues
How the GAP can be filled?
- Concurrency
- to cope with Massive Computation
- Efficient Data Transfer Architecture
- to cope with Massive Data Transfer
- Application Specific Communication/Comp.
- to cope with Cost and Power Constraints
- Higher Level Programming Model
- to cope with Huge Design Time
6
7Objective
What is our objective?
- Multicore Architecture
- to achieve Concurrency
- Efficient Network on Chip
- to achieve Efficient Data Transfer
- Heterogeneous Communication/Computation
- Specific HW/SW Interfacing to achieve Application
Specific Comm./Comp. - High Level Parallel Programming Model
- Message Passing to achieve Higher Level Design
? Highly flexible and scalable architectures
SW Comp.
SW Comp.
SW Comp.
SW Comp.
Task1
Task2
Task3
TaskX
MP-API
Task1
Task2
Task3
TaskN
Specific OS HAL
DesignFlow
CPU SubSystem
HW Adapt
Specific HW/SW Intrf.
API
Parallel Programming Model
NoC
7
8Design Flow
Application Specifications
Application
Algorithm Specifications
High Level SW Description
High Level MP-SoC Architecture
8
9Design Flow
Abs.M1
Abstract Module2
Abs.M3
Abs.M4
High Level MP-SoC Architecture
Task1
Task2
Task3
Task4
Task5
M.PPM
Module PPM
M.PPM
M.PPM
Abstract NoC
NoC PPM
High level MP-SoC Architecture
Low Level MP-SoC Architecture
8
10Outline
- Introduction
- Flexible and Scalable Architecture for Parallel
Computations - Parallel Programming Model
- Application DivX Real Time Encoder
- Conclusions
9
11Without Parallel Programming Model
Task1
Task3
Task4
Task5
Task2
API
API
Specific OS HAL
Specific OS HAL
CPU2
CPU1
CPU SubSystem
CPU SubSystem
Specific HW/SW interf.
Specific HW/SW interf.
Communication Network (NoC)
10
12With Parallel Programming Model
Task1
Task2
Task3
Task4
Task5
Specific OS HAL
Specific OS HAL
CPU2
CPU1
CPU SubSystem
CPU SubSystem
Specific HW/SW interf.
Specific HW/SW interf.
Communication Network (NoC)
11
13Parallel Programming Model 1/2
What is a Parallel Programming Model?
- HW/SW INTERFACE which
- separates the high-level properties(SW) from
low-level ones(HW) - ABSTRACT MACHINE which
- provides certain operations to the programming
level above(SW) - requires implementation for each operations of
the architectures bellow(HW)
12
14Parallel Programming Model 2/2
What are the properties of a Parallel Programming
Model?
- EASY TO PROGRAM because it needs to conceal
the - partitioning of a program into different modules
- mapping of the tasks into different types of
modules (HW,SW) - communication type between the tasks
- synchronization method between the tasks
- SOFTWARE DEVELOPMENT TECHNOLOGY
- needs to allow the development of the
application in a typical software design method
- ARCHITECTURE INDEPENDENT
- to be able to migrate from one architecture
model to another, without having to be
redeveloped or trivially modified
- EFFICIENTLY IMPLEMENTABLE
- needs to offer efficient results over a high
variety of different parallel architectures
- COST MEASURES
- the ability to decide that Operation A is better
than Operation B for a particular problem
13
15Types of Parallel Programming Models
EXPLICIT
IMPLICIT
Hard
Building parallel applications
Easy
Low
High
Efficient parallel applications
14
16Hard to Debug
!!! Hard to debug when is designed to be
Application specific !!!
Data dependent computation
Application SW
C library bug
Incorrect FIFO counter value causes
5
5
12
Parallel Prog. Model
deadlock.
30
12
Context switch does not work correctly.
µ-Kernel/OS
Booting is not synchronized
13
among processors.
13
5
5
Lost some interrupts
Bugs
Wrong interrupt priority levels
Result of compressed video is not
correct.
Abnormal execution of a portion
of C code
DAC04, San Diego, CA Mohamed Wassim YOUSSEF,
Sungjoo YOO, Arif SASONGKO, Yanick PAVIOT, Ahmed
JERRAYA TIMA Laboratory Debugging HW/SW
Interface for MP-SoC Video Encoder System Design
Case
15
17Application/PPM Communication
Parallel Application
Task1
Task2
Task3
Task4
Task5
Task6
TaskN
How the Application interacts with the Parallel
Programming Model?
MP_Init(this,argc,argv) MP_Finalize(this) MP_
ISend(this,buf,count,datatype,dest,tag,comm) M
P_IRecv(this,buf,count,datatype,source,tag,comm
,status) MP_IBSend(this,buf,count,datatype,de
st,tag,comm) MP_IBRecv(this,buf,count,datatype
,source,tag,comm,status) MP_ISSend(this,buf,
count,datatype,dest,tag,comm) MP_ISRecv(this,b
uf,count,datatype,source,tag,comm,status) MPI_Wa
it(this,request,status) MPI_Test(this,request,f
lag,status)
Shared memory Message passing RDMA
16
18Outline
- Introduction
- Flexible and Scalable Architecture for Parallel
Computations - Parallel Programming Model
- Application DivX Real Time Encoder
- Conclusions
17
19DivX Encoder Description
DivX very popular implementation of the MPEG4
standard (ISO/IEC 14496-2)
quanta
Motion vectors
I
YUV
t
Motion Estimation
MPEG4/ISO Bitstream
DCT
Quant.
VLC
P
Motion vectors
Reference image
P
Motion Comp.
IDCT
DeQuant.
t-1
I
18
20DivX Parallelization
VLC
19
21DivX MP-SoC Architecture Generation
Splitter (Preprocessing)
Antenna (Video source)
Combiner (Postprocessing)
Main DivX1
Main DivXn
VLC1
VLCm
MPEG4 storage
Message Passing API
Parallel Programming Model
Flexible and Scalable High Level Architecture
Template Model
Parameters (Parallel.,Part., Mapp., Comm.,
Sync.,etc)
Design Flow
Antenna
Storage
HW Adapt
HW Adapt
Main DivXn
VLC1
VLCn
Main DivX1
Splitter
Combiner
MP-API
MP-API
MP-API
MP-API
Specific OS
Specific OS
Specific OS
Specific OS
CPU Subsystem
CPU Subsystem
CPU Subsystem
CPU Subsystem
HW Adapt
HW Adapt
Spec. HW/SW intrf.
Spec. HW/SW intrf.
Spec. HW/SW intrf.
Spec. HW/SW intrf.
NoC
20
22Performance results 1/6
QCIF RESOLUTION, 25 frames/s
176
144
21
23Performance results 2/6
22
24Performance results 3/6
23
25Performance results 4/6
CIF RESOLUTION, 25 frames/s
352
288
24
26Performance results 5/6
25
27Performance results 6/6
26
28Outline
- Introduction
- Flexible and Scalable Architecture for Parallel
Computations - Parallel Programming Model
- Application DivX Real Time Encoder
- Conclusions
27
29Conclusions
- todays MP-SoC architectures require
- Multicore Based Architectures
- Efficient Data Transfer Architectures
- Application Specific Communication/Computation
- High Level Programming Models
- both, SW design and HW design, are crucial for
obtaining efficient results - linking the SW designers work with the HW
designers work at early stages of design is
difficult - using Parallel Programming Models through the
design flow is the right efficient linking
method - example on a Real Time DivX Encoder Application
was presented - experimental results prove an efficient and
scalable MP-SoC architecture obtained through a
very efficient design flow - Future work
- testing this approach on different other
applications (i.e. MP3 Real Time Encoder) - fully automating the design flow
28
30Thank you