Title: Complex multiprocessor architectures
1Chapter 9 Complex multiprocessor architectures
- Outline
- Introduction limitations of simple
multiprocessors - Application domain characteristics multi-window
TV - Top level architecture
- Architecture of the signal processing subsystem
2Example DVP platform
3Discussion
- architecture
- the limit of scalability of busses is reached
- communication via central memory doubles the
bandwidth - programming (software) issues
- synchronisation via central CPU leads to coarse
grain tasks - scheduling of central resources (bus and memory)
is difficult and time consuming specially for
real-time applications - VLSI (hardware) issues
- interconnect delay dominates the gate delay
which leads - to multi-hop communication
- clocking
4From busses to concentrators to ...
5Networks-on-Silicon
Extern SDRAM
Hierarchical architecture with clusters of
processors and memories, autonomously operating
and cooperating via an on-chip network with
routers, segmented interconnect and distributed
memories
C4
C1
C8
C5
C12
C9
C16
C13
C2
C3
C6
C7
C10
C11
C14
C15
embedded cores embedded memories
6Networks-on-Silicon
C4
C1
C8
C5
C12
C9
C16
C13
C2
C3
C6
C7
C10
C11
C14
C15
7Design paradigm shift every 7- 8 years
Networks-on-Silicon
Multi-processor arch Platform based
design Reuse of IP
Embedded processors HLS SW-compilers ILP,
VLIW
FU design Logic RT synthesis parametrised
libraries
8Outline
- Introduction limitations of simple
multiprocessors - Application domain characteristics multi-window
TV - Top level architecture
- Architecture of the periodic subsystem
9High-end TV architecture of 1998
- ad hoc architecture
- local optima
- globally not
- cost-effective
10Market analysis
New features PC like windows with variable sizes
and shapes e.g. PIP, TXT, OSD
11(No Transcript)
12(No Transcript)
13Application domain analysis
14Application domain analysis
control UI, menu, modem, cond. access, select
tt page soft real time tt generation hard real
time video
mem
Video In1
NR
HSRC
VSRC
mix
100Hz
Peak
Matrix
Video In2
NR
HSRC
VSRC
mix
Txt gen
mem
HSRC
VSRC
mem
- nodes limited number of well
- known weakly programmable tasks
- graph represents 1 application
- 50 ... 100 applications
- run-time switching between applic.
RC
15Application domain analysis
mem
Application graph gt subgraphs gt tasks
set of closely coupled tasks
Video In1
NR
HSRC
VSRC
mix
100Hz
Peak
Matrix
Video In2
NR
HSRC
VSRC
mix
Txt gen
mem
HSRC
VSRC
mem
- processing power Gops/s
- bandwidth between tasks GB/s
- 11 internal streams
- 20 external streams to mem
- 5 IO streams
- asynchronous
16Outline
- Introduction limitations of simple
multiprocessors - Application domain characteristics multi-window
TV - Top level architecture
- Architecture of the signal processing subsystem
17Top-level architecture
signal processing subsystem
network
periodic requests
P_3
P_4
P_5
P_6
SDRAM
interface
random requests
memory subsystem
P_1
P_2
I
D
Embedded CPU
control subsystem
18Hosseini-Khayat 95
time
server
time
T
time
?
time
1 2 ...
N
Bus time slot (e.g. 18 cc)
Bus service cycle ? N time slots (e.g. N
64) Q number of bus time slots per service
cycle reserved for periodic streams
Minimize the latency for random requests while
guaranteeing the throughput for periodic requests
19Algorithm
- N number of time slots
- Q number of time slots per service cycle for
periodic requests - n remaining number of time slots (initially n
N) - q remaining number of time slots for periodic
requests - (initially q Q)
1. If n gt q choose random request if available 2.
If n ? q choose periodic request if available 3.
Decrement q if a periodic request was chosen 4.
Decrement n. If n0, restart (nN, qQ)
Claim average delay of random requests in the
presence of periodic requests approaches
the delay when periodic requests are absent
(if no overload situation).
20From here always choose periodic request
21Outline
- Introduction limitations of simple
multiprocessors - Application domain characteristics multi-window
TV - Top level architecture
- Architecture of the signal processing subsystem
22Signal Processing Subsystem
A
B
C
B
D
C
A
A
B
C
D
A
B
C
D
(re)configuration like in FPGAs but at coarse
level
23Signal Processing Subsystem
Communication network
- different flowgraphs
- implemented via
- programmable
- switches
- separate procltgtcomm.
- Fifo (signals)
- dynamic dataflow
- Async. Streams
- dyn Fnct. (VLD)
- simpler
- scalable
fifos
P_1
P_2
P_n
...
fifos
1 to 1 mapping
24Signal Processing Subsystem
Reconfigurable communication network
fifos
P_1
P_2
P_n
...
fifos
Inverse communication network
25Signal Processing Subsystem
Resource sharing of processors
2
A/B Proc P
C Proc Q
1
3
4
A
B
C
Mapping multiplexing
Data identification is needed -gt header or tag
26D
y
A
B
P
E
Q
x
x
y
C
(a) Process flow graph
Proper schedule A B C P D E Q
Forbidden schedule A B D C
x y P/Q
x y P/Q
Deadlock!
(b) Mapping
Data is present in the system but there is no
progress.
27Signal Processing Subsystem
Ways to avoid deadlock
1. Graph transformations 2. Extra memory
processes (rearrange order of tokens)
D
A
B
M
P
E
M
Q
C
28Signal Processing Subsystem
3. Extra control on the sequence of firing
D
A
B
P
E
Q
C
29Signal Processing Subsystem
4. Out of order execution
P/Q
Separate fifos for tokens with different colors
no sharing of fifos
30Processor model
Clock generation
fifo_1
fifo_2
fifo_3
fifo_4
- 4 streams in parallel
- no fifo/state sharing
- Zero overhead
- context switch
- blocking protocol
- via clock gating
- round robin scheduler
Clock gating
Shared logic
State_1
State_m
State_1
State_m
debug
Local control task switching
fifo_1
fifo_2
fifo_3
fifo_4
31b1
b2
b3
a1
a2
a3
a4
32Space switch
Time switch
b1
b2
b3
b4
a1
frame
frame
s1
s2
a2
time
a3
a3
a4
a2
a1
a4
a2
a1
a3
s1
s2
a4
time
b3
b3
b4
b2
b1
b4
b2
b1
33TST interconnect network
space
time
time
x
inputs to processors
outputs from processors
y
Communication ctrl active task memory cyclostatic
1 2 3 4
x
x
x
y
y
y
phase
y
y
y
Configuration memories
Configuration ctrl run time reconfiguring
Appl. Graph 1
Appl. Graph 2
34Example Communication backbone
blanking
time
Task 1
time
Video stream 1
Task 2
time
Task 3
blanking
Task 4
Video stream 2
Task 5
Task 6
Overlap old and new appl graph
35Chip metrics
36Chip layout
SE
S2MEM MEM2S JUGGLER
NR
OUT
RC
video
IN
CC
VS
SDRAM
HS
INT
CPU
37The End
- Course goals
- understand the design space and the trade-offs
between area, - time and power
- understand the trends and the driving forces
behind the - different types of embedded cores (hardware and
software) - understand the role and the task of the system
level architect
Be aware that learning, just like architecting is
an iterative and interactive process. iterative
gt read and consume it again interactive gt
contact me if you have questions