Title: Embedded Systems in Silicon TD5102 Introduction and overview
1Embedded Systems in SiliconTD5102Introduction
and overview
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2Contents
- Trends
- Platforms
- Application mapping
- Design flow
- Summary
3Observation 1The 3 Cs
- Convergence of 3 Cs
- computers, communications and consumer
- electronics
- The computer enters the 3rd fase
- computing power - networking - intelligent
processing - The world is one network
- wherever, whenever, all information and
communication available
We get a smart environment
4Observation 2 Current design practise
- Y-Chart (Gajski-Kuhn)
- Design Flow is path in Y chart
- Till RT-level largely manual flow
5Observation 3 Informal system specification
6Observation 4 design productivity
- Yes, we can fabricate the ICs, but
- Can we design them ?
- Can we program them ?
7Obervation 5More dynamic applications
Video
P. Kuhn, G. Diebel, Complexity Analysis of the
MPEG-4 VM 8.0, ISO/IEC JTC1/SC29/WG11/MPEG97/m28
62, Fribourg, October 1997
3D
8Observation 6 Memory problem
Performance
µProc 55/year
1000
CPU
100
Moores Law
10
DRAM 7/year
DRAM
1
1980
1985
1990
1995
2000
Time
Patterson
9What do we learn from these observations?
- We need
- Short Time-to-Market
- reuse
- short design time
- Flexible solution
- programmability
- reconfigurability
- Scalability
- Low power
- Low cost
- QoS control
At sufficient performance !
10Solution ?
- Platforms
- HW and SW IP reuse
- Standardization (interfaces)
- QoS (quality of service) hooks
- Advanced Design Flow for Platforms
- Raise abstraction level
- Tool support
- Modeling of Power, Cost, Performance
- Predictable design
11Lecture 1 Introduction
- Trends
- Platforms
- Application mapping
- Design flow
- Summary
12What is a platform?
A platform is a generic, but domain
specific information processing (sub-)system
In future available as single chip (SoC), or
package (SiP)
13What is a platform?
- HW properties
- One or more programmable processors
- Advanced memory organization
- Programmable communication network
- I/O (highly domain dependent)
- Possible extra HW features
- Reconfigurable logic
- Domain specific accelerators
14What is a platform?
- SW components
- Standardized RTOS
- Proper tooling for platform system design
- Compilers, Models, Exploration, Debugging,
Simulation, - Possible extra SW features
- Middleware layer on top of OS for features like
- QoS
- Domain specific protocols
- Domain specific SW interfaces
- Control reconfigurable logic
- Library components
- Distributed / Active network processing
- Billing
- Security
15Example Platform Philips Nexperia
- Available in the Billion Transistor Era
- E.g. TI OMAP, Sony Cell, Philips Nexperia, TRIPS,
Xilinx Virtex-4 Pro,
16Future platforms
- Example Smart Networked Devices
active packets
Virtual Machine
Protocols Multimedia (MPEG 21) Network
OS
library
reconfig. hardware
accelerator hardware
programmable hardware
radio
17Future platform architecture concept
Reconfigurable HW blocks
CPUs
Reconfigurable HW blocks
Accelerators
Reconfigurable HW blocks
CPUs
Accelerators
Accelerators
CPUs
Communication network
Memory
Memory
I/O
Level 0
Communication network
Level 1
Communication network
I/O
Level N
Memory
18Future platforms
Network interface
On-chip Network
IP core
- IP - Isles
- 32 RISC microprocessor 20 Kgates
- MPEG decoding 100 Kgates
- Wavelet filtering 40 Kgates
- SRAM
- DRAM
- FPGA block
19Lecture 1 Introduction
- Trends
- Platforms
- Application mapping
- Design flow
- Summary
20Platform and platform design
Applications
SDT system design technology
Design technology
Platform
PDT platform design technology
Enabling technologies
21What is the system designers problem ?
Specification
Implementation
Find for an application (idea/specification) an
efficient mapping/implementation on a given
realization space, under given constraints
(cost, P, E, T, ED, Throughput, pins, ..)
22A (single) processor how does it
look inside?
23Mapping placing operations in space and time
- d a b
- e a d
- f 2 b d
- r f e
- x z y
24How to map these operations?
- Architecture 1
- One Function Unit
- All operations single cycle latency
b
a
2
d
z
y
e
f
-
x
r
25How to map these operations?
- Architecture 2
- One Add-Sub and one Mul unit
- All operations single cycle latency
26How to map these operations?
- Architecture 3
- One Add-sub and one Mul unit
- Add/Sub 1 cycle, Mul 2 cycles
27There are many mapping solutions
Let S be the solution space containing solutions
x (xi), then x Pareto point ? x ? S, and ? y
? S ?i xi lt yi
28Can we do better?
Yes !!
- Much better !!
- transforming the specification
- a different architecture
- a different mapping
- speculative execution
- be creative ..
29Transforming the specification (1)
Example tree height reduction
Based on associativity of operation a (b c)
(a b) c
30Transforming the specification (2)
r f e 2b d (a d) 2b a x
z y
d a b e a d f 2 b d r f e x
z y
31Changing the architecture adding more complex
units
4-input adder why is this faster?
32Changing the architecture adding more complex
units
- In the extreme case put everything into one unit!
Spatial mapping - no control flow
33More complex control flow
Program part
-a- If cond Then -b- Else -c- -d-
34Mapping the CFG example 3
options what's the best?
-a- br c
-a- br b
-a- br c
-b- jmp d
-c- jmp d
-b-
-b-
-c-
-d-
-d-
-d-
-c- jmp d
35Why not removing the control flow ?
36If conversion shortens the schedule
-a- br c
-a-
-b- jmp d
cond -b-
!cond -c-
-c-
-d-
-d-
Using guarded instructions like r3 add
r1,r2,r5 !r3 mul r4,r5,3
37Speculative execution makes it even shorter!
-a- br c
-a-
-b-
-c-
-b- jmp d
-d-
-c-
-d-
Why not executing -d- in parallel?
38However Real life much more complex
E.g. MPEG-4 multimedia
Huge requirements gt 10 GOP/s gt 6 GB/s gt 10
MB storage
Software specification - more than 200 000
lines C - hundreds of files - written by
approx. 80 teams
39Can we handle this?
Nowadays implementations - small images -
decoding only - not real-time - several W -
single task - limited dynamism
Wanted features - large images (HDTV) -
encoding and decoding - real-time - 100 mW
(mobile) - multiple tasks - dealing with
dynamism
40Lecture 1 Introduction
- Trends
- Platforms
- Application mapping
- Design flow
- Summary
41Embedded system design
How to map your application graph A(L,A,D) to
hardware graph (L,N,C)
L design level (e.g. architecture,
implementation or realization level) A
application components (e.g. tasks, operations,
data structures) D dependences between
application components N hardware components
(e.g. processors, ASICs, FPGA,memories) C
connections between hardware components
42Abstraction levels
Level specification
Inter-level transformation
System specification level
languages
Level 0 Requirements
English
Idea
Is modeled by
ES/RT-UML, Esterel, SDL
Level 1 Architecture
Is implemented by
C, JAVA,
Level 2 Implementation
C, VHDL, SystemC
Compiles into
Machine code,
Level 3 Realization
Hardware modules
Exploration
search area
43Design space exploration
Level n-1
Design point
Cost
LT(n-1,n)
Exploration at
level n
Exploration
search area
Realization
space
global optimum
Exploration search area
Design transformation
44Design space exploration framework- another
Y-chart
45Design flow steps and constraints
idea
high abstraction level
Refinement steps
Architecture / Platform constraints
Transformation
low abstraction level
realization
46In which order should we perform the steps?
Decision trees
47Well-known phase ordering examples
- Concurrency versus Data management
- e.g. loop partitioning versus array partitioning
for a multiprocessor - Scheduling versus Register allocation
- Logic synthesis versus Placement and Routing
48Rule of thumb!
- Perform steps with biggest impact first
- Biggest impact
- depends on your interest ( cost function)
- min. E, P, ED, D, Area, Npins, ...
49Phase ordering exampleWhy fix data
storage/transfer before concurrency management
issues?
Recursive image processing algorithm on local
neighborhoods (for i 0 .. I-1 ) (for j 0
.. J-1 ) imgij f(imgij-k,
old_imgij)
50Why fix data storage/transfer before concurrency
mngnt issues?
- Unrolling outerloop (i) M times
- needed M J-word FIFOs (image lines)
- M data paths
51Why fix data storage/transfer before concurrency
mngnt issues?
Unrolling (j) innerloop (limited by k) M - 1
buffer reg (i 0 .. I-1 ) (j 0 .. (J div
2)-1 ) imgi2j-1 f(imgi2j-k-1,
old_imgi2j-1) imgi2j
f(imgi2j-k, old_imgi2j)
52Proposed System Design Methodology
System Specification
System-Level Exploration and refinement
Optimized algorithms
(C/C specification)
SW/HW
Partitioning/
Traditional
Architecture
Exploration
(parallelizing)
HW Synthesis
Compiler Steps
Steps
Code per (parallel) proc.
Structural VHDL Code
53Design flow
54Remove OO overhead
55Object-based versus Object-oriented
Object-oriented
Object-based
- calls through function pointer
- cannot be inlined
- direct calls
- can be inlined
gt OO is good for specification, not for
implementation
56Whole-system optimization techniques
- Aggressive use of traditional inter-procedural
techniques - in the embedded world you often know the whole
application ! - OO specific optimization
- Data allocation optimization
57Example data inlining
- Eliminate
- dynamic allocation
- pointer de-reference
- polymorphic calls
class A
B bA() b new C A() delete b void
f() b-gtg()
58Example dynamic allocation removal
- Eliminate dynamic allocation
- Re-use stack memory already needed for
other call tree branches
void teq(,short size,) float Ryy Ryy
new floatsize teq computation delete
Ryy
void teq(,) float Ryy64 teq
computation
teq(,64,) teq(,64,)
teq(,) teq(,)
59ADSL result footprint -33
Unoptimized
ARM C optimized (-O2 -Ospace)
Inlining, dead code, constant prop.
Virtual call elimination
400kB
Data alloc. optim.
200kB
106
100
83
82
67
Total memory footprint (code data)
60Dynamic Memory Management
- Data type refinement
- Virtual memory management
61Data type refinement
ATM_cell Data_In Association_Table
Routing_Table Routing_Table new
Association_Table() Data_In new
ATM_cell() if ( Routing_Table-gtLookup(Data_In)
) ...
Impl. alternatives
62Data type refinement Array
ATM_cell Data_In Array Routing_Table Routin
g_Table new Array () Data_In new
ATM_cell() if ( Routing_Table-gtLookup(Data_In)
) ...
Impl. alternatives
63Data type refinement Linked List
ATM_cell Data_In Linked_List
Routing_Table Routing_Table new Linked_List
() Data_In new ATM_cell() if (
Routing_Table-gtLookup(Data_In) ) ...
Impl. alternatives
64Data type refinement Binary Tree
ATM_cell Data_In Binary_Tree
Routing_Table Routing_Table new Binary_Tree
() Data_In new ATM_cell() if (
Routing_Table-gtLookup(Data_In) ) ...
Impl. alternatives
65Task Concurrency Management
Going from specification concurrency to
implementation concurrency
66Modelling MTG
67TCM transformations
- Why transformations?
- shift existing Pareto curves
- create new points on the Pareto curves
- improve available task level parallelism
68TCM Transformations
less memory
Shared Memory Area
MA Cycle Budget
Tasks freely assigned to 2 Processors
Tasks order constrained to reduce memory
requirements
Independent, dynamic tasks assigned to 1
Processor
Partial Order Constraints
Conflict
P1
HW1
T1
T2
T3
T5
T1
T6
T3
HW1
T4
T2
P2
T4
T5
T6
69Static Memory Management
DTSE data transfer and storage exploration
70Static data memory management (DMM)
3 Exploit memory hierarchy
Local Latch 1 Bank 1
Processor Data Paths
L1 Cache
L2 Cache
Cache Bank Recombine
Local Latch N Bank N
Chip
Off-chip SDRAM
6 Exploit limited life-time and data layout
freedom
5 Meet real-time constraints
71DMM how to improve locality?
FOR i1 TO N DO Bif(Ai) FOR i1 TO N
DO Cig(Bi) FOR i1 TO N DO
Bif(Ai) Cig(Bi)
Local Latch 1 Bank 1
Processor Data Paths
L1 Cache
L2 Cache
Cache Bank Recombine
Local Latch N Bank N
Chip
Off-chip SDRAM
72Exploiting Memory Hierarchy
A 100
A 1
A 10
M''
M''
Processor Data Paths
Reg. file
M''
P0.01
P0.1
P1
P (before) 100 P (after) 1000.01
100.1 1 1 3
73How to Avoid N-port Memories?
74Address Optimization
75Algebraic Transformations and Aggressive Code
Hoisting for Expression Elimination
Initial
for(y0..9 y) for(x0..99 x) if
(xgt1) A (y3)3 (x-2)3 ... if (xgt4)
...A (y3)3 (x-5)3
76Modulo substitution for piece-wise linear
addressing
Optimised-1st
for(y0..9 y) v_y (y3)3
for(x0..99 x) v_yx (x-2)3v_y
if (xgt1) Av_yx if (xgt4) Av_yx
77What do we gain?Running example cavity detection
- Application domain
- Computer Tomography in medical imaging
- Algorithm
- Cavity detection in CT-scans
- Detect dark regions in successive images
- Indicate cavity in brain
78Starting point
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y
- Reference (conceptual) C code for the algorithm
- all functions image_inN x Mt-1 -gt image_outN
x Mt - new value of pixel depends on its neighbors
- neighbor pixels read from background memory
- approximately 110 lines of C code (ignoring file
I/O etc) - experiments with N x M 640 x 400 pixels
- straightforward implementation 6 image buffers
79Cavity Detector Results
80Lecture 1 Introduction
- Trends
- Platforms
- Application mapping
- Design flow
- Summary
81Summary
- Billions of Embedded systems, everywhere!!!
- Multi-media applications become extremely complex
and dynamic - Time-to-Market pressure
- Solution
- Platforms as design target (raise abstraction
level) - Advanced emb. system design flow needed
82Traditional Design Methodology
System Specification
SW/HW
Partitioning/
HW System
(SW System
Exploration
Exploration
Exploration)
Optimized SW spec
Optimized HW spec
(C specification)
(VHDL specification)
Architecture
Traditional
HW Synthesis
(parallelizing)
Steps
Compiler Steps
Structural VHDL Code
Code per (parallel) proc.
83Proposed System Design Methodology
System Specification
Our main focus
System-Level Exploration and refinement
Optimized algorithms
(C/C specification)
SW/HW
Partitioning/
Traditional
Architecture
Exploration
(parallelizing)
HW Synthesis
Compiler Steps
Steps
Code per (parallel) proc.
Structural VHDL Code