Embedded Systems in Silicon TD5102 Introduction and overview - PowerPoint PPT Presentation

1 / 83

About This Presentation

Title:

Embedded Systems in Silicon TD5102 Introduction and overview

Description:

Yes, we can fabricate the ICs, but ... Can we design them ? ... Proper tooling for platform system design. Compilers, Models, Exploration, Debugging, Simulation, ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 84

Provided by: henkcor2

Category:

more less

Transcript and Presenter's Notes

Title: Embedded Systems in Silicon TD5102 Introduction and overview

1
Embedded Systems in SiliconTD5102Introduction
and overview
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2
Contents

Trends
Platforms
Application mapping
Design flow
Summary

3
Observation 1The 3 Cs

Convergence of 3 Cs
computers, communications and consumer
electronics
The computer enters the 3rd fase
computing power - networking - intelligent
processing
The world is one network
wherever, whenever, all information and
communication available

We get a smart environment
4
Observation 2 Current design practise

Y-Chart (Gajski-Kuhn)
Design Flow is path in Y chart
Till RT-level largely manual flow

5
Observation 3 Informal system specification
6
Observation 4 design productivity

Yes, we can fabricate the ICs, but
Can we design them ?
Can we program them ?

7
Obervation 5More dynamic applications

Video
P. Kuhn, G. Diebel, Complexity Analysis of the
MPEG-4 VM 8.0, ISO/IEC JTC1/SC29/WG11/MPEG97/m28
62, Fribourg, October 1997

3D
8
Observation 6 Memory problem
Performance
µProc 55/year
1000
CPU
100
Moores Law
10
DRAM 7/year
DRAM
1
1980
1985
1990
1995
2000
Time
Patterson
9
What do we learn from these observations?

We need
Short Time-to-Market
reuse
short design time
Flexible solution
programmability
reconfigurability
Scalability
Low power
Low cost
QoS control

At sufficient performance !
10
Solution ?

Platforms
HW and SW IP reuse
Standardization (interfaces)
QoS (quality of service) hooks
Advanced Design Flow for Platforms
Raise abstraction level
Tool support
Modeling of Power, Cost, Performance
Predictable design

11
Lecture 1 Introduction

Trends
Platforms
Application mapping
Design flow
Summary

12
What is a platform?
A platform is a generic, but domain
specific information processing (sub-)system
In future available as single chip (SoC), or
package (SiP)
13
What is a platform?

HW properties
One or more programmable processors
Advanced memory organization
Programmable communication network
I/O (highly domain dependent)
Possible extra HW features
Reconfigurable logic
Domain specific accelerators

14
What is a platform?

SW components
Standardized RTOS
Proper tooling for platform system design
Compilers, Models, Exploration, Debugging,
Simulation,
Possible extra SW features
Middleware layer on top of OS for features like
QoS
Domain specific protocols
Domain specific SW interfaces
Control reconfigurable logic
Library components
Distributed / Active network processing
Billing
Security

15
Example Platform Philips Nexperia

Available in the Billion Transistor Era
E.g. TI OMAP, Sony Cell, Philips Nexperia, TRIPS,
Xilinx Virtex-4 Pro,

16
Future platforms

Example Smart Networked Devices

active packets
Virtual Machine
Protocols Multimedia (MPEG 21) Network
OS
library
reconfig. hardware
accelerator hardware
programmable hardware
radio
17
Future platform architecture concept
Reconfigurable HW blocks
CPUs
Reconfigurable HW blocks
Accelerators
Reconfigurable HW blocks
CPUs
Accelerators
Accelerators
CPUs
Communication network
Memory
Memory
I/O
Level 0
Communication network
Level 1
Communication network
I/O
Level N
Memory
18
Future platforms
Network interface
On-chip Network
IP core

IP - Isles
32 RISC microprocessor 20 Kgates
MPEG decoding 100 Kgates
Wavelet filtering 40 Kgates
SRAM
DRAM
FPGA block

19
Lecture 1 Introduction

Trends
Platforms
Application mapping
Design flow
Summary

20
Platform and platform design
Applications
SDT system design technology
Design technology
Platform
PDT platform design technology
Enabling technologies
21
What is the system designers problem ?

Idea

Specification
Implementation
Find for an application (idea/specification) an
efficient mapping/implementation on a given
realization space, under given constraints
(cost, P, E, T, ED, Throughput, pins, ..)
22
A (single) processor how does it
look inside?
23
Mapping placing operations in space and time

d a b
e a d
f 2 b d
r f e
x z y

24
How to map these operations?

Architecture 1
One Function Unit
All operations single cycle latency

b
a
2

d

z
y
e
f

-
x
r
25
How to map these operations?

Architecture 2
One Add-Sub and one Mul unit
All operations single cycle latency

26
How to map these operations?

Architecture 3
One Add-sub and one Mul unit
Add/Sub 1 cycle, Mul 2 cycles

27
There are many mapping solutions
Let S be the solution space containing solutions
x (xi), then x Pareto point ? x ? S, and ? y
? S ?i xi lt yi
28
Can we do better?
Yes !!

Much better !!
transforming the specification
a different architecture
a different mapping
speculative execution
be creative ..

29
Transforming the specification (1)
Example tree height reduction
Based on associativity of operation a (b c)
(a b) c
30
Transforming the specification (2)
r f e 2b d (a d) 2b a x
z y
d a b e a d f 2 b d r f e x
z y
31
Changing the architecture adding more complex
units

4-input adder why is this faster?
32
Changing the architecture adding more complex
units

In the extreme case put everything into one unit!

Spatial mapping - no control flow
33
More complex control flow
Program part
-a- If cond Then -b- Else -c- -d-
34
Mapping the CFG example 3
options what's the best?
-a- br c
-a- br b
-a- br c
-b- jmp d
-c- jmp d
-b-
-b-
-c-
-d-
-d-
-d-
-c- jmp d
35
Why not removing the control flow ?
36
If conversion shortens the schedule
-a- br c
-a-
-b- jmp d
cond -b-
!cond -c-
-c-
-d-
-d-
Using guarded instructions like r3 add
r1,r2,r5 !r3 mul r4,r5,3
37
Speculative execution makes it even shorter!
-a- br c
-a-
-b-
-c-
-b- jmp d
-d-
-c-
-d-
Why not executing -d- in parallel?
38
However Real life much more complex
E.g. MPEG-4 multimedia
Huge requirements gt 10 GOP/s gt 6 GB/s gt 10
MB storage
Software specification - more than 200 000
lines C - hundreds of files - written by
approx. 80 teams
39
Can we handle this?
Nowadays implementations - small images -
decoding only - not real-time - several W -
single task - limited dynamism
Wanted features - large images (HDTV) -
encoding and decoding - real-time - 100 mW
(mobile) - multiple tasks - dealing with
dynamism
40
Lecture 1 Introduction

Trends
Platforms
Application mapping
Design flow
Summary

41
Embedded system design
How to map your application graph A(L,A,D) to
hardware graph (L,N,C)
L design level (e.g. architecture,
implementation or realization level) A
application components (e.g. tasks, operations,
data structures) D dependences between
application components N hardware components
(e.g. processors, ASICs, FPGA,memories) C
connections between hardware components
42
Abstraction levels
Level specification
Inter-level transformation
System specification level
languages
Level 0 Requirements
English
Idea
Is modeled by
ES/RT-UML, Esterel, SDL
Level 1 Architecture
Is implemented by
C, JAVA,
Level 2 Implementation
C, VHDL, SystemC
Compiles into
Machine code,
Level 3 Realization
Hardware modules
Exploration
search area
43
Design space exploration
Level n-1
Design point
Cost
LT(n-1,n)
Exploration at
level n
Exploration
search area
Realization
space
global optimum
Exploration search area
Design transformation
44
Design space exploration framework- another
Y-chart
45
Design flow steps and constraints
idea
high abstraction level
Refinement steps
Architecture / Platform constraints
Transformation
low abstraction level
realization
46
In which order should we perform the steps?
Decision trees
47
Well-known phase ordering examples

Concurrency versus Data management
e.g. loop partitioning versus array partitioning
for a multiprocessor
Scheduling versus Register allocation
Logic synthesis versus Placement and Routing

48
Rule of thumb!

Perform steps with biggest impact first
Biggest impact
depends on your interest ( cost function)
min. E, P, ED, D, Area, Npins, ...

49
Phase ordering exampleWhy fix data
storage/transfer before concurrency management
issues?
Recursive image processing algorithm on local
neighborhoods (for i 0 .. I-1 ) (for j 0
.. J-1 ) imgij f(imgij-k,
old_imgij)
50
Why fix data storage/transfer before concurrency
mngnt issues?

Unrolling outerloop (i) M times
needed M J-word FIFOs (image lines)
M data paths

51
Why fix data storage/transfer before concurrency
mngnt issues?
Unrolling (j) innerloop (limited by k) M - 1
buffer reg (i 0 .. I-1 ) (j 0 .. (J div
2)-1 ) imgi2j-1 f(imgi2j-k-1,
old_imgi2j-1) imgi2j
f(imgi2j-k, old_imgi2j)
52
Proposed System Design Methodology
System Specification
System-Level Exploration and refinement
Optimized algorithms
(C/C specification)
SW/HW
Partitioning/
Traditional
Architecture
Exploration
(parallelizing)
HW Synthesis
Compiler Steps
Steps
Code per (parallel) proc.
Structural VHDL Code
53
Design flow
54
Remove OO overhead
55
Object-based versus Object-oriented
Object-oriented
Object-based

calls through function pointer
cannot be inlined

direct calls
can be inlined

gt OO is good for specification, not for
implementation
56
Whole-system optimization techniques

Aggressive use of traditional inter-procedural
techniques
in the embedded world you often know the whole
application !
OO specific optimization
Data allocation optimization

57
Example data inlining

Eliminate
dynamic allocation
pointer de-reference
polymorphic calls

class A
B bA() b new C A() delete b void
f() b-gtg()
58
Example dynamic allocation removal

Eliminate dynamic allocation
Re-use stack memory already needed for
other call tree branches

void teq(,short size,) float Ryy Ryy
new floatsize teq computation delete
Ryy
void teq(,) float Ryy64 teq
computation
teq(,64,) teq(,64,)
teq(,) teq(,)
59
ADSL result footprint -33
Unoptimized
ARM C optimized (-O2 -Ospace)
Inlining, dead code, constant prop.
Virtual call elimination
400kB
Data alloc. optim.
200kB
106
100
83
82
67
Total memory footprint (code data)
60
Dynamic Memory Management

Data type refinement
Virtual memory management

61
Data type refinement
ATM_cell Data_In Association_Table
Routing_Table Routing_Table new
Association_Table() Data_In new
ATM_cell() if ( Routing_Table-gtLookup(Data_In)
) ...
Impl. alternatives
62
Data type refinement Array
ATM_cell Data_In Array Routing_Table Routin
g_Table new Array () Data_In new
ATM_cell() if ( Routing_Table-gtLookup(Data_In)
) ...
Impl. alternatives
63
Data type refinement Linked List
ATM_cell Data_In Linked_List
Routing_Table Routing_Table new Linked_List
() Data_In new ATM_cell() if (
Routing_Table-gtLookup(Data_In) ) ...
Impl. alternatives
64
Data type refinement Binary Tree
ATM_cell Data_In Binary_Tree
Routing_Table Routing_Table new Binary_Tree
() Data_In new ATM_cell() if (
Routing_Table-gtLookup(Data_In) ) ...
Impl. alternatives
65
Task Concurrency Management
Going from specification concurrency to
implementation concurrency
66
Modelling MTG
67
TCM transformations

Why transformations?
shift existing Pareto curves
create new points on the Pareto curves
improve available task level parallelism

68
TCM Transformations
less memory
Shared Memory Area
MA Cycle Budget
Tasks freely assigned to 2 Processors
Tasks order constrained to reduce memory
requirements
Independent, dynamic tasks assigned to 1
Processor
Partial Order Constraints
Conflict
P1
HW1
T1
T2
T3
T5
T1
T6
T3
HW1
T4
T2
P2
T4
T5
T6
69
Static Memory Management
DTSE data transfer and storage exploration
70
Static data memory management (DMM)
3 Exploit memory hierarchy
Local Latch 1 Bank 1
Processor Data Paths
L1 Cache
L2 Cache
Cache Bank Recombine
Local Latch N Bank N
Chip
Off-chip SDRAM
6 Exploit limited life-time and data layout
freedom
5 Meet real-time constraints
71
DMM how to improve locality?
FOR i1 TO N DO Bif(Ai) FOR i1 TO N
DO Cig(Bi) FOR i1 TO N DO
Bif(Ai) Cig(Bi)
Local Latch 1 Bank 1
Processor Data Paths
L1 Cache
L2 Cache
Cache Bank Recombine
Local Latch N Bank N
Chip
Off-chip SDRAM
72
Exploiting Memory Hierarchy
A 100
A 1
A 10
M''
M''
Processor Data Paths
Reg. file
M''
P0.01
P0.1
P1
P (before) 100 P (after) 1000.01
100.1 1 1 3
73
How to Avoid N-port Memories?
74
Address Optimization
75
Algebraic Transformations and Aggressive Code
Hoisting for Expression Elimination
Initial
for(y0..9 y) for(x0..99 x) if
(xgt1) A (y3)3 (x-2)3 ... if (xgt4)
...A (y3)3 (x-5)3
76
Modulo substitution for piece-wise linear
addressing
Optimised-1st
for(y0..9 y) v_y (y3)3
for(x0..99 x) v_yx (x-2)3v_y
if (xgt1) Av_yx if (xgt4) Av_yx
77
What do we gain?Running example cavity detection

Application domain
Computer Tomography in medical imaging
Algorithm
Cavity detection in CT-scans
Detect dark regions in successive images
Indicate cavity in brain

78
Starting point
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y

Reference (conceptual) C code for the algorithm
all functions image_inN x Mt-1 -gt image_outN
x Mt
new value of pixel depends on its neighbors
neighbor pixels read from background memory
approximately 110 lines of C code (ignoring file
I/O etc)
experiments with N x M 640 x 400 pixels
straightforward implementation 6 image buffers

79
Cavity Detector Results
80
Lecture 1 Introduction

Trends
Platforms
Application mapping
Design flow
Summary

81
Summary

Billions of Embedded systems, everywhere!!!
Multi-media applications become extremely complex
and dynamic
Time-to-Market pressure
Solution
Platforms as design target (raise abstraction
level)
Advanced emb. system design flow needed

82
Traditional Design Methodology
System Specification
SW/HW
Partitioning/
HW System
(SW System
Exploration
Exploration
Exploration)
Optimized SW spec
Optimized HW spec
(C specification)
(VHDL specification)
Architecture
Traditional
HW Synthesis
(parallelizing)
Steps
Compiler Steps
Structural VHDL Code
Code per (parallel) proc.
83
Proposed System Design Methodology
System Specification
Our main focus
System-Level Exploration and refinement
Optimized algorithms
(C/C specification)
SW/HW
Partitioning/
Traditional
Architecture
Exploration
(parallelizing)
HW Synthesis
Compiler Steps
Steps
Code per (parallel) proc.
Structural VHDL Code

Write a Comment

User Comments (0)