Embedded Systems in Silicon TD5102 Data Management (1) Overview - PowerPoint PPT Presentation

About This Presentation

Title:

Embedded Systems in Silicon TD5102 Data Management (1) Overview

Description:

Title: Design Technology for future Multi-Media Systems Author: henk corporaal Last modified by: Medewerker Created Date: 2/25/2002 12:06:54 PM Document presentation ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 55

Provided by: henkcor1

Category:

more less

Transcript and Presenter's Notes

Title: Embedded Systems in Silicon TD5102 Data Management (1) Overview

1
Embedded Systems in SiliconTD5102Data
Management (1)Overview
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2
Data Management Overview

Motivation
Example application
Data Management (DM) steps
Results

Important note
We consider here static declared data structures
only
DM is also called
DTSE (Data Transfer and Storage Exploration), or
Physical Memory Management

3
Design flow
4
The underlying idea
for (i0iltni) for (j0 jlt3 j) for
(k1 klt7 k) Bj Ai4k
5
Platform architecture model
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
6
Platform example TriMedia
7
Data transfer and storage power
8
Positioning in the Y-chart
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
9
Current practiceMapping, easy, but...........
Idea

Given
reference C code for applicatione.g. MPEG-4
Motion Estimation
platform SUPERDUPER-LX50
Task
map application on architecture
But wait a moment
me_at_workgt CC o2 mpeg4_me mpeg4_me.cThank you
for running SUPERDUPER-LX50 compiler.Your
program uses 257321886 bytes memory, 78 Watt,
428798765291 clock cycles

ab5d for (...) ..
10
Lets help the compiler ...DTSE data transfer
and storage exploration

DTSE is a methodology to explore data-transfer
and data-storage in multi-media applications
Transforms C-code of the application
By focusing on multi-dimensional signals (arrays)
To better exploit platform capabilities
This overview covers the major steps to improve
power, area, performance trade-off

11
Data Management principles
Off-chip SDRAM
Exploit limited life-time
12
DM steps
C-in
Preprocessing
Dataflow transformations
Loop transformations
Data reuse Memory hierarchy layer assignment
Cycle budget distribution
Memory allocation and assignment
Data layout
Address optimization
C-out
13
The DM steps

Preprocessing
Rewrite code in 3 layers (parts)
Selective inlining, Single Assignment form, ....
Data flow transformations
Eliminate redundant transfers and storage
Loop and control flow transformations
Improve regularity of accesses and data locality
Data re-use and memory hierarchy layer assignment
Determine when to move which data between
memories to meet the cycle budget of the
application with low cost
Determine in which layer to put the arrays (and
copies)

14
The DM steps

Per memory layer
Cycle budget distribution
determine memory access constraints for given
cycle budget
Memory allocation and assignment
which memories to use, and where to put the
arrays
Data layout
determine how to combine and put arrays into
memories
Address optimization on the final C-code

15
Application example

Application domain
Computer Tomography in medical imaging
Algorithm
Cavity detection in CT-scans
Detect dark regions in successive images
Indicate cavity in brain

? Bad news for owner of brain
16
Data enters Cavity Detectorrow-wise
serial scan
Buffer
image_in
GaussBlur loop
Cavity Detector
17
Application
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y

Reference (conceptual) C code for the algorithm
all functions image_inN x Mt-1 -gt image_outN
x Mt
new value of pixel depends on its neighbors
neighbor pixels read from background memory
approximately 110 lines of C code (ignoring file
I/O etc)
experiments with N x M 640 x 400 pixels
straightforward implementation 6 image buffers

18
Preprocessing Dividing an application in the 3
layers
Module1a
LAYER1
Module2
Module3
Module1b
- testbench call
- dynamic event behaviour
Synchronisation
- mode selection
for (i0iltN i)
for (j0 jltM j)
LAYER2
if (i 0)
Bij 1
else
Bij func1(Aij, Ai-1j)
int
func1(int a, int b)
LAYER3

return ab

19
Layered code structure
main() / Layer 1 code /
read_image(IN_NAME, image_in) cav_detect()
write_image(image_out)
void cav_detect() / Layer 2 code /
for (xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)

20
Layered code structure
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
/ Makes code for data access
/ / and data transfer explicit /
int foo(int arg1) / Layer 3 / /
arithmetic, data-dependent operations to be
mapped to data-path, controller /
21
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y)
gauss_x_imagexy0
for (x1 xltN-2 x) for (y1 yltM-2
y) gauss_x_tmp 0 for (k-1
klt1 k) gauss_x_tmp
image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)

accesses N M (N-2) (M-2)
22
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y) if
((xgt1 xltN-2) (ygt1 yltM-2))
gauss_x_tmp 0 for (k-1 klt1
k) gauss_x_tmp image_inxkyGau
ssabs(k) gauss_x_imagexy
foo(gauss_x_tmp) else
gauss_x_imagexy 0
accesses N M gain is 50
23
Data-flow transformation

In total 5 types of data-flow transformations
advanced signal substitution and (copy)
propagation
algebraic transformations (associativity, etc.)
shifting delay lines
re-computation
transformations to eliminate bottlenecks for
subsequent loop transformations

24
Loop transformations

Loop transformations
improve regularity of accesses
improve temporal locality production ?
consumption
Expected influence
reduce temporary storage and (anticipated)
background storage

25
Global loop transformation steps applied to
cavity detection

Removal of data-flow bottleneck
allows merging of loops
done in global data-flow trafo step
Make all loop dimensions equal
Regularize loop traversalY and X loop
interchange
follow order of input stream
Y loop folding and global mergingX loop folding
and global merging
full, global scope regularity
nearly complete locality for main signals

26
Loop trafo - cavity detection
N x M
Scanner
X
Y
From double bufferto single buffer
27
Loop interchange (Y ? X)
for (x0xltNx) for (y0yltMy) /
filtering code /
for (y0yltMy) for (x0xltNx) /
filtering code /

Single assignment ? always possible
For all loops, to maintain regularity

28
Loop trafo - cavity detection
N x (2GB1)
N x 3
Compute Edges
Gauss Blur y
Gauss Blur x
Repeated fold and loop merge
3(offset arrays)
2GB1
From N x M toN x (3) buffer size
From N x M toN x (2GB1) buffer size
29
Improve regularity and locality? Loop Merging
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (y0yltMy) for
(x0xltNx) / 2nd filtering code /
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (x0xltNx) / 2nd
filtering code /

!! Impossible due to dependencies!

30
Data dependencies between1st and 2nd loop
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0yltMy) for
(x0xltNx) for (k-GB kltGB k)
gauss_x_imagexyk
31
Enable merging withLoop Folding (bumping)
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0GByltMGBy)
for (x0xltNx) y-GB for (k-GB
kltGB k) gauss_x_imagexyk-GB
32
Y-loop merging on 1st and 2nd loop nest
for (y0yltMGBy) if (yltM) for
(x0xltNx) gauss_x_imagexy
if (ygtGB) for (x0xltNx) if
(xgtGB xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB
k) gauss_x_imagexy-GBk
else
33
Simplify conditionsin merged loop nest
for (y0yltMGBy) for (x0xltNx) if
(yltM) gauss_x_imagexy
for (x0xltNx) if (ygtGB xgtGB
xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB k)
gauss_x_imagexy-GBk else if
(ygtGB)
34
Global loop merging/folding steps

1 x ? y Loop interchange (done)
2 Global y-loop folding/merging 1st and 2nd nest
(done)
3 Global y-loop folding/merging 1st/2nd and 3rd
nest
4 Global y-loop folding/merging 1st/2nd/3rd and
4th nest
5 Global x-loop folding/merging 1st and 2nd nest
6 Global x-loop folding/merging 1st/2nd and 3rd
nest
7 Global x-loop folding/merging 1st/2nd/3rd and
4th nest

35
End result of global loop trafo
for (y0 yltMGB2 y) for (x0 xltN2
x) if (xgtGB xltN-1-GB
(y-GB)gtGB (y-GB)ltM-1-GB)
gauss_xy_computexy-GB0 0 for
(k-GB kltGB k) gauss_xy_computexy-
GBGBk1 gauss_xy_computexy-GB
GBk gauss_x_imagexy-GBk
Gaussabs(k) gauss_xy_imagexy-GB
gauss_xy_computexy-GB(2GB)1/tot
else if (xltN (y-GB)gt0 (y-GB)ltM)
gauss_xy_imagexy-GB 0
36
Data re-use memory hierarchy
A 100
Processor Data Paths
Reg File
100
10
1
P (original) access x power/access 100
P (after) 100 x 0.01 10 x 0.1 1 x 1 3

Introduce memory hierarchy
reduce number of reads from main memory
heavily accessed arrays stored in smaller memories

37
Data re-use

Data flow transformations to introduce
extracopies of heavily accessed signals
Step 1 figure out data re-use possibilities
Step 2 calculate possible gain
Step 3 decide on data assignment to memory
hierarchy

38
Data re-use

Data flow transformations to introduce
extracopies of heavily accessed signals
Step 1 figure out data re-use possibilities
Step 2 calculate possible gain
Step 3 decide on data assignment to memory
hierarchy

1216
N216
39
Data re-use tree
image_in
gauss_xy/comp_edge
gauss_x
image_out
NM
M3
M3
M3
NM
NM
NM3
NM3
NM
0
11
N1
13
33
NM
NM8
NM8
NM3
31
NM3
CPU
CPU
CPU
CPU
CPU
40
Memory hierarchy assignment
image_in
gauss_x
gauss_xy
comp_edge
image_out
NM
NM
1MB SDRAM
0
NM
M3
M3
M3
16KB Cache
NM3
NM3
NM
NM
NM3
128 B RegFile
11
11
31
33
33
NM3
NM8
NM8
NM8
NM8
41
Data-reuse - cavity detection code
Code before reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
if (xgt1 xltN-2 ygt1 yltM-2)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_compute)
else if (xltN yltM)
gauss_x_linesxy 0 / Other
merged code omitted /
42
Data-reuse - cavity code
Code after reuse transformation detection
for (y0 yltM3 y) for (x0 xltN2 x)
/ first in_pixel initialized / if (x0
ygt1 yltM-2) for (k0 klt1 k)
in_pixels(xk)3y1 image_inxky
/ copy rest of in_pixel's in row / if (xgt0
xltN-2 ygt1 yltM-2)
in_pixels(x1)3y1 image_inx1y
if (xgt1 xltN-1-1 ygt1 yltM-2)
gauss_x_tmp0 for (k-1 klt1 k)
gauss_x_tmp in_pixels(xk)3y1GaussAbs(
k) gauss_x_linesxy3
foo(gauss_x_tmp) else if (xltN
yltM) gauss_x_linesxy3 0
43
Data layout optimization

At this point multi-dimensional arraysare to be
assigned to physical memories
Data layout optimization determines exactly where
in each memory an array should be placed, to
reduce memory size by in-placing arrays that do
not overlap in time (disjoint lifetimes)
to avoid cache misses due to conflicts
exploit spatial locality of the data in memory to
improve performance of e.g. page-mode memory
access sequences

44
In-place mapping
Inter in-place
addresses
Intra in-place
time
45
In-place mapping

Implements all the anticipated memory size
savings obtained in previous steps
Modifies code to introduce one array per real
memory
Changes indices to addresses in mem. arrays

b8 A100100 b6 B2020 for (i,j,k,l )
Bij f(Bji, Aikjl)
46
In-place mapping

Input image is partly consumed by the time first
results for output image are ready

index
Image_in
time
index
Image_out
time
47
In-place - cavity detection code
for (y0 yltM3 y) for (x0 xltN5 x)
image_outx-5y-3 / code
removed / image_inx1y
for (y0 yltM3 y) for (x0 xltN5 x)
imagex-5y-3 / code
removed / image x1y
48
The last step ADOPT
(Address OPTimization)

Increased execution time introduced by DTSE
Complicated address arithmetic (modulo!)
Additional complex control flow
Multimedia platform not adapted to address
calculations
Additional transformations needed to
Simplify control flow
Simplify address arithmetic common
sub-expression elimination, modulo expansion,
Match remaining expressions on target machine

49
ADOPT principles
Behavioral description
Extract address expr. codePerform addr. expr.
splitting
Apply transformations- Loop invariant code
motion- Induction variable analysis- Algebraic
transformations
Processor specific algebraictransformations
Optimized behavioral descr.for target processor
Optimized behavioral descr.
Compile to target processor
Map to custom ACU
50
ADOPT principles
Example Full-search Motion Estimation
for (i- 8 ilt8 i) for (j- 4 jlt3
j) for (k- 4 klt3 k)
A((208i)2578j)257 16ik
B(8j)25716ik dist A3096 -
B((208i)2574)257 16i-4
cse1 (33025i6869616)2 cse3 1040i
cse4 j2571032 cse5
kcse4 cse5cse1 cse5cse3
3096 cse1
Algebraic transformations at word-level
51
DMM results for cavity detection on ASIC
52
Cavity detection on Pentium-MMX
Main Memory Accesses
Local Memory Accesses
Execution Time (sec)
53
The Y-chart revisited
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
54
Fixing platform parameters

Assume configurable on-chip memory hierarchy
Trade-off power versus cycle-budget

power mW
25
20
15
10
5
storagecyclebudget
50,000
100,000
150,000
55
Conclusion

In multi-media applications exploring data
transfer and storage issues should be done at
system level
DTSE is a methodology for Data Transfer and
Storage Exploration based on manual and/or
tool-assisted code rewriting
Platform independent high-level transformations
Platform dependent transformations exploit
platform characteristics (optimal use of cache,
)
Substantial reduction in power and memory size
demonstrated on MPEG-4, OFDM, H.263, ADSL, ...