Title: Embedded Systems in Silicon TD5102 Data Management (1) Overview
1Embedded Systems in SiliconTD5102Data
Management (1)Overview
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2Data Management Overview
- Motivation
- Example application
- Data Management (DM) steps
- Results
- Important note
- We consider here static declared data structures
only - DM is also called
- DTSE (Data Transfer and Storage Exploration), or
- Physical Memory Management
3Design flow
4The underlying idea
for (i0iltni) for (j0 jlt3 j) for
(k1 klt7 k) Bj Ai4k
5Platform architecture model
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
6Platform example TriMedia
7Data transfer and storage power
8Positioning in the Y-chart
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
9Current practiceMapping, easy, but...........
Idea
- Given
- reference C code for applicatione.g. MPEG-4
Motion Estimation - platform SUPERDUPER-LX50
- Task
- map application on architecture
- But wait a moment
- me_at_workgt CC o2 mpeg4_me mpeg4_me.cThank you
for running SUPERDUPER-LX50 compiler.Your
program uses 257321886 bytes memory, 78 Watt,
428798765291 clock cycles
ab5d for (...) ..
10Lets help the compiler ...DTSE data transfer
and storage exploration
- DTSE is a methodology to explore data-transfer
and data-storage in multi-media applications - Transforms C-code of the application
- By focusing on multi-dimensional signals (arrays)
- To better exploit platform capabilities
- This overview covers the major steps to improve
power, area, performance trade-off
11Data Management principles
Off-chip SDRAM
Exploit limited life-time
12DM steps
C-in
Preprocessing
Dataflow transformations
Loop transformations
Data reuse Memory hierarchy layer assignment
Cycle budget distribution
Memory allocation and assignment
Data layout
Address optimization
C-out
13The DM steps
- Preprocessing
- Rewrite code in 3 layers (parts)
- Selective inlining, Single Assignment form, ....
- Data flow transformations
- Eliminate redundant transfers and storage
- Loop and control flow transformations
- Improve regularity of accesses and data locality
- Data re-use and memory hierarchy layer assignment
- Determine when to move which data between
memories to meet the cycle budget of the
application with low cost - Determine in which layer to put the arrays (and
copies)
14The DM steps
- Per memory layer
- Cycle budget distribution
- determine memory access constraints for given
cycle budget - Memory allocation and assignment
- which memories to use, and where to put the
arrays - Data layout
- determine how to combine and put arrays into
memories - Address optimization on the final C-code
15Application example
- Application domain
- Computer Tomography in medical imaging
- Algorithm
- Cavity detection in CT-scans
- Detect dark regions in successive images
- Indicate cavity in brain
? Bad news for owner of brain
16Data enters Cavity Detectorrow-wise
serial scan
Buffer
image_in
GaussBlur loop
Cavity Detector
17Application
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y
- Reference (conceptual) C code for the algorithm
- all functions image_inN x Mt-1 -gt image_outN
x Mt - new value of pixel depends on its neighbors
- neighbor pixels read from background memory
- approximately 110 lines of C code (ignoring file
I/O etc) - experiments with N x M 640 x 400 pixels
- straightforward implementation 6 image buffers
18Preprocessing Dividing an application in the 3
layers
Module1a
LAYER1
Module2
Module3
Module1b
- testbench call
- dynamic event behaviour
Synchronisation
- mode selection
for (i0iltN i)
for (j0 jltM j)
LAYER2
if (i 0)
Bij 1
else
Bij func1(Aij, Ai-1j)
int
func1(int a, int b)
LAYER3
return ab
19Layered code structure
main() / Layer 1 code /
read_image(IN_NAME, image_in) cav_detect()
write_image(image_out)
void cav_detect() / Layer 2 code /
for (xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
20Layered code structure
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
/ Makes code for data access
/ / and data transfer explicit /
int foo(int arg1) / Layer 3 / /
arithmetic, data-dependent operations to be
mapped to data-path, controller /
21Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y)
gauss_x_imagexy0
for (x1 xltN-2 x) for (y1 yltM-2
y) gauss_x_tmp 0 for (k-1
klt1 k) gauss_x_tmp
image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
accesses N M (N-2) (M-2)
22Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y) if
((xgt1 xltN-2) (ygt1 yltM-2))
gauss_x_tmp 0 for (k-1 klt1
k) gauss_x_tmp image_inxkyGau
ssabs(k) gauss_x_imagexy
foo(gauss_x_tmp) else
gauss_x_imagexy 0
accesses N M gain is 50
23Data-flow transformation
- In total 5 types of data-flow transformations
- advanced signal substitution and (copy)
propagation - algebraic transformations (associativity, etc.)
- shifting delay lines
- re-computation
- transformations to eliminate bottlenecks for
subsequent loop transformations
24Loop transformations
- Loop transformations
- improve regularity of accesses
- improve temporal locality production ?
consumption - Expected influence
- reduce temporary storage and (anticipated)
background storage
25Global loop transformation steps applied to
cavity detection
- Removal of data-flow bottleneck
- allows merging of loops
- done in global data-flow trafo step
- Make all loop dimensions equal
- Regularize loop traversalY and X loop
interchange - follow order of input stream
- Y loop folding and global mergingX loop folding
and global merging - full, global scope regularity
- nearly complete locality for main signals
26Loop trafo - cavity detection
N x M
Scanner
X
Y
From double bufferto single buffer
27Loop interchange (Y ? X)
for (x0xltNx) for (y0yltMy) /
filtering code /
for (y0yltMy) for (x0xltNx) /
filtering code /
- Single assignment ? always possible
- For all loops, to maintain regularity
28Loop trafo - cavity detection
N x (2GB1)
N x 3
Compute Edges
Gauss Blur y
Gauss Blur x
Repeated fold and loop merge
3(offset arrays)
2GB1
From N x M toN x (3) buffer size
From N x M toN x (2GB1) buffer size
29Improve regularity and locality? Loop Merging
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (y0yltMy) for
(x0xltNx) / 2nd filtering code /
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (x0xltNx) / 2nd
filtering code /
- !! Impossible due to dependencies!
30Data dependencies between1st and 2nd loop
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0yltMy) for
(x0xltNx) for (k-GB kltGB k)
gauss_x_imagexyk
31Enable merging withLoop Folding (bumping)
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0GByltMGBy)
for (x0xltNx) y-GB for (k-GB
kltGB k) gauss_x_imagexyk-GB
32Y-loop merging on 1st and 2nd loop nest
for (y0yltMGBy) if (yltM) for
(x0xltNx) gauss_x_imagexy
if (ygtGB) for (x0xltNx) if
(xgtGB xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB
k) gauss_x_imagexy-GBk
else
33Simplify conditionsin merged loop nest
for (y0yltMGBy) for (x0xltNx) if
(yltM) gauss_x_imagexy
for (x0xltNx) if (ygtGB xgtGB
xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB k)
gauss_x_imagexy-GBk else if
(ygtGB)
34Global loop merging/folding steps
- 1 x ? y Loop interchange (done)
- 2 Global y-loop folding/merging 1st and 2nd nest
(done) - 3 Global y-loop folding/merging 1st/2nd and 3rd
nest - 4 Global y-loop folding/merging 1st/2nd/3rd and
4th nest - 5 Global x-loop folding/merging 1st and 2nd nest
- 6 Global x-loop folding/merging 1st/2nd and 3rd
nest - 7 Global x-loop folding/merging 1st/2nd/3rd and
4th nest
35End result of global loop trafo
for (y0 yltMGB2 y) for (x0 xltN2
x) if (xgtGB xltN-1-GB
(y-GB)gtGB (y-GB)ltM-1-GB)
gauss_xy_computexy-GB0 0 for
(k-GB kltGB k) gauss_xy_computexy-
GBGBk1 gauss_xy_computexy-GB
GBk gauss_x_imagexy-GBk
Gaussabs(k) gauss_xy_imagexy-GB
gauss_xy_computexy-GB(2GB)1/tot
else if (xltN (y-GB)gt0 (y-GB)ltM)
gauss_xy_imagexy-GB 0
36Data re-use memory hierarchy
A 100
Processor Data Paths
Reg File
100
10
1
P (original) access x power/access 100
P (after) 100 x 0.01 10 x 0.1 1 x 1 3
- Introduce memory hierarchy
- reduce number of reads from main memory
- heavily accessed arrays stored in smaller memories
37Data re-use
- Data flow transformations to introduce
extracopies of heavily accessed signals - Step 1 figure out data re-use possibilities
- Step 2 calculate possible gain
- Step 3 decide on data assignment to memory
hierarchy
38Data re-use
- Data flow transformations to introduce
extracopies of heavily accessed signals - Step 1 figure out data re-use possibilities
- Step 2 calculate possible gain
- Step 3 decide on data assignment to memory
hierarchy
1216
N216
39Data re-use tree
image_in
gauss_xy/comp_edge
gauss_x
image_out
NM
M3
M3
M3
NM
NM
NM3
NM3
NM
0
11
N1
13
33
NM
NM8
NM8
NM3
31
NM3
CPU
CPU
CPU
CPU
CPU
40Memory hierarchy assignment
image_in
gauss_x
gauss_xy
comp_edge
image_out
NM
NM
1MB SDRAM
0
NM
M3
M3
M3
16KB Cache
NM3
NM3
NM
NM
NM3
128 B RegFile
11
11
31
33
33
NM3
NM8
NM8
NM8
NM8
41Data-reuse - cavity detection code
Code before reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
if (xgt1 xltN-2 ygt1 yltM-2)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_compute)
else if (xltN yltM)
gauss_x_linesxy 0 / Other
merged code omitted /
42Data-reuse - cavity code
Code after reuse transformation detection
for (y0 yltM3 y) for (x0 xltN2 x)
/ first in_pixel initialized / if (x0
ygt1 yltM-2) for (k0 klt1 k)
in_pixels(xk)3y1 image_inxky
/ copy rest of in_pixel's in row / if (xgt0
xltN-2 ygt1 yltM-2)
in_pixels(x1)3y1 image_inx1y
if (xgt1 xltN-1-1 ygt1 yltM-2)
gauss_x_tmp0 for (k-1 klt1 k)
gauss_x_tmp in_pixels(xk)3y1GaussAbs(
k) gauss_x_linesxy3
foo(gauss_x_tmp) else if (xltN
yltM) gauss_x_linesxy3 0
43Data layout optimization
- At this point multi-dimensional arraysare to be
assigned to physical memories - Data layout optimization determines exactly where
in each memory an array should be placed, to - reduce memory size by in-placing arrays that do
not overlap in time (disjoint lifetimes) - to avoid cache misses due to conflicts
- exploit spatial locality of the data in memory to
improve performance of e.g. page-mode memory
access sequences
44In-place mapping
Inter in-place
addresses
Intra in-place
time
45In-place mapping
- Implements all the anticipated memory size
savings obtained in previous steps - Modifies code to introduce one array per real
memory - Changes indices to addresses in mem. arrays
b8 A100100 b6 B2020 for (i,j,k,l )
Bij f(Bji, Aikjl)
46In-place mapping
- Input image is partly consumed by the time first
results for output image are ready
index
Image_in
time
index
Image_out
time
47In-place - cavity detection code
for (y0 yltM3 y) for (x0 xltN5 x)
image_outx-5y-3 / code
removed / image_inx1y
for (y0 yltM3 y) for (x0 xltN5 x)
imagex-5y-3 / code
removed / image x1y
48The last step ADOPT
(Address OPTimization)
- Increased execution time introduced by DTSE
- Complicated address arithmetic (modulo!)
- Additional complex control flow
- Multimedia platform not adapted to address
calculations - Additional transformations needed to
- Simplify control flow
- Simplify address arithmetic common
sub-expression elimination, modulo expansion, - Match remaining expressions on target machine
49ADOPT principles
Behavioral description
Extract address expr. codePerform addr. expr.
splitting
Apply transformations- Loop invariant code
motion- Induction variable analysis- Algebraic
transformations
Processor specific algebraictransformations
Optimized behavioral descr.for target processor
Optimized behavioral descr.
Compile to target processor
Map to custom ACU
50ADOPT principles
Example Full-search Motion Estimation
for (i- 8 ilt8 i) for (j- 4 jlt3
j) for (k- 4 klt3 k)
A((208i)2578j)257 16ik
B(8j)25716ik dist A3096 -
B((208i)2574)257 16i-4
cse1 (33025i6869616)2 cse3 1040i
cse4 j2571032 cse5
kcse4 cse5cse1 cse5cse3
3096 cse1
Algebraic transformations at word-level
51DMM results for cavity detection on ASIC
52Cavity detection on Pentium-MMX
Main Memory Accesses
Local Memory Accesses
Execution Time (sec)
53The Y-chart revisited
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
54Fixing platform parameters
- Assume configurable on-chip memory hierarchy
- Trade-off power versus cycle-budget
power mW
25
20
15
10
5
storagecyclebudget
50,000
100,000
150,000
55Conclusion
- In multi-media applications exploring data
transfer and storage issues should be done at
system level - DTSE is a methodology for Data Transfer and
Storage Exploration based on manual and/or
tool-assisted code rewriting - Platform independent high-level transformations
- Platform dependent transformations exploit
platform characteristics (optimal use of cache,
) - Substantial reduction in power and memory size
demonstrated on MPEG-4, OFDM, H.263, ADSL, ...