Embedded Systems in Silicon TD5102 Data Management (1) Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Embedded Systems in Silicon TD5102 Data Management (1) Overview

Description:

Title: Design Technology for future Multi-Media Systems Author: henk corporaal Last modified by: Medewerker Created Date: 2/25/2002 12:06:54 PM Document presentation ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 55
Provided by: henkcor1
Category:

less

Transcript and Presenter's Notes

Title: Embedded Systems in Silicon TD5102 Data Management (1) Overview


1
Embedded Systems in SiliconTD5102Data
Management (1)Overview
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2
Data Management Overview
  • Motivation
  • Example application
  • Data Management (DM) steps
  • Results
  • Important note
  • We consider here static declared data structures
    only
  • DM is also called
  • DTSE (Data Transfer and Storage Exploration), or
  • Physical Memory Management

3
Design flow
4
The underlying idea
for (i0iltni) for (j0 jlt3 j) for
(k1 klt7 k) Bj Ai4k
5
Platform architecture model
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
6
Platform example TriMedia
7
Data transfer and storage power
8
Positioning in the Y-chart
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
9
Current practiceMapping, easy, but...........
Idea
  • Given
  • reference C code for applicatione.g. MPEG-4
    Motion Estimation
  • platform SUPERDUPER-LX50
  • Task
  • map application on architecture
  • But wait a moment
  • me_at_workgt CC o2 mpeg4_me mpeg4_me.cThank you
    for running SUPERDUPER-LX50 compiler.Your
    program uses 257321886 bytes memory, 78 Watt,
    428798765291 clock cycles

ab5d for (...) ..
10
Lets help the compiler ...DTSE data transfer
and storage exploration
  • DTSE is a methodology to explore data-transfer
    and data-storage in multi-media applications
  • Transforms C-code of the application
  • By focusing on multi-dimensional signals (arrays)
  • To better exploit platform capabilities
  • This overview covers the major steps to improve
    power, area, performance trade-off

11
Data Management principles
Off-chip SDRAM
Exploit limited life-time
12
DM steps
C-in
Preprocessing
Dataflow transformations
Loop transformations
Data reuse Memory hierarchy layer assignment
Cycle budget distribution
Memory allocation and assignment
Data layout
Address optimization
C-out
13
The DM steps
  • Preprocessing
  • Rewrite code in 3 layers (parts)
  • Selective inlining, Single Assignment form, ....
  • Data flow transformations
  • Eliminate redundant transfers and storage
  • Loop and control flow transformations
  • Improve regularity of accesses and data locality
  • Data re-use and memory hierarchy layer assignment
  • Determine when to move which data between
    memories to meet the cycle budget of the
    application with low cost
  • Determine in which layer to put the arrays (and
    copies)

14
The DM steps
  • Per memory layer
  • Cycle budget distribution
  • determine memory access constraints for given
    cycle budget
  • Memory allocation and assignment
  • which memories to use, and where to put the
    arrays
  • Data layout
  • determine how to combine and put arrays into
    memories
  • Address optimization on the final C-code

15
Application example
  • Application domain
  • Computer Tomography in medical imaging
  • Algorithm
  • Cavity detection in CT-scans
  • Detect dark regions in successive images
  • Indicate cavity in brain

? Bad news for owner of brain
16
Data enters Cavity Detectorrow-wise
serial scan
Buffer
image_in
GaussBlur loop
Cavity Detector
17
Application
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y
  • Reference (conceptual) C code for the algorithm
  • all functions image_inN x Mt-1 -gt image_outN
    x Mt
  • new value of pixel depends on its neighbors
  • neighbor pixels read from background memory
  • approximately 110 lines of C code (ignoring file
    I/O etc)
  • experiments with N x M 640 x 400 pixels
  • straightforward implementation 6 image buffers

18
Preprocessing Dividing an application in the 3
layers
Module1a
LAYER1
Module2
Module3
Module1b
- testbench call
- dynamic event behaviour
Synchronisation
- mode selection
for (i0iltN i)
for (j0 jltM j)
LAYER2
if (i 0)
Bij 1
else
Bij func1(Aij, Ai-1j)
int
func1(int a, int b)
LAYER3

return ab

19
Layered code structure
main() / Layer 1 code /
read_image(IN_NAME, image_in) cav_detect()
write_image(image_out)
void cav_detect() / Layer 2 code /
for (xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)

20
Layered code structure
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
/ Makes code for data access
/ / and data transfer explicit /
int foo(int arg1) / Layer 3 / /
arithmetic, data-dependent operations to be
mapped to data-path, controller /
21
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y)
gauss_x_imagexy0
for (x1 xltN-2 x) for (y1 yltM-2
y) gauss_x_tmp 0 for (k-1
klt1 k) gauss_x_tmp
image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)

accesses N M (N-2) (M-2)
22
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y) if
((xgt1 xltN-2) (ygt1 yltM-2))
gauss_x_tmp 0 for (k-1 klt1
k) gauss_x_tmp image_inxkyGau
ssabs(k) gauss_x_imagexy
foo(gauss_x_tmp) else
gauss_x_imagexy 0
accesses N M gain is 50
23
Data-flow transformation
  • In total 5 types of data-flow transformations
  • advanced signal substitution and (copy)
    propagation
  • algebraic transformations (associativity, etc.)
  • shifting delay lines
  • re-computation
  • transformations to eliminate bottlenecks for
    subsequent loop transformations

24
Loop transformations
  • Loop transformations
  • improve regularity of accesses
  • improve temporal locality production ?
    consumption
  • Expected influence
  • reduce temporary storage and (anticipated)
    background storage

25
Global loop transformation steps applied to
cavity detection
  • Removal of data-flow bottleneck
  • allows merging of loops
  • done in global data-flow trafo step
  • Make all loop dimensions equal
  • Regularize loop traversalY and X loop
    interchange
  • follow order of input stream
  • Y loop folding and global mergingX loop folding
    and global merging
  • full, global scope regularity
  • nearly complete locality for main signals

26
Loop trafo - cavity detection
N x M
Scanner
X
Y
From double bufferto single buffer
27
Loop interchange (Y ? X)
for (x0xltNx) for (y0yltMy) /
filtering code /
for (y0yltMy) for (x0xltNx) /
filtering code /
  • Single assignment ? always possible
  • For all loops, to maintain regularity

28
Loop trafo - cavity detection
N x (2GB1)
N x 3
Compute Edges
Gauss Blur y
Gauss Blur x
Repeated fold and loop merge
3(offset arrays)
2GB1
From N x M toN x (3) buffer size
From N x M toN x (2GB1) buffer size
29
Improve regularity and locality? Loop Merging
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (y0yltMy) for
(x0xltNx) / 2nd filtering code /
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (x0xltNx) / 2nd
filtering code /
  • !! Impossible due to dependencies!

30
Data dependencies between1st and 2nd loop
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0yltMy) for
(x0xltNx) for (k-GB kltGB k)
gauss_x_imagexyk
31
Enable merging withLoop Folding (bumping)
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0GByltMGBy)
for (x0xltNx) y-GB for (k-GB
kltGB k) gauss_x_imagexyk-GB
32
Y-loop merging on 1st and 2nd loop nest
for (y0yltMGBy) if (yltM) for
(x0xltNx) gauss_x_imagexy
if (ygtGB) for (x0xltNx) if
(xgtGB xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB
k) gauss_x_imagexy-GBk
else
33
Simplify conditionsin merged loop nest
for (y0yltMGBy) for (x0xltNx) if
(yltM) gauss_x_imagexy
for (x0xltNx) if (ygtGB xgtGB
xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB k)
gauss_x_imagexy-GBk else if
(ygtGB)
34
Global loop merging/folding steps
  • 1 x ? y Loop interchange (done)
  • 2 Global y-loop folding/merging 1st and 2nd nest
    (done)
  • 3 Global y-loop folding/merging 1st/2nd and 3rd
    nest
  • 4 Global y-loop folding/merging 1st/2nd/3rd and
    4th nest
  • 5 Global x-loop folding/merging 1st and 2nd nest
  • 6 Global x-loop folding/merging 1st/2nd and 3rd
    nest
  • 7 Global x-loop folding/merging 1st/2nd/3rd and
    4th nest

35
End result of global loop trafo
for (y0 yltMGB2 y) for (x0 xltN2
x) if (xgtGB xltN-1-GB
(y-GB)gtGB (y-GB)ltM-1-GB)
gauss_xy_computexy-GB0 0 for
(k-GB kltGB k) gauss_xy_computexy-
GBGBk1 gauss_xy_computexy-GB
GBk gauss_x_imagexy-GBk
Gaussabs(k) gauss_xy_imagexy-GB
gauss_xy_computexy-GB(2GB)1/tot
else if (xltN (y-GB)gt0 (y-GB)ltM)
gauss_xy_imagexy-GB 0
36
Data re-use memory hierarchy
A 100
Processor Data Paths
Reg File
100
10
1
P (original) access x power/access 100
P (after) 100 x 0.01 10 x 0.1 1 x 1 3
  • Introduce memory hierarchy
  • reduce number of reads from main memory
  • heavily accessed arrays stored in smaller memories

37
Data re-use
  • Data flow transformations to introduce
    extracopies of heavily accessed signals
  • Step 1 figure out data re-use possibilities
  • Step 2 calculate possible gain
  • Step 3 decide on data assignment to memory
    hierarchy

38
Data re-use
  • Data flow transformations to introduce
    extracopies of heavily accessed signals
  • Step 1 figure out data re-use possibilities
  • Step 2 calculate possible gain
  • Step 3 decide on data assignment to memory
    hierarchy

1216
N216
39
Data re-use tree
image_in
gauss_xy/comp_edge
gauss_x
image_out
NM
M3
M3
M3
NM
NM
NM3
NM3
NM
0
11
N1
13
33
NM
NM8
NM8
NM3
31
NM3
CPU
CPU
CPU
CPU
CPU
40
Memory hierarchy assignment
image_in
gauss_x
gauss_xy
comp_edge
image_out
NM
NM
1MB SDRAM
0
NM
M3
M3
M3
16KB Cache
NM3
NM3
NM
NM
NM3
128 B RegFile
11
11
31
33
33
NM3
NM8
NM8
NM8
NM8
41
Data-reuse - cavity detection code
Code before reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
if (xgt1 xltN-2 ygt1 yltM-2)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_compute)
else if (xltN yltM)
gauss_x_linesxy 0 / Other
merged code omitted /
42
Data-reuse - cavity code
Code after reuse transformation detection
for (y0 yltM3 y) for (x0 xltN2 x)
/ first in_pixel initialized / if (x0
ygt1 yltM-2) for (k0 klt1 k)
in_pixels(xk)3y1 image_inxky
/ copy rest of in_pixel's in row / if (xgt0
xltN-2 ygt1 yltM-2)
in_pixels(x1)3y1 image_inx1y
if (xgt1 xltN-1-1 ygt1 yltM-2)
gauss_x_tmp0 for (k-1 klt1 k)
gauss_x_tmp in_pixels(xk)3y1GaussAbs(
k) gauss_x_linesxy3
foo(gauss_x_tmp) else if (xltN
yltM) gauss_x_linesxy3 0
43
Data layout optimization
  • At this point multi-dimensional arraysare to be
    assigned to physical memories
  • Data layout optimization determines exactly where
    in each memory an array should be placed, to
  • reduce memory size by in-placing arrays that do
    not overlap in time (disjoint lifetimes)
  • to avoid cache misses due to conflicts
  • exploit spatial locality of the data in memory to
    improve performance of e.g. page-mode memory
    access sequences

44
In-place mapping
Inter in-place
addresses
Intra in-place
time
45
In-place mapping
  • Implements all the anticipated memory size
    savings obtained in previous steps
  • Modifies code to introduce one array per real
    memory
  • Changes indices to addresses in mem. arrays

b8 A100100 b6 B2020 for (i,j,k,l )
Bij f(Bji, Aikjl)
46
In-place mapping
  • Input image is partly consumed by the time first
    results for output image are ready

index
Image_in
time
index
Image_out
time
47
In-place - cavity detection code
for (y0 yltM3 y) for (x0 xltN5 x)
image_outx-5y-3 / code
removed / image_inx1y
for (y0 yltM3 y) for (x0 xltN5 x)
imagex-5y-3 / code
removed / image x1y
48
The last step ADOPT
(Address OPTimization)
  • Increased execution time introduced by DTSE
  • Complicated address arithmetic (modulo!)
  • Additional complex control flow
  • Multimedia platform not adapted to address
    calculations
  • Additional transformations needed to
  • Simplify control flow
  • Simplify address arithmetic common
    sub-expression elimination, modulo expansion,
  • Match remaining expressions on target machine

49
ADOPT principles
Behavioral description
Extract address expr. codePerform addr. expr.
splitting
Apply transformations- Loop invariant code
motion- Induction variable analysis- Algebraic
transformations
Processor specific algebraictransformations
Optimized behavioral descr.for target processor
Optimized behavioral descr.
Compile to target processor
Map to custom ACU
50
ADOPT principles
Example Full-search Motion Estimation
for (i- 8 ilt8 i) for (j- 4 jlt3
j) for (k- 4 klt3 k)
A((208i)2578j)257 16ik
B(8j)25716ik dist A3096 -
B((208i)2574)257 16i-4
cse1 (33025i6869616)2 cse3 1040i
cse4 j2571032 cse5
kcse4 cse5cse1 cse5cse3
3096 cse1
Algebraic transformations at word-level
51
DMM results for cavity detection on ASIC
52
Cavity detection on Pentium-MMX
Main Memory Accesses
Local Memory Accesses
Execution Time (sec)
53
The Y-chart revisited
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
54
Fixing platform parameters
  • Assume configurable on-chip memory hierarchy
  • Trade-off power versus cycle-budget

power mW
25
20
15
10
5
storagecyclebudget
50,000
100,000
150,000
55
Conclusion
  • In multi-media applications exploring data
    transfer and storage issues should be done at
    system level
  • DTSE is a methodology for Data Transfer and
    Storage Exploration based on manual and/or
    tool-assisted code rewriting
  • Platform independent high-level transformations
  • Platform dependent transformations exploit
    platform characteristics (optimal use of cache,
    )
  • Substantial reduction in power and memory size
    demonstrated on MPEG-4, OFDM, H.263, ADSL, ...
Write a Comment
User Comments (0)
About PowerShow.com