Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

About This Presentation
Title:

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

Description:

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions ... RedHat Distribution 7.2 (Enigma) Operating System. 1 GB RDRAM (PC800) Memory ... –

Number of Views:60
Avg rating:3.0/5.0
Slides: 40
Provided by: Manuel144
Category:

less

Transcript and Presenter's Notes

Title: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions


1
Vectorization of the 2D Wavelet Lifting Transform
Using SIMD Extensions
D. Chaver, C. Tenllado, L. Piñuel, M. Prieto,
F. Tirado
2
Index
  1. Motivation
  2. Experimental environment
  3. Lifting Transform
  4. Memory hierarchy exploitation
  5. SIMD optimization
  6. Conclusions
  7. Future work

3
Motivation
4
Motivation
  • Applications based on the Wavelet Transform
  • JPEG-2000 MPEG-4
  • Usage of the lifting scheme
  • Study based on a modern general purpose
    microprocessor
  • Pentium 4
  • Objectives
  • Efficient exploitation of Memory Hierarchy
  • Use of the SIMD ISA extensions

5
Experimental Environment
6
Experimental Environment
Intel Pentium4 (2,4 GHz)
Platform
DFI WT70-EC
Motherboard
NA
IL1
Cache
DL1
8 KB, 64 Byte/Line, Write-Through
512 KB, 128 Byte/Line
L2
1 GB RDRAM (PC800)
Memory
Operating System
RedHat Distribution 7.2 (Enigma)
Intel ICC compiler GCC compiler
Compiler
7
Lifting Transform
8
Lifting Transform
Original element
1st step
A
D
A
D
D
D
A
A
1st
1st
1st
1st
1st
1st
1st
1st
2nd step
9
Lifting Transform
Original element
Approximation
Horizontal Filtering (1D Lifting Transform)
1 Level
N Levels
Vertical Filtering (1D Lifting Transform)
10
Lifting Transform
Horizontal Filtering
Vertical Filtering
11
Memory Hierarchy Exploitation
12
Memory Hierarchy Exploitation
  • Poor data locality of one component (canonical
    layouts)
  • E.g. column-major layout ? processing image
    rows (Horizontal Filtering)
  • Aggregation (loop tiling)
  • Poor data locality of the whole transform
  • Other layouts

13
Memory Hierarchy Exploitation
Horizontal Filtering
Vertical Filtering
14
Memory Hierarchy Exploitation
Aggregation
Horizontal Filtering
IMAGE
15
Memory Hierarchy Exploitation
  • INPLACE
  • Common implementation of the transform
  • Memory Only requires the original matrix
  • For most applications needs post-processing
  • MALLAT
  • Memory requires 2 matrices
  • Stores the image in the expected order
  • INPLACE-MALLAT
  • Memory requires 2 matrices
  • Stores the image in the expected order

Different studied schemes
16
Memory Hierarchy Exploitation
O
O
O
O
MATRIX 1
INPLACE
O
O
O
O
O
O
O
O
O
O
O
O
logical view
physical view
17
Memory Hierarchy Exploitation
O
O
O
O
MATRIX 1
MATRIX 2
MALLAT
O
O
O
O
O
O
O
O
O
O
O
O
logical view
physical view
18
Memory Hierarchy Exploitation
O
O
O
O
INPLACE- MALLAT
O
O
O
O
MATRIX 2
MATRIX 1
O
O
O
O
O
O
O
O
logical view
physical view
19
Memory Hierarchy Exploitation
  • Execution time breakdown for several sizes
    comparing both compilers.
  • I, IM and M denote inplace, inplace-mallat, and
    mallat strategies respectively.
  • Each bar shows the execution time of each level
    and the post-processing step.

20
Memory Hierarchy Exploitation
CONCLUSIONS
  • The Mallat and Inplace-Mallat approaches
    outperform the Inplace approach for levels 2 and
    above
  • These 2 approaches have a noticeable slowdown for
    the 1st level
  • Larger working set
  • More complex access pattern
  • The Inplace-Mallat version achieves the best
    execution time
  • ICC compiler outperforms GCC for Mallat and
    Inplace-Mallat, but not for the Inplace approach

21
SIMD Optimization
22
SIMD Optimization
  • Objective Extract the parallelism available on
    the Lifting Transform
  • Different strategies
  • Semi-automatic vectorization
  • Hand-coded vectorization
  • Only the horizontal filtering of the transform
    can be semi-automatically vectorized (when using
    a column-major layout)

23
SIMD Optimization
  • Automatic Vectorization (Intel C/C Compiler)
  • Inner loops
  • Simple array index manipulation
  • Iterate over contiguous memory locations
  • Global variables avoided
  • Pointer disambiguation if pointers are employed

24
SIMD Optimization
Original element
1st step
D
A
1st
1st
2nd step
25
SIMD Optimization
Vectorial Horizontal filtering
Horizontal filtering
a
x

a
a
x

a

a
Column-major layout

26
SIMD Optimization
Vectorial Vertical filtering
Vertical filtering
a
x

a
a
x

a

a
Column-major layout

27
SIMD Optimization
Horizontal Vectorial Filtering (semi-automatic)
for(j2,k1jlt(columns-4)j2,k) pragma
vector aligned for(i0iltrowsi)
/ 1st operation / col3col3 alfa(
col4 col2) / 2nd operation /
col2col2 beta( col3 col1) / 3rd
operation / col1col1 gama( col2
col0) / 4th operation / col0 col0
delt( col1 col-1) / Last step /
detail col1 phi_inv aprox col0
phi
28
SIMD Optimization
  • Hand-coded Vectorization
  • SIMD parallelism has to be explicitly expressed
  • Intrinsics allow more flexibility
  • Possibility to also vectorize the vertical
    filtering

29
SIMD Optimization
Horizontal Vectorial Filtering (hand)
/ 1st operation / t2 _mm_load_ps(col2) t4
_mm_load_ps(col4) t3 _mm_load_ps(col3) coeff
_mm_set_ps1(alfa) t4 _mm_add_ps(t2,t4) t4
_mm_mul_ps(t4,coeff) t3 _mm_add_ps(t4,t3) _mm_
store_ps(col3,t3) / 2nd operation / / 3rd
operation / / 4th operation / / Last step
/ _mm_store_ps(detail,t1) _mm_store_ps(aprox,t0)

a
a
x

a
a

t2
t3
t4
30
SIMD Optimization
  • Execution time breakdown of the horizontal
    filtering (10242 pixels image).
  • I, IM and M denote inplace, inplace-mallat and
    mallat approaches.
  • S, A and H denote scalar, automatic-vectorized
    and hand-coded-vectorized.

31
SIMD Optimization
CONCLUSIONS
  • Speedup between 4 and 6 depending on the
    strategy. The reason for such a high improvement
    is due not only to the vectorial computations,
    but also to a considerable reduction in the
    memory accesses.
  • The speedups achieved by the strategies with
    recursive layouts (i.e. inplace-mallat and
    mallat) are higher than the inplace version
    counterparts, since the computation on the latter
    can only be vectorized in the first level.
  • For ICC, both vectorization approaches (i.e.
    automatic and hand-tuned) produce similar
    speedups, which highlights the quality of the ICC
    vectorizer.

32
SIMD Optimization
  • Execution time breakdown of the whole transform
    (10242 pixels image).
  • I, IM and M denote inplace, inplace-mallat and
    mallat approaches.
  • S, A and H denote scalar, automatic-vectorized
    and hand-coded-vectorized.

33
SIMD Optimization
CONCLUSIONS
  • Speedup between 1,5 and 2 depending on the
    strategy.
  • For ICC the shortest execution time is reached by
    the mallat version.
  • When using GCC both recursive-layout strategies
    obtain similar results.

34
SIMD Optimization
  • Speedup achieved by the different vectorial codes
    over the inplace-mallat and inplace.
  • We show the hand-coded ICC, the automatic ICC,
    and the hand-coded GCC.

35
SIMD Optimization
CONCLUSIONS
  • The speedup grows with the image size since.
  • On average, the speedup is about 1.8 over the
    inplace-mallat scheme, growing to about 2 when
    considering it over the inplace strategy.
  • Focusing on the compilers, ICC clearly
    outperforms GCC by a significant 20-25 for all
    the image sizes

36
Conclusions
37
Conclusions
  • Scalar version We have introduced a new scheme
    called Inplace-Mallat, that outperforms both the
    Inplace implementation and the Mallat scheme.
  • SIMD exploitation Code modifications for the
    vectorial processing of the lifting algorithm.
    Two different methodologies with ICC compiler
    semi-automatic and intrinsic-based
    vectorizations. Both provide similar results.
  • Speedup Horizontal filtering about 4-6
    (vectorization also reduces the pressure on the
    memory system).
  • Whole transform around 2.
  • The vectorial Mallat approach outperforms the
    other schemes and exhibits a better scalability.
  • Most of our insights are compiler independent.

38
Future work
39
Future work
  • 4D layout for a lifting-based scheme
  • Measurements using other platforms
  • Intel Itanium
  • Intel Pentium-4 with hiperthreading
  • Parallelization using OpenMP (SMT)

For additional information http//www.dacya.ucm.
es/dchaver
Write a Comment
User Comments (0)
About PowerShow.com