Title: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions
1Vectorization of the 2D Wavelet Lifting Transform
Using SIMD Extensions
D. Chaver, C. Tenllado, L. Piñuel, M. Prieto,
F. Tirado
2Index
- Motivation
- Experimental environment
- Lifting Transform
- Memory hierarchy exploitation
- SIMD optimization
- Conclusions
- Future work
3Motivation
4Motivation
- Applications based on the Wavelet Transform
- JPEG-2000 MPEG-4
- Usage of the lifting scheme
- Study based on a modern general purpose
microprocessor - Pentium 4
- Objectives
- Efficient exploitation of Memory Hierarchy
- Use of the SIMD ISA extensions
5Experimental Environment
6Experimental Environment
Intel Pentium4 (2,4 GHz)
Platform
DFI WT70-EC
Motherboard
NA
IL1
Cache
DL1
8 KB, 64 Byte/Line, Write-Through
512 KB, 128 Byte/Line
L2
1 GB RDRAM (PC800)
Memory
Operating System
RedHat Distribution 7.2 (Enigma)
Intel ICC compiler GCC compiler
Compiler
7Lifting Transform
8Lifting Transform
Original element
1st step
A
D
A
D
D
D
A
A
1st
1st
1st
1st
1st
1st
1st
1st
2nd step
9Lifting Transform
Original element
Approximation
Horizontal Filtering (1D Lifting Transform)
1 Level
N Levels
Vertical Filtering (1D Lifting Transform)
10Lifting Transform
Horizontal Filtering
Vertical Filtering
11Memory Hierarchy Exploitation
12Memory Hierarchy Exploitation
- Poor data locality of one component (canonical
layouts) - E.g. column-major layout ? processing image
rows (Horizontal Filtering)
- Aggregation (loop tiling)
- Poor data locality of the whole transform
13Memory Hierarchy Exploitation
Horizontal Filtering
Vertical Filtering
14Memory Hierarchy Exploitation
Aggregation
Horizontal Filtering
IMAGE
15Memory Hierarchy Exploitation
- INPLACE
- Common implementation of the transform
- Memory Only requires the original matrix
- For most applications needs post-processing
- MALLAT
- Memory requires 2 matrices
- Stores the image in the expected order
- INPLACE-MALLAT
- Memory requires 2 matrices
- Stores the image in the expected order
Different studied schemes
16Memory Hierarchy Exploitation
O
O
O
O
MATRIX 1
INPLACE
O
O
O
O
O
O
O
O
O
O
O
O
logical view
physical view
17Memory Hierarchy Exploitation
O
O
O
O
MATRIX 1
MATRIX 2
MALLAT
O
O
O
O
O
O
O
O
O
O
O
O
logical view
physical view
18Memory Hierarchy Exploitation
O
O
O
O
INPLACE- MALLAT
O
O
O
O
MATRIX 2
MATRIX 1
O
O
O
O
O
O
O
O
logical view
physical view
19Memory Hierarchy Exploitation
- Execution time breakdown for several sizes
comparing both compilers. - I, IM and M denote inplace, inplace-mallat, and
mallat strategies respectively. - Each bar shows the execution time of each level
and the post-processing step.
20Memory Hierarchy Exploitation
CONCLUSIONS
- The Mallat and Inplace-Mallat approaches
outperform the Inplace approach for levels 2 and
above - These 2 approaches have a noticeable slowdown for
the 1st level - Larger working set
- More complex access pattern
- The Inplace-Mallat version achieves the best
execution time - ICC compiler outperforms GCC for Mallat and
Inplace-Mallat, but not for the Inplace approach
21SIMD Optimization
22SIMD Optimization
- Objective Extract the parallelism available on
the Lifting Transform -
- Different strategies
- Semi-automatic vectorization
- Hand-coded vectorization
-
- Only the horizontal filtering of the transform
can be semi-automatically vectorized (when using
a column-major layout)
23SIMD Optimization
- Automatic Vectorization (Intel C/C Compiler)
- Inner loops
- Simple array index manipulation
- Iterate over contiguous memory locations
- Global variables avoided
- Pointer disambiguation if pointers are employed
24SIMD Optimization
Original element
1st step
D
A
1st
1st
2nd step
25SIMD Optimization
Vectorial Horizontal filtering
Horizontal filtering
a
x
a
a
x
a
a
Column-major layout
26SIMD Optimization
Vectorial Vertical filtering
Vertical filtering
a
x
a
a
x
a
a
Column-major layout
27SIMD Optimization
Horizontal Vectorial Filtering (semi-automatic)
for(j2,k1jlt(columns-4)j2,k) pragma
vector aligned for(i0iltrowsi)
/ 1st operation / col3col3 alfa(
col4 col2) / 2nd operation /
col2col2 beta( col3 col1) / 3rd
operation / col1col1 gama( col2
col0) / 4th operation / col0 col0
delt( col1 col-1) / Last step /
detail col1 phi_inv aprox col0
phi
28SIMD Optimization
- Hand-coded Vectorization
- SIMD parallelism has to be explicitly expressed
- Intrinsics allow more flexibility
- Possibility to also vectorize the vertical
filtering
29SIMD Optimization
Horizontal Vectorial Filtering (hand)
/ 1st operation / t2 _mm_load_ps(col2) t4
_mm_load_ps(col4) t3 _mm_load_ps(col3) coeff
_mm_set_ps1(alfa) t4 _mm_add_ps(t2,t4) t4
_mm_mul_ps(t4,coeff) t3 _mm_add_ps(t4,t3) _mm_
store_ps(col3,t3) / 2nd operation / / 3rd
operation / / 4th operation / / Last step
/ _mm_store_ps(detail,t1) _mm_store_ps(aprox,t0)
a
a
x
a
a
t2
t3
t4
30SIMD Optimization
- Execution time breakdown of the horizontal
filtering (10242 pixels image). - I, IM and M denote inplace, inplace-mallat and
mallat approaches. - S, A and H denote scalar, automatic-vectorized
and hand-coded-vectorized.
31SIMD Optimization
CONCLUSIONS
- Speedup between 4 and 6 depending on the
strategy. The reason for such a high improvement
is due not only to the vectorial computations,
but also to a considerable reduction in the
memory accesses. - The speedups achieved by the strategies with
recursive layouts (i.e. inplace-mallat and
mallat) are higher than the inplace version
counterparts, since the computation on the latter
can only be vectorized in the first level. - For ICC, both vectorization approaches (i.e.
automatic and hand-tuned) produce similar
speedups, which highlights the quality of the ICC
vectorizer.
32SIMD Optimization
- Execution time breakdown of the whole transform
(10242 pixels image). - I, IM and M denote inplace, inplace-mallat and
mallat approaches. - S, A and H denote scalar, automatic-vectorized
and hand-coded-vectorized.
33SIMD Optimization
CONCLUSIONS
- Speedup between 1,5 and 2 depending on the
strategy. - For ICC the shortest execution time is reached by
the mallat version. - When using GCC both recursive-layout strategies
obtain similar results.
34SIMD Optimization
- Speedup achieved by the different vectorial codes
over the inplace-mallat and inplace. - We show the hand-coded ICC, the automatic ICC,
and the hand-coded GCC.
35SIMD Optimization
CONCLUSIONS
- The speedup grows with the image size since.
- On average, the speedup is about 1.8 over the
inplace-mallat scheme, growing to about 2 when
considering it over the inplace strategy. - Focusing on the compilers, ICC clearly
outperforms GCC by a significant 20-25 for all
the image sizes
36Conclusions
37Conclusions
- Scalar version We have introduced a new scheme
called Inplace-Mallat, that outperforms both the
Inplace implementation and the Mallat scheme. - SIMD exploitation Code modifications for the
vectorial processing of the lifting algorithm.
Two different methodologies with ICC compiler
semi-automatic and intrinsic-based
vectorizations. Both provide similar results. - Speedup Horizontal filtering about 4-6
(vectorization also reduces the pressure on the
memory system). - Whole transform around 2.
- The vectorial Mallat approach outperforms the
other schemes and exhibits a better scalability. - Most of our insights are compiler independent.
38Future work
39Future work
- 4D layout for a lifting-based scheme
- Measurements using other platforms
- Intel Itanium
- Intel Pentium-4 with hiperthreading
- Parallelization using OpenMP (SMT)
For additional information http//www.dacya.ucm.
es/dchaver