Title: Optimizing Data Permutations for SIMD Devices
1Optimizing Data Permutations for SIMD Devices
- Gang Ren, Peng Wu1, David Padua
- University of Illinois at Urbana-Champaign
- 1 IBM T.J. Watson Research Center
2SIMD Is Everywhere
3SIMD Compilation
- Vectorization
- Instruction Packing
- If Conversion
- Data Permutation Optimization
- Idiom Recognition
- Execution Mapping
- Type Promotion Elimination
4Strict SIMD Architecture (1)
- Most SIMD devices only support memory accesses on
contiguous and aligned memory sections
... ...a031...
? vr1 vec_load(a)
a0
a1
a2
a3
a4
a5
a6
a7
ALU
Register File
Memory
5Strict SIMD Architecture (2)
- Additional permutation instructions are needed
for non-contiguous and/or misaligned memory
references
... ...a062...
? vr1 vec_load(a) vr2 vec_load(a4)
vr4 vperm(vr1, vr2, lt0,2,4,6gt)
a0
a1
a2
a3
a4
a5
a6
a7
ALU
Register File
Strict SIMD devices All data reorganization must
be accomplished with permutation instructions.
Memory
6Overview of the Optimization Framework
7Example An 8-point FFT Program
0
1
2
3
Generating native permutation instructions from
Permute operations
8Overview of the Optimization Framework
- Use generic Permute to represent
- Non-unit strides
- Misalignment
- Other reorganizations
9Data Permutations on Vectors
- Permute(Xn, Pn) Xn is a vector and Pn is a
permutation matrix - Use Permute to represent all data reorganizations
explicitly
a03
b03
b03 Permute(a03, lt2,1,0,3gt)
Two stride-2 accesses at right-hand side
10Overview of the Optimization Framework
- Minimize Permute ops in a basic block
- - Based on two rules of Permute
- A NP-complete problem
- Propagation-based algorithm
11Two Important Rules on Permutations
- Composition Rule
- Distributive Rule
Permute(Permute(a031, lt1, 0, 3, 2gt), lt2, 1,
0, 3gt)
Permute(a031, lt3, 0, 1, 2gt)
Permute(a031, lt1, 0, 3, 2gt)
Permute(b031, lt1, 0, 3, 2gt)
Permute(a031 b031, lt1, 0, 3, 2gt)
12Propagation-Based Optimization Algorithm
- Overview Propagating permutation to permutation
- Step 1 Pickup an unvisited permutation statement
- Step 2 Propagate the permutation from the
definition to the uses - Step 3 If a use is a permutation, goto (a),
otherwise goto (b) - Merge it with the propagated permutation pattern.
Goto Step 1 - Propagate the permutation from right-hand side to
left-hand side. Goto Step 2
1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(v207, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
t20712. u307 Permute(t307, P6)13.
u403 u303 u34714. u447
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(t107, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
t20712. u307 Permute(t307, P6)13.
u403 u303 u34714. u447
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(t107, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
u20712. u307 Permute(t307, P6)13.
y03 u303 u34714. y47
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
13Propagating Permutations to Partial Uses
Q
b07 Permute(a07, lt3,2,1,0,7,6,5,4gt) c03
b03 b47
P
b07 Permute(a07, lt3,2,1,0,7,6,5,4gt) c03
b03 b47
R
Not all permutations can be partitioned and
propagated to partial uses
- Improvements over partial use boundary
- - Permutation decomposition
- Register-wise decomposition
- Shuffle instruction decomposition
- Permutation reshaping
14Optimization Permutation Reshaping
- For permutations used in commutative operations
15Overview of the Optimization Framework
- Strip-mine Permute to vperm inst.
- Map vperm to native permutation inst.
16Generating Permutation Instructions (1)
a015 Permute(b015, lt0,4,8,12,1,5,9,13,2,6,
10,14,3,7,11,15gt)
b015
a015
17Generating Permutation Instructions (2)
a015 Permute(b015, lt0,4,8,12,1,5,9,13,2,6,
10,14,3,7,11,15gt)
b015
a015
- Two Steps
- Maximize empty slots when generating vperm
instructions - Fill empty slots with data elements that go to
the same target
18Experiment Setups
- Two SIMD devices VMX(AltiVec) SSE2
- Tested applications
- Group I Applications with relatively simple
permutation patterns - C-Saxpy Complex version of saxpy ( y alphax
y ) - R-Color, C-Dot, R-FIR,
- Group II Applications with complicated
permutation patterns - FFT Fast Fourier transform programs generated by
the SPIRAL system - WHT Walsh-Hadamard transform routines generated
by the SPIRAL system - Bitonic sorting One of the fastest sorting
networks - Group III Reorganization-only applications
- Matrix transpose
- Bit-reversal reordering
Processor 1.8G PowerPC G5 2.0G Pentium 4
Main Memory 2048 MB 1024 MB
Operating System Mac OS v10.3 Linux v2.4
Compiler xlc v6.0 icc v9.0
Compiler Options -O3 -qaltivec -fast (-O3)
19Static Evaluation of Permutation Inst.
VMX VMX VMX SSE2 SSE2 SSE2
Program Size Base Opt Reduced Base Opt Reduced
fft.4 16 96 24 75.0 96 24 75.0
fft.5 32 208 48 79.6 208 48 79.6
fft.6 64 352 96 72.7 352 96 72.7
wht.4 16 48 12 75.0 48 12 75.0
wht.5 32 96 24 75.0 96 24 75.0
wht.6 64 192 48 75.0 192 48 75.0
bitonic.4 16 52 34 34.6 56 34 39.3
bitonic.5 32 136 92 32.3 144 92 36.1
bitonic.6 64 336 232 31.0 352 232 34.1
20Run-time Performance of FFT Bitonic Sorting
21Overall Speedups
22Related Work
- Optimizing permutation instructions introduced by
misalignment - A. Eichenberger, P. Wu, K. O'Brien, Vectorization
for SIMD architectures with alignment
constraints, PLDI 04 - P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD
Code Generation for Runtime Alignment and Length
Conversion, CGO 05 - Efficient permutation instruction generation
- A. Kudriavtsev, P. Kogge, Generation of
permutations for SIMD processors, LCTES 05 - M. Narayanan, K. Yelick, Generating permutation
instructions from a high-level description, MSP
04 - D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization
of interleaved data for SIMD, PLDI 06 - Similar idea, different applications
- A. Solar-Lezama, R. Rabbah, R. Bodik, K.
Ebcioglu, Programming by sketching for
bit-streaming programs, PLDI 05 - S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng.
Automatic array alignment in data-parallel
programs, POPL 93 - G. Hwang, J. K. Lee, D. Ju, An array operation
synthesis scheme to optimize FORTRAN 90 programs,
PPOPP 95
23Conclusion
- It is a performance critical problem for SIMD
compilation to reduce the overhead introduced by
permutation instructions - A unified framework is proposed to optimize data
permutations - Putting all forms of data permutations into a
unified representation - Propagating permutations across statements and
merging them together - Generating efficient permutation instructions
natively supported by devices - Experiments were conducted on different
applications - Up to 77 permutation instructions are eliminated
- Improve average performance by 48 on VMX and 68
on SSE2 - Near-peak overall speedups are achieved on some
applications
24Thank You!