Optimizing Data Permutations for SIMD Devices - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Optimizing Data Permutations for SIMD Devices

Description:

Optimizing Data Permutations for SIMD Devices. Gang Ren, Peng Wu1, David Padua ... Shuffle instruction decomposition. Permutation reshaping. 14. PLDI 06 ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 25

Provided by: shen153

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing Data Permutations for SIMD Devices

1
Optimizing Data Permutations for SIMD Devices

Gang Ren, Peng Wu1, David Padua
University of Illinois at Urbana-Champaign
1 IBM T.J. Watson Research Center

2
SIMD Is Everywhere
3
SIMD Compilation

Vectorization
Instruction Packing
If Conversion

Data Permutation Optimization
Idiom Recognition
Execution Mapping
Type Promotion Elimination

4
Strict SIMD Architecture (1)

Most SIMD devices only support memory accesses on
contiguous and aligned memory sections

... ...a031...
? vr1 vec_load(a)
a0
a1
a2
a3
a4
a5
a6
a7

ALU
Register File
Memory
5
Strict SIMD Architecture (2)

Additional permutation instructions are needed
for non-contiguous and/or misaligned memory
references

... ...a062...
? vr1 vec_load(a) vr2 vec_load(a4)
vr4 vperm(vr1, vr2, lt0,2,4,6gt)
a0
a1
a2
a3
a4
a5
a6
a7

ALU
Register File
Strict SIMD devices All data reorganization must
be accomplished with permutation instructions.
Memory
6
Overview of the Optimization Framework
7
Example An 8-point FFT Program
0
1
2
3
Generating native permutation instructions from
Permute operations
8
Overview of the Optimization Framework

Use generic Permute to represent
Non-unit strides
Misalignment
Other reorganizations

9
Data Permutations on Vectors

Permute(Xn, Pn) Xn is a vector and Pn is a
permutation matrix
Use Permute to represent all data reorganizations
explicitly

a03
b03
b03 Permute(a03, lt2,1,0,3gt)
Two stride-2 accesses at right-hand side
10
Overview of the Optimization Framework

Minimize Permute ops in a basic block
- Based on two rules of Permute
A NP-complete problem
Propagation-based algorithm

11
Two Important Rules on Permutations

Composition Rule
Distributive Rule

Permute(Permute(a031, lt1, 0, 3, 2gt), lt2, 1,
0, 3gt)
Permute(a031, lt3, 0, 1, 2gt)
Permute(a031, lt1, 0, 3, 2gt)
Permute(b031, lt1, 0, 3, 2gt)
Permute(a031 b031, lt1, 0, 3, 2gt)
12
Propagation-Based Optimization Algorithm

Overview Propagating permutation to permutation
Step 1 Pickup an unvisited permutation statement
Step 2 Propagate the permutation from the
definition to the uses
Step 3 If a use is a permutation, goto (a),
otherwise goto (b)
Merge it with the propagated permutation pattern.
Goto Step 1
Propagate the permutation from right-hand side to
left-hand side. Goto Step 2

1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(v207, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
t20712. u307 Permute(t307, P6)13.
u403 u303 u34714. u447
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(t107, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
t20712. u307 Permute(t307, P6)13.
u403 u303 u34714. u447
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(t107, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
u20712. u307 Permute(t307, P6)13.
y03 u303 u34714. y47
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
13
Propagating Permutations to Partial Uses
Q
b07 Permute(a07, lt3,2,1,0,7,6,5,4gt) c03
b03 b47
P
b07 Permute(a07, lt3,2,1,0,7,6,5,4gt) c03
b03 b47
R
Not all permutations can be partitioned and
propagated to partial uses

Improvements over partial use boundary
- Permutation decomposition
Register-wise decomposition
Shuffle instruction decomposition
Permutation reshaping

14
Optimization Permutation Reshaping

For permutations used in commutative operations

15
Overview of the Optimization Framework

Strip-mine Permute to vperm inst.
Map vperm to native permutation inst.

16
Generating Permutation Instructions (1)
a015 Permute(b015, lt0,4,8,12,1,5,9,13,2,6,
10,14,3,7,11,15gt)
b015
a015
17
Generating Permutation Instructions (2)
a015 Permute(b015, lt0,4,8,12,1,5,9,13,2,6,
10,14,3,7,11,15gt)
b015
a015

Two Steps
Maximize empty slots when generating vperm
instructions
Fill empty slots with data elements that go to
the same target

18
Experiment Setups

Two SIMD devices VMX(AltiVec) SSE2
Tested applications
Group I Applications with relatively simple
permutation patterns
C-Saxpy Complex version of saxpy ( y alphax
y )
R-Color, C-Dot, R-FIR,
Group II Applications with complicated
permutation patterns
FFT Fast Fourier transform programs generated by
the SPIRAL system
WHT Walsh-Hadamard transform routines generated
by the SPIRAL system
Bitonic sorting One of the fastest sorting
networks
Group III Reorganization-only applications
Matrix transpose
Bit-reversal reordering

Processor 1.8G PowerPC G5 2.0G Pentium 4
Main Memory 2048 MB 1024 MB
Operating System Mac OS v10.3 Linux v2.4
Compiler xlc v6.0 icc v9.0
Compiler Options -O3 -qaltivec -fast (-O3)
19
Static Evaluation of Permutation Inst.
VMX VMX VMX SSE2 SSE2 SSE2
Program Size Base Opt Reduced Base Opt Reduced
fft.4 16 96 24 75.0 96 24 75.0
fft.5 32 208 48 79.6 208 48 79.6
fft.6 64 352 96 72.7 352 96 72.7
wht.4 16 48 12 75.0 48 12 75.0
wht.5 32 96 24 75.0 96 24 75.0
wht.6 64 192 48 75.0 192 48 75.0
bitonic.4 16 52 34 34.6 56 34 39.3
bitonic.5 32 136 92 32.3 144 92 36.1
bitonic.6 64 336 232 31.0 352 232 34.1
20
Run-time Performance of FFT Bitonic Sorting
21
Overall Speedups
22
Related Work

Optimizing permutation instructions introduced by
misalignment
A. Eichenberger, P. Wu, K. O'Brien, Vectorization
for SIMD architectures with alignment
constraints, PLDI 04
P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD
Code Generation for Runtime Alignment and Length
Conversion, CGO 05
Efficient permutation instruction generation
A. Kudriavtsev, P. Kogge, Generation of
permutations for SIMD processors, LCTES 05
M. Narayanan, K. Yelick, Generating permutation
instructions from a high-level description, MSP
04
D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization
of interleaved data for SIMD, PLDI 06
Similar idea, different applications
A. Solar-Lezama, R. Rabbah, R. Bodik, K.
Ebcioglu, Programming by sketching for
bit-streaming programs, PLDI 05
S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng.
Automatic array alignment in data-parallel
programs, POPL 93
G. Hwang, J. K. Lee, D. Ju, An array operation
synthesis scheme to optimize FORTRAN 90 programs,
PPOPP 95

23
Conclusion

It is a performance critical problem for SIMD
compilation to reduce the overhead introduced by
permutation instructions
A unified framework is proposed to optimize data
permutations
Putting all forms of data permutations into a
unified representation
Propagating permutations across statements and
merging them together
Generating efficient permutation instructions
natively supported by devices
Experiments were conducted on different
applications
Up to 77 permutation instructions are eliminated
Improve average performance by 48 on VMX and 68
on SSE2
Near-peak overall speedups are achieved on some
applications