Optimizing Data Permutations for SIMD Devices - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Optimizing Data Permutations for SIMD Devices

Description:

Optimizing Data Permutations for SIMD Devices. Gang Ren, Peng Wu1, David Padua ... Shuffle instruction decomposition. Permutation reshaping. 14. PLDI 06 ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 25
Provided by: shen153
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Data Permutations for SIMD Devices


1
Optimizing Data Permutations for SIMD Devices
  • Gang Ren, Peng Wu1, David Padua
  • University of Illinois at Urbana-Champaign
  • 1 IBM T.J. Watson Research Center

2
SIMD Is Everywhere
3
SIMD Compilation
  • Vectorization
  • Instruction Packing
  • If Conversion
  • Data Permutation Optimization
  • Idiom Recognition
  • Execution Mapping
  • Type Promotion Elimination

4
Strict SIMD Architecture (1)
  • Most SIMD devices only support memory accesses on
    contiguous and aligned memory sections

... ...a031...
? vr1 vec_load(a)
a0
a1
a2
a3
a4
a5
a6
a7

ALU
Register File
Memory
5
Strict SIMD Architecture (2)
  • Additional permutation instructions are needed
    for non-contiguous and/or misaligned memory
    references

... ...a062...
? vr1 vec_load(a) vr2 vec_load(a4)
vr4 vperm(vr1, vr2, lt0,2,4,6gt)
a0
a1
a2
a3
a4
a5
a6
a7

ALU
Register File
Strict SIMD devices All data reorganization must
be accomplished with permutation instructions.
Memory
6
Overview of the Optimization Framework
7
Example An 8-point FFT Program
0
1
2
3
Generating native permutation instructions from
Permute operations
8
Overview of the Optimization Framework
  • Use generic Permute to represent
  • Non-unit strides
  • Misalignment
  • Other reorganizations

9
Data Permutations on Vectors
  • Permute(Xn, Pn) Xn is a vector and Pn is a
    permutation matrix
  • Use Permute to represent all data reorganizations
    explicitly

a03
b03
b03 Permute(a03, lt2,1,0,3gt)
Two stride-2 accesses at right-hand side
10
Overview of the Optimization Framework
  • Minimize Permute ops in a basic block
  • - Based on two rules of Permute
  • A NP-complete problem
  • Propagation-based algorithm

11
Two Important Rules on Permutations
  • Composition Rule
  • Distributive Rule

Permute(Permute(a031, lt1, 0, 3, 2gt), lt2, 1,
0, 3gt)
Permute(a031, lt3, 0, 1, 2gt)
Permute(a031, lt1, 0, 3, 2gt)
Permute(b031, lt1, 0, 3, 2gt)
Permute(a031 b031, lt1, 0, 3, 2gt)
12
Propagation-Based Optimization Algorithm
  • Overview Propagating permutation to permutation
  • Step 1 Pickup an unvisited permutation statement
  • Step 2 Propagate the permutation from the
    definition to the uses
  • Step 3 If a use is a permutation, goto (a),
    otherwise goto (b)
  • Merge it with the propagated permutation pattern.
    Goto Step 1
  • Propagate the permutation from right-hand side to
    left-hand side. Goto Step 2

1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(v207, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
t20712. u307 Permute(t307, P6)13.
u403 u303 u34714. u447
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(t107, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
t20712. u307 Permute(t307, P6)13.
u403 u303 u34714. u447
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
1. v103 x03 x472. v147
x03 - x473. t007 Permute(v107,
P1)4. t107 T807 v1075.
v207 Permute(t107, P2)6. u107
Permute(t107, P3)7. u203 u103
u1478. u247 u103 - u1479.
v307 Permute(u207, P4)10. t207
Permute(v307, P5)11. t307 T4_207
u20712. u307 Permute(t307, P6)13.
y03 u303 u34714. y47
u303 - u34715. v407 Permute(u407,
P7)16. y07 Permute(v407, P8)
13
Propagating Permutations to Partial Uses
Q
b07 Permute(a07, lt3,2,1,0,7,6,5,4gt) c03
b03 b47
P
b07 Permute(a07, lt3,2,1,0,7,6,5,4gt) c03
b03 b47
R
Not all permutations can be partitioned and
propagated to partial uses
  • Improvements over partial use boundary
  • - Permutation decomposition
  • Register-wise decomposition
  • Shuffle instruction decomposition
  • Permutation reshaping

14
Optimization Permutation Reshaping
  • For permutations used in commutative operations

15
Overview of the Optimization Framework
  • Strip-mine Permute to vperm inst.
  • Map vperm to native permutation inst.

16
Generating Permutation Instructions (1)
a015 Permute(b015, lt0,4,8,12,1,5,9,13,2,6,
10,14,3,7,11,15gt)
b015
a015
17
Generating Permutation Instructions (2)
a015 Permute(b015, lt0,4,8,12,1,5,9,13,2,6,
10,14,3,7,11,15gt)
b015
a015
  • Two Steps
  • Maximize empty slots when generating vperm
    instructions
  • Fill empty slots with data elements that go to
    the same target

18
Experiment Setups
  • Two SIMD devices VMX(AltiVec) SSE2
  • Tested applications
  • Group I Applications with relatively simple
    permutation patterns
  • C-Saxpy Complex version of saxpy ( y alphax
    y )
  • R-Color, C-Dot, R-FIR,
  • Group II Applications with complicated
    permutation patterns
  • FFT Fast Fourier transform programs generated by
    the SPIRAL system
  • WHT Walsh-Hadamard transform routines generated
    by the SPIRAL system
  • Bitonic sorting One of the fastest sorting
    networks
  • Group III Reorganization-only applications
  • Matrix transpose
  • Bit-reversal reordering

Processor 1.8G PowerPC G5 2.0G Pentium 4
Main Memory 2048 MB 1024 MB
Operating System Mac OS v10.3 Linux v2.4
Compiler xlc v6.0 icc v9.0
Compiler Options -O3 -qaltivec -fast (-O3)
19
Static Evaluation of Permutation Inst.
VMX VMX VMX SSE2 SSE2 SSE2
Program Size Base Opt Reduced Base Opt Reduced
fft.4 16 96 24 75.0 96 24 75.0
fft.5 32 208 48 79.6 208 48 79.6
fft.6 64 352 96 72.7 352 96 72.7
wht.4 16 48 12 75.0 48 12 75.0
wht.5 32 96 24 75.0 96 24 75.0
wht.6 64 192 48 75.0 192 48 75.0
bitonic.4 16 52 34 34.6 56 34 39.3
bitonic.5 32 136 92 32.3 144 92 36.1
bitonic.6 64 336 232 31.0 352 232 34.1
20
Run-time Performance of FFT Bitonic Sorting
21
Overall Speedups
22
Related Work
  • Optimizing permutation instructions introduced by
    misalignment
  • A. Eichenberger, P. Wu, K. O'Brien, Vectorization
    for SIMD architectures with alignment
    constraints, PLDI 04
  • P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD
    Code Generation for Runtime Alignment and Length
    Conversion, CGO 05
  • Efficient permutation instruction generation
  • A. Kudriavtsev, P. Kogge, Generation of
    permutations for SIMD processors, LCTES 05
  • M. Narayanan, K. Yelick, Generating permutation
    instructions from a high-level description, MSP
    04
  • D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization
    of interleaved data for SIMD, PLDI 06
  • Similar idea, different applications
  • A. Solar-Lezama, R. Rabbah, R. Bodik, K.
    Ebcioglu, Programming by sketching for
    bit-streaming programs, PLDI 05
  • S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng.
    Automatic array alignment in data-parallel
    programs, POPL 93
  • G. Hwang, J. K. Lee, D. Ju, An array operation
    synthesis scheme to optimize FORTRAN 90 programs,
    PPOPP 95

23
Conclusion
  • It is a performance critical problem for SIMD
    compilation to reduce the overhead introduced by
    permutation instructions
  • A unified framework is proposed to optimize data
    permutations
  • Putting all forms of data permutations into a
    unified representation
  • Propagating permutations across statements and
    merging them together
  • Generating efficient permutation instructions
    natively supported by devices
  • Experiments were conducted on different
    applications
  • Up to 77 permutation instructions are eliminated
  • Improve average performance by 48 on VMX and 68
    on SSE2
  • Near-peak overall speedups are achieved on some
    applications

24
Thank You!
  • June 2006
Write a Comment
User Comments (0)
About PowerShow.com