Title: Auto-Vectorization of Interleaved Data for SIMD
1Auto-Vectorization of Interleaved Data for SIMD
- Dorit Nuzman, Ira Rosen, Ayal Zaks
- IBM Haifa Research Lab HiPEAC member, Isreal
- dorit, ira, zaks_at_il.ibm.com
2Main Message
- Most SIMD targets support access to packed data
in memory (SIMpD), but there are important
applications which access non-consecutive data - We show how a classic compiler loop-based
auto-SIMDizing optimization was augmented to
support accesses to strided, interleaved data - This can serve as a first step to combine
traditional loop-based vectorization with
(if-converted) basic-block vectorization (SLP)
3SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(b) OP(c) OP(d)
a
b
c
d
VOP( a, b, c, d )
VR1
Vector Operation
Vectorization
Vector Registers
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
4SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(b) OP(c) OP(d)
a
b
c
d
VOP( a, b, c, d )
VR1
Vectorization
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
b
c
d
5SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(f) OP(k) OP(p)
a
f
VOP( a, f, k, p )
VR5
k
p
a
f
k
p
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
f
k
p
6OP(a) OP(f) OP(k) OP(p)
a
mask ? loop (VR1,,VR4) ? vload (mem) VR5 ?
pack (VR1,,VR4),mask VOP(VR5)
f
VOP( a, f, k, p )
VR5
k
p
a
f
k
p
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
f
k
p
7Application accessing non-consecutive data
Viterbi decoder(before)
Stride 1
Stride 2
Stride 2
-
-
ltlt 1
ltlt 11
ltlt 1
ltlt 11
max
max
sel
sel
Stride 4
8Application accessing non-consecutive data
Viterbi decoder(after)
Stride 1
Stride 2
Stride 2
-
-
ltlt 1
ltlt 11
ltlt 1
ltlt 11
max
max
sel
sel
Stride 4
9Application accessing non-consecutive data
Audio downmix(before)
Stride 4
gtgt 1
gtgt 1
gtgt 1
gtgt 1
Stride 2
10Application accessing non-consecutive data
Audio downmix(after)
Stride 4
gtgt 1
gtgt 1
gtgt 1
gtgt 1
Stride 2
11Basic unpacking and packing operations for
strided access
- Use two pairs of inverse operations widely
supported on SIMD platforms - extract_even, extract_odd
- interleave_high, interleave_low
- Use them recursively to support strided accesses
with power-of-2 strides - Support several data types
12Classic loop-based auto-vectorization
- vect_analyze_loop (loop)
- if (!1_analyze_counted_single_bb_loop (loop))
FAIL - if (!2_determine_VF (loop)) FAIL
- if (!3_analyze_memory_access_patterns (loop))
FAIL - if (!4_analyze_scalar_dependence_cycles (loop))
FAIL - if (!5_analyze_data_dependence_distances
(loop)) FAIL - if (!6_analyze_consecutive_data_accesses
(loop)) FAIL - if (!7_analyze_data_alignment (loop)) FAIL
- if (!8_analyze_vops_exist_forall_ops (loop))
FAIL - SUCCEED
-
- vect_transform_loop (loop)
- FOR_ALL_STMTS_IN_LOOP(loop, stmt)
- replace_OP_by_VOP (stmt)
- decrease_loop_bound_by_factor_VF (loop)
13Vectorizing non unit stride access
- One VOP accessing data with stride d requires
loading of dVF elements - Several, otherwise unrelated VOPs can share these
loaded elements - If they all share the same stride d
- If they all start close to each other
- Upto d VOPS if less, there are gaps
- Recognize this spatial reuse potential to
eliminate redundant load and extract operations - Better make the decision earlier than later
without such elimination - vectorizing the loop may be non beneficial (for
loads) - vectorizing the loop may be prohibited (for
stores)
14Augmenting the vectorizer step 1/3 build
spatial groups
- 5_analyze_data_dependence_distancesalready
traversed all pairs of load/stores to analyze
their dependence distanceif (cross_iteration_de
pendence_distance lt (VF-1)stride) - if (read,write) or (write,read) or
(write,write) - ok dep_resolve()
- endif
- endif
- Augment this traversal to look for spatial reuse
between pairs of independent loads and stores,
building spatial groupsif ok and
(intra_iteration_address_distance lt strideu) - if (read,read) or (write,write)
- ok analyze_and_build_spatial_groups()
- endif
- endif
15Augmenting the vectorizer step 2/3 check
spatial groups
- 6_analyze_consecutive_data_accesses already
traversed each individual load/store to analyze
its access pattern - Augment this traversal by
- Allowing non-consecutive accesses
- Building singleton groups for strided ungrouped
load/stores - Checking for gaps and profitability of spatial
groups
16Augmenting the vectorizer step 3/3
transformation
- vect_transform_stmt generates vector code per
scalar OP - Augment this by considering
- If OP is a load/store in first position of a
spatial group - generate d load/stores
- handle their alignment according to the starting
address - generate d log d extract/interleaves
- If OP belongs to a spatial group, connect it to
the appropriate extract/interleave according to
its position - Unused extract/interleaves are discarded by
subsequent DCE
17Performance qualitative VF/(1 log d)
d VF4 VF8 VF16
1 4 8 16
2 2 4 8
4 1.3 2.6 5.3
8 1 2 4
16 0.8 1.6 3.2
32 0.6 1.2 2.4
- Vectorized code has d load/stores and (d log d)
extract/interleaves - Scalar code has dVF loads/stores
- Performance improvement factor in of
load/store/extract/interleave is - VF/(1 log d)
18Performance empirically (on PowerPC 970 with
Altivec)
- Stride of 2 always provides speedups
- Strides of 8, 16 suffer from increased code-size
turns off loop unrolling - Stride of 32 suffers from high register pressure
(d1) - If non-permute operations exist speedups for
all strides if VFm8
19Performance stride of 8 with gaps
- Position of gaps affects the number of extract
(interleaves) needed - Improvement is observed even for a single strided
access(VF16 with arithmetic operations)
20Performance - kernels
- 4 groups VF4, 8, 16, 16-with-gaps
- Strides prefix each kernel
- Slowdown when doing only memory operations at
VF4, d8
21Future direction towards loop-aware SLP
- When building spatial groups, we consider
distinct operations accessing adjacent/close
addresses this is the first step of building SLP
chains - SLP looks for VF fully interleaved accesses,
without gaps may require earlier loop unrolling - Next step is to consider the operations that use
a spatial group of loads if theyre isomorphic,
try to postpone the extracts - Analogous to handling alignment using zero-shift,
lazy-shift, eager-shift policies
22Conclusions
- Existing SIMD targets supporting SIMpD can
provide improved performance for important
power-of-2 strided applications dont be afraid
of d gt 2 - Existing compiler loop-based auto-vectorization
can be augmented efficiently to handle such
strided accesses - This can serve as a first step combining
traditional loop-based vectorization with
(if-converted) basic-block vectorization (SLP) - This area of work is fertile
- consider details (d, gaps, positions, VF,
non-mem ops) for it not to be futile!
23Questions
?