Auto-Vectorization of Interleaved Data for SIMD - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Auto-Vectorization of Interleaved Data for SIMD

Description:

PLDI 2006. Auto-Vectorization of Interleaved Data for SIMD. Dorit Nuzman, Ira ... We show how a classic compiler loop-based auto-SIMDizing optimization was ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 24

Provided by: dori91

Category:

more less

Transcript and Presenter's Notes

Title: Auto-Vectorization of Interleaved Data for SIMD

1
Auto-Vectorization of Interleaved Data for SIMD

Dorit Nuzman, Ira Rosen, Ayal Zaks
IBM Haifa Research Lab HiPEAC member, Isreal
dorit, ira, zaks_at_il.ibm.com

2
Main Message

Most SIMD targets support access to packed data
in memory (SIMpD), but there are important
applications which access non-consecutive data
We show how a classic compiler loop-based
auto-SIMDizing optimization was augmented to
support accesses to strided, interleaved data
This can serve as a first step to combine
traditional loop-based vectorization with
(if-converted) basic-block vectorization (SLP)

3
SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(b) OP(c) OP(d)
a
b
c
d
VOP( a, b, c, d )
VR1
Vector Operation
Vectorization
Vector Registers
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
4
SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(b) OP(c) OP(d)
a
b
c
d
VOP( a, b, c, d )
VR1
Vectorization
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
b
c
d
5
SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(f) OP(k) OP(p)
a
f
VOP( a, f, k, p )
VR5
k
p
a
f
k
p
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
f
k
p
6
OP(a) OP(f) OP(k) OP(p)
a
mask ? loop (VR1,,VR4) ? vload (mem) VR5 ?
pack (VR1,,VR4),mask VOP(VR5)
f
VOP( a, f, k, p )
VR5
k
p
a
f
k
p
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
f
k
p
7
Application accessing non-consecutive data
Viterbi decoder(before)
Stride 1
Stride 2
Stride 2

-
-
ltlt 1
ltlt 11

ltlt 1
ltlt 11
max
max
sel
sel
Stride 4
8
Application accessing non-consecutive data
Viterbi decoder(after)
Stride 1
Stride 2
Stride 2

-
-
ltlt 1
ltlt 11

ltlt 1
ltlt 11
max
max
sel
sel
Stride 4
9
Application accessing non-consecutive data
Audio downmix(before)
Stride 4
gtgt 1
gtgt 1
gtgt 1
gtgt 1

Stride 2
10
Application accessing non-consecutive data
Audio downmix(after)
Stride 4
gtgt 1
gtgt 1
gtgt 1
gtgt 1

Stride 2
11
Basic unpacking and packing operations for
strided access

Use two pairs of inverse operations widely
supported on SIMD platforms
extract_even, extract_odd
interleave_high, interleave_low
Use them recursively to support strided accesses
with power-of-2 strides
Support several data types

12
Classic loop-based auto-vectorization

vect_analyze_loop (loop)
if (!1_analyze_counted_single_bb_loop (loop))
FAIL
if (!2_determine_VF (loop)) FAIL
if (!3_analyze_memory_access_patterns (loop))
FAIL
if (!4_analyze_scalar_dependence_cycles (loop))
FAIL
if (!5_analyze_data_dependence_distances
(loop)) FAIL
if (!6_analyze_consecutive_data_accesses
(loop)) FAIL
if (!7_analyze_data_alignment (loop)) FAIL
if (!8_analyze_vops_exist_forall_ops (loop))
FAIL
SUCCEED
vect_transform_loop (loop)
FOR_ALL_STMTS_IN_LOOP(loop, stmt)
replace_OP_by_VOP (stmt)
decrease_loop_bound_by_factor_VF (loop)

13
Vectorizing non unit stride access

One VOP accessing data with stride d requires
loading of dVF elements
Several, otherwise unrelated VOPs can share these
loaded elements
If they all share the same stride d
If they all start close to each other
Upto d VOPS if less, there are gaps
Recognize this spatial reuse potential to
eliminate redundant load and extract operations
Better make the decision earlier than later
without such elimination
vectorizing the loop may be non beneficial (for
loads)
vectorizing the loop may be prohibited (for
stores)

14
Augmenting the vectorizer step 1/3 build
spatial groups

5_analyze_data_dependence_distancesalready
traversed all pairs of load/stores to analyze
their dependence distanceif (cross_iteration_de
pendence_distance lt (VF-1)stride)
if (read,write) or (write,read) or
(write,write)
ok dep_resolve()
endif
endif
Augment this traversal to look for spatial reuse
between pairs of independent loads and stores,
building spatial groupsif ok and
(intra_iteration_address_distance lt strideu)
if (read,read) or (write,write)
ok analyze_and_build_spatial_groups()
endif
endif

15
Augmenting the vectorizer step 2/3 check
spatial groups

6_analyze_consecutive_data_accesses already
traversed each individual load/store to analyze
its access pattern
Augment this traversal by
Allowing non-consecutive accesses
Building singleton groups for strided ungrouped
load/stores
Checking for gaps and profitability of spatial
groups

16
Augmenting the vectorizer step 3/3
transformation

vect_transform_stmt generates vector code per
scalar OP
Augment this by considering
If OP is a load/store in first position of a
spatial group
generate d load/stores
handle their alignment according to the starting
address
generate d log d extract/interleaves
If OP belongs to a spatial group, connect it to
the appropriate extract/interleave according to
its position
Unused extract/interleaves are discarded by
subsequent DCE

17
Performance qualitative VF/(1 log d)
d VF4 VF8 VF16
1 4 8 16
2 2 4 8
4 1.3 2.6 5.3
8 1 2 4
16 0.8 1.6 3.2
32 0.6 1.2 2.4

Vectorized code has d load/stores and (d log d)
extract/interleaves
Scalar code has dVF loads/stores
Performance improvement factor in of
load/store/extract/interleave is
VF/(1 log d)

18
Performance empirically (on PowerPC 970 with
Altivec)

Stride of 2 always provides speedups
Strides of 8, 16 suffer from increased code-size
turns off loop unrolling
Stride of 32 suffers from high register pressure
(d1)
If non-permute operations exist speedups for
all strides if VFm8

19
Performance stride of 8 with gaps

Position of gaps affects the number of extract
(interleaves) needed
Improvement is observed even for a single strided
access(VF16 with arithmetic operations)

20
Performance - kernels

4 groups VF4, 8, 16, 16-with-gaps
Strides prefix each kernel
Slowdown when doing only memory operations at
VF4, d8

21
Future direction towards loop-aware SLP

When building spatial groups, we consider
distinct operations accessing adjacent/close
addresses this is the first step of building SLP
chains
SLP looks for VF fully interleaved accesses,
without gaps may require earlier loop unrolling
Next step is to consider the operations that use
a spatial group of loads if theyre isomorphic,
try to postpone the extracts
Analogous to handling alignment using zero-shift,
lazy-shift, eager-shift policies

22
Conclusions

Existing SIMD targets supporting SIMpD can
provide improved performance for important
power-of-2 strided applications dont be afraid
of d gt 2
Existing compiler loop-based auto-vectorization
can be augmented efficiently to handle such
strided accesses
This can serve as a first step combining
traditional loop-based vectorization with
(if-converted) basic-block vectorization (SLP)
This area of work is fertile
consider details (d, gaps, positions, VF,
non-mem ops) for it not to be futile!