Title: SuperwordLevel Parallelism in the Presence of Control Flow
1Superword-Level Parallelism in the Presence of
Control Flow
- Jaewook Shin
- Mary Hall
- Jacqueline Chame
CGO05
March 22 2005
2Multimedia Extension Architectures
- Multimedia applications are becoming increasingly
important. - Most microprocessors have multimedia extensions.
- SIMD parallelism
- Variable-sized data fields
3Superword-Level Parallelism (SLP)
- Fine grain SIMD parallelism in aggregate data
objects larger than a machine word
- Most compilers for multimedia extensions are
based on conventional vectorization techniques.
4SLP Compiler (Larsen Amarasinghe)
for (i0 ilt16 i) ai bi ci
Unroll by 4
for (i0 ilt16 i4) ai0 bi0
ci0 ai1 bi1 ci1 ai2
bi2 ci2 ai3 bi3 ci3
Pack isomorphic statements
for (i0 ilt16 i4) aii3 bii3
cii3
5Control Flow and the SLP Compiler
for (i0 ilt16 i) if (ai ! 0) bi
Only parallelizes within a basic block !
6Our Approach
for (i0 ilt16 i) if (ai ! 0) bi
for (i0 ilt16 i4) Vcond aii3 ! (0,
0, 0, 0) Vtemp bii3 (1, 1, 1, 1)
bii3 Combine bii3 and Vtemp
according to Vcond
7Key Concepts
- Borrow from optimizations for architectures
supporting predicated execution - Derive a large basic block of predicated
instructions - SELECT operations merge data values for different
control flow paths - Restore control flow
if-conversion
parallelize
remove superword predicates(SELECT)
remove scalar predicates (unpredicate)
8If-Conversion
if (a ! 0) b b 1
cond a ! 0 pT, pF pset(cond)
b b 1 ltpTgt
9SELECT instruction
dst
src1
src2
predicate
3 2 , 3 , 1 2 2 , 3 , 0 3
2 , 3 , 1 2 2 , 3 , 0
SELECT( )
Va Vb (1, 1, 1, 1) ltVpgt
Vtemp Vb (1, 1, 1, 1) Va SELECT(Va,
Vtemp, Vp)
10Unpredicate
if (p) bredi fred bgrei fgre
bblui fblu else bredi 100 bgrei
100 bblui 100
bredi fred ltpgt bredi 100
ltpgt bgrei fgre ltpgt bgrei 100
ltpgt bblui fblu ltpgt bblui 100 ltpgt
11Algorithm
- If-conversion
- Park and Schlanskers RK-algorithm
- SELECT
- Insert the minimum number of SELECT instructions
- Use reaching definition based on predicate
covering - Unpredicate
- Try to reduce the number of conditional branches
- Use predicate covering
12Predicate Hierarchy Graph (PHG)
TRUE
- PHG represents relationships among predicates.
T
F
T
F
pT1
pF1
pT2
pF2
pT1, pF1 pset (c1) ltTRUEgt
T
F
pT2, pF2 pset (c2) ltTRUEgt
pT3
pF3
pT3, pF3 pset (c3) ltpT1gt
13Predicate Covering
- A predicate p is covered by a set of predicates G
if p true e ½ pc G such that p true.
TRUE
Q Predicate covering predecessors of I3 ?
T
F
T
F
I1 ltpF3gt I2 ltpT3gt I3 ltpT1gt
pT1
pF1
pT2
pF2
pT1
T
F
G
pT3
,pF3
,pT1
pT3
pF3
pT3
pF3
14SELECT Algorithm
- SELECT is not necessary for the first reaching
definition
TRUE
d1 VaV1 ltpF3gt d2 VaV2 ltpT3gt u3 VcVa ltpT1gt
T
F
T
F
pT1
pF1
pT2
pF2
T
F
d1 VaV1 d2 VaSELECT(Va,V2,pT3) u3 VcVa
pT3
pF3
15Predicate CFG Generator
- Find the predicate covering predecessors for each
predicated instruction.
if(pT1) if(pF3) I1 else I2 I3
I1
I2
I1 ltpF3gt I2 ltpT3gt I3 ltpT1gt
I3
(a) predicated scalar code
(b) CFG
(c) code generated
16Unpredicate Algorithm
- Schedule each instruction within an existing
basic block where it is safe. - If no such basic block exists, use predicate CFG
generator.
I1
p
p
I2
I1 I4
I1 ltpgt I2 ltpgt I3 ltTRUEgt I4 ltpgt I5 ltpgt
I2 I5
p
p
I3
TRUE
I3
TRUE
I4
I5
p
p
(a) predicated scalar code
(b) predicate CFG generator
(c) unpredicate
17Our Implementation
superword level locality
alignment analysis
original C code
unroll
remove superword predicates(SELECT)
parallelize
if-conversion
output C code
superword replacement
remove scalar predicates (unpredicate)
our previous work
MIT SLP compiler
new for this paper
18Applications
- Kernels
- Chroma Chroma keying of two images
- Sobel Sobel edge detection
- TM Template Matching
- Max Max value search
- transitive Shortest path search
- Functions from UCLA MediaBench
- MPEG2-dist1 dist1 of MPEG2 encoder
- EPIC-unquantize unquantize_image of unepic
- GSM-Calculation Calculation_of_the_LTP_parameter
s of gsmencode - Two data set sizes
- Large Representative data set
- Small Isolates parallelization effects
19Experimental Flow
SLP compiler
Control flow extension
Original C code
Baseline
SLP
SLP-CF
GCC AltiVec extended
PowerPC G4
20Overall Improvements large data
21Overall Improvements small data
22Related Research
- Vectorization techniques for conditionals
- Sreraman and Govindarajan(IJPP00)
- Bik et. al.(IJPP02)
- Architectures with SELECT
- Multimedia Extension Architectures AltiVec
- Processing-In-Memory DIVA
- Vector machines Smith et. al.(ISCA00)
- Phi-predication Chuang et. al.(CGO03)
- Requires scalar SELECT
- Predicate CFG generator Mahlke(96)
23Conclusion
- Compiler system to exploit SLP in the presence of
control flow - Compiler algorithms to
- Minimum SELECTs
- Restore efficient control flow
- In an experiment with 5 kernels and 3 benchmarks,
the performance improved by 1.97X 15.07X.
24Predicate Covering Predecessors
- A predicate p is covered by a set of predicates G
if p true e ½ pc G such that p true. - Given a sequence of predicated instructions,
Predicate covering predecessors of an instruction
I guarded by a predicate p is - The first set S of textually preceding
instructions of I - The predicate set G of S cover the predicate p of
I - For each instruction Ic S and its predicate pc
G, - p is not covered by any subset of predicates
guarding instructions between I and I - P and p are not mutually exclusive
25Is PHG inaccurate ?
- PHG is said to be inaccurate and limited in
representation. - We use a subset of predicate analyses based on
PHG. - Predicate deposit type only Conditional
- No false positive for the following properties
- Mutually exclusive
- Predicate covering
- Leads to possibly conservative but always correct
code - Predicate analyses can be replaced with any other
predicate analysis system
26Reverse If-Conversion (RIC) ?
- RIC assumes no operations are inserted nor
removed during RIC. - ? Superword instructions are inserted and scalar
instructions are removed in our case. - RIC uses
- predicate define operations to preserve control
dependences - implicit predicate merge operations to preserve
control anti-dependence - ? The original CFG can be partially or totally
removed in our case.