Title: Vectorization for Modern Architectures II
1Vectorization for Modern Architectures (II)
Wednesday, July 22, 2009
2Want more ?
- Control flow
- SLP in the presence of control flow
- Introducing control flow back
- Superword-Level Locality (SLL)
3Control Flow and the SLP Compiler
for (i0 iOnly parallelizes within a basic block !
4One Approach
for (i0 ifor (i0 i0, 0, 0) Vtemp bii3 (1, 1, 1, 1)
bii3 Combine bii3 and Vtemp
according to Vcond
5Key Concepts
- Borrow from optimizations for architectures
supporting predicated execution - Derive a large basic block of predicated
instructions - SELECT operations merge data values for different
control flow paths - Restore control flow
if-conversion
parallelize
remove superword predicates(SELECT)
remove scalar predicates (unpredicate)
6If-Conversion
if (a ! 0) b b 1
cond a ! 0 pT, pF pset(cond)
b b 1
7SELECT instruction
dst
src1
src2
predicate
3 2 , 3 , 1 2 2 , 3 , 0 3
2 , 3 , 1 2 2 , 3 , 0
SELECT( )
Va Vb (1, 1, 1, 1)
Vtemp Vb (1, 1, 1, 1) Va SELECT(Va,
Vtemp, Vp)
8Unpredicate
if (p) bredi fred bgrei fgre
bblui fblu else bredi 100 bgrei
100 bblui 100
bredi fred bredi 100
bgrei fgre bgrei 100
bblui fblu bblui 100
9Extend for Control Flow
for (i0 ifor (i0 i0, 0, 0) old bii3 new old (1,
1, 1, 1) bii3 SELECT(old, new, pred)
Overhead Both control flow paths are always
executed !
10An OptimizationBranch-On-Superword-Condition-Cod
e
for (i0 i0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
11Branch-On-Superword-Condition-Code (BOSCC)
branch-on-none( src )
branch-on-all( src )
12An OptimizationBranch-On-Superword-Condition-Cod
e
for (i0 i0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
bypass
Overhead can be reduced for some input data sets
but BOSCC can increase overhead because we
introduce the branch instruction.
13Understanding BOSCC profitability
- Consider the following kernel
for (i0 i(temp Bi) Ci temp Di
14Sensitivity to True Density
Vector Length 4, Data fit in L1 cache.
15Exponential Decrease in Taken BOSCC
- A branch-on-none is taken when all fields in a
superword are false. - Given a true density D,
- the probability for one field to be false is 1-D.
- The probability for four fields in a superword to
be false at the same time is (1-D)4. - A branch cost is highest when its percentage to
be taken is around 50 . - The percentage of taken BOSCC when vector length
is 4 is 50 when scalar true density is 16 .
16An Observation By Using BOSCCs ...
- Assume a simple of model of parallelization where
- one scalar instruction is mapped to one parallel
instruction, and - the cost of BOSCCs is zero.
- We guarantee that each parallel instruction is
executed once iff the corresponding scalar
instruction is executed at least once for vector
length iterations. - This means that parallelized codes are always
faster than the scalar original.
17Superword-Level Locality (SLL)
- Definition Exploit data reuse in superword
registers - Large capacity register file is used as a
compiler controlled cache. - Differences from data reuse in caches
- Eliminates memory access cycles completely
- Storage has to be named explicitly
- Differences from data reuse in scalar registers
- Spatial reuse in superword registers
256 bits
128 bits
32
32
AltiVec
DIVA
18Scalar vs. Superword Replacement
- Identifies array references to the same memory
address - Replaces array references with scalar/superword
variables
Original loop nest
Superword-level parallelization
4X
for(i1iAij Ai-1j Bj Ai1j
Aij Bj
for(i1 i Aijj3 Ai-1jj3 Bjj3
Ai1jj3 Aijj3 Bjj3
1.5X
1.5X
6X
Scalar replacement
Superword replacement
for(i1 i T1 Bj T2 Ai-1j T1
Ai1j T2 T1 Aij T2
for(i1 i SV1 Bjj3 SV2 Ai-1jj3
SV1 Ai1jj3 SV2 SV1
Aijj3 SV2
19Loop Skewing
- Reshape iteration space to uncover parallelism
- DO I 1, 8
- DO J 1, 4
- (0,1)
- S A(I,J) A(I-1,J) A(I,J-1)
- (1,0)
- ENDDO
- ENDDO
20Loop Skewing
1, 1 0, 1 1, 0 1, 2 0, 2 1, 1 1, 3 0, 3
1, 2 1, 4 0, 4 1, 3 2, 1 1, 1 2, 0 2, 2
1, 2 2, 1 2, 3 1, 3 2, 2 2, 4 1, 4 2,
3 3, 1 2, 1 3, 0 3, 2 2, 2 3, 1 3, 3 2,
3 3, 2 3, 4 2, 4 3, 3 4, 1 3, 1 4, 0 4,
2 3, 2 4, 1 ...
Parallelism not apparent.
I
J
21Loop Skewing jIJ
- DO I 1, N
- DO j I1, IN
- S A(I,j-I) A(I-1,j-I) A(I,j-I-1)
- ENDDO
- ENDDO
- Loop interchange to..
- DO j 2, NN
- DO I max(1,j-N), min(N,j-1)
- S A(I,j-I) A(I-1,j-I) A(I,j-I-1)
- ENDDO
- ENDDO
22Loop Skewing
- The accesses pattern to A changes as follows.
- I, J I-1,J I,J-1(I, K)
- ---- ----- ----- -----
- 1, 1 0, 1 1, 0 (1, 2)
- 1, 2 0, 2 1, 1 (1, 3)
- 2, 1 1, 1 2, 0 (2, 3)
- 1, 3 0, 3 1, 2 (1, 4)
- 2, 2 1, 2 2, 1 (2, 4)
- 3, 1 2, 1 3, 0 (3, 4)
- 1, 4 0, 4 1, 3 (1, 5)
- 2, 3 1, 3 2, 2 (2, 5)
- 3, 2 2, 2 3, 1 (3, 5)
- 4, 1 3, 1 4, 0 (4, 5)
- 2, 4 1, 4 2, 3 (2, 6)
- 3, 3 2, 3 3, 2 (3, 6)
- 4, 2 3, 2 4, 1 (4, 6)
- 5, 1 4, 1 5, 0 (5, 6)
- 3, 4 2, 4 3, 3 (3, 7)
- 4, 3 3, 3 4, 2 (4, 7)
- 5, 2 4, 2 5, 1 (5, 7)
I
j2 3 4 5 6 7
23Loop Skewing
- Disadvantages
- Varying vector length
- Not profitable if Nj is small
- If vector startup time is more than speedup time,
this is not profitable - Vector loop bounds must be recomputed on each
iteration of outer loop - Apply loop skewing if everything else fails
24References
- J. Shin, M. Hall and J. Chame, Superword-Level
Parallelism in the Presence of Control Flow, CGO
2005 - J. Shin, M. Hall and J. Chame, Compiler-Controlled
Caching in Superword Register Files for
Multimedia Extension Architectures, PACT 2002 - J. Shin, J. Chame and M. Hall, Evaluating
Compiler Technology for Control-Flow
Optimizations for Multimedia Extension, MSP6,
2004 -