Vectorization for Modern Architectures II - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Vectorization for Modern Architectures II

Description:

... of Chicago. 5. Key ... University of Chicago. 11. Branch-On-Superword ... University of Chicago. 14. Sensitivity to True Density. Vector Length = 4, ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 25

Provided by: jaewoo

Category:

more less

Transcript and Presenter's Notes

Title: Vectorization for Modern Architectures II

1
Vectorization for Modern Architectures (II)
Wednesday, July 22, 2009

Jaewook Shin

2
Want more ?

Control flow
SLP in the presence of control flow
Introducing control flow back
Superword-Level Locality (SLL)

3
Control Flow and the SLP Compiler
for (i0 iOnly parallelizes within a basic block !
4
One Approach
for (i0 ifor (i0 i0, 0, 0) Vtemp bii3 (1, 1, 1, 1)
bii3 Combine bii3 and Vtemp
according to Vcond
5
Key Concepts

Borrow from optimizations for architectures
supporting predicated execution
Derive a large basic block of predicated
instructions
SELECT operations merge data values for different
control flow paths
Restore control flow

if-conversion
parallelize
remove superword predicates(SELECT)
remove scalar predicates (unpredicate)
6
If-Conversion
if (a ! 0) b b 1
cond a ! 0 pT, pF pset(cond)
b b 1
7
SELECT instruction
dst
src1
src2
predicate
3 2 , 3 , 1 2 2 , 3 , 0 3
2 , 3 , 1 2 2 , 3 , 0
SELECT( )
Va Vb (1, 1, 1, 1)
Vtemp Vb (1, 1, 1, 1) Va SELECT(Va,
Vtemp, Vp)
8
Unpredicate
if (p) bredi fred bgrei fgre
bblui fblu else bredi 100 bgrei
100 bblui 100
bredi fred bredi 100
bgrei fgre bgrei 100
bblui fblu bblui 100
9
Extend for Control Flow
for (i0 ifor (i0 i0, 0, 0) old bii3 new old (1,
1, 1, 1) bii3 SELECT(old, new, pred)
Overhead Both control flow paths are always
executed !
10
An OptimizationBranch-On-Superword-Condition-Cod
e
for (i0 i0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
11
Branch-On-Superword-Condition-Code (BOSCC)
branch-on-none( src )
branch-on-all( src )
12
An OptimizationBranch-On-Superword-Condition-Cod
e
for (i0 i0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
bypass
Overhead can be reduced for some input data sets
but BOSCC can increase overhead because we
introduce the branch instruction.
13
Understanding BOSCC profitability

Consider the following kernel

for (i0 i(temp Bi) Ci temp Di
14
Sensitivity to True Density
Vector Length 4, Data fit in L1 cache.
15
Exponential Decrease in Taken BOSCC

A branch-on-none is taken when all fields in a
superword are false.
Given a true density D,
the probability for one field to be false is 1-D.
The probability for four fields in a superword to
be false at the same time is (1-D)4.
A branch cost is highest when its percentage to
be taken is around 50 .
The percentage of taken BOSCC when vector length
is 4 is 50 when scalar true density is 16 .

16
An Observation By Using BOSCCs ...

Assume a simple of model of parallelization where
one scalar instruction is mapped to one parallel
instruction, and
the cost of BOSCCs is zero.
We guarantee that each parallel instruction is
executed once iff the corresponding scalar
instruction is executed at least once for vector
length iterations.
This means that parallelized codes are always
faster than the scalar original.

17
Superword-Level Locality (SLL)

Definition Exploit data reuse in superword
registers
Large capacity register file is used as a
compiler controlled cache.
Differences from data reuse in caches
Eliminates memory access cycles completely
Storage has to be named explicitly
Differences from data reuse in scalar registers
Spatial reuse in superword registers

256 bits
128 bits

32
32
AltiVec
DIVA
18
Scalar vs. Superword Replacement

Identifies array references to the same memory
address
Replaces array references with scalar/superword
variables

Original loop nest
Superword-level parallelization
4X
for(i1iAij Ai-1j Bj Ai1j
Aij Bj
for(i1 i Aijj3 Ai-1jj3 Bjj3
Ai1jj3 Aijj3 Bjj3
1.5X
1.5X
6X
Scalar replacement
Superword replacement
for(i1 i T1 Bj T2 Ai-1j T1
Ai1j T2 T1 Aij T2
for(i1 i SV1 Bjj3 SV2 Ai-1jj3
SV1 Ai1jj3 SV2 SV1
Aijj3 SV2
19
Loop Skewing

Reshape iteration space to uncover parallelism
DO I 1, 8
DO J 1, 4
(0,1)
S A(I,J) A(I-1,J) A(I,J-1)
(1,0)
ENDDO
ENDDO

20
Loop Skewing
1, 1 0, 1 1, 0 1, 2 0, 2 1, 1 1, 3 0, 3
1, 2 1, 4 0, 4 1, 3 2, 1 1, 1 2, 0 2, 2
1, 2 2, 1 2, 3 1, 3 2, 2 2, 4 1, 4 2,
3 3, 1 2, 1 3, 0 3, 2 2, 2 3, 1 3, 3 2,
3 3, 2 3, 4 2, 4 3, 3 4, 1 3, 1 4, 0 4,
2 3, 2 4, 1 ...
Parallelism not apparent.
I
J
21
Loop Skewing jIJ

DO I 1, N
DO j I1, IN
S A(I,j-I) A(I-1,j-I) A(I,j-I-1)
ENDDO
ENDDO
Loop interchange to..
DO j 2, NN
DO I max(1,j-N), min(N,j-1)
S A(I,j-I) A(I-1,j-I) A(I,j-I-1)
ENDDO
ENDDO

22
Loop Skewing

The accesses pattern to A changes as follows.

I, J I-1,J I,J-1(I, K)
---- ----- ----- -----
1, 1 0, 1 1, 0 (1, 2)
1, 2 0, 2 1, 1 (1, 3)
2, 1 1, 1 2, 0 (2, 3)
1, 3 0, 3 1, 2 (1, 4)
2, 2 1, 2 2, 1 (2, 4)
3, 1 2, 1 3, 0 (3, 4)
1, 4 0, 4 1, 3 (1, 5)
2, 3 1, 3 2, 2 (2, 5)
3, 2 2, 2 3, 1 (3, 5)
4, 1 3, 1 4, 0 (4, 5)
2, 4 1, 4 2, 3 (2, 6)
3, 3 2, 3 3, 2 (3, 6)
4, 2 3, 2 4, 1 (4, 6)
5, 1 4, 1 5, 0 (5, 6)
3, 4 2, 4 3, 3 (3, 7)
4, 3 3, 3 4, 2 (4, 7)
5, 2 4, 2 5, 1 (5, 7)

I
j2 3 4 5 6 7
23
Loop Skewing

Disadvantages
Varying vector length
Not profitable if Nj is small
If vector startup time is more than speedup time,
this is not profitable
Vector loop bounds must be recomputed on each
iteration of outer loop
Apply loop skewing if everything else fails

24
References

J. Shin, M. Hall and J. Chame, Superword-Level
Parallelism in the Presence of Control Flow, CGO
2005
J. Shin, M. Hall and J. Chame, Compiler-Controlled
Caching in Superword Register Files for
Multimedia Extension Architectures, PACT 2002
J. Shin, J. Chame and M. Hall, Evaluating
Compiler Technology for Control-Flow
Optimizations for Multimedia Extension, MSP6,
2004