Vectorization for Modern Architectures II - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Vectorization for Modern Architectures II

Description:

... of Chicago. 5. Key ... University of Chicago. 11. Branch-On-Superword ... University of Chicago. 14. Sensitivity to True Density. Vector Length = 4, ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 25
Provided by: jaewoo
Category:

less

Transcript and Presenter's Notes

Title: Vectorization for Modern Architectures II


1
Vectorization for Modern Architectures (II)
Wednesday, July 22, 2009
  • Jaewook Shin

2
Want more ?
  • Control flow
  • SLP in the presence of control flow
  • Introducing control flow back
  • Superword-Level Locality (SLL)

3
Control Flow and the SLP Compiler
for (i0 iOnly parallelizes within a basic block !
4
One Approach
for (i0 ifor (i0 i0, 0, 0) Vtemp bii3 (1, 1, 1, 1)
bii3 Combine bii3 and Vtemp
according to Vcond
5
Key Concepts
  • Borrow from optimizations for architectures
    supporting predicated execution
  • Derive a large basic block of predicated
    instructions
  • SELECT operations merge data values for different
    control flow paths
  • Restore control flow

if-conversion
parallelize
remove superword predicates(SELECT)
remove scalar predicates (unpredicate)
6
If-Conversion
if (a ! 0) b b 1
cond a ! 0 pT, pF pset(cond)
b b 1
7
SELECT instruction
dst
src1
src2
predicate
3 2 , 3 , 1 2 2 , 3 , 0 3
2 , 3 , 1 2 2 , 3 , 0
SELECT( )
Va Vb (1, 1, 1, 1)
Vtemp Vb (1, 1, 1, 1) Va SELECT(Va,
Vtemp, Vp)
8
Unpredicate
if (p) bredi fred bgrei fgre
bblui fblu else bredi 100 bgrei
100 bblui 100
bredi fred bredi 100
bgrei fgre bgrei 100
bblui fblu bblui 100
9
Extend for Control Flow
for (i0 ifor (i0 i0, 0, 0) old bii3 new old (1,
1, 1, 1) bii3 SELECT(old, new, pred)
Overhead Both control flow paths are always
executed !
10
An OptimizationBranch-On-Superword-Condition-Cod
e
for (i0 i0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
11
Branch-On-Superword-Condition-Code (BOSCC)
branch-on-none( src )
branch-on-all( src )
12
An OptimizationBranch-On-Superword-Condition-Cod
e
for (i0 i0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
bypass
Overhead can be reduced for some input data sets
but BOSCC can increase overhead because we
introduce the branch instruction.
13
Understanding BOSCC profitability
  • Consider the following kernel

for (i0 i(temp Bi) Ci temp Di
14
Sensitivity to True Density
Vector Length 4, Data fit in L1 cache.
15
Exponential Decrease in Taken BOSCC
  • A branch-on-none is taken when all fields in a
    superword are false.
  • Given a true density D,
  • the probability for one field to be false is 1-D.
  • The probability for four fields in a superword to
    be false at the same time is (1-D)4.
  • A branch cost is highest when its percentage to
    be taken is around 50 .
  • The percentage of taken BOSCC when vector length
    is 4 is 50 when scalar true density is 16 .

16
An Observation By Using BOSCCs ...
  • Assume a simple of model of parallelization where
  • one scalar instruction is mapped to one parallel
    instruction, and
  • the cost of BOSCCs is zero.
  • We guarantee that each parallel instruction is
    executed once iff the corresponding scalar
    instruction is executed at least once for vector
    length iterations.
  • This means that parallelized codes are always
    faster than the scalar original.

17
Superword-Level Locality (SLL)
  • Definition Exploit data reuse in superword
    registers
  • Large capacity register file is used as a
    compiler controlled cache.
  • Differences from data reuse in caches
  • Eliminates memory access cycles completely
  • Storage has to be named explicitly
  • Differences from data reuse in scalar registers
  • Spatial reuse in superword registers

256 bits
128 bits


32
32
AltiVec
DIVA
18
Scalar vs. Superword Replacement
  • Identifies array references to the same memory
    address
  • Replaces array references with scalar/superword
    variables

Original loop nest
Superword-level parallelization
4X
for(i1iAij Ai-1j Bj Ai1j
Aij Bj
for(i1 i Aijj3 Ai-1jj3 Bjj3
Ai1jj3 Aijj3 Bjj3
1.5X
1.5X
6X
Scalar replacement
Superword replacement
for(i1 i T1 Bj T2 Ai-1j T1
Ai1j T2 T1 Aij T2
for(i1 i SV1 Bjj3 SV2 Ai-1jj3
SV1 Ai1jj3 SV2 SV1
Aijj3 SV2
19
Loop Skewing
  • Reshape iteration space to uncover parallelism
  • DO I 1, 8
  • DO J 1, 4
  • (0,1)
  • S A(I,J) A(I-1,J) A(I,J-1)
  • (1,0)
  • ENDDO
  • ENDDO

20
Loop Skewing
1, 1 0, 1 1, 0 1, 2 0, 2 1, 1 1, 3 0, 3
1, 2 1, 4 0, 4 1, 3 2, 1 1, 1 2, 0 2, 2
1, 2 2, 1 2, 3 1, 3 2, 2 2, 4 1, 4 2,
3 3, 1 2, 1 3, 0 3, 2 2, 2 3, 1 3, 3 2,
3 3, 2 3, 4 2, 4 3, 3 4, 1 3, 1 4, 0 4,
2 3, 2 4, 1 ...
Parallelism not apparent.
I
J
21
Loop Skewing jIJ
  • DO I 1, N
  • DO j I1, IN
  • S A(I,j-I) A(I-1,j-I) A(I,j-I-1)
  • ENDDO
  • ENDDO
  • Loop interchange to..
  • DO j 2, NN
  • DO I max(1,j-N), min(N,j-1)
  • S A(I,j-I) A(I-1,j-I) A(I,j-I-1)
  • ENDDO
  • ENDDO

22
Loop Skewing
  • The accesses pattern to A changes as follows.
  • I, J I-1,J I,J-1(I, K)
  • ---- ----- ----- -----
  • 1, 1 0, 1 1, 0 (1, 2)
  • 1, 2 0, 2 1, 1 (1, 3)
  • 2, 1 1, 1 2, 0 (2, 3)
  • 1, 3 0, 3 1, 2 (1, 4)
  • 2, 2 1, 2 2, 1 (2, 4)
  • 3, 1 2, 1 3, 0 (3, 4)
  • 1, 4 0, 4 1, 3 (1, 5)
  • 2, 3 1, 3 2, 2 (2, 5)
  • 3, 2 2, 2 3, 1 (3, 5)
  • 4, 1 3, 1 4, 0 (4, 5)
  • 2, 4 1, 4 2, 3 (2, 6)
  • 3, 3 2, 3 3, 2 (3, 6)
  • 4, 2 3, 2 4, 1 (4, 6)
  • 5, 1 4, 1 5, 0 (5, 6)
  • 3, 4 2, 4 3, 3 (3, 7)
  • 4, 3 3, 3 4, 2 (4, 7)
  • 5, 2 4, 2 5, 1 (5, 7)

I
j2 3 4 5 6 7
23
Loop Skewing
  • Disadvantages
  • Varying vector length
  • Not profitable if Nj is small
  • If vector startup time is more than speedup time,
    this is not profitable
  • Vector loop bounds must be recomputed on each
    iteration of outer loop
  • Apply loop skewing if everything else fails

24
References
  • J. Shin, M. Hall and J. Chame, Superword-Level
    Parallelism in the Presence of Control Flow, CGO
    2005
  • J. Shin, M. Hall and J. Chame, Compiler-Controlled
    Caching in Superword Register Files for
    Multimedia Extension Architectures, PACT 2002
  • J. Shin, J. Chame and M. Hall, Evaluating
    Compiler Technology for Control-Flow
    Optimizations for Multimedia Extension, MSP6,
    2004
Write a Comment
User Comments (0)
About PowerShow.com