Title: Affine Partitioning for Parallelism
1Affine Partitioning for Parallelism Locality
- Amy Lim
- Stanford University
- http//suif.stanford.edu/
2Useful Transforms for ParallelismLocality
- INTERCHANGE FOR i FOR
j FOR j FOR i - Ai,j Ai,j
- REVERSAL FOR i 1 to n FOR i n downto
1 Ai Ai - SKEWING FOR i1 TO n FOR i1 TO
n FOR j1 TO n FOR ki1 to in - Ai,j Ai,k-i
- FUSION/FISSION FOR i 1 TO n FOR i 1 TO
n Ai Ai - FOR i 1 TO n Bi
- Bi
- REINDEXING FOR i 1 to n A1
B0 Ai Bi-1 FOR i 1 to
n-1 Ci Ai1 Ai1
Bi Ci Ai1 - Cn An1
- Traditional approach is it legal desirable to
apply one transform?
3Question How to combine the transformations?
- Affine mappings Lim Lam, POPL 97, ICS 99
- Domain arbitrary loop nesting, affine loop
indices instruction optimized separately - Unifies
- Permutation
- Skewing
- Reversal
- Fusion
- Fission
- Statement reordering
- Supports blocking across all (non-perfectly
nested) loops - Optimal Max. deg. of parallelism min. deg. of
synchronization - Minimize communication by aligning the
computation and pipelining
4Loop Transforms Cholesky factorization example
- DO 1 J 0, N
- I0 MAX ( -M, -J )
- DO 2 I I0, -1
- DO 3 JJ I0 - I, -1
- DO 3 L 0, NMAT
- 3 A(L,I,J) A(L,I,J) - A(L,JJ,IJ)
A(L,IJJ,J) - DO 2 L 0, NMAT
- 2 A(L,I,J) A(L,I,J) A(L,0,IJ)
- DO 4 L 0, NMAT
- 4 EPSS(L) EPS A(L,0,J)
- DO 5 JJ I0, -1
- DO 5 L 0, NMAT
- 5 A(L,0,J) A(L,0,J) - A(L,JJ,J) 2
- DO 1 L 0, NMAT
- 1 A(L,0,J) 1. / SQRT ( ABS (EPSS(L)
A(L,0,J)) ) - DO 6 I 0, NRHS
- DO 7 K 0, N
- DO 8 L 0, NMAT
5Results for Optimizing Perfect Nests
Speedup on a Digital Turbolaser with 8 300Mhz
21164 processors
6Optimizing Arbitrary Loop Nesting Using Affine
Partitions
- DO 1 J 0, N
- I0 MAX ( -M, -J )
- DO 2 I I0, -1
- DO 3 JJ I0 - I, -1
- DO 3 L 0, NMAT
- 3 A(L,I,J) A(L,I,J) - A(L,JJ,IJ)
A(L,IJJ,J) - DO 2 L 0, NMAT
- 2 A(L,I,J) A(L,I,J) A(L,0,IJ)
- DO 4 L 0, NMAT
- 4 EPSS(L) EPS A(L,0,J)
- DO 5 JJ I0, -1
- DO 5 L 0, NMAT
- 5 A(L,0,J) A(L,0,J) - A(L,JJ,J) 2
- DO 1 L 0, NMAT
- 1 A(L,0,J) 1. / SQRT ( ABS (EPSS(L)
A(L,0,J)) ) - DO 6 I 0, NRHS
- DO 7 K 0, N
- DO 8 L 0, NMAT
A
L
B
L
EPSS
L
7Results with Affine Partitioning Blocking
8A Simple Example
- FOR i 1 TO n DO
- FOR j 1 TO n DO
- Ai,j Ai,jBi-1,j (S1)
- Bi,j Ai,j-1Bi,j (S2)
S1
i
S2
j
9Best Parallelization Scheme
- SPMD code Let p be the processors ID number
- if (1-n lt p lt n) then
- if (1 lt p) then
- Bp,1 Ap,0 Bp,1 (S2)
- for i1 max(1,1p) to min(n,n-1p)Â do
- Ai1,i1-p Ai1,i1-p Bi1-1,i1-p (S1)
- Bi1,i1-p1 Ai1,i1-p Bi1,i1-p1 (S2)
- if (p lt 0) then
- Anp,n Anp,N Bnp-1,n (S1)
- Solution can be expressed as affine partitions
- S1 Execute iteration (i, j) on processor i-j.
- S2 Execute iteration (i, j) on processor
i-j1.
10Maximum Parallelism No Communication
- Let Fxj be an access to array x in statement j,
- ij be an iteration index for statement j,
- Bj ij ? 0 represent loop bound constraints
for statement j, - Find Cj which maps an instance of statement j
to a processor - ? ij, ik Bj ij ? 0, Bk ik ? 0
- Fxj (ij) Fxk (ik) ? Cj (ij) Ck (ik)
- with the objective of maximizing the rank of Cj
11Algorithm
- ? ij, ik Bj ij ? 0, Bk ik ? 0
- Fxj (ij) Fxk (ik) ? Cj (ij) Ck (ik)
- Rewrite partition constraints as systems of
linear equations - use affine form of Farkas Lemma to rewrite
constraints assystems of linear inequalities in
C and l - use Fourier-Motzkin algorithm to eliminate Farkas
multipliers l and get systems of linear equations
AC 0 - Find solutions using linear algebra techniques
- the null space for matrix A is a solution of C
with maximum rank.
12PipeliningAlternating Direction Integration
Example
- Requires transposing data
- DO J 1 to N (parallel)
- DO I 1 to N
- A(I,J) f(A(I,J),A(I-1,J)
- DO J 1 to N
- DO I 1 to N (parallel)
- A(I,J) g(A(I,J),A(I,J-1))
- Moves only boundary data
- DO J 1 to N (parallel)
- DO I 1 to N
- A(I,J) f(A(I,J),A(I-1,J)
- DO J 1 to N (pipelined)
- DO I 1 to N
- A(I,J) g(A(I,J),A(I,J-1))
13Finding the Maximum Degree of Pipelining
F1 (i1)
Array
Loops
F2 (i2)
T1 (i1)
T2 (i2)
Time Stage
- Let Fxj be an access to array x in statement j,
- ij be an iteration index for statement j,
- Bj ij ? 0 represent loop bound constraints
for statement j, - Find Tj which maps an instance of statement j
to a time stage - ? ij, ik Bj ij ? 0, Bk ik ? 0
- ( ij ? ik ) ? (Fxj ( ij) Fxk ( ik)) ? Tj
(ij) ? Tk (ik) - lexicographically
- with the objective of maximizing the rank of Tj
14Key Insight
- Choice in time mapping gt (pipelined) parallelism
- Degrees of parallelism rank(T) - 1
15Putting it All Together
- Find maximum outer-loop parallelism with minimum
synchronization - Divide into strongly connected components
- Apply processor mapping algorithm (no
communication) to program - If no parallelism found,
- Apply time mapping algorithm to find pipelining
- If no pipelining found (found outer sequential
loop) - Repeat process on inner loops
- Minimize communication
- Use a greedy method to order communicating pairs
- Try to find communication-free, or neighborhood
only communication by solving similar equations - Aggregate computations of consecutive data to
improve spatial locality
16Use of Affine Partitioning in Locality Opt.
- Promotes array contraction
- Finds independent threads and shortens the live
ranges of variables - Supports blocking of imperfectly nested loops
- Finds largest fully permutable loop nest via
affine partitioning - Fully permutable loop nest -gt blockable