Affine Partitioning for Parallelism - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Affine Partitioning for Parallelism

Description:

use affine form of Farkas Lemma to rewrite constraints as ... Finds largest fully permutable loop nest via affine partitioning ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 17

Provided by: Monic79

Learn more at: https://suif.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Affine Partitioning for Parallelism

1
Affine Partitioning for Parallelism Locality

Amy Lim
Stanford University
http//suif.stanford.edu/

2
Useful Transforms for ParallelismLocality

INTERCHANGE FOR i FOR
j FOR j FOR i
Ai,j Ai,j
REVERSAL FOR i 1 to n FOR i n downto
1 Ai Ai
SKEWING FOR i1 TO n FOR i1 TO
n FOR j1 TO n FOR ki1 to in
Ai,j Ai,k-i
FUSION/FISSION FOR i 1 TO n FOR i 1 TO
n Ai Ai
FOR i 1 TO n Bi
Bi
REINDEXING FOR i 1 to n A1
B0 Ai Bi-1 FOR i 1 to
n-1 Ci Ai1 Ai1
Bi Ci Ai1
Cn An1
Traditional approach is it legal desirable to
apply one transform?

3
Question How to combine the transformations?

Affine mappings Lim Lam, POPL 97, ICS 99
Domain arbitrary loop nesting, affine loop
indices instruction optimized separately
Unifies
Permutation
Skewing
Reversal
Fusion
Fission
Statement reordering
Supports blocking across all (non-perfectly
nested) loops
Optimal Max. deg. of parallelism min. deg. of
synchronization
Minimize communication by aligning the
computation and pipelining

4
Loop Transforms Cholesky factorization example

DO 1 J 0, N
I0 MAX ( -M, -J )
DO 2 I I0, -1
DO 3 JJ I0 - I, -1
DO 3 L 0, NMAT
3 A(L,I,J) A(L,I,J) - A(L,JJ,IJ)
A(L,IJJ,J)
DO 2 L 0, NMAT
2 A(L,I,J) A(L,I,J) A(L,0,IJ)
DO 4 L 0, NMAT
4 EPSS(L) EPS A(L,0,J)
DO 5 JJ I0, -1
DO 5 L 0, NMAT
5 A(L,0,J) A(L,0,J) - A(L,JJ,J) 2
DO 1 L 0, NMAT
1 A(L,0,J) 1. / SQRT ( ABS (EPSS(L)
A(L,0,J)) )
DO 6 I 0, NRHS
DO 7 K 0, N
DO 8 L 0, NMAT

5
Results for Optimizing Perfect Nests
Speedup on a Digital Turbolaser with 8 300Mhz
21164 processors
6
Optimizing Arbitrary Loop Nesting Using Affine
Partitions

DO 1 J 0, N
I0 MAX ( -M, -J )
DO 2 I I0, -1
DO 3 JJ I0 - I, -1
DO 3 L 0, NMAT
3 A(L,I,J) A(L,I,J) - A(L,JJ,IJ)
A(L,IJJ,J)
DO 2 L 0, NMAT
2 A(L,I,J) A(L,I,J) A(L,0,IJ)
DO 4 L 0, NMAT
4 EPSS(L) EPS A(L,0,J)
DO 5 JJ I0, -1
DO 5 L 0, NMAT
5 A(L,0,J) A(L,0,J) - A(L,JJ,J) 2
DO 1 L 0, NMAT
1 A(L,0,J) 1. / SQRT ( ABS (EPSS(L)
A(L,0,J)) )
DO 6 I 0, NRHS
DO 7 K 0, N
DO 8 L 0, NMAT

A
L

B
L
EPSS
L
7
Results with Affine Partitioning Blocking
8
A Simple Example

FOR i 1 TO n DO
FOR j 1 TO n DO
Ai,j Ai,jBi-1,j (S1)
Bi,j Ai,j-1Bi,j (S2)

S1
i
S2
j
9
Best Parallelization Scheme

SPMD code Let p be the processors ID number
if (1-n lt p lt n) then
if (1 lt p) then
Bp,1 Ap,0 Bp,1 (S2)
for i1 max(1,1p) to min(n,n-1p) do
Ai1,i1-p Ai1,i1-p Bi1-1,i1-p (S1)
Bi1,i1-p1 Ai1,i1-p Bi1,i1-p1 (S2)
if (p lt 0) then
Anp,n Anp,N Bnp-1,n (S1)
Solution can be expressed as affine partitions
S1 Execute iteration (i, j) on processor i-j.
S2 Execute iteration (i, j) on processor
i-j1.

10
Maximum Parallelism No Communication

Let Fxj be an access to array x in statement j,
ij be an iteration index for statement j,
Bj ij ? 0 represent loop bound constraints
for statement j,
Find Cj which maps an instance of statement j
to a processor
? ij, ik Bj ij ? 0, Bk ik ? 0
Fxj (ij) Fxk (ik) ? Cj (ij) Ck (ik)
with the objective of maximizing the rank of Cj

11
Algorithm

? ij, ik Bj ij ? 0, Bk ik ? 0
Fxj (ij) Fxk (ik) ? Cj (ij) Ck (ik)
Rewrite partition constraints as systems of
linear equations
use affine form of Farkas Lemma to rewrite
constraints assystems of linear inequalities in
C and l
use Fourier-Motzkin algorithm to eliminate Farkas
multipliers l and get systems of linear equations
AC 0
Find solutions using linear algebra techniques
the null space for matrix A is a solution of C
with maximum rank.

12
PipeliningAlternating Direction Integration
Example

Requires transposing data
DO J 1 to N (parallel)
DO I 1 to N
A(I,J) f(A(I,J),A(I-1,J)
DO J 1 to N
DO I 1 to N (parallel)
A(I,J) g(A(I,J),A(I,J-1))

Moves only boundary data
DO J 1 to N (parallel)
DO I 1 to N
A(I,J) f(A(I,J),A(I-1,J)
DO J 1 to N (pipelined)
DO I 1 to N
A(I,J) g(A(I,J),A(I,J-1))

13
Finding the Maximum Degree of Pipelining
F1 (i1)
Array
Loops
F2 (i2)
T1 (i1)
T2 (i2)
Time Stage

Let Fxj be an access to array x in statement j,
ij be an iteration index for statement j,
Bj ij ? 0 represent loop bound constraints
for statement j,
Find Tj which maps an instance of statement j
to a time stage
? ij, ik Bj ij ? 0, Bk ik ? 0
( ij ? ik ) ? (Fxj ( ij) Fxk ( ik)) ? Tj
(ij) ? Tk (ik)
lexicographically
with the objective of maximizing the rank of Tj

14
Key Insight

Choice in time mapping gt (pipelined) parallelism
Degrees of parallelism rank(T) - 1

15
Putting it All Together

Find maximum outer-loop parallelism with minimum
synchronization
Divide into strongly connected components
Apply processor mapping algorithm (no
communication) to program
If no parallelism found,
Apply time mapping algorithm to find pipelining
If no pipelining found (found outer sequential
loop)
Repeat process on inner loops
Minimize communication
Use a greedy method to order communicating pairs
Try to find communication-free, or neighborhood
only communication by solving similar equations
Aggregate computations of consecutive data to
improve spatial locality