Title: Dense Linear Algebra (Data Distributions)
1Dense Linear Algebra(Data Distributions)
2Gaussian Elimination - Review
- Version 1
- for each column i
- zero it out below the diagonal by adding
multiples of row i to later rows - for i 1 to n-1
- for each row j below row i
- for j i1 to n
- add a multiple of row i to row j
- for k i to n
- A(j, k) A(j, k) A(j, i)/A(i, i) A(i,
k)
i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
3Gaussian Elimination - Review
- Version 2 Remove A(j, i)/A(i, i) from inner
loop - for each column i
- zero it out below the diagonal by adding
multiples of row i to later rows - for i 1 to n-1
- for each row j below row i
- for j i1 to n
- m A(j, i) / A(i, i)
- for k i to n
- A(j, k) A(j, k) m A(i, k)
i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
4Gaussian Elimination - Review
- Version 3 Dont compute what we already know
- for each column i
- zero it out below the diagonal by adding
multiples of row i to later rows - for i 1 to n-1
- for each row j below row i
- for j i1 to n
- m A(j, i) / A(i, i)
- for k i1 to n
- A(j, k) A(j, k) m A(i, k)
i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
5Gaussian Elimination - Review
- Version 4 Store multipliers m below diagonals
- for each column i
- zero it out below the diagonal by adding
multiples of row i to later rows - for i 1 to n-1
- for each row j below row i
- for j i1 to n
- A(j, i) A(j, i) / A(i, i)
- for k i1 to n
- A(j, k) A(j, k) A(j, i) A(i, k)
i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
6GE - Runtime
- Divisions
- Multiplications / subtractions
- Total
1 2 3 (n-1) n2/2 (approx.)
12 22 32 42 52 . (n-1)2 n3/3 n2/2
2n3/3
7Parallel GE
- 1st step 1-D block partitioning along blocks of
n columns by p processors
i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
81D block partitioning - Steps
n2/2
2. Broadcast
xlog(p) ylog(p-1) zlog(p-3) log1 lt n2logp
3. Multiplications and Subtractions
(n-1)n/p (n-2)n/p . 1x1 n3/p (approx.)
Runtime
lt n2/2 n2logp n3/p
92-D block
P
i
0 0 0 0 0 0
k
0 0 0 0 0
i
Q
i,i X X x
j
102D block partitioning - Steps
logQ
2. Divisions
n2/Q (approx.)
3. Broadcast of multipliers
xlog(P) ylog(P-1) zlog(P-2) . n2/Q logP
4. Multiplications and subtractions
n3/PQ (approx.)
11Problem with block partitioning for GE
- Once a block is finished, the corresponding
processor remains idle for the rest of the
execution - Solution? -
12Onto cyclic
- The block partitioning algorithms waste processor
cycles. No load balancing throughout the
algorithm. - Onto cyclic
0
1
2
3
0
2
3
0
2
3
1
1
0
1-D block-cyclic
2-D block-cyclic
cyclic
Load balance, block operations, but column
factorization bottleneck
Has everything
Load balance
13Block cyclic
- Having blocks in a processor can lead to
block-based operations (block matrix multiply
etc.) - Block based operations lead to high performance
- Operations can be split into 3 categories based
on number of operations per memory reference - Referred to BLAS levels
14Basic Linear Algebra Subroutines (BLAS) 3
levels of operations
- Memory hierarchy efficiently exploited by higher
level BLAS
BLAS Memory Refs. Flops Flops/Memory refs.
Level-1 (vector) yyax Zy.x 3n 2n 2/3
Level-2 (Matrix-vector) yyAx A A(alpha) xyT n2 2n2 2
Level-3 (Matrix-Matrix) CCAB 4n2 2n3 n/2
15Gaussian Elimination - Review
i
- Version 4 Store multipliers m below diagonals
- for each column i
- zero it out below the diagonal by adding
multiples of row i to later rows - for i 1 to n-1
- for each row j below row i
- for j i1 to n
- A(j, i) A(j, i) / A(i, i)
- for k i1 to n
- A(j, k) A(j, k) A(j, i) A(i, k)
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
What GE really computes for i1 to n-1
A(i1n, i) A(i1n, i) / A(i, i) A(i1n,
i1n) - A(i1n, i)A(i, i1n)
i
Finished multipliers
A(i,i)
A(i,k)
i
A(i, i1n)
Finished multipliers
A(j,i)
A(j,k)
BLAS1 and BLAS2 operations
A(i1n, i)
A(i1n, i1n)
16Converting BLAS2 to BLAS3
- Use blocking for optimized matrix-multiplies
(BLAS3) - Matrix multiplies by delayed updates
- Save several updates to trailing matrices
- Apply several updates in the form of matrix
multiply
17Modified GE using BLAS3Courtesy Dr. Jack
Dongarra
- for ib 1 to n-1 step b / process matrix b
columns at a time / - end ibb-1
- / Apply BLAS 2 version of GE to get A(ibn,
ibend) factored. - Let LL denote the strictly lower triangular
portion of A(ibend, ibend)1 / - A(ibend, end1n) LL-1A(ibend, end1n) /
update next b rows of U / - A(end1n, end1n) A(end1n, end1n) -
A(ibn, ibend) A(ibend, end1n) - / Apply delayed updates with single matrix
multiply / -
b
ib
end
Completed part of U
A(ibend, ibend)
ib
Completed part of L
b
end
A(end1n, end1n)
A(end1n, ibend)
18GE MiscellaneousGE with Partial Pivoting
- 1D block-column partitioning which is better?
Column or row pivoting - 2D block partitioning Can restrict the pivot
search to limited number of columns
- Column pivoting does not involve any extra steps
since pivot search and exchange are done locally
on each processor. O(n-i-1) - The exchange information is passed to the other
processes by piggybacking with the multiplier
information
- Row pivoting
- Involves distributed search and exchange
O(n/P)O(logP)
19Triangular Solve Unit Upper triangular matrix
- Sequential Complexity -
- Complexity of parallel algorithm with 2D block
partitioning (P0.5P0.5)
O(n2)
O(n2)/P0.5
- Thus (parallel GE / parallel TS) lt (sequential
GE / sequential TS) - Overall (GETS) O(n3/P)