Dense Linear Algebra (Data Distributions) - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Dense Linear Algebra (Data Distributions)

Description:

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 20

Provided by: SathishV4

Category:

more less

Transcript and Presenter's Notes

Title: Dense Linear Algebra (Data Distributions)

1
Dense Linear Algebra(Data Distributions)

Sathish Vadhiyar

2
Gaussian Elimination - Review

Version 1
for each column i
zero it out below the diagonal by adding
multiples of row i to later rows
for i 1 to n-1
for each row j below row i
for j i1 to n
add a multiple of row i to row j
for k i to n
A(j, k) A(j, k) A(j, i)/A(i, i) A(i,
k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
3
Gaussian Elimination - Review

Version 2 Remove A(j, i)/A(i, i) from inner
loop
for each column i
zero it out below the diagonal by adding
multiples of row i to later rows
for i 1 to n-1
for each row j below row i
for j i1 to n
m A(j, i) / A(i, i)
for k i to n
A(j, k) A(j, k) m A(i, k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
4
Gaussian Elimination - Review

Version 3 Dont compute what we already know
for each column i
zero it out below the diagonal by adding
multiples of row i to later rows
for i 1 to n-1
for each row j below row i
for j i1 to n
m A(j, i) / A(i, i)
for k i1 to n
A(j, k) A(j, k) m A(i, k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
5
Gaussian Elimination - Review

Version 4 Store multipliers m below diagonals
for each column i
zero it out below the diagonal by adding
multiples of row i to later rows
for i 1 to n-1
for each row j below row i
for j i1 to n
A(j, i) A(j, i) / A(i, i)
for k i1 to n
A(j, k) A(j, k) A(j, i) A(i, k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
6
GE - Runtime

Divisions
Multiplications / subtractions
Total

1 2 3 (n-1) n2/2 (approx.)
12 22 32 42 52 . (n-1)2 n3/3 n2/2
2n3/3
7
Parallel GE

1st step 1-D block partitioning along blocks of
n columns by p processors

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
8
1D block partitioning - Steps

1. Divisions

n2/2
2. Broadcast
xlog(p) ylog(p-1) zlog(p-3) log1 lt n2logp
3. Multiplications and Subtractions
(n-1)n/p (n-2)n/p . 1x1 n3/p (approx.)
Runtime
lt n2/2 n2logp n3/p
9
2-D block

To speedup the divisions

P
i
0 0 0 0 0 0
k
0 0 0 0 0
i
Q
i,i X X x
j
10
2D block partitioning - Steps

1. Broadcast of (k,k)

logQ
2. Divisions
n2/Q (approx.)
3. Broadcast of multipliers
xlog(P) ylog(P-1) zlog(P-2) . n2/Q logP
4. Multiplications and subtractions
n3/PQ (approx.)
11
Problem with block partitioning for GE

Once a block is finished, the corresponding
processor remains idle for the rest of the
execution
Solution? -

12
Onto cyclic

The block partitioning algorithms waste processor
cycles. No load balancing throughout the
algorithm.
Onto cyclic

0
1
2
3
0
2
3
0
2
3
1
1
0
1-D block-cyclic
2-D block-cyclic
cyclic
Load balance, block operations, but column
factorization bottleneck
Has everything
Load balance
13
Block cyclic

Having blocks in a processor can lead to
block-based operations (block matrix multiply
etc.)
Block based operations lead to high performance
Operations can be split into 3 categories based
on number of operations per memory reference
Referred to BLAS levels

14
Basic Linear Algebra Subroutines (BLAS) 3
levels of operations

Memory hierarchy efficiently exploited by higher
level BLAS

BLAS Memory Refs. Flops Flops/Memory refs.
Level-1 (vector) yyax Zy.x 3n 2n 2/3
Level-2 (Matrix-vector) yyAx A A(alpha) xyT n2 2n2 2
Level-3 (Matrix-Matrix) CCAB 4n2 2n3 n/2
15
Gaussian Elimination - Review
i

Version 4 Store multipliers m below diagonals
for each column i
zero it out below the diagonal by adding
multiples of row i to later rows
for i 1 to n-1
for each row j below row i
for j i1 to n
A(j, i) A(j, i) / A(i, i)
for k i1 to n
A(j, k) A(j, k) A(j, i) A(i, k)

0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
What GE really computes for i1 to n-1
A(i1n, i) A(i1n, i) / A(i, i) A(i1n,
i1n) - A(i1n, i)A(i, i1n)
i
Finished multipliers
A(i,i)
A(i,k)
i
A(i, i1n)
Finished multipliers
A(j,i)
A(j,k)
BLAS1 and BLAS2 operations
A(i1n, i)
A(i1n, i1n)
16
Converting BLAS2 to BLAS3

Use blocking for optimized matrix-multiplies
(BLAS3)
Matrix multiplies by delayed updates
Save several updates to trailing matrices
Apply several updates in the form of matrix
multiply

17
Modified GE using BLAS3Courtesy Dr. Jack
Dongarra

for ib 1 to n-1 step b / process matrix b
columns at a time /
end ibb-1
/ Apply BLAS 2 version of GE to get A(ibn,
ibend) factored.
Let LL denote the strictly lower triangular
portion of A(ibend, ibend)1 /
A(ibend, end1n) LL-1A(ibend, end1n) /
update next b rows of U /
A(end1n, end1n) A(end1n, end1n) -
A(ibn, ibend) A(ibend, end1n)
/ Apply delayed updates with single matrix
multiply /

b
ib
end
Completed part of U
A(ibend, ibend)
ib
Completed part of L
b
end
A(end1n, end1n)
A(end1n, ibend)
18
GE MiscellaneousGE with Partial Pivoting

1D block-column partitioning which is better?
Column or row pivoting
2D block partitioning Can restrict the pivot
search to limited number of columns

Column pivoting does not involve any extra steps
since pivot search and exchange are done locally
on each processor. O(n-i-1)
The exchange information is passed to the other
processes by piggybacking with the multiplier
information