Title: What will SDSC be well-known for ?
1High-Performance Grid Computing and Research
Networking
Algorithms on a Ring (II)
Presented by Yuming Zhang (Minton) Instructor
S. Masoud Sadjadi http//www.cs.fiu.edu/sadjadi/T
eaching/ sadjadi At cs Dot fiu Dot edu
2Acknowledgements
- The content of many of the slides in this lecture
notes have been adopted from the online resources
prepared previously by the people listed below.
Many thanks! - Henri Casanova
- Principles of High Performance Computing
- http//navet.ics.hawaii.edu/casanova
- henric_at_hawaii.edu
3Stencil Application
- Weve talked about stencil applications in the
context of shared-memory programs - We found that we had to cut the matrix in small
blocks - On a ring the same basic idea applies, but lets
do it step-by-step
t1
t1
t
t
0
1
2
3
4
5
6
t
1
2
3
4
5
6
7
2
3
4
5
6
7
8
Cijt1 F(Ci-1jt1 Cij-1t1
Ci1jt Cij1t )
3
4
5
6
7
8
9
4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
12
440x40 Example with p1
- Recap sequential code
- Example with n40 and p1
- Row after row
- Wave front
53x3 Example with p9
0,0
0,1
0,2
1,1
1,0
1,2
2,0
2,1
2,2
63x3 Example with p9
0,0
0,1
0,2
Step
1,1
1,0
1,2
0
2,0
2,1
2,2
7Stencil Application
- Let us, for now, consider that the domain is of
size nxn and that we have pn processors - Each processor is responsible for computing one
row of the domain (at each iteration) - One first simple idea is to have each processor
send each cell value to its neighbor as soon as
that cell value is computed - Basic principle 1 do communication as early as
possible to get your neighbors started as early
as possible - Remember that one of the goals of a parallel
program is to reduce idle time on the processors - We call this algorithm the Greedy algorithm, and
seek an evaluation of its performance
83x3 Example with p3
0,0
0,1
0,2
1,1
1,0
1,2
2,0
2,1
2,2
93x3 Example with p3
0,0
0,1
0,2
Step
1,1
1,0
1,2
0
2,0
2,1
2,2
10Greedy Algorithm with np
- float Cn/pn // for now n/p 1
- // code for only one iteration
- my_rank ? rank()
- p ? num_procs()
- for (j0 jltn j)
- if (my_rank gt 0) RECV(tmp,1) else tmp-1
- C0j update(j,tmp) // implements the
stencil - if (my_rank lt num_procs-1)
SEND((C0j),1) -
- We made a few assumptions about the
implementation of the update function - At time ij, Processor Pi does three things
- it receives c(i-1,j) from Pi-1
- it computes c(i,j)
- it send c(i,j) to Pi1
- Not technically true for P0 and Pp-1, but we
dont care for performance analysis
11Greedy Algorithm with n m p
- This is all well and good, but really, we almost
always have more rows in the domain than
processors - Example with n40 and p10
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
12Greedy Algorithm with n m p
- This is all well and good, but really, we almost
always have more rows in the domain than
processors - Example with n40 and p10
- First algorithm
- P1 has to wait for 10 steps
- P2 has to wait for 36 steps
-
- P9 has to wait for 666 steps
P0
P1
P2
P3
P4
P5
13Greedy Algorithm with n m p
- This is all well and good, but really, we almost
always have more rows in the domain than
processors - Example with n40 and p10
- First algorithm
- P1 has to wait for 10 steps
- P2 has to wait for 36 steps
-
- P9 has to wait for 666 steps
- Second algorithm
- P1 has to wait for 4 steps
- P2 has to wait for 8 steps
-
- P9 has to wait for 36 steps
P0
P1
P2
P3
P4
P5
14Greedy Algorithm
- Lets assume that we have more rows in the domain
than processors a realistic assumption! - The question is then how do we allocate matrix
rows to processors? - Similarly to what we saw in the shared memory
case, we use a cyclic (i.e., interleaved)
distribution - Basic Principle 2 A cyclic distribution of data
among processors is a good way to achieve good
load balancing - Remember that in the mat-vec and mat-mat multiply
we did not use a cyclic distribution - There was no such need there because all
computations were independent and we could just
assign blocks of rows to processors and everybody
could compute - If we did this here, all processors would be idle
while processor P0 computes its block of rows,
and then all processors would be idle while P1
computes its block of rows, etc. It would be a
sequential execution!
15Cyclic Greedy Algorithm
- Assumption n m p
- n40 and p10
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
16Cyclic Greedy Algorithm
- Assumption n m p
- n40 and p10
- With Cyclic Distribution
- P1 waits 1 step
- P2 waits 2 steps
-
- P9 waits 9 steps
- Waits p-1 steps
- Need to do n2/p
- Finishes after p-1n2/p steps
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
17Cyclic Greedy Algorithm
- float Cn/pn // n/p gt 1 ad is an integer!
- // code for only one iteration
- my_rank ? rank()
- p ? num_procs()
- for (i0iltn/pi)
- for (j0 jltn j)
- if (my_rankip gt 0) RECV(tmp,1) else tmp
-1 - Cij update(i,j,tmp)
- if (my_rankip lt n-1) SEND((Cij),1)
-
18Cyclic Greedy Algorithm
- Let us compute the execution time for this
algorithm, T(n,p) - Remember that n gt p
- We can assume that sending a message is done in a
non-blocking fashion (while receiving is
blocking) - Then when a processor sends a message at step k
of the algorithm in parallel it receives a
message at step k1 - This is a reasonable assumption because the
message sizes are identical - Remember that in performance analysis we make
simplifying assumptions otherwise reasoning
becomes overly complex - Therefore, at each algorithm step processors
(i.e., at least one) do - Call update on one cell takes time Ta
- Ta computation in Flops / machine speed in
Flop/sec - Send/receive a cell value takes time b Tc
- b communication start-up latency
- Tc cell value size (in byte) / network bandwidth
- Each steps lasts Ta b Tc
19Cyclic Greedy Algorithm
- Each step takes time Ta b Tc
- How many steps are there?
- Weve done this for the shared-memory version
(sort of) - It takes p-1 steps before processor Pp-1 can
start computing - Then it computes n2/p cells
- Therefore there are p-1n2/p steps
- T(n,p) (p-1n2/p) (Ta Tc b)
- This formula points to a big problem a large
component of the execution time is caused by the
communication start-up time b - In practice b can be as large or larger than Ta
Tc! - The reason is we send many small messages
- A bunch of SEND(...,1)!
- Therefore we can fix it by sending larger
messages - What we need to do is augment the granularity of
the algorithm - Basic Principle 3 Sending large messages
reduces communication overhead - Conflicts with principle 1
20Higher Granularity
- As opposed to sending a cell value as soon as it
is computed, compute k cell values and send them
all in bulk - We assume that k divides n
- As opposed to having each processor hold n/p non
contiguous rows of the domain, have each
processor hold blocks of r consecutive rows - We assume that pr divides n
21Higher Granularity
- Assumption n m p
- n40 and p10
- R2 and k5
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
22Idle processors?
- In the previous picture, it may be that, after it
finishes computing its first block row, processor
P0 has to wait for data from Pp-1 - Processor P0 computes its first block row in n/k
algorithm steps - Processor Pp-1 computes the first subblock of its
first block row after p algorithm steps - Therefore, P0 is not idle if n/k gt p, or n gt kp
- If n lt kp idle time, which is not a good idea
- If n gt kp processors need to receive and store
values for a while before being able to use them
23Higher Granularity
- Assumption n m p
- n40 and p10
- R2 and k5
- nltkp
- 40lt5x10
P0
P1
P2
P3
P4
P5
P6
P7
24Cyclic Greedy Algorithm
- Assumption n m p
- n40 and p10
- R1 and k1
- nltkp
- 40lt1x10
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
25Higher Granularity
- Assumption n m p
- n40 and p10
- R2 and k4
- nkp
- 404x10
k
P0
P1
P2
P3
P4
P5
P6
P7
P8
r
P9
26Higher Granularity
- Assumption n m p
- n40 and p10
- R4 and k5
- nkp
- 404x10
k
P0
P1
P2
P3
P4
P5
r
P6
P7
P8
P9
27Performance Analysis
- Very similar to what we did before
- Each step takes time krTa kTc b
- time to compute kr cells
- time to send k cells
- There are p -1 n2 /(pkr) steps
- p-1 steps before processor Pp-1 can start any
computation - Then Pp-1 has to compute (n2/kr)/p blocks
- T(n,p,k,r) (p-1n2/(pkr)) (krTa kTc b)
- Compare to T(n,p) (p-1n2/p) (Ta Tc b)
- We traded off Principle 1 for Principle 3,
which is probably better in practice - Values of k and r can be experimented with to
find what works best in practice - Note that optimal values can be computed
28Solving Linear Systems of Eq.
- Method for solving Linear Systems
- The need to solve linear systems arises in an
estimated 75 of all scientific computing
problems Dahlquist 1974 - Gaussian Elimination is perhaps the most
well-known method - based on the fact that the solution of a linear
system is invariant under scaling and under row
additions - One can multiply a row of the matrix by a
constant as long as one multiplies the
corresponding element of the right-hand side by
the same constant - One can add a row of the matrix to another one as
long as one adds the corresponding elements of
the right-hand side - Idea scale and add equations so as to transform
matrix A in an upper triangular matrix
?
?
x
?
?
?
?
equation n-i has i unknowns, with
29Gaussian Elimination
1 1 1
1 -2 2
1 2 -1
0
4
2
x
Substract row 1 from rows 2 and 3
1 1 1
0 -3 1
0 1 -2
0
4
2
x
Multiple row 3 by 3 and add row 2
1 1 1
0 -3 1
0 0 -5
0
4
10
x
-5x3 10 x3 -2 -3x2 x3 4
x2 -2 x1 x2 x3 0 x1 4
Solving equations in reverse order (backsolving)
30Gaussian Elimination
- The algorithm goes through the matrix from the
top-left corner to the bottom-right corner - the ith step eliminates non-zero sub-diagonal
elements in column i, substracting the ith row
scaled by aji/aii from row j, for ji1,..,n.
values already computed
i
pivot row
0
to be zeroed
values yet to be updated
i
31Sequential Gaussian Elimination
- Simple sequential algorithm
- // for each column i
- // zero it out below the diagonal by adding
- // multiples of row i to later rows
- for i 1 to n-1
- // for each row j below row i
- for j i1 to n
- // add a multiple of row i to row j
- for k i to n
- A(j,k) A(j,k) - (A(j,i)/A(i,i))
A(i,k) - Several tricks that do not change the spirit of
the algorithm but make implementation easier
and/or more efficient - Right-hand side is typically kept in column n1
of the matrix and one speaks of an augmented
matrix - Compute the A(i,j)/A(i,i) term outside of the
loop
32Pivoting Motivation
0 1
1 1
- A few pathological cases
- Division by small numbers ? round-off error in
computer arithmetics - Consider the following system
- 0.0001x1 x2 1.000
- x1 x2 2.000
- exact solution x11.00010 and x2 0.99990
- say we round off after 3 digits after the decimal
point - Multiply the first equation by 104 and subtract
it from the second equation - (1 - 1)x1 (1 - 104)x2 2 - 104
- But, in finite precision with only 3 digits
- 1 - 104 -0.9999 E4 -0.999 E4
- 2 - 104 -0.9998 E4 -0.999 E4
- Therefore, x2 1 and x1 0 (from the first
equation) - Very far from the real solution!
33Partial Pivoting
- One can just swap rows
- x1 x2 2.000
- 0.0001x1 x2 1.000
- Multiple the first equation by 0.0001 and
subtract it from the second equation gives - (1 - 0.0001)x2 1 - 0.0001
- 0.9999 x2 0.9999 gt x2 1
- and then x1 1
- Final solution is closer to the real solution.
(Magical) - Partial Pivoting
- For numerical stability, one doesnt go in order,
but pick the next row in rows i to n that has the
largest element in row i - This row is swapped with row i (along with
elements of the right hand side) before the
subtractions - the swap is not done in memory but rather one
keeps an indirection array - Total Pivoting
- Look for the greatest element ANYWHERE in the
matrix - Swap columns
- Swap rows
- Numerical stability is really a difficult field
34Parallel Gaussian Elimination?
- Assume that we have one processor per matrix
element
max aji needed to compute the scaling factor
to find the max aji
Independent computation of the scaling factor
Compute
Reduction
Broadcast
Every update needs the scaling factor and the
element from the pivot row
Independent computations
Broadcasts
Compute
35LU Factorization
- Gaussian Elimination is simple but
- What if we have to solve many Ax b systems for
different values of b? - This happens a LOT in real applications
- Another method is the LU Factorization
- Ax b
- Say we could rewrite A L U, where L is a lower
triangular matrix, and U is an upper triangular
matrix O(n3) - Then Ax b is written L U x b
- Solve L y b O(n2)
- Solve U x y O(n2)
triangular system solves are easy
?
?
?
x
?
x
?
?
?
?
?
?
?
?
equation i has i unknowns
equation n-i has i unknowns
36LU Factorization Principle
- It works just like the Gaussian Elimination, but
instead of zeroing out elements, one saves
scaling coefficients. - Magically, A L x U !
- Should be done with pivoting as well
1 2 -1
4 3 1
2 2 3
1 2 -1
0 -5 5
2 2 3
1 2 -1
4 -5 5
2 2 3
1 2 -1
4 -5 5
2 -2 5
gaussian elimination
save the scaling factor
gaussian elimination save the scaling factor
1 2 -1
4 -5 5
2 2/5 3
1 2 -1
0 -5 5
0 0 3
1 0 0
4 1 0
2 2/5 1
L
gaussian elimination save the scaling factor
U
37LU Factorization
- Were going to look at the simplest possible
version - No pivotingjust creates a bunch of indirections
that are easy but make the code look complicated
without changing the overall principle
LU-sequential(A,n) for k 0 to n-2 //
preparing column k for i k1 to n-1
aik ? -aik / akk for j k1 to n-1
// Task Tkj update of column j for
ik1 to n-1 aij ? aij aik akj
stores the scaling factors
k
k
38LU Factorization
- Were going to look at the simplest possible
version - No pivotingjust creates a bunch of indirections
that are easy but make the code look complicated
without changing the overall principle
LU-sequential(A,n) for k 0 to n-2 //
preparing column k for i k1 to n-1
aik ? -aik / akk for j k1 to n-1
// Task Tkj update of column j for
ik1 to n-1 aij ? aij aik akj
update
k
k
j
i
39Parallel LU on a ring
- Since the algorithm operates by columns from left
to right, we should distribute columns to
processors - Principle of the algorithm
- At each step, the processor that owns column k
does the prepare task and then broadcasts the
bottom part of column k to all others - Annoying if the matrix is stored in row-major
fashion - Remember that one is free to store the matrix in
anyway one wants, as long as its coherent and
that the right output is generated - After the broadcast, the other processors can
then update their data. - Assume there is a function alloc(k) that returns
the rank of the processor that owns column k - Basically so that we dont clutter our program
with too many global-to-local index translations - In fact, we will first write everything in terms
of global indices, as to avoid all annoying index
arithmetic
40LU-broadcast algorithm
- LU-broadcast(A,n)
- q ? rank()
- p ? numprocs()
- for k 0 to n-2
- if (alloc(k) q)
- // preparing column k
- for i k1 to n-1
- bufferi-k-1 ? aik ? -aik / akk
- broadcast(alloc(k),buffer,n-k-1)
- for j k1 to n-1
- if (alloc(j) q)
- // update of column j
- for ik1 to n-1
- aij ? aij bufferi-k-1 akj
-
41Dealing with local indices
- Assume that p divides n
- Each processor needs to store rn/p columns and
its local indices go from 0 to r-1 - After step k, only columns with indices greater
than k will be used - Simple idea use a local index, l, that everyone
initializes to 0 - At step k, processor alloc(k) increases its local
index so that next time it will point to its next
local column
42LU-broadcast algorithm
- ...
- double an-1r-1
- q ? rank()
- p ? numprocs()
- l ? 0
- for k 0 to n-2
- if (alloc(k) q)
- for i k1 to n-1
- bufferi-k-1 ? ai,k ? -ai,l /
ak,l - l ? l1
- broadcast(alloc(k),buffer,n-k-1)
- for j l to r-1
- for ik1 to n-1
- ai,j ? ai,j bufferi-k-1
ak,j -
43What about the Alloc function?
- One thing we have left completely unspecified is
how to write the alloc function how are columns
distributed among processors - There are two complications
- The amount of data to process varies throughout
the algorithms execution - At step k, columns k1 to n-1 are updated
- Fewer and fewer columns to update
- The amount of computation varies among columns
- e.g., column n-1 is updated more often than
column 2 - Holding columns on the right of the matrix leads
to much more work - There is a strong need for load balancing
- All processes should do the same amount of work
44Bad load balancing
P1
P2
P3
P4
already done
already done
working on it
45Good Load Balancing?
already done
already done
working on it
Cyclic distribution
46Proof that load balancing is good
- The computation consists of two types of
operations - column preparations
- matrix element updates
- There are many more updates than preparations, so
we really care about good balancing of the
preparations - Consider column j
- Lets count the number of updates performed by
the processor holding column j - Column j is updated at steps k0, ..., j-1
- At step k, elements ik1, ..., n-1 are updates
- indices start at 0
- Therefore, at step k, the update of column j
entails n-k-1 updates - The total number of updates for column j in the
execution is
47Proof that load balancing is good
- Consider processor Pi, which holds columns lpi
for l0, ... , n/p -1 - Processor Pi needs to perform this many updates
- Turns out this can be computed
- separate terms
- use formulas for sums of integers and sums of
squares - What it all boils down to is
- This does not depend on i
- Therefore it is (asymptotically) the same for all
Pi processors - Therefore we have (asymptotically) perfect load
balancing!
48Load-balanced program
- ...
- double an-1r-1
- q ? rank()
- p ? numprocs()
- l ? 0
- for k 0 to n-2
- if (k q mod p)
- for i k1 to n-1
- bufferi-k-1 ? ai,k ? -ai,l /
ak,l - l ? l1
- broadcast(alloc(k),buffer,n-k-1)
- for j l to r-1
- for ik1 to n-1
- ai,j ? ai,j bufferi-k-1
ak,j -
49Performance Analysis
- How long does this code take to run?
- This is not an easy question because there are
many tasks and many communications - A little bit of analysis shows that the execution
time is the sum of three terms - n-1 communications nxb (n2/2)Tcomm O(1)
- n-1 column preparations (n2/2)Tcomp O(1)
- column updates (n3/3p)Tcomp O(n2)
- Therefore, the execution time is (n3/3p)Tcomp
- Note that the sequential time is (n3 /3)Tcomp
- Therefore, we have perfect asymptotic efficiency!
- once again
- This is good, but isnt always the best in
practice - How can we improve this algorithm?
50Pipelining on the Ring
- So far, the algorithm weve used a simple
broadcast - Nothing was specific to being on a ring of
processors and its portable - in fact you could just write raw MPI that just
looks like our pseudo-code and have a very
limited, inefficient for small n, LU
factorization that works only for some number of
processors - But its not efficient
- The n-1 communication steps are not overlapped
with computations - Therefore Amdahls law, etc.
- Turns out that on a ring, with a cyclic
distribution of the columns, one can interleave
pieces of the broadcast with the computation - It almost looks like inserting the source code
from the broadcast code we saw at the very
beginning throughout the LU code
51Previous program
- ...
- double an-1r-1
- q ? rank()
- p ? numprocs()
- l ? 0
- for k 0 to n-2
- if (k q mod p)
- for i k1 to n-1
- bufferi-k-1 ? ai,k ? -ai,l /
ak,l - l ? l1
- broadcast(alloc(k),buffer,n-k-1)
- for j l to r-1
- for ik1 to n-1
- ai,j ? ai,j bufferi-k-1
ak,j -
52LU-pipeline algorithm
- double an-1r-1
- q ? rank()
- p ? numprocs()
- l ? 0
- for k 0 to n-2
- if (k q mod p)
- for i k1 to n-1
- bufferi-k-1 ? ai,k ? -ai,l /
ak,l - l ? l1
- send(buffer,n-k-1)
- else
- recv(buffer,n-k-1)
- if (q ? k-1 mod p) send(buffer, n-k-1)
- for j l to r-1
- for ik1 to n-1
- ai,j ? ai,j bufferi-k-1
ak,j -
53Why is it better?
- During a broadcast the roots successor just sits
idle while the message goes along the ring - This is because of the way we have implemented
broadcast, partially - With a better broadcast on a general topology the
wait may be smaller - But there is still a wait
- What we have done is allow each processor to move
on to other business after receiving and
forwarding the message - Possible by writing the code with just sends and
receive - More complicated, more efficient usual trade-off
- Lets look at a (idealized) time-line
54Prep(0)
Send(0)
Recv(0)
Update(0,4)
Send(0)
Recv(0)
Update(0,8)
Update(0,1)
Send(0)
Recv(0)
Update(0,12)
Update(0,5)
Update(0,2)
Update(0,3)
Update(0,9)
Update(0,6)
Update(0,7)
Update(0,13)
Update(0,10)
Update(0,11)
A processor sends out data as soon as it
receives it
Update(0,14)
Update(0,15)
Prep(1)
Send(1)
Recv(1)
Update(1,5)
Send(1)
Recv(1)
Update(1,9)
Update(1,2)
Send(1)
Recv(1)
Update(1,13)
Update(1,6)
Update(1,3)
Update(1,4)
Update(1,10)
Update(1,7)
Update(1,8)
First four stages
Update(1,14)
Update(1,11)
Update(1,12)
Update(1,15)
Prep(2)
Send(2)
Recv(2)
Update(2,6)
Send(2)
Recv(2)
Update(2,10)
Update(2,3)
Send(2)
Recv(2)
Some communication occurs in parallel with
computation
Update(2,14)
Update(2,7)
Update(2,4)
Update(2,5)
Update(2,11)
Update(2,8)
Update(2,9)
Update(2,15)
Update(2,12)
Update(2,13)
Prep(3)
Send(3)
Recv(3)
Update(3,7)
Send(3)
Recv(3)
Update(3,11)
Update(3,4)
Send(3)
Recv(3)
Update(3,15)
Update(3,8)
Update(3,5)
Update(3,6)
Update(3,12)
Update(3,9)
Update(3,10)
Update(3,13)
Update(3,14)
55Can we do better?
- In the previous algorithm, a processor does all
its updates before doing a Prep() computation
that then leads to a communication - But in fact, some of these updates can be done
later - Idea Send out pivot as soon as possible
- Example
- In the previous algorithm
- P1 Receive(0), Send(0)
- P1 Update(0,1), Update(0,5), Update(0,9),
Update(0,13) - P1 Prep(1)
- P1 Send(1)
- ...
- In the new algorithm
- P1 Receive(0), Send(0)
- P1 Update(0,1)
- P1 Prep(1)
- P1 Send(1)
- P1 Update(0,5), Update(0,9), Update(0,13)
- ...
56Prep(0)
Send(0)
Recv(0)
Update(0,4)
Send(0)
Recv(0)
Update(0,8)
Update(0,1)
Send(0)
Recv(0)
Update(0,12)
Update(0,2)
Update(0,3)
Prep(1)
Update(0,7)
Send(1)
Recv(1)
A processor sends out data as soon as it
receives it
Update(0,5)
Send(1)
Recv(1)
Update(0,9)
Update(1,2)
Send(1)
Recv(1)
Update(0,13)
Update(1,3)
Update(1,4)
Prep(2)
Update(1,5)
Update(1,8)
Send(2)
Recv(2)
Update(0,6)
Update(1,9)
Send(2)
Recv(2)
First four stages
Update(0,10)
Update(2,3)
Send(2)
Recv(2)
Update(0,14)
Update(1,13)
Update(1,12)
Prep(3)
Update(1,6)
Update(2,5)
Send(3)
Recv(3)
Update(0,11)
Update(1,10)
Send(3)
Recv(3)
Many communications occur in parallel with
computation
Update(0,15)
Update(2,4)
Send(3)
Recv(3)
Update(1,14)
Update(1,7)
Update(2,8)
Update(2,9)
Update(1,11)
Update(2,6)
Update(2,12)
Update(2,13)
Update(1,15)
Update(2,10)
Update(3,4)
Update(3,5)
Update(2,14)
Update(2,7)
Update(3,8)
Update(3,9)
Update(2,11)
Update(3,12)
Update(3,13)
Update(3,6)
Update(2,15)
Update(3,10)
Update(3,7)
Update(3,14)
Update(3,11)
Update(3,15)
57LU-look-ahead algorithm
- q ? rank()
- p ? numprocs()
- l ? 0
- for k 0 to n-2
- if (k q mod p)
- Prep(k)
- Send(buffer,n-k-1)
- for all j k mod p, jgtk Update(k-1,j)
- for all j k mod p, jgtk Update(k,j)
- else
- Recv(buffer,n-k-1)
- if (q ? k - 1 mod p) then
Send(buffer,n-k-1) - if (q ? k 1 mod p) then
- Update(k,k1)
- else
- for all j k mod p, jgtk Update(k,j)
-
58Further improving performance
- One can use local overlap of communication and
computation - multi-threading, good MPI non-blocking
implementation, etc. - There is much more to be said about parallel LU
factorization - Many research articles
- Many libraries available
- Its a good example of an application for which
one can think hard about operation orderings and
try to find improved sequences - The basic principle is always the same send
things as early as possible - The modified principle send things as early as
required, but not earlier! You can avoid extra
communication load by sending less number of
longer messages.