What will SDSC be well-known for ? - PowerPoint PPT Presentation

About This Presentation
Title:

What will SDSC be well-known for ?

Description:

High-Performance Grid Computing and Research Networking Algorithms on a Ring (II) Presented by Yuming Zhang (Minton) Instructor: S. Masoud Sadjadi – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 59
Provided by: Henri208
Learn more at: http://users.cis.fiu.edu
Category:

less

Transcript and Presenter's Notes

Title: What will SDSC be well-known for ?


1
High-Performance Grid Computing and Research
Networking
Algorithms on a Ring (II)
Presented by Yuming Zhang (Minton) Instructor
S. Masoud Sadjadi http//www.cs.fiu.edu/sadjadi/T
eaching/ sadjadi At cs Dot fiu Dot edu
2
Acknowledgements
  • The content of many of the slides in this lecture
    notes have been adopted from the online resources
    prepared previously by the people listed below.
    Many thanks!
  • Henri Casanova
  • Principles of High Performance Computing
  • http//navet.ics.hawaii.edu/casanova
  • henric_at_hawaii.edu

3
Stencil Application
  • Weve talked about stencil applications in the
    context of shared-memory programs
  • We found that we had to cut the matrix in small
    blocks
  • On a ring the same basic idea applies, but lets
    do it step-by-step

t1
t1
t
t
0
1
2
3
4
5
6
t
1
2
3
4
5
6
7
2
3
4
5
6
7
8
Cijt1 F(Ci-1jt1 Cij-1t1
Ci1jt Cij1t )
3
4
5
6
7
8
9
4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
12
4
40x40 Example with p1
  • Recap sequential code
  • Example with n40 and p1
  • Row after row
  • Wave front

5
3x3 Example with p9
0,0
0,1
0,2
1,1
1,0
1,2
2,0
2,1
2,2
6
3x3 Example with p9
0,0
0,1
0,2
Step
1,1
1,0
1,2
0
2,0
2,1
2,2
7
Stencil Application
  • Let us, for now, consider that the domain is of
    size nxn and that we have pn processors
  • Each processor is responsible for computing one
    row of the domain (at each iteration)
  • One first simple idea is to have each processor
    send each cell value to its neighbor as soon as
    that cell value is computed
  • Basic principle 1 do communication as early as
    possible to get your neighbors started as early
    as possible
  • Remember that one of the goals of a parallel
    program is to reduce idle time on the processors
  • We call this algorithm the Greedy algorithm, and
    seek an evaluation of its performance

8
3x3 Example with p3
0,0
0,1
0,2
1,1
1,0
1,2
2,0
2,1
2,2
9
3x3 Example with p3
0,0
0,1
0,2
Step
1,1
1,0
1,2
0
2,0
2,1
2,2
10
Greedy Algorithm with np
  • float Cn/pn // for now n/p 1
  • // code for only one iteration
  • my_rank ? rank()
  • p ? num_procs()
  • for (j0 jltn j)
  • if (my_rank gt 0) RECV(tmp,1) else tmp-1
  • C0j update(j,tmp) // implements the
    stencil
  • if (my_rank lt num_procs-1)
    SEND((C0j),1)
  • We made a few assumptions about the
    implementation of the update function
  • At time ij, Processor Pi does three things
  • it receives c(i-1,j) from Pi-1
  • it computes c(i,j)
  • it send c(i,j) to Pi1
  • Not technically true for P0 and Pp-1, but we
    dont care for performance analysis

11
Greedy Algorithm with n m p
  • This is all well and good, but really, we almost
    always have more rows in the domain than
    processors
  • Example with n40 and p10

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
12
Greedy Algorithm with n m p
  • This is all well and good, but really, we almost
    always have more rows in the domain than
    processors
  • Example with n40 and p10
  • First algorithm
  • P1 has to wait for 10 steps
  • P2 has to wait for 36 steps
  • P9 has to wait for 666 steps

P0
P1
P2
P3
P4
P5
13
Greedy Algorithm with n m p
  • This is all well and good, but really, we almost
    always have more rows in the domain than
    processors
  • Example with n40 and p10
  • First algorithm
  • P1 has to wait for 10 steps
  • P2 has to wait for 36 steps
  • P9 has to wait for 666 steps
  • Second algorithm
  • P1 has to wait for 4 steps
  • P2 has to wait for 8 steps
  • P9 has to wait for 36 steps

P0
P1
P2
P3
P4
P5
14
Greedy Algorithm
  • Lets assume that we have more rows in the domain
    than processors a realistic assumption!
  • The question is then how do we allocate matrix
    rows to processors?
  • Similarly to what we saw in the shared memory
    case, we use a cyclic (i.e., interleaved)
    distribution
  • Basic Principle 2 A cyclic distribution of data
    among processors is a good way to achieve good
    load balancing
  • Remember that in the mat-vec and mat-mat multiply
    we did not use a cyclic distribution
  • There was no such need there because all
    computations were independent and we could just
    assign blocks of rows to processors and everybody
    could compute
  • If we did this here, all processors would be idle
    while processor P0 computes its block of rows,
    and then all processors would be idle while P1
    computes its block of rows, etc. It would be a
    sequential execution!

15
Cyclic Greedy Algorithm
  • Assumption n m p
  • n40 and p10

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
16
Cyclic Greedy Algorithm
  • Assumption n m p
  • n40 and p10
  • With Cyclic Distribution
  • P1 waits 1 step
  • P2 waits 2 steps
  • P9 waits 9 steps
  • Waits p-1 steps
  • Need to do n2/p
  • Finishes after p-1n2/p steps

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
17
Cyclic Greedy Algorithm
  • float Cn/pn // n/p gt 1 ad is an integer!
  • // code for only one iteration
  • my_rank ? rank()
  • p ? num_procs()
  • for (i0iltn/pi)
  • for (j0 jltn j)
  • if (my_rankip gt 0) RECV(tmp,1) else tmp
    -1
  • Cij update(i,j,tmp)
  • if (my_rankip lt n-1) SEND((Cij),1)

18
Cyclic Greedy Algorithm
  • Let us compute the execution time for this
    algorithm, T(n,p)
  • Remember that n gt p
  • We can assume that sending a message is done in a
    non-blocking fashion (while receiving is
    blocking)
  • Then when a processor sends a message at step k
    of the algorithm in parallel it receives a
    message at step k1
  • This is a reasonable assumption because the
    message sizes are identical
  • Remember that in performance analysis we make
    simplifying assumptions otherwise reasoning
    becomes overly complex
  • Therefore, at each algorithm step processors
    (i.e., at least one) do
  • Call update on one cell takes time Ta
  • Ta computation in Flops / machine speed in
    Flop/sec
  • Send/receive a cell value takes time b Tc
  • b communication start-up latency
  • Tc cell value size (in byte) / network bandwidth
  • Each steps lasts Ta b Tc

19
Cyclic Greedy Algorithm
  • Each step takes time Ta b Tc
  • How many steps are there?
  • Weve done this for the shared-memory version
    (sort of)
  • It takes p-1 steps before processor Pp-1 can
    start computing
  • Then it computes n2/p cells
  • Therefore there are p-1n2/p steps
  • T(n,p) (p-1n2/p) (Ta Tc b)
  • This formula points to a big problem a large
    component of the execution time is caused by the
    communication start-up time b
  • In practice b can be as large or larger than Ta
    Tc!
  • The reason is we send many small messages
  • A bunch of SEND(...,1)!
  • Therefore we can fix it by sending larger
    messages
  • What we need to do is augment the granularity of
    the algorithm
  • Basic Principle 3 Sending large messages
    reduces communication overhead
  • Conflicts with principle 1

20
Higher Granularity
  • As opposed to sending a cell value as soon as it
    is computed, compute k cell values and send them
    all in bulk
  • We assume that k divides n
  • As opposed to having each processor hold n/p non
    contiguous rows of the domain, have each
    processor hold blocks of r consecutive rows
  • We assume that pr divides n

21
Higher Granularity
  • Assumption n m p
  • n40 and p10
  • R2 and k5

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
22
Idle processors?
  • In the previous picture, it may be that, after it
    finishes computing its first block row, processor
    P0 has to wait for data from Pp-1
  • Processor P0 computes its first block row in n/k
    algorithm steps
  • Processor Pp-1 computes the first subblock of its
    first block row after p algorithm steps
  • Therefore, P0 is not idle if n/k gt p, or n gt kp
  • If n lt kp idle time, which is not a good idea
  • If n gt kp processors need to receive and store
    values for a while before being able to use them

23
Higher Granularity
  • Assumption n m p
  • n40 and p10
  • R2 and k5
  • nltkp
  • 40lt5x10

P0
P1
P2
P3
P4
P5
P6
P7
24
Cyclic Greedy Algorithm
  • Assumption n m p
  • n40 and p10
  • R1 and k1
  • nltkp
  • 40lt1x10

P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
25
Higher Granularity
  • Assumption n m p
  • n40 and p10
  • R2 and k4
  • nkp
  • 404x10

k
P0
P1
P2
P3
P4
P5
P6
P7
P8
r
P9
26
Higher Granularity
  • Assumption n m p
  • n40 and p10
  • R4 and k5
  • nkp
  • 404x10

k
P0
P1
P2
P3
P4
P5
r
P6
P7
P8
P9
27
Performance Analysis
  • Very similar to what we did before
  • Each step takes time krTa kTc b
  • time to compute kr cells
  • time to send k cells
  • There are p -1 n2 /(pkr) steps
  • p-1 steps before processor Pp-1 can start any
    computation
  • Then Pp-1 has to compute (n2/kr)/p blocks
  • T(n,p,k,r) (p-1n2/(pkr)) (krTa kTc b)
  • Compare to T(n,p) (p-1n2/p) (Ta Tc b)
  • We traded off Principle 1 for Principle 3,
    which is probably better in practice
  • Values of k and r can be experimented with to
    find what works best in practice
  • Note that optimal values can be computed

28
Solving Linear Systems of Eq.
  • Method for solving Linear Systems
  • The need to solve linear systems arises in an
    estimated 75 of all scientific computing
    problems Dahlquist 1974
  • Gaussian Elimination is perhaps the most
    well-known method
  • based on the fact that the solution of a linear
    system is invariant under scaling and under row
    additions
  • One can multiply a row of the matrix by a
    constant as long as one multiplies the
    corresponding element of the right-hand side by
    the same constant
  • One can add a row of the matrix to another one as
    long as one adds the corresponding elements of
    the right-hand side
  • Idea scale and add equations so as to transform
    matrix A in an upper triangular matrix

?
?
x

?
?
?
?
equation n-i has i unknowns, with
29
Gaussian Elimination
1 1 1
1 -2 2
1 2 -1
0
4
2
x
Substract row 1 from rows 2 and 3
1 1 1
0 -3 1
0 1 -2
0
4
2
x
Multiple row 3 by 3 and add row 2
1 1 1
0 -3 1
0 0 -5
0
4
10
x
-5x3 10 x3 -2 -3x2 x3 4
x2 -2 x1 x2 x3 0 x1 4
Solving equations in reverse order (backsolving)
30
Gaussian Elimination
  • The algorithm goes through the matrix from the
    top-left corner to the bottom-right corner
  • the ith step eliminates non-zero sub-diagonal
    elements in column i, substracting the ith row
    scaled by aji/aii from row j, for ji1,..,n.

values already computed
i
pivot row
0
to be zeroed
values yet to be updated
i
31
Sequential Gaussian Elimination
  • Simple sequential algorithm
  • // for each column i
  • // zero it out below the diagonal by adding
  • // multiples of row i to later rows
  • for i 1 to n-1
  • // for each row j below row i
  • for j i1 to n
  • // add a multiple of row i to row j
  • for k i to n
  • A(j,k) A(j,k) - (A(j,i)/A(i,i))
    A(i,k)
  • Several tricks that do not change the spirit of
    the algorithm but make implementation easier
    and/or more efficient
  • Right-hand side is typically kept in column n1
    of the matrix and one speaks of an augmented
    matrix
  • Compute the A(i,j)/A(i,i) term outside of the
    loop

32
Pivoting Motivation
0 1
1 1
  • A few pathological cases
  • Division by small numbers ? round-off error in
    computer arithmetics
  • Consider the following system
  • 0.0001x1 x2 1.000
  • x1 x2 2.000
  • exact solution x11.00010 and x2 0.99990
  • say we round off after 3 digits after the decimal
    point
  • Multiply the first equation by 104 and subtract
    it from the second equation
  • (1 - 1)x1 (1 - 104)x2 2 - 104
  • But, in finite precision with only 3 digits
  • 1 - 104 -0.9999 E4 -0.999 E4
  • 2 - 104 -0.9998 E4 -0.999 E4
  • Therefore, x2 1 and x1 0 (from the first
    equation)
  • Very far from the real solution!

33
Partial Pivoting
  • One can just swap rows
  • x1 x2 2.000
  • 0.0001x1 x2 1.000
  • Multiple the first equation by 0.0001 and
    subtract it from the second equation gives
  • (1 - 0.0001)x2 1 - 0.0001
  • 0.9999 x2 0.9999 gt x2 1
  • and then x1 1
  • Final solution is closer to the real solution.
    (Magical)
  • Partial Pivoting
  • For numerical stability, one doesnt go in order,
    but pick the next row in rows i to n that has the
    largest element in row i
  • This row is swapped with row i (along with
    elements of the right hand side) before the
    subtractions
  • the swap is not done in memory but rather one
    keeps an indirection array
  • Total Pivoting
  • Look for the greatest element ANYWHERE in the
    matrix
  • Swap columns
  • Swap rows
  • Numerical stability is really a difficult field

34
Parallel Gaussian Elimination?
  • Assume that we have one processor per matrix
    element

max aji needed to compute the scaling factor
to find the max aji
Independent computation of the scaling factor
Compute
Reduction
Broadcast
Every update needs the scaling factor and the
element from the pivot row
Independent computations
Broadcasts
Compute
35
LU Factorization
  • Gaussian Elimination is simple but
  • What if we have to solve many Ax b systems for
    different values of b?
  • This happens a LOT in real applications
  • Another method is the LU Factorization
  • Ax b
  • Say we could rewrite A L U, where L is a lower
    triangular matrix, and U is an upper triangular
    matrix O(n3)
  • Then Ax b is written L U x b
  • Solve L y b O(n2)
  • Solve U x y O(n2)

triangular system solves are easy
?
?
?
x
?
x


?
?
?
?
?
?
?
?
equation i has i unknowns
equation n-i has i unknowns
36
LU Factorization Principle
  • It works just like the Gaussian Elimination, but
    instead of zeroing out elements, one saves
    scaling coefficients.
  • Magically, A L x U !
  • Should be done with pivoting as well

1 2 -1
4 3 1
2 2 3
1 2 -1
0 -5 5
2 2 3
1 2 -1
4 -5 5
2 2 3
1 2 -1
4 -5 5
2 -2 5
gaussian elimination
save the scaling factor
gaussian elimination save the scaling factor
1 2 -1
4 -5 5
2 2/5 3
1 2 -1
0 -5 5
0 0 3
1 0 0
4 1 0
2 2/5 1
L
gaussian elimination save the scaling factor
U
37
LU Factorization
  • Were going to look at the simplest possible
    version
  • No pivotingjust creates a bunch of indirections
    that are easy but make the code look complicated
    without changing the overall principle

LU-sequential(A,n) for k 0 to n-2 //
preparing column k for i k1 to n-1
aik ? -aik / akk for j k1 to n-1
// Task Tkj update of column j for
ik1 to n-1 aij ? aij aik akj

stores the scaling factors
k
k
38
LU Factorization
  • Were going to look at the simplest possible
    version
  • No pivotingjust creates a bunch of indirections
    that are easy but make the code look complicated
    without changing the overall principle

LU-sequential(A,n) for k 0 to n-2 //
preparing column k for i k1 to n-1
aik ? -aik / akk for j k1 to n-1
// Task Tkj update of column j for
ik1 to n-1 aij ? aij aik akj

update
k
k
j
i
39
Parallel LU on a ring
  • Since the algorithm operates by columns from left
    to right, we should distribute columns to
    processors
  • Principle of the algorithm
  • At each step, the processor that owns column k
    does the prepare task and then broadcasts the
    bottom part of column k to all others
  • Annoying if the matrix is stored in row-major
    fashion
  • Remember that one is free to store the matrix in
    anyway one wants, as long as its coherent and
    that the right output is generated
  • After the broadcast, the other processors can
    then update their data.
  • Assume there is a function alloc(k) that returns
    the rank of the processor that owns column k
  • Basically so that we dont clutter our program
    with too many global-to-local index translations
  • In fact, we will first write everything in terms
    of global indices, as to avoid all annoying index
    arithmetic

40
LU-broadcast algorithm
  • LU-broadcast(A,n)
  • q ? rank()
  • p ? numprocs()
  • for k 0 to n-2
  • if (alloc(k) q)
  • // preparing column k
  • for i k1 to n-1
  • bufferi-k-1 ? aik ? -aik / akk
  • broadcast(alloc(k),buffer,n-k-1)
  • for j k1 to n-1
  • if (alloc(j) q)
  • // update of column j
  • for ik1 to n-1
  • aij ? aij bufferi-k-1 akj

41
Dealing with local indices
  • Assume that p divides n
  • Each processor needs to store rn/p columns and
    its local indices go from 0 to r-1
  • After step k, only columns with indices greater
    than k will be used
  • Simple idea use a local index, l, that everyone
    initializes to 0
  • At step k, processor alloc(k) increases its local
    index so that next time it will point to its next
    local column

42
LU-broadcast algorithm
  • ...
  • double an-1r-1
  • q ? rank()
  • p ? numprocs()
  • l ? 0
  • for k 0 to n-2
  • if (alloc(k) q)
  • for i k1 to n-1
  • bufferi-k-1 ? ai,k ? -ai,l /
    ak,l
  • l ? l1
  • broadcast(alloc(k),buffer,n-k-1)
  • for j l to r-1
  • for ik1 to n-1
  • ai,j ? ai,j bufferi-k-1
    ak,j

43
What about the Alloc function?
  • One thing we have left completely unspecified is
    how to write the alloc function how are columns
    distributed among processors
  • There are two complications
  • The amount of data to process varies throughout
    the algorithms execution
  • At step k, columns k1 to n-1 are updated
  • Fewer and fewer columns to update
  • The amount of computation varies among columns
  • e.g., column n-1 is updated more often than
    column 2
  • Holding columns on the right of the matrix leads
    to much more work
  • There is a strong need for load balancing
  • All processes should do the same amount of work

44
Bad load balancing
P1
P2
P3
P4
already done
already done
working on it
45
Good Load Balancing?
already done
already done
working on it
Cyclic distribution
46
Proof that load balancing is good
  • The computation consists of two types of
    operations
  • column preparations
  • matrix element updates
  • There are many more updates than preparations, so
    we really care about good balancing of the
    preparations
  • Consider column j
  • Lets count the number of updates performed by
    the processor holding column j
  • Column j is updated at steps k0, ..., j-1
  • At step k, elements ik1, ..., n-1 are updates
  • indices start at 0
  • Therefore, at step k, the update of column j
    entails n-k-1 updates
  • The total number of updates for column j in the
    execution is

47
Proof that load balancing is good
  • Consider processor Pi, which holds columns lpi
    for l0, ... , n/p -1
  • Processor Pi needs to perform this many updates
  • Turns out this can be computed
  • separate terms
  • use formulas for sums of integers and sums of
    squares
  • What it all boils down to is
  • This does not depend on i
  • Therefore it is (asymptotically) the same for all
    Pi processors
  • Therefore we have (asymptotically) perfect load
    balancing!

48
Load-balanced program
  • ...
  • double an-1r-1
  • q ? rank()
  • p ? numprocs()
  • l ? 0
  • for k 0 to n-2
  • if (k q mod p)
  • for i k1 to n-1
  • bufferi-k-1 ? ai,k ? -ai,l /
    ak,l
  • l ? l1
  • broadcast(alloc(k),buffer,n-k-1)
  • for j l to r-1
  • for ik1 to n-1
  • ai,j ? ai,j bufferi-k-1
    ak,j

49
Performance Analysis
  • How long does this code take to run?
  • This is not an easy question because there are
    many tasks and many communications
  • A little bit of analysis shows that the execution
    time is the sum of three terms
  • n-1 communications nxb (n2/2)Tcomm O(1)
  • n-1 column preparations (n2/2)Tcomp O(1)
  • column updates (n3/3p)Tcomp O(n2)
  • Therefore, the execution time is (n3/3p)Tcomp
  • Note that the sequential time is (n3 /3)Tcomp
  • Therefore, we have perfect asymptotic efficiency!
  • once again
  • This is good, but isnt always the best in
    practice
  • How can we improve this algorithm?

50
Pipelining on the Ring
  • So far, the algorithm weve used a simple
    broadcast
  • Nothing was specific to being on a ring of
    processors and its portable
  • in fact you could just write raw MPI that just
    looks like our pseudo-code and have a very
    limited, inefficient for small n, LU
    factorization that works only for some number of
    processors
  • But its not efficient
  • The n-1 communication steps are not overlapped
    with computations
  • Therefore Amdahls law, etc.
  • Turns out that on a ring, with a cyclic
    distribution of the columns, one can interleave
    pieces of the broadcast with the computation
  • It almost looks like inserting the source code
    from the broadcast code we saw at the very
    beginning throughout the LU code

51
Previous program
  • ...
  • double an-1r-1
  • q ? rank()
  • p ? numprocs()
  • l ? 0
  • for k 0 to n-2
  • if (k q mod p)
  • for i k1 to n-1
  • bufferi-k-1 ? ai,k ? -ai,l /
    ak,l
  • l ? l1
  • broadcast(alloc(k),buffer,n-k-1)
  • for j l to r-1
  • for ik1 to n-1
  • ai,j ? ai,j bufferi-k-1
    ak,j

52
LU-pipeline algorithm
  • double an-1r-1
  • q ? rank()
  • p ? numprocs()
  • l ? 0
  • for k 0 to n-2
  • if (k q mod p)
  • for i k1 to n-1
  • bufferi-k-1 ? ai,k ? -ai,l /
    ak,l
  • l ? l1
  • send(buffer,n-k-1)
  • else
  • recv(buffer,n-k-1)
  • if (q ? k-1 mod p) send(buffer, n-k-1)
  • for j l to r-1
  • for ik1 to n-1
  • ai,j ? ai,j bufferi-k-1
    ak,j

53
Why is it better?
  • During a broadcast the roots successor just sits
    idle while the message goes along the ring
  • This is because of the way we have implemented
    broadcast, partially
  • With a better broadcast on a general topology the
    wait may be smaller
  • But there is still a wait
  • What we have done is allow each processor to move
    on to other business after receiving and
    forwarding the message
  • Possible by writing the code with just sends and
    receive
  • More complicated, more efficient usual trade-off
  • Lets look at a (idealized) time-line

54
Prep(0)
Send(0)
Recv(0)
Update(0,4)
Send(0)
Recv(0)
Update(0,8)
Update(0,1)
Send(0)
Recv(0)
Update(0,12)
Update(0,5)
Update(0,2)
Update(0,3)
Update(0,9)
Update(0,6)
Update(0,7)
Update(0,13)
Update(0,10)
Update(0,11)
A processor sends out data as soon as it
receives it
Update(0,14)
Update(0,15)
Prep(1)
Send(1)
Recv(1)
Update(1,5)
Send(1)
Recv(1)
Update(1,9)
Update(1,2)
Send(1)
Recv(1)
Update(1,13)
Update(1,6)
Update(1,3)
Update(1,4)
Update(1,10)
Update(1,7)
Update(1,8)
First four stages
Update(1,14)
Update(1,11)
Update(1,12)
Update(1,15)
Prep(2)
Send(2)
Recv(2)
Update(2,6)
Send(2)
Recv(2)
Update(2,10)
Update(2,3)
Send(2)
Recv(2)
Some communication occurs in parallel with
computation
Update(2,14)
Update(2,7)
Update(2,4)
Update(2,5)
Update(2,11)
Update(2,8)
Update(2,9)
Update(2,15)
Update(2,12)
Update(2,13)
Prep(3)
Send(3)
Recv(3)
Update(3,7)
Send(3)
Recv(3)
Update(3,11)
Update(3,4)
Send(3)
Recv(3)
Update(3,15)
Update(3,8)
Update(3,5)
Update(3,6)
Update(3,12)
Update(3,9)
Update(3,10)
Update(3,13)
Update(3,14)
55
Can we do better?
  • In the previous algorithm, a processor does all
    its updates before doing a Prep() computation
    that then leads to a communication
  • But in fact, some of these updates can be done
    later
  • Idea Send out pivot as soon as possible
  • Example
  • In the previous algorithm
  • P1 Receive(0), Send(0)
  • P1 Update(0,1), Update(0,5), Update(0,9),
    Update(0,13)
  • P1 Prep(1)
  • P1 Send(1)
  • ...
  • In the new algorithm
  • P1 Receive(0), Send(0)
  • P1 Update(0,1)
  • P1 Prep(1)
  • P1 Send(1)
  • P1 Update(0,5), Update(0,9), Update(0,13)
  • ...

56
Prep(0)
Send(0)
Recv(0)
Update(0,4)
Send(0)
Recv(0)
Update(0,8)
Update(0,1)
Send(0)
Recv(0)
Update(0,12)
Update(0,2)
Update(0,3)
Prep(1)
Update(0,7)
Send(1)
Recv(1)
A processor sends out data as soon as it
receives it
Update(0,5)
Send(1)
Recv(1)
Update(0,9)
Update(1,2)
Send(1)
Recv(1)
Update(0,13)
Update(1,3)
Update(1,4)
Prep(2)
Update(1,5)
Update(1,8)
Send(2)
Recv(2)
Update(0,6)
Update(1,9)
Send(2)
Recv(2)
First four stages
Update(0,10)
Update(2,3)
Send(2)
Recv(2)
Update(0,14)
Update(1,13)
Update(1,12)
Prep(3)
Update(1,6)
Update(2,5)
Send(3)
Recv(3)
Update(0,11)
Update(1,10)
Send(3)
Recv(3)
Many communications occur in parallel with
computation
Update(0,15)
Update(2,4)
Send(3)
Recv(3)
Update(1,14)
Update(1,7)
Update(2,8)
Update(2,9)
Update(1,11)
Update(2,6)
Update(2,12)
Update(2,13)
Update(1,15)
Update(2,10)
Update(3,4)
Update(3,5)
Update(2,14)
Update(2,7)
Update(3,8)
Update(3,9)
Update(2,11)
Update(3,12)
Update(3,13)
Update(3,6)
Update(2,15)
Update(3,10)
Update(3,7)
Update(3,14)
Update(3,11)
Update(3,15)
57
LU-look-ahead algorithm
  • q ? rank()
  • p ? numprocs()
  • l ? 0
  • for k 0 to n-2
  • if (k q mod p)
  • Prep(k)
  • Send(buffer,n-k-1)
  • for all j k mod p, jgtk Update(k-1,j)
  • for all j k mod p, jgtk Update(k,j)
  • else
  • Recv(buffer,n-k-1)
  • if (q ? k - 1 mod p) then
    Send(buffer,n-k-1)
  • if (q ? k 1 mod p) then
  • Update(k,k1)
  • else
  • for all j k mod p, jgtk Update(k,j)

58
Further improving performance
  • One can use local overlap of communication and
    computation
  • multi-threading, good MPI non-blocking
    implementation, etc.
  • There is much more to be said about parallel LU
    factorization
  • Many research articles
  • Many libraries available
  • Its a good example of an application for which
    one can think hard about operation orderings and
    try to find improved sequences
  • The basic principle is always the same send
    things as early as possible
  • The modified principle send things as early as
    required, but not earlier! You can avoid extra
    communication load by sending less number of
    longer messages.
Write a Comment
User Comments (0)
About PowerShow.com