Title: Dense Matrix Algorithms
1Dense Matrix Algorithms
- CS 524 High-Performance Computing
2Definitions
- p number of processors (0 to p-1)
- n dimension of array/matrix (0 to n-1)
- q number of blocks along one dimension (0 to
q-1) - tc computation time for one flop
- ts communication startup time
- tw communication transfer time per word
- Interconnection network crossbar switch with
bi-directional links
3Partitioning One Dimension (Striping)
4Partitioning Both Dimensions (Checkerboard)
5Matrix Transpose (MT)
- AT(i, j) A(j, i) for all i and j
- Sequential run-time
- do i 0, n-1
- do j 0, n-1
- B(i, j) A(j, i)
- end do
- end do
- Run time is (n2 n)/2 or n2/2
6MT - Checkerboard Partitioning (1)
7MT Checkerboard Partitioning (2)
8MT Striped Partitioning
9Matrix-Vector Multiplication (MVM)
- MVM y Ax
- do i 0, n-1
- do j 0, n-1
- y(i) y(i) A(i, j)x(j)
- end do
- end do
- Sequential algorithm requires n2 multiplications
and additions - Assuming one flop takes tc time, sequential run
time is 2tcn2
10Row-wise Striping p n (1)
11Row-wise Striping p n (2)
- Data partitioning Pi has row i of A and element
i of x - Communication Each processor broadcasts its
element of x - Computation Each processor perform n additions
and multiplications - Parallel run time Tp 2ntc p(ts tw) 2ntc
n(ts tw) - Algorithm is cost-optimal as both parallel and
serial cost is O(n2)
12Row-wise Striping p lt n
- Data partitioning Each processor has n/p rows of
A and corresponding n/p elements of x - Communication Each processor broadcasts its
elements of x - Computation Each processor perform n2/p
additions and multiplications - Parallel run time Tp 2tcn2/p pts (n/p)tw
- Algorithm is cost-optimal for p O(n)
13Checkerboard Partitioning p n2 (1)
14Checkerboard Partitioning p n2 (2)
- Data partitioning Each processor has one element
of A only processors in last column have one
element of x - Communication
- One element of x from last column to diagonal
processor - Broadcast from diagonal processor to all
processors in column - Global sum of y from all processors in row to
last processor - Computation one multiplication addition
- Parallel run time Tp 2tc 3(ts tw)
- Algorithm is cost-optimal as serial and parallel
cost is O(n2) - For bus network, communication time is 3n(ts
tw) system is not cost-optimal as cost is O(n3)
15Checkerboard Partitioning p lt n2
- Data partitioning Each processor has n/vp x n/vp
elements of A processors in last column have
n/vp elements of x - Communication
- n/vp elements of x from last column to diagonal
processor - Broadcast from diagonal processor to all
processors in column - Global sum of y from all processors in row to
last processor - Computation n2/p multiplications additions
- Parallel run time Tp 2tcn2/p 3 (ts tw n/vp)
- Algorithm is cost-optimal only if p O(n2)
16Matrix-Matrix Multiplication (MMM)
- C A x B, n x n square matrices
- Block matrix multiplication algebraic operations
on sub-matrices or blocks of matrices. This view
of MMM aids parallelization. - do i 0, q-1
- do j 0, q-1
- do k 0, q-1
- Ci,j Ci,j Ai,k x Bk,j
- end do end do end do
- Number of multiplications additions n3 .
Sequential run time 2tcn3
17Checkerboard Partitioning q vp
- Data partitioning Pi,j has Ai,j and Bi,j blocks
of A and B of dimension n/vp x n/vp - Communication Each processor broadcasts its
submatrix Ai,j to all processors in row each
processor broadcasts its submatrix Bi,j to all
processors in column - Computation Each processor performs nn/vp n/vp
n3/p multiplications additions - Parallel run time Tp 2tcn3/p 2vpts
(n2/p)tw - Algorithm is cost-optimal only if p O(n2)
18Cannons Algorithm (1)
- Memory-efficient version of the checkerboard
partitioned block MMM - At any time, each processor has one block of A
and B - Blocks are cycled after each computation in such
a way that after vp computations the
multiplication is done for Ci,j - Initial distribution of matrices is same as
checkerboard partitioning - Communication
- Initial block Ai,j is moved left by i steps
(with wraparound) block Bi,j is moved up by j
steps (with wraparound) - Subsequent vp-1 block Ai,j is moved left by
one step block Bi,j moved up by one step (both
with wraparound) - After vp computation and communication steps the
multiplication is complete for Ci,j
19Cannons Algorithm (2)
20Cannons Algorithm (3)
21Cannons Algorithm (4)
- Communication
- vp point-to-point communications of size n2/p
along rows - vp point-to-point communications of size n2/p
along columns - Computation over vp steps, each processors
performs n3/p multiplications additions - Parallel run time Tp 2tcn3/p 2vpts
(n2/p)tw - Algorithm is cost-optimal if p O(n2)
22Foxs Algorithm (1)
- Another memory-efficient version of the
checkerboard partitioned block MMM - Initial distribution of matrices is same as
checkerboard partitioning - At any time, each processor has one block of A
and B - Steps (repeated vp times)
- Broadcast Ai,i to all processors in the row
- Multiply block of A received with resident block
of B - Send the block of B up one step (with wraparound)
- Select block Ai,(j1)modvp (where Ai,j is the
block broadcast in the previous step) and
broadcast to all processors in row. Go to 2.
23Foxs Algorithm (2)
24Foxs Algorithm (3)
- Communication
- vp broadcasts of size n2/p along rows
- vp point-to-point communications of size n2/p
along columns - Computation Each processor performs n3/p
multiplications additions - Parallel run time Tp 2tcn3/p 2vpts
(n2/p)tw - Algorithm is cost-optimal if p O(n2)
25Solving a System of Linear Equations
- System of linear equations, Ax b
- A is dense n x n matrix of coefficients
- b is n x 1 vector of RHS values
- x is n x 1 vector of unknowns
- Solving x is usually done in two stages
- First, Ax b is reduced to Ux y, where U is an
unit upper triangular matrix U(i,j) 0 if i gt
j otherwise U(i,j) ? 0 and U(i,i) 1 for 0 i
lt n. This stage is called Gaussian elimination. - Second, the unknowns are solved in reverse order
starting from x(n-1). This stage is called
back-substitution.
26Gaussian Elimination (1)
- do k 0, n-1
- do j k1, n-1
- A(k, j) A(k, j)/A(k, k)
- end do
- y(k) b(k)/A(k, k)
- A(k, k) 1
- do i k1, n-1
- do j k1, n-1
- A(i, j) A(i, j) A(i, k)A(k, j)
- end do
- b(i) b(i) A(i, k)y(k)
- A(i, k) 0
- end do
- end do
27Gaussian Elimination (2)
- Computations
- Approximately n2/2 divisions
- Approximately n3/3 n2/2 multiplications
subtractions - Approx. sequential run time Ts 2tcn3/3
28Striped Partitioning p n (1)
- Data partitioning Each processor has one row of
matrix A - Communication during k (outermost loop)
- broadcast of active part of kth (size nk1) row
to processors k1 to n-1 - Computation during iteration k (outermost loop)
- n k -1 divisions at processor Pk
- n k -1 multiplications subtractions for
processors Pi (k lt i lt n) - Parallel run time Tp (3/2)n(n-1)tc nts
0.5n(n-1)tw - Algorithm is cost-optimal since serial and
parallel costs are O(n3)
29Striped Partitioning p n (2)
30Striped Partitioning p n (3)
31Pipelined Version (Striped Partitioning)
- In the non-pipelined or synchronous version,
outer loop k is executed in order. - When Pk is performing the division step, all
other processors are idle - When performing the elimination step, only
processors k1 to n-1 are active rest are idle - In pipelined version, the division step,
communication, and elimination step are
overlapped. - Each processor communicates, if it has data to
communicate computes, if it has computations to
be done or waits, if none of these can be done. - Cost-optimal for linear array, mesh and hypercube
interconnection networks that have
directly-connected processors.
32Pipelined Version (2)
33Pipelined Version (3)
34Striped Partitioning p lt n (1)
35Striped Partitioning p lt n (2)
36Checkerboard Partitioning p n2 (1)
37Checkerboard Partitioning p n2 (2)
- Data partitioning Pi,j has element A(i, j) of
matrix A - Communication during iteration k (outermost loop)
- Broadcast of A(k, k) to processor (k, k1) to (k,
n-1) in the kth row - Broadcast of modified A(i,k) along ith row for k
i lt n - Broadcast of modified A(k,j) along jth column for
k j lt n - Computation during iteration k (outermost loop)
- One division at Pk,k
- One multiplication subtraction at processors
Pi,j (k lt i ,jlt n) - Parallel run time Tp (3/2)n(n-1)tc nts
0.5(n-1)tw - Algorithm is cost-optimal since serial and
parallel costs are O(n3)
38Back-Substitution
- Solution of Ux y, where U is unit upper
triangular matrix - do k n-1, 0
- x(k) y(k)
- do i k-1, 0
- y(i) y(i) x(k)U(i,k)
- end do
- end do
- Computation approx. n2/2 multiplications
subtractions - Parallel algorithm is similar to that for the
Gaussian elimination stage