Title: Numerical Linear Algebra An Introduction
1Numerical Linear AlgebraAn Introduction
2Levels of multiplication
- vector-vector aibi
- matrix-vector Ai,jbj
- matrix-matrix AI,kBk,j
3Matrix-Matrix
- for (i0 iltn i)
- for (j0 jltn j)
-
- Ci,j0.0
- for (k0 kltn k)
- Ci,jCi,j Ai,kBk,j
-
-
- NoteO(n3) work
4Block matrix
- Serial version - same pseudocode, but interpret
i, j as indices of subblocks, and AB means block
matrix multiplication - Let n be a power of two, and generate a recursive
algorithm. Terminates with an explicit formula
for the elementary 2X2 multiplications. Allows
for parallelism. Can get O(n2.8) work
5Pipeline method
- Pump data left to right and top to bottom
- recv(A,Pi,j-1)
- recv(B,Pi-1,j)
- CCAB
- send(A,Pi,j1)
- send(B,Pi1, j)
6Pipeline method
B,0
B,1
B,2
B,3
C0,0
C0,3
C0,2
C0,1
A0,
C1,0
C1,1
C1,2
C1,3
A1,
C2,0
C2,1
C2,2
C2,3
A2,
C3,1
C3,3
C3,2
C3,0
A3,
7Pipeline method
- Similar method for matrix-vector multiplication.
But you lose some of the cache reuse
8A sense of speed vector ops
Loop Flops per pass Operation per pass operation
1 2 v1(i)v1(i)av2(i) update
2 8 v1(i)v1(i)S skvk(i) 4-fold vector update
3 1 v1(i)v1(i)/v2(i) divide
4 2 v1(i)v1(i)sv2(ind(i)) updategather
5 2 v1(i)v2(i)-v3(i)v1(i-1) Bidiagonal
6 2 ssv1(i)v2(i) Inner product
9A sense of speed vector ops
Loop J90 cft77 (100 nsec clock) r_8 n1/2 T90 1 processor cft77 (2.2 nsec clock) r_8 n1/2
1 97 19 159
2 163 29 1245 50
3 21 6 219 43
4 72 21 780 83
5 4.9 2 25 9
6 120 202 474 164
10Observation s
- Simple do loops not effective
- Cache and memory hierarchy bottlenecks
- For better performance,
- combine loops
- minimize memory transfer
11LINPACK
- library of subroutines to solve linear algebra
- example LU decomposition and system solve
(dgefa and dgesl, resp.) - In turn, built on BLAS
- see netlib.org
12BLAS Basic Linear Algebra Subprograms
- a library of subroutines designed to provide
efficient computation of commonly-used linear
algebra routines, like dot products,
matrix-vector multiplies, and matrix-matrix
multiplies. - The naming convention is not unlike other
libraries - the fist letter indicates precision,
the rest gives a hint (maybe) of what the routine
does, e.g. SAXPY, DGEMM. - The BLAS are divided into 3 levels
vector-vector, matrix-vector, and matrix-matrix.
The biggest speed-up usually in level 3.
13BLAS
14BLAS
15BLAS
16How efficient is the BLAS?
- load/store float ops refs/ops
- level 1
- SAXPY 3N 2N 3/2
- level 2
- SGEMV MNN2M 2MN 1/2
- level 3
- SGEMM 2MNMKKN 2MNK 2/N
17Matrix-vector
- read x(1n) into fast memory
- read y(1n) into fast memory
- for i 1n
- read row i of A into fast memory
- for j 1n
- y(i) y(i) A(i,j)x(j)
- write y(1n) back to slow memory
18Matrix-vector
- m slow memory refs n2 3n
- f arithmetic ops 2n2
- qf/m 2
- Mat-vec multiple limited by slow memory
19Matrix-matrix
20Matrix Multiply - unblocked
- for i 1 to n
- read row i of A into fast memory
- for j 1 to n
- read C(i,j) into fast memory
- read column j of B into fast memory
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
- write C(i,j) back to slow memory
21Matrix Multiply unblocked
- Number of slow memory references on unblocked
matrix multiply - m n3 read each column of B n times
- n2 read each column of A once for
each i - 2n2 read and write each element of C
once - n3 3n2
- So q f/m (2n3)/(n3 3n2)
- 2 for large n, no improvement over
matrix-vector multiply
22Matrix Multiply blocked
- Consider A,B,C to be N by N matrices of b by b
subblocks where bn/N is called the blocksize - for i 1 to N
- for j 1 to N
- read block C(i,j) into fast memory
- for k 1 to N
- read block A(i,k) into fast
memory - read block B(k,j) into fast
memory - C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks - write block C(i,j) back to slow memory
23Matrix Multiply blocked
- Number of slow memory references on blocked
matrix multiply - m Nn2 read each block of B N3 times
(N3 n/N n/N) - Nn2 read each block of A N3
times - 2n2 read and write each block of
C - (2N 2)n2
- So q f/m 2n3 / ((2N 2)n2)
- n/N b for large n
24Matrix Multiply blocked
- So we can improve performance by increasing the
blocksize b - Can be much faster than matrix-vector multiplty
(q2) - Limit All three blocks from A,B,C must fit in
fast memory (cache), so we - cannot make these blocks arbitrarily large
3b2 lt M, so q b lt sqrt(M/3)
25More on BLAS
- Industry standard interface(evolving)
- Vendors, others supply optimized implementations
- History
- BLAS1 (1970s)
- vector operations dot product, saxpy
- m2n, f2n, q 1 or less
- BLAS2 (mid 1980s)
- matrix-vector operations
- mn2, f2n2, q2, less overhead
- somewhat faster than BLAS1
26More on BLAS
- BLAS3 (late 1980s)
- matrix-matrix operations matrix matrix multiply,
etc - m gt 4n2, fO(n3), so q can possibly be as
large as n, so BLAS3 is potentially much faster
than BLAS2 - Good algorithms used BLAS3 when possible (LAPACK)
- www.netlib.org/blas, www.netlib.org/lapack
27BLAS on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)