Numerical Linear Algebra An Introduction - PowerPoint PPT Presentation

About This Presentation

Title:

Numerical Linear Algebra An Introduction

Description:

BLAS. Basic Linear Algebra Subprograms ... The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. ... – PowerPoint PPT presentation

Number of Views:157

Avg rating:3.0/5.0

Slides: 28

Provided by: universit74

Learn more at: https://www.nsm.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Numerical Linear Algebra An Introduction

1
Numerical Linear AlgebraAn Introduction
2
Levels of multiplication

vector-vector aibi
matrix-vector Ai,jbj
matrix-matrix AI,kBk,j

3
Matrix-Matrix

for (i0 iltn i)
for (j0 jltn j)
Ci,j0.0
for (k0 kltn k)
Ci,jCi,j Ai,kBk,j
NoteO(n3) work

4
Block matrix

Serial version - same pseudocode, but interpret
i, j as indices of subblocks, and AB means block
matrix multiplication
Let n be a power of two, and generate a recursive
algorithm. Terminates with an explicit formula
for the elementary 2X2 multiplications. Allows
for parallelism. Can get O(n2.8) work

5
Pipeline method

Pump data left to right and top to bottom
recv(A,Pi,j-1)
recv(B,Pi-1,j)
CCAB
send(A,Pi,j1)
send(B,Pi1, j)

6
Pipeline method
B,0
B,1
B,2
B,3
C0,0
C0,3
C0,2
C0,1
A0,
C1,0
C1,1
C1,2
C1,3
A1,
C2,0
C2,1
C2,2
C2,3
A2,
C3,1
C3,3
C3,2
C3,0
A3,
7
Pipeline method

Similar method for matrix-vector multiplication.
But you lose some of the cache reuse

8
A sense of speed vector ops
Loop Flops per pass Operation per pass operation
1 2 v1(i)v1(i)av2(i) update
2 8 v1(i)v1(i)S skvk(i) 4-fold vector update
3 1 v1(i)v1(i)/v2(i) divide
4 2 v1(i)v1(i)sv2(ind(i)) updategather
5 2 v1(i)v2(i)-v3(i)v1(i-1) Bidiagonal
6 2 ssv1(i)v2(i) Inner product
9
A sense of speed vector ops
Loop J90 cft77 (100 nsec clock) r_8 n1/2 T90 1 processor cft77 (2.2 nsec clock) r_8 n1/2
1 97 19 159
2 163 29 1245 50
3 21 6 219 43
4 72 21 780 83
5 4.9 2 25 9
6 120 202 474 164
10
Observation s

Simple do loops not effective
Cache and memory hierarchy bottlenecks
For better performance,
combine loops
minimize memory transfer

11
LINPACK

library of subroutines to solve linear algebra
example LU decomposition and system solve
(dgefa and dgesl, resp.)
In turn, built on BLAS
see netlib.org

12
BLAS Basic Linear Algebra Subprograms

a library of subroutines designed to provide
efficient computation of commonly-used linear
algebra routines, like dot products,
matrix-vector multiplies, and matrix-matrix
multiplies.
The naming convention is not unlike other
libraries - the fist letter indicates precision,
the rest gives a hint (maybe) of what the routine
does, e.g. SAXPY, DGEMM.
The BLAS are divided into 3 levels
vector-vector, matrix-vector, and matrix-matrix.
The biggest speed-up usually in level 3.

13
BLAS

Level 1

14
BLAS

Level 2

15
BLAS

Level 3

16
How efficient is the BLAS?

load/store float ops refs/ops
level 1
SAXPY 3N 2N 3/2
level 2
SGEMV MNN2M 2MN 1/2
level 3
SGEMM 2MNMKKN 2MNK 2/N

17
Matrix-vector

read x(1n) into fast memory
read y(1n) into fast memory
for i 1n
read row i of A into fast memory
for j 1n
y(i) y(i) A(i,j)x(j)
write y(1n) back to slow memory

18
Matrix-vector

m slow memory refs n2 3n
f arithmetic ops 2n2
qf/m 2
Mat-vec multiple limited by slow memory

19
Matrix-matrix
20
Matrix Multiply - unblocked

for i 1 to n
read row i of A into fast memory
for j 1 to n
read C(i,j) into fast memory
read column j of B into fast memory
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)
write C(i,j) back to slow memory

21
Matrix Multiply unblocked

Number of slow memory references on unblocked
matrix multiply
m n3 read each column of B n times
n2 read each column of A once for
each i
2n2 read and write each element of C
once
n3 3n2
So q f/m (2n3)/(n3 3n2)
2 for large n, no improvement over
matrix-vector multiply

22
Matrix Multiply blocked

Consider A,B,C to be N by N matrices of b by b
subblocks where bn/N is called the blocksize
for i 1 to N
for j 1 to N
read block C(i,j) into fast memory
for k 1 to N
read block A(i,k) into fast
memory
read block B(k,j) into fast
memory
C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
write block C(i,j) back to slow memory

23
Matrix Multiply blocked

Number of slow memory references on blocked
matrix multiply
m Nn2 read each block of B N3 times
(N3 n/N n/N)
Nn2 read each block of A N3
times
2n2 read and write each block of
C
(2N 2)n2
So q f/m 2n3 / ((2N 2)n2)
n/N b for large n

24
Matrix Multiply blocked

So we can improve performance by increasing the
blocksize b
Can be much faster than matrix-vector multiplty
(q2)
Limit All three blocks from A,B,C must fit in
fast memory (cache), so we
cannot make these blocks arbitrarily large
3b2 lt M, so q b lt sqrt(M/3)

25
More on BLAS

Industry standard interface(evolving)
Vendors, others supply optimized implementations
History
BLAS1 (1970s)
vector operations dot product, saxpy
m2n, f2n, q 1 or less
BLAS2 (mid 1980s)
matrix-vector operations
mn2, f2n2, q2, less overhead
somewhat faster than BLAS1

26
More on BLAS

BLAS3 (late 1980s)
matrix-matrix operations matrix matrix multiply,
etc
m gt 4n2, fO(n3), so q can possibly be as
large as n, so BLAS3 is potentially much faster
than BLAS2
Good algorithms used BLAS3 when possible (LAPACK)
www.netlib.org/blas, www.netlib.org/lapack

27
BLAS on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)

Write a Comment

User Comments (0)