Matrix Multiply Routine - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Matrix Multiply Routine

Description:

One of the interesting features is that a program can create as many threads as it wants. ... A fixed number of threads are created and put in an idle state ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 58

Provided by: phili271

Category:

more less

Transcript and Presenter's Notes

Title: Matrix Multiply Routine

1

/ Matrix Multiply Routine /
include ltmalloc.hgt
include ltstdlib.hgt
include ltstdio.hgt
include ltmath.hgt
struct timeb
time_t time
short millitime
short timezone
short dstflag
double timingroutine(int n,struct timeb t1,
struct timeb t2)
double d
d n / (t2.time-t1.time (t2.millitime -
t1.millitime)/1000.0)return d

double matalloc(int n1, int n2)
int i
double mat, temp_ptr
/ Allocate
space for the array /
temp_ptr (double ) calloc(n1n2,
sizeof(double))
if((void )temp_ptr NULL) /
inform 4 /
return NULL
mat (double ) calloc(n1, sizeof(double
))
if((void )(mat) NULL) /inform
4/
return NULL
for(i0 ilt n1 i)
mati (temp_ptrin2)
return mat

void free_mat(double mat)
free(mat)
free(mat)

int N500
main()
double x, y, z
int i,j,k,m
double flops
struct timeb t1,t2
mN-1
xmatalloc(N,N)
ymatalloc(N,N)
zmatalloc(N,N)
for(i0iltNi)
for(j0jltNj)
xij sin(1.0(i1)(j1))
yij cos(1.0(i1)(j1))
zij 0.0

ftime(t1)
for(i0iltNi)
for(j0jltNj)
for(k0kltNk)
zijxikykj
ftime(t2)
flops timingroutine(NNN,t1,t2)/1000000.0
printf("c(n,n) f\n",zmm)
printf("Flops .1f MFlops\n",flops)
printf("Time Elapsed f sec\n",t2.time -
t1.time \
(t2.millitime-t1.millitime)/1000.0)
free(x)
free(y)
free(z)

6
Compiling

cc O3 matmul.c lm
a.out
cc O3 matmul.c lm o matmul
matmul

7
Example Lab 2

Compile your matrix multiply code with the apo
option and run the 500 by 500 and 1000 by 1000 on
1,2,4,8, and 16 threads.
Graph threads vs megaflops for each example.
What can you say about efficiency of the code.
What of the code is parallel?

8
Amdahls Law

T S P/(procs)
where T is the running time with procs
processors
S is the time spent in serial code
P is the time spent in serial code when run on
one proc.
T1 S P/(procs1)
T2 S P/(procs2)
Solve for S and P.

9
Block Matrix Multiply

Compute C AB
A is m by n B is n by r C is m by r
Let p be the blocking factor
Assume that m, n, and r are all multiples of p

10
Example

A11 A12 A13 B11 B12 B13 B14
A21 A22 A23 B21 B22 B23 B24
B31 B32 B33 B34
A is 2p by 3p while B is 3p by 4p.

11
Fortran

mm(A, LDA, B,LDB, C,LDB,m,n,s) returns C AB
C 0
Do i 1, 2
Do j 1,4
do k 1, 3
C(i,j) C(i,j) A(i,k) B(k,j)
enddo
enddo
enddo

12
Code for square MM

subroutine mat_block(x,y,z,m)
real(kind8), dimension(m,m) x, y, z
integer m, R, C, D, ii, jj, p, q
p 20
do j1,m/p
do i1,m/p
R (i-1)p
C (j-1)p
do q 1,m/p
D (q-1)p
do jj1,p
do k 1,p
do ii1,p
z(Rii,C jj) z(Rii,C jj) x(Rii,Dk)
y(Dk,Cjj)
end do
end do
end do
end do
end do

13
MM Fragment

p 20
do j1,m/p
do i1,m/p
R (i-1)p
C (j-1)p
do q 1,m/p
D (q-1)p
do jj1,p
do k 1,p
do ii1,p
z(Rii,C jj) z(Rii,C jj) x(Rii,Dk)
y(Dk,Cjj)

14
Matlab (Jake Kessinger)

n 100
h 1/n
maindiag ones(99,1)(-2-h2)
offdiag ones(98,1)
coefmat diag(maindiag)diag(offdiag,1)diag(offd
iag,-1)
constvec zeros(99,1)
constvec(1) -2
constvec(99) -4
yhat coefmat\constvec

15
Matlab Assignment

Write a .m file to solve a constant coefficient
ode bvp with arbitrary right hand side by finite
differences.
Ay By C y f, y(left) bvleft, y(right)
bvright.
T,Y Solve_ode(a, b, c, rhs, left, right,
bvleft, bvright, numsub)

16
Matlab (2)

Discretizing yields
A (yj1 2yj yj-1)hB (yj1 yj-1)/2Ch2yj
h2 f(ti), i 0, , n
With y0 and yn known.

17
Matlab (3)

Setting up the matrix yields
diag(A) -2ACh2
supdiag(A) AhB/2,
Subdiag(A) A-hB/2,
and the rhs
(-Ay0hBy0/2h2 f(t1), h2 f(t2), , h2 f(tn-2),
-Ayn hByn/2h2 f(tn-1))

18
Matlab (4)

Notice that if A, B, or C were functions that
depend on t then we would modify the matrix
entries as follows
Let T ( t0 h, t0 2h, , t0 (n-1)h )

19
Matlab 5

diag(A) (-2ACh2)(T)
supdiag(A) (AhB/2)( T(1n-2) ),
Subdiag(A) (A-hB/2)( T(2n-1) ),
and the rhs
(-A(t1 )y0 hB(t1)y0/2 h2 f(t1), h2 f(t2), ,
h2 f(tn-2), -A(tn-1 )yn hB(t n-1)yn/2 h2
f(tn-1))

20
Matlab (6)

One of the interesting points to make is that
once the code is written it is just as easy to
solve a variable coefficient ode bvp numerically
as it is to solve a constant coefficient ode bvp.

21
Big Picture Dom Decomp

Suppose we have our Matlab routine written and we
want to solve
y y cos(t) y(0) 0, y(1) 0.
Domain decomposition subdivide 0,1 into 2n
intervals of spacing 1/(2n). Consider the two
overlapping intervals 0, .51/2n and .5, 1 of
n1 intervals.

22
Big Picture (2)

S(1) y y cos(t) y(0) 0, y(.51/2n) a
S(2) y y cos(t) y(.5) b, y(1) 0
Replace b with S(1)n and a with S(2)1 and
iterate.

23
Pthreads

man pthreads
pthreads is a library of C functions to spawn
light-weight processes.
Used for concurrent and parallel programming.
Uses a shared memory model of parallelism

24
pthread_create

Prototype
int pthread_create (
pthread_t thread,
const pthread_attr_t attr,
void (fcn)(void ),
void arg)

25
pthread_join

Prototype
int pthread_join (
pthread_t thread,
void value_ptr)

26
simple.c

int r1 0, r2 0
extern int main(void)
do_one_thing(r1)
do_another_thing(r2)
do_wrap_up(r1, r2)
return 0

27
simple_threads.c

pthread_t thread1, thread2
pthread_create(thread1,
NULL,
(void ) do_one_thing,
(void ) r1)
pthread_create(thread2,
NULL,
(void ) do_another_thing,
(void ) r2)
pthread_join(thread1, NULL)
pthread_join(thread2, NULL)
do_wrap_up(r1, r2)

28
Pthreads Tutorials

http//www.uq.edu.au/cmamuys/humbug/talks/pthread
s/pthreads.html
http//www.cs.ucr.edu/sshah/pthreads/tutorial.htm
l

29
Example Codes

ftp ftp.uu.net
Name anonymous
Password yourname_at_where_ever.com
ftpgt cd /published/oreilly/nutshell/pthreads
ftpgt binary
ftpgt get examples.tar.gz
ftpgt quit

30
To extract files

gzcat examples.tar.gz tar xvf -
Or in System V
gzcat examples.tar.gz tar xof
Or
gunzip examples.tar.gz
tar xvf examples.tar

31
Lab 2

Scott Franklin
Angela Menke
Shawn Feng (please get me hard copy)
Jake Kessinger
Jeff Robinson
Samanmalee
NOTE If you cant send me MS word or pdf then
give me hard copy.

32
Three models for threads

Boss/Worker
Peer
Pipeline

33
Boss/Worker

Boss thread accepts requests, creates threads to
satisfy request.
Examples
Automated Teller
Data Base server
window managers

34
Peer

Peers do not require a boss to schedule work or
identify type of work needed
Examples
Matrix Multiply
Game of life

35
Pipeline

Pipeline assumes a long input stream with a
series of sub-operations. Note all operations
must take the same amount of time or there will
be a bottleneck.
Example
Manufacturing process
image processing

36
Thread Pool

One of the interesting features is that a program
can create as many threads as it wants. This can
cause problems with the OS as it is easy to
overload it.
One solution is a thread pool. A fixed number of
threads are created and put in an idle state
waiting to serve requests.

37
Quadrature

Rectangle rule I(f) ? f(0)h f(h)hf(1)h
Lets assume that we have a C-function with
prototype double f(double)
Is this a candidate for parallelism?
How would we thread it?

38
quadrature routine

double quad(left, right, n)
double sum 0., h
int k
h (right-left)/n
for(k0 kltn k)
sum f(leftkh)
return hsum

39
Example

Arguments for quadrature routine
typedef struct
double left
double right
int n
double approx quad_arg_t

40
peer_quad

void peer_quad(quad_arg_t qarg)
qarg-gtapprox
quad(qarg-gtleft, qarg-gtright, qarg-gtn)

41
main

quad_arg_t q10
double sum, L 0., H .1
for (k0 klt10 k)
qk (quad_arg_t ) malloc(sizeof(quad_qrg_t
))
qk-gtleft LkH
qk-gtright qk-gtleft H
pthread_create((peerid), NULL,
(void ) peer_quad, (void )
qk)
for(k 0 klt10 k)
sum qk-gtapprox free(qk)

42
Challenge

I will pay 20 to the first pure C code matrix
multiply that performs at 250 Mflops or better on
the SGI O2K at Reese under the following
conditions
One processor (any optimization level)
Matrix size 1000 by 1000
You may use double or double and index into
the arrays yourself.
You may NOT compute ABT
Our current best is around 190 Mflops

43
Lab 3

Using both the apo option (FORTRAN) and
pthreads, C. Produce two quadrature routines
that run in parallel. Implement the rectangle
rule.
Extra credit in the pthreads example if you pass
the function to the quadrature routine. More
extra credit if you implement another quadrature
rule as well.
Discuss these two programming methodologies.
Get an approximation to ? gF dt on 0, 1
g(t) ?k1 400 cos(k2pit)/k, F(t)
cos(2pit)
approx gF (t) by ? g(t-jh)F(jh)h where h 1/N
and j0 to N-1, N 1000.

44
Convergence of the FD method

H1 The solution to the 2pt BvpAy By C y
f, y(a) c, y(b) d is C4.
H2 A, B, and C are continuous with AClt0.
C There is a K so that the FD approximation y
yh satisfiesy(j) e(ajh) lt Kh2 for
j1,,n-1.

45
Outline of Proof