Introduction to Parallel Programming

About This Presentation

Title:

Introduction to Parallel Programming

Description:

Don't know in advance which data we need to access. Parallel ... In cases 3 and 4 the parallel program does less work = negative search overhead. Discussion ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 76

Provided by: csVu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Parallel Programming

1
Introduction to Parallel Programming

Language notation message passing
Distributed-memory machine (e.g., workstations on
a network)
5 parallel algorithms of increasing complexity
Matrix multiplication
Successive overrelaxation
All-pairs shortest paths
Linear equations
Search problem

2
Message Passing

SEND (destination, message)
blocking wait until message has arrived (like a
fax)
non blocking continue immediately (like a
mailbox)
RECEIVE (source, message)
RECEIVE-FROM-ANY (message)
blocking wait until message is available
non blocking test if message is available

3
Parallel Matrix Multiplication

Given two N x N matrices A and B
Compute C A x B
Cij Ai1B1j Ai2B2j .. AiNBNj

A
B
C
4
Sequential Matrix Multiplication

for (i 1 i lt N i)
for (j 1 j lt N j)
C i,j 0
for (k 1 k lt N k)
Ci,j Ai,k Bk,j
The order of the operations is over specified
Everything can be computed in parallel

5
Parallel Algorithm 1

Each processor computes 1 element of C
Requires N2 processors
Each processor needs 1 row of A and 1 column of B
as input

6
Structure

Master distributes the work and receives the
results
Slaves get work and execute it
How to start up master/slave processes depends on
Operating System (not discussed here)

Master
A1,
AN,
C1,1
B,1
CN,N
B,N
.
Slave
Slave
1
N2
7
Parallel Algorithm 1
Slaves int AixN, BxjN, Cij RECEIVE(0,
Aix, Bxj, i, j) Cij 0 for (k 1 k lt
N k) Cij Aixk Bxjk SEND(0, Cij , i,
j)

Master (processor 0)
for (i 1 i lt N i)
for (j 1 j lt N j)
SEND(p, Ai,, B,j, i, j)
for (x 1 x lt NN x)
RECEIVE_FROM_ANY(result, i, j)
Ci,j result

8
Efficiency

Each processor needs O(N) communication to do
O(N) computations
Communication 2N1 integers
Computation per processor N multiplications/addit
ions
Exact communication/computation costs depend on
network and CPU
Still this algorithm is inefficient for any
existing machine
Need to improve communication/computation ratio

9
Parallel Algorithm 2

Each processor computes 1 row (N elements) of C
Requires N processors
Need entire B matrix and 1 row of A as input

10
Structure
Master
A1,
B,
AN,
C1,
CN,
B,
.
Slave
Slave
1
N
11
Parallel Algorithm 2

Master (processor 0)
for (i 1 i lt N i)
SEND (i, Ai,, B,, i)
for (x 1 x lt N x)
RECEIVE_FROM_ANY (result, i)
Ci, result

Slaves int AixN, BN,N, CN RECEIVE(0,
Aix, B, i) for (j 1 j lt N j) Cj
0 for (k 1 k lt N k) Cj Aixk
Bj,k SEND(0, C , i)
12
Problem need larger granularity

Each processor now needs O(N2) communication and
O(N2) computation
Still inefficient
Assumption N gtgt P (i.e. we solve a large
problem)
Assign many rows to each processor

13
Parallel Algorithm 3

Each processor computes N/P rows of C
Need entire B matrix and N/P rows of A as input
Each processor now needs O(N2) communication and
O(N3 / P) computation

14
Parallel Algorithm 3

Master (processor 0)
int result N, N/nprocs
int inc N/nprocs / number of rows per cpu
/
int lb 1
for (i 1 i lt nprocs i)
SEND (i, Alb .. lbinc-1, , B,, lb,
lbinc-1)
lb inc
for (x 1 x lt nprocs x)
RECEIVE_FROM_ANY (result, lb)
for (i 1 i lt N/nprocs i)
Clbi-1, resulti,

Slaves int AN/nprocs, N, BN,N, CN/nprocs,
N RECEIVE(0, A, B, lb, ub) for (i lb
i lt ub i) for (j 1 j lt N
j) Ci,j 0 for (k 1 k lt N
k) Ci,j Ai,k Bk,j SEND(0,
C, , lb)
15
Comparison
Algorithm Parallelism (jobs) Communication per job Computation per job Ratio (comp/comm)
1 N2 N N 1 N O(1)
2 N N N2 N N2 O(1)
3 P N2/P N2 N2/P N3/P O(N/P)

If N gtgt P, algorithm 3 will have low
communication overhead
Its grain size is high

16
Example speedup graph
17
Discussion

Matrix multiplication is trivial to parallelize
Getting good performance is a problem
Need right grain size
Need large input problem

18
Successive Over relaxation (SOR)

Iterative method for solving Laplace equations
Repeatedly updates elements of a grid
float G1N, 1M, Gnew1N, 1M
for (step 0 step lt NSTEPS step)
for (i 2 i lt N i) / update grid /
for (j 2 j lt M j)
Gnewi,j f(Gi,j, Gi-1,j,
Gi1,j,Gi,j-1, Gi,j1)
G Gnew

19
SOR example
20
SOR example
21
Parallelizing SOR

Domain decomposition on the grid
Each processor owns N/P rows
Need communication between neighbors to exchange
elements at processor boundaries

22
SOR example partitioning
23
SOR example partitioning
24
Communication scheme

Each CPU communicates with left right neighbor
(if existing)

25
Parallel SOR

float Glb-1ub1, 1M, Gnewlb-1ub1, 1M
for (step 0 step lt NSTEPS step)
SEND(cpuid-1, Glb) / send 1st row left
/
SEND(cpuid1, Gub) / send last row right
/
RECEIVE(cpuid-1, Glb-1) / receive from
left /
RECEIVE(cpuid1, Gub1) / receive from
right /
for (i lb i lt ub i) / update my rows
/
for (j 2 j lt M j)
Gnewi,j f(Gi,j, Gi-1,j, Gi1,j,
Gi,j-1, Gi,j1)
G Gnew

26
Performance of SOR

Communication and computation during each
iteration
Each processor sends/receives 2 messages with M
reals
Each processor computes N/P M updates
The algorithm will have good performance if
Problem size is large N gtgt P
Message exchanges can be done in parallel

27
All-pairs Shorts Paths (ASP)

Given a graph G with a distance table C
C i , j length of direct path from node
i to node j
Compute length of shortest path between any two
nodes in G

28
Floyd's Sequential Algorithm

Basic step
for (k 1 k lt N k)
for (i 1 i lt N i)
for (j 1 j lt N j)
C i , j MIN ( C i, j, C i ,k C k,
j)

During iteration k, you can visit only
intermediate nodes in the set 1 .. k
k0 gt initial problem, no intermediate nodes
kN gt final solution

29
Parallelizing ASP
k
j

Distribute rows of C among the P processors
During iteration k, each processor executes
C i,j MIN (Ci ,j, Ci,k Ck,j)
on its own rows i, so it needs these rows and
row k
Before iteration k, the processor owning row k
sends it to all the others

. .
i
.
k
30
Parallelizing ASP
k
j

Distribute rows of C among the P processors
During iteration k, each processor executes
C i,j MIN (Ci ,j, Ci,k Ck,j)
on its own rows i, so it needs these rows and
row k
Before iteration k, the processor owning row k
sends it to all the others

. . .
i
. .
k
31
Parallelizing ASP
j

Distribute rows of C among the P processors
During iteration k, each processor executes
C i,j MIN (Ci ,j, Ci,k Ck,j)
on its own rows i, so it needs these rows and
row k
Before iteration k, the processor owning row k
sends it to all the others

. . . . . . . .
i
. . . . . . . .
k
32
Parallel ASP Algorithm

int lb, ub / lower/upper bound for
this CPU /
int rowKN, Clbub, N / pivot row
matrix /
for (k 1 k lt N k)
if (k gt lb k lt ub) / do I have it?
/
rowK Ck,
for (p 1 p lt nproc p) / broadcast
row /
if (p ! myprocid) SEND(p, rowK)
else
RECEIVE_FROM_ANY(rowK) / receive row /
for (i lb i lt ub i) / update my
rows /
for (j 1 j lt N j)
Ci,j MIN(Ci,j, Ci,k rowKj)

33
Parallel ASP Algorithm

int lb, ub / lower/upper bound for
this CPU /
int rowKN, Clbub, N / pivot row
matrix /
for (k 1 k lt N k)
for (i lb i lt ub i) / update my
rows /
for (j 1 j lt N j)
Ci,j MIN(Ci,j, Ci,k rowKj)

34
Parallel ASP Algorithm

int lb, ub / lower/upper bound for
this CPU /
int rowKN, Clbub, N / pivot row
matrix /
for (k 1 k lt N k)
if (k gt lb k lt ub) / do I have it?
/
rowK Ck,
for (p 1 p lt nproc p) / broadcast
row /
if (p ! myprocid) SEND(p, rowK)
else
RECEIVE_FROM_ANY(rowK) / receive row /
for (i lb i lt ub i) / update my
rows /
for (j 1 j lt N j)
Ci,j MIN(Ci,j, Ci,k rowKj)

35
Performance Analysis ASP

Per iteration
1 CPU sends P -1 messages with N integers
Each CPU does N/P x N comparisons
Communication/ computation ratio is small if N
gtgt P

... but, is the Algorithm Correct?

37
Parallel ASP Algorithm

int lb, ub
int rowKN, Clbub, N
for (k 1 k lt N k)
if (k gt lb k lt ub)
rowK Ck,
for (p 1 p lt nproc p)
if (p ! myprocid) SEND(p, rowK)
else
RECEIVE_FROM_ANY (rowK)
for (i lb i lt ub i)
for (j 1 j lt N j)
Ci,j MIN(Ci,j, Ci,k rowKj)

38
Non-FIFO Message Ordering

Row 2 may be received before row 1

39
FIFO Ordering

Row 5 may be received before row 4

40
Correctness

Problems
Asynchronous non-FIFO SEND
Messages from different senders may overtake
each other

41
Correctness

Problems
Asynchronous non-FIFO SEND
Messages from different senders may overtake
each other
Solutions

42
Correctness

Problems
Asynchronous non-FIFO SEND
Messages from different senders may overtake
each other
Solutions
Synchronous SEND (less efficient)

43
Correctness

Problems
Asynchronous non-FIFO SEND
Messages from different senders may overtake
each other
Solutions
Synchronous SEND (less efficient)
Barrier at the end of outer loop (extra
communication)

44
Correctness

Problems
Asynchronous non-FIFO SEND
Messages from different senders may overtake
each other
Solutions
Synchronous SEND (less efficient)
Barrier at the end of outer loop (extra
communication)
Order incoming messages (requires buffering)

45
Correctness

Problems
Asynchronous non-FIFO SEND
Messages from different senders may overtake
each other
Solutions
Synchronous SEND (less efficient)
Barrier at the end of outer loop (extra
communication)
Order incoming messages (requires buffering)
RECEIVE (cpu, msg) (more complicated)

46
Linear equations

Linear equations
a1,1x1 a1,2x2 a1,nxn b1
...
an,1x1 an,2x2 an,nxn bn
Matrix notation Ax b
Problem compute x, given A and b
Linear equations have many important applications
Practical applications need huge sets of
equations

47
Solving a linear equation

Two phases
Upper-triangularization -gt U x y
Back-substitution -gt x
Most computation time is in upper-triangularizati
on
Upper-triangular matrix
U i, i 1
U i, j 0 if i gt j

1 . . . . . . .
0 1 . . . . . .
0 0 1 . . . . .
0 0 0 1 . . . .
0 0 0 0 1 . . .
0 0 0 0 0 1 . .
0 0 0 0 0 0 1 .
0 0 0 0 0 0 0 1
48
Sequential Gaussian elimination

for (k 1 k lt N k)
for (j k1 j lt N j)
Ak,j Ak,j / Ak,k
yk bk / Ak,k
Ak,k 1
for (i k1 i lt N i)
for (j k1 j lt N j)
Ai,j Ai,j - Ai,k Ak,j
bi bi - Ai,k yk
Ai,k 0

Converts Ax b into Ux y
Sequential algorithm uses 2/3 N3 operations

normalize
1 . . . . . . .
0 . . . . . . .
0 . . . . . . .
subtract
0 . . . . . . .
A
y
49
Parallelizing Gaussian elimination

Row-wise partitioning scheme
Each cpu gets one row (striping )
Execute one (outer-loop) iteration at a time
Communication requirement
During iteration k, cpus Pk1 Pn-1 need part
of row k
This row is stored on CPU Pk
-gt need partial broadcast (multicast)

50
Communication
51
Performance problems

Communication overhead (multicast)
Load imbalance
CPUs P0PK are idle during iteration k
Bad load balance means bad speedups, as some
CPUs have too much work and other too little
In general, number of CPUs is less than n
Choice between block-striped and
cyclic-striped distribution
Block-striped distribution has high
load-imbalance
Cyclic-striped distribution has less
load-imbalance

52
Block-striped distribution

CPU 0 gets first N/2 rows
CPU 1 gets last N/2 rows
CPU 0 has much less work to do
CPU 1 becomes the bottleneck

53
Cyclic-striped distribution

CPU 0 gets odd rows (1, 3, )
CPU 1 gets even rows (2, 4, )
CPU 0 and 1 have more or less the same amount of
work

54
A Search Problem

Given an array A1..N and an item x, check if x
is present in A
int present false
for (i 1 !present i lt N i)
if ( A i x) present true
Dont know in advance which data we need to access

55
Parallel Search on 2 CPUs

int lb, ub
int Albub
for (i lb i lt ub i)
if (A i x)
print( Found item")
SEND(1-cpuid) / send other CPU empty
message/
exit()
/ check message from other CPU /
if (NONBLOCKING_RECEIVE(1-cpuid)) exit()

56
Performance Analysis

How much faster is the parallel program than the
sequential program for N100 ?

57
Performance Analysis

How much faster is the parallel program than the
sequential program for N100 ?
1. if x not present gt factor 2

58
Performance Analysis

How much faster is the parallel program than the
sequential program for N100 ?
1. if x not present gt factor 2
2. if x present in A1 .. 50 gt factor 1

59
Performance Analysis

How much faster is the parallel program than the
sequential program for N100 ?
1. if x not present gt factor 2
2. if x present in A1 .. 50 gt factor 1
3. if A51 x gt factor 51

60
Performance Analysis

How much faster is the parallel program than the
sequential program for N100 ?
1. if x not present gt factor 2
2. if x present in A1 .. 50 gt factor 1
3. if A51 x gt factor 51
4. if A75 x gt factor 3

61
Performance Analysis

How much faster is the parallel program than the
sequential program for N100 ?
1. if x not present gt factor 2
2. if x present in A1 .. 50 gt factor 1
3. if A51 x gt factor 51
4. if A75 x gt factor 3
In case 2 the parallel program does more work
than the sequential program gt search overhead

62
Performance Analysis

How much faster is the parallel program than the
sequential program for N100 ?
1. if x not present gt factor 2
2. if x present in A1 .. 50 gt factor 1
3. if A51 x gt factor 51
4. if A75 x gt factor 3
In case 2 the parallel program does more work
than the sequential program gt search overhead
In cases 3 and 4 the parallel program does less
work gt negative search overhead

63
Discussion