Title: Introduction to Parallel Programming
1Introduction to Parallel Programming
- Language notation message passing
- Distributed-memory machine
- (e.g., workstations on a network)
- 5 parallel algorithms of increasing complexity
- Matrix multiplication
- Successive overrelaxation
- All-pairs shortest paths
- Linear equations
- Search problem
2Message Passing
- SEND (destination, message)
- blocking wait until message has arrived (like a
fax) - non blocking continue immediately (like a
mailbox) - RECEIVE (source, message)
- RECEIVE-FROM-ANY (message)
- blocking wait until message is available
- non blocking test if message is available
3Syntax
- Use pseudo-code with C-like syntax
- Use indentation instead of .. to indicate
block structure - Arrays can have user-defined index ranges
- Default start at 1
- int A10100 runs from 10 to 100
- int AN runs from 1 to N
- Use array slices (sub-arrays)
- Ai..j elements A i to A j
- Ai, elements Ai, 1 to Ai, N i.e.
row i of matrix A - A, k elements A1, k to AN, k i.e.
column k of A
4Parallel Matrix Multiplication
- Given two N x N matrices A and B
- Compute C A x B
- Cij Ai1B1j Ai2B2j .. AiNBNj
A
B
C
5Sequential Matrix Multiplication
- for (i 1 i lt N i)
- for (j 1 j lt N j)
- C i,j 0
- for (k 1 k lt N k)
- Ci,j Ai,k Bk,j
- The order of the operations is over specified
- Everything can be computed in parallel
6Parallel Algorithm 1
- Each processor computes 1 element of C
- Requires N2 processors
- Each processor needs 1 row of A and 1 column of B
7Structure
- Master distributes the work and receives the
results - Slaves get work and execute it
- Slaves are numbered consecutively from 1 to P
- How to start up master/slave processes depends on
Operating System (not discussed here)
- Master distributes work and receives results
- Slaves (1 .. P) get work and execute it
- How to start up master/slave processes depends on
Operating System
8Parallel Algorithm 1
Master (processor 0) int proc 1 for (i
1 i lt N i) for (j 1 j lt N
j) SEND(proc, Ai,, B,j, i, j)
proc for (x 1 x lt NN x) RECEIVE_FRO
M_ANY(result, i, j) Ci,j result
Slaves (processors 1 .. P) int AixN, BxjN,
Cij RECEIVE(0, Aix, Bxj, i, j) Cij
0 for (k 1 k lt N k) Cij Aixk
Bxjk SEND(0, Cij , i, j)
9Efficiency (complexity analysis)
- Each processor needs O(N) communication to do
O(N) computations - Communication 2N1 integers O(N)
- Computation per processor N multiplications/addit
ions O(N) - Exact communication/computation costs depend on
network and CPU - Still this algorithm is inefficient for any
existing machine - Need to improve communication/computation ratio
10Parallel Algorithm 2
- Each processor computes 1 row (N elements) of C
- Requires N processors
- Need entire B matrix and 1 row of A as input
11Structure
Master
A1,
B,
AN,
C1,
CN,
B,
.
Slave
Slave
1
N
12Parallel Algorithm 2
- Master (processor 0)
- for (i 1 i lt N i)
- SEND (i, Ai,, B,, i)
- for (x 1 x lt N x)
- RECEIVE_FROM_ANY (result, i)
- Ci, result
Slaves int AixN, BN,N, CN RECEIVE(0,
Aix, B, i) for (j 1 j lt N j) Cj
0 for (k 1 k lt N k) Cj Aixk
Bj,k SEND(0, C , i)
13Problem need larger granularity
- Each processor now needs O(N2) communication and
O(N2) computation -gt Still inefficient - Assumption N gtgt P (i.e. we solve a large
problem) - Assign many rows to each processor
14Parallel Algorithm 3
- Each processor computes N/P rows of C
- Need entire B matrix and N/P rows of A as input
- Each processor now needs O(N2) communication and
O(N3 / P) computation
15Parallel Algorithm 3 (master)
- Master (processor 0)
- int result N, N / P
- int inc N / P / number of rows per cpu /
- int lb 1 / lb lower bound /
- for (i 1 i lt P i)
- SEND (i, Alb .. lbinc-1, , B,, lb,
lbinc-1) - lb inc
- for (x 1 x lt P x)
- RECEIVE_FROM_ANY (result, lb)
- for (i 1 i lt N / P i)
- Clbi-1, resulti,
16Parallel Algorithm 3 (slave)
Slaves int AN / P, N, BN,N, CN / P,
N RECEIVE(0, A, B, lb, ub) for (i lb
i lt ub i) for (j 1 j lt N
j) Ci,j 0 for (k 1 k lt N
k) Ci,j Ai,k Bk,j SEND(0,
C, , lb)
17Comparison
- If N gtgt P, algorithm 3 will have low
communication overhead - Its grain size is high
18Example speedup graph
19Discussion
- Matrix multiplication is trivial to parallelize
- Getting good performance is a problem
- Need right grain size
- Need large input problem
20Successive Over relaxation (SOR)
- Iterative method for solving Laplace equations
- Repeatedly updates elements of a grid
21Successive Over relaxation (SOR)
- float G1N, 1M, Gnew1N, 1M
- for (step 0 step lt NSTEPS step)
- for (i 2 i lt N i) / update grid /
- for (j 2 j lt M j)
- Gnewi,j f(Gi,j, Gi-1,j,
Gi1,j,Gi,j-1, Gi,j1) - G Gnew
22SOR example
23SOR example
24Parallelizing SOR
- Domain decomposition on the grid
- Each processor owns N/P rows
- Need communication between neighbors to exchange
elements at processor boundaries
25SOR example partitioning
26SOR example partitioning
27Communication scheme
- Each CPU communicates with left right
neighbor(if existing)
28Parallel SOR
- float Glb-1ub1, 1M, Gnewlb-1ub1, 1M
- for (step 0 step lt NSTEPS step)
- SEND(cpuid-1, Glb) / send 1st row left
/ - SEND(cpuid1, Gub) / send last row
right / - RECEIVE(cpuid-1, Glb-1) / receive from
left / - RECEIVE(cpuid1, Gub1) / receive from
right / - for (i lb i lt ub i) / update my rows
/ - for (j 2 j lt M j)
- Gnewi,j f(Gi,j, Gi-1,j, Gi1,j,
Gi,j-1, Gi,j1) - G Gnew
29Performance of SOR
- Communication and computation during each
iteration - Each CPU sends/receives 2 messages with M reals
- Each CPU computes N/P M updates
- The algorithm will have good performance if
- Problem size is large N gtgt P
- Message exchanges can be done in parallel
30All-pairs Shorts Paths (ASP)
- Given a graph G with a distance table C
- C i , j length of direct path from node
i to node j - Compute length of shortest path between any two
nodes in G
31Floyd's Sequential Algorithm
- Basic step
- for (k 1 k lt N k)
- for (i 1 i lt N i)
- for (j 1 j lt N j)
- C i , j MIN ( C i, j, . C
i ,k C k, j)
- During iteration k, you can visit only
intermediate nodes in the set 1 .. k - k0 gt initial problem, no intermediate nodes
- kN gt final solution
- During iteration k, you can visit only
intermediate nodes in the set 1 .. k - k0 gt initial problem, no intermediate nodes
- kN gt final solution
32Parallelizing ASP
- Distribute rows of C among the P processors
- During iteration k, each processor executes
- C i,j MIN (Ci ,j, Ci,k Ck,j)
- on its own rows i, so it needs these rows and
row k - Before iteration k, the processor owning row k
sends it to all the others
33k
j
. .
i
.
k
34k
j
. . .
i
. .
k
35j
. . . . . . . .
i
. . . . . . . .
k
36Parallel ASP Algorithm
- int lb, ub / lower/upper bound for this CPU
/ - int rowKN, Clbub, N / pivot row matrix
/ - for (k 1 k lt N k)
- if (k gt lb k lt ub) / do I have it? /
- rowK Ck,
- for (proc 1 proc lt P proc) /
broadcast row / - if (proc ! myprocid) SEND(proc, rowK)
- else
- RECEIVE_FROM_ANY(rowK) / receive row /
- for (i lb i lt ub i) / update my
rows / - for (j 1 j lt N j)
- Ci,j MIN(Ci,j, Ci,k rowKj)
37Performance Analysis ASP
- Per iteration
- 1 CPU sends P -1 messages with N integers
- Each CPU does N/P x N comparisons
- Communication/ computation ratio is small if N
gtgt P
38- ... but, is the Algorithm Correct?
39Parallel ASP Algorithm
- int lb, ub / lower/upper bound for this CPU
/ - int rowKN, Clbub, N / pivot row matrix
/ - for (k 1 k lt N k)
- if (k gt lb k lt ub) / do I have it? /
- rowK Ck,
- for (proc 1 proc lt P proc) /
broadcast row / - if (proc ! myprocid) SEND(proc, rowK)
- else
- RECEIVE_FROM_ANY(rowK) / receive row /
- for (i lb i lt ub i) / update my
rows / - for (j 1 j lt N j)
- Ci,j MIN(Ci,j, Ci,k rowKj)
40Non-FIFO Message Ordering
- Row 2 may be received before row 1
41FIFO Ordering
- Row 5 may be received before row 4
42Correctness
- Problems
- Asynchronous non-FIFO SEND
- Messages from different senders may overtake each
other
43Correctness
- Problems
- Asynchronous non-FIFO SEND
- Messages from different senders may overtake each
other - Solutions
44Correctness
- Problems
- Asynchronous non-FIFO SEND
- Messages from different senders may overtake
each other - Solutions
- Synchronous SEND (less efficient)
45Correctness
- Problems
- Asynchronous non-FIFO SEND
- Messages from different senders may overtake
each other - Solutions
- Synchronous SEND (less efficient)
- Barrier at the end of outer loop (extra
communication)
46Correctness
- Problems
- Asynchronous non-FIFO SEND
- Messages from different senders may overtake
each other - Solutions
- Synchronous SEND (less efficient)
- Barrier at the end of outer loop (extra
communication) - Order incoming messages (requires buffering)
47Correctness
- Problems
- Asynchronous non-FIFO SEND
- Messages from different senders may overtake
each other - Solutions
- Synchronous SEND (less efficient)
- Barrier at the end of outer loop (extra
communication) - Order incoming messages (requires buffering)
- RECEIVE (cpu, msg) (more complicated)
48Introduction to Parallel Programming
- Language notation message passing
- Distributed-memory machine
- (e.g., workstations on a network)
- 5 parallel algorithms of increasing complexity
- Matrix multiplication
- Successive overrelaxation
- All-pairs shortest paths
- Linear equations
- Search problem
49Linear equations
- Linear equations
- a1,1x1 a1,2x2 a1,nxn b1
- ...
- an,1x1 an,2x2 an,nxn bn
- Matrix notation Ax b
- Problem compute x, given A and b
- Linear equations have many important applications
- Practical applications need huge sets of
equations
50Solving a linear equation
- Two phases
- Upper-triangularization -gt U x y
- Back-substitution -gt x
- Most computation time is in upper-triangularizati
on - Upper-triangular matrix
- U i, i 1
- U i, j 0 if i gt j
1 . . . . . . .
0 1 . . . . . .
0 0 1 . . . . .
0 0 0 1 . . . .
0 0 0 0 1 . . .
0 0 0 0 0 1 . .
0 0 0 0 0 0 1 .
0 0 0 0 0 0 0 1
51Sequential Gaussian elimination
- Converts Ax b into Ux y
- Sequential algorithm uses 2/3 N3 operations
- for (k 1 k lt N k)
- for (j k1 j lt N j)
- Ak,j Ak,j / Ak,k
- yk bk / Ak,k
- Ak,k 1
- for (i k1 i lt N i)
- for (j k1 j lt N j)
- Ai,j Ai,j - Ai,k Ak,j
- bi bi - Ai,k yk
- Ai,k 0
1 . . . . . . .
0 . . . . . . .
0 . . . . . . .
0 . . . . . . .
A
y
52Parallelizing Gaussian elimination
- Row-wise partitioning scheme
- Each cpu gets one row (striping )
- Execute one (outer-loop) iteration at a time
- Communication requirement
- During iteration k, cpus Pk1 Pn-1 need part
of row k - This row is stored on CPU Pk
- -gt need partial broadcast (multicast)
53Communication
54Performance problems
- Communication overhead (multicast)
- Load imbalance
- CPUs P0PK are idle during iteration k
- Bad load balance means bad speedups, as
some CPUs have too much work - In general, number of CPUs is less than n
- Choice between block-striped cyclic-striped
distribution - Block-striped distribution has high
load-imbalance - Cyclic-striped distribution has less
load-imbalance
55Block-striped distribution
- CPU 0 gets first N/2 rows
- CPU 1 gets last N/2 rows
- CPU 0 has much less work to do
- CPU 1 becomes the bottleneck
56Cyclic-striped distribution
- CPU 0 gets odd rows
- CPU 1 gets even rows
- CPU 0 and 1 have more or less the same amount of
work
57A Search Problem
- Given an array A1..N and an item x, check if x
is present in A - int present false
- for (i 1 !present i lt N i)
- if ( A i x) present true
- Dont know in advance which data we need to access
58Parallel Search on 2 CPUs
- int lb, ub
- int Albub
- for (i lb i lt ub i)
- if (A i x)
- print( Found item")
- SEND(1-cpuid) / send other CPU empty
message/ - exit()
- / check message from other CPU /
- if (NONBLOCKING_RECEIVE(1-cpuid)) exit()
59Performance Analysis
- How much faster is the parallel program than the
sequential program for N100 ?
60Performance Analysis
- How much faster is the parallel program than the
sequential program for N100 ? - 1. if x not present gt factor 2
61Performance Analysis
- How much faster is the parallel program than the
sequential program for N100 ? - 1. if x not present gt factor 2
- 2. if x present in A1 .. 50 gt factor 1
62Performance Analysis
- How much faster is the parallel program than the
sequential program for N100 ? - 1. if x not present gt factor 2
- 2. if x present in A1 .. 50 gt factor 1
- 3. if A51 x gt factor 51
63Performance Analysis
- How much faster is the parallel program than the
sequential program for N100 ? - 1. if x not present gt factor 2
- 2. if x present in A1 .. 50 gt factor 1
- 3. if A51 x gt factor 51
- 4. if A75 x gt factor 3
64Performance Analysis
- How much faster is the parallel program than the
sequential program for N100 ? - 1. if x not present gt factor 2
- 2. if x present in A1 .. 50 gt factor 1
- 3. if A51 x gt factor 51
- 4. if A75 x gt factor 3
- In case 2 the parallel program does more work
than the sequential program gt search overhead
65Performance Analysis
- How much faster is the parallel program than the
sequential program for N100 ? - 1. if x not present gt factor 2
- 2. if x present in A1 .. 50 gt factor 1
- 3. if A51 x gt factor 51
- 4. if A75 x gt factor 3
- In case 2 the parallel program does more work
than the sequential program gt search overhead - In cases 3 and 4 the parallel program does less
work gt negative search overhead
66Discussion
- Several kinds of performance overhead
- Communication overhead communication/computation
ratio must be low - Load imbalance all processors must do same
amount of work - Search overhead avoid useless (speculative)
computations - Making algorithms correct is nontrivial
- Message ordering
67Designing Parallel Algorithms
- Source Designing and building parallel programs
(Ian Foster, 1995) - (available on-line at http//www.mcs.anl.gov/dbpp)
- Partitioning
- Communication
- Agglomeration
- Mapping
68Figure 2.1 from Foster's book
69Partitioning
- Domain decomposition
- Partition the data
- Partition computations on data
- owner-computes rule
- Functional decomposition
- Divide computations into subtasks
- E.g. search algorithms
70Communication
- Analyze data-dependencies between partitions
- Use communication to transfer data
- Many forms of communication, e.g.
- Local communication with neighbors (SOR)
- Global communication with all processors (ASP)
- Synchronous (blocking) communication
- Asynchronous (non blocking) communication
71Agglomeration
- Reduce communication overhead by
- increasing granularity
- improving locality
72Mapping
- On which processor to execute each subtask?
- Put concurrent tasks on different CPUs
- Put frequently communicating tasks on same CPU?
- Avoid load imbalances
73Summary
- Hardware and software models
- Example applications
- Matrix multiplication - Trivial parallelism
(independent tasks) - Successive over relaxation - Neighbor
communication - All-pairs shortest paths - Broadcast
communication - Linear equations - Load balancing problem
- Search problem - Search overhead
- Designing parallel algorithms