Title: Introduction to Parallel Programming
1Introduction to Parallel Programming
- Language notation message passing
- Distributed-memory machine
- (e.g., workstations on a network)
- 5 parallel algorithms of increasing complexity
- Matrix multiplication
- Successive overrelaxation
- All-pairs shortest paths
- Linear equations
- Traveling Salesman problem
2Message Passing
- SEND (destination, message)
- blocking wait until message has arrived (like a
fax) - non blocking continue immediately (like a
mailbox) - RECEIVE (source, message)
- RECEIVE-FROM-ANY (message)
- blocking wait until message is available
- non blocking test if message is available
3Syntax
- Use pseudo-code with C-like syntax
- Use indentation instead of .. to indicate
block structure - Arrays can have user-defined index ranges
- Default start at 1
- int A10100 runs from 10 to 100
- int AN runs from 1 to N
- Use array slices (sub-arrays)
- Ai..j elements A i to A j
- Ai, elements Ai, 1 to Ai, N i.e.
row i of matrix A - A, k elements A1, k to AN, k i.e.
column k of A
4Parallel Matrix Multiplication
- Given two N x N matrices A and B
- Compute C A x B
- Cij Ai1B1j Ai2B2j .. AiNBNj
A
B
C
5Sequential Matrix Multiplication
- for (i 1 i lt N i)
- for (j 1 j lt N j)
- C i,j 0
- for (k 1 k lt N k)
- Ci,j Ai,k Bk,j
- The order of the operations is over specified
- Everything can be computed in parallel
6Parallel Algorithm 1
- Each processor computes 1 element of C
- Requires N2 processors
- Each processor needs 1 row of A and 1 column of B
7Structure
- Master distributes the work and receives the
results - Slaves get work and execute it
- Slaves are numbered consecutively from 1 to P
- How to start up master/slave processes depends on
Operating System (not discussed here)
- Master distributes work and receives results
- Slaves (1 .. P) get work and execute it
- How to start up master/slave processes depends on
Operating System
8Parallel Algorithm 1
Master (processor 0) int proc 1 for (i
1 i lt N i) for (j 1 j lt N
j) SEND(proc, Ai,, B,j, i, j)
proc for (x 1 x lt NN x) RECEIVE_FRO
M_ANY(result, i, j) Ci,j result
Slaves (processors 1 .. P) int AixN, BxjN,
Cij RECEIVE(0, Aix, Bxj, i, j) Cij
0 for (k 1 k lt N k) Cij Aixk
Bxjk SEND(0, Cij , i, j)
9Efficiency (complexity analysis)
- Each processor needs O(N) communication to do
O(N) computations - Communication 2N1 integers O(N)
- Computation per processor N multiplications/addit
ions O(N) - Exact communication/computation costs depend on
network and CPU - Still this algorithm is inefficient for any
existing machine - Need to improve communication/computation ratio
10Parallel Algorithm 2
- Each processor computes 1 row (N elements) of C
- Requires N processors
- Need entire B matrix and 1 row of A as input
11Structure
Master
A1,
B,
AN,
C1,
CN,
B,
.
Slave
Slave
1
N
12Parallel Algorithm 2
- Master (processor 0)
- for (i 1 i lt N i)
- SEND (i, Ai,, B,, i)
- for (x 1 x lt N x)
- RECEIVE_FROM_ANY (result, i)
- Ci, result
Slaves int AixN, BN,N, CN RECEIVE(0,
Aix, B, i) for (j 1 j lt N j) Cj
0 for (k 1 k lt N k) Cj Aixk
Bj,k SEND(0, C , i)
13Problem need larger granularity
- Each processor now needs O(N2) communication and
O(N2) computation -gt Still inefficient - Assumption N gtgt P (i.e. we solve a large
problem) - Assign many rows to each processor
14Parallel Algorithm 3
- Each processor computes N/P rows of C
- Need entire B matrix and N/P rows of A as input
- Each processor now needs O(N2) communication and
O(N3 / P) computation
15Parallel Algorithm 3 (master)
- Master (processor 0)
- int result N, N / P
- int inc N / P / number of rows per cpu /
- int lb 1 / lb lower bound /
- for (i 1 i lt P i)
- SEND (i, Alb .. lbinc-1, , B,, lb,
lbinc-1) - lb inc
- for (x 1 x lt P x)
- RECEIVE_FROM_ANY (result, lb)
- for (i 1 i lt N / P i)
- Clbi-1, resulti,
16Parallel Algorithm 3 (slave)
Slaves int AN / P, N, BN,N, CN / P,
N RECEIVE(0, A, B, lb, ub) for (i lb
i lt ub i) for (j 1 j lt N
j) Ci,j 0 for (k 1 k lt N
k) Ci,j Ai,k Bk,j SEND(0,
C, , lb)
17Comparison
Algorithm Parallelism (jobs) Communication per job Computation per job Ratio comp/comm
1 N2 N N 1 N O(1)
2 N N N2 N N2 O(1)
3 P N2/P N2 N2/P N3/P O(N/P)
- If N gtgt P, algorithm 3 will have low
communication overhead - Its grain size is high
18Example speedup graph
19Discussion
- Matrix multiplication is trivial to parallelize
- Getting good performance is a problem
- Need right grain size
- Need large input problem
20Successive Over relaxation (SOR)
- Iterative method for solving Laplace equations
- Repeatedly updates elements of a grid
21Successive Over relaxation (SOR)
- float G1N, 1M, Gnew1N, 1M
- for (step 0 step lt NSTEPS step)
- for (i 2 i lt N i) / update grid /
- for (j 2 j lt M j)
- Gnewi,j f(Gi,j, Gi-1,j,
Gi1,j,Gi,j-1, Gi,j1) - G Gnew
22SOR example
23SOR example
24Parallelizing SOR
- Domain decomposition on the grid
- Each processor owns N/P rows
- Need communication between neighbors to exchange
elements at processor boundaries
25SOR example partitioning
26SOR example partitioning
27Communication scheme
- Each CPU communicates with left right
neighbor(if existing)
28Parallel SOR
- float Glb-1ub1, 1M, Gnewlb-1ub1, 1M
- for (step 0 step lt NSTEPS step)
- SEND(cpuid-1, Glb) / send 1st row left
/ - SEND(cpuid1, Gub) / send last row
right / - RECEIVE(cpuid-1, Glb-1) / receive from
left / - RECEIVE(cpuid1, Gub1) / receive from
right / - for (i lb i lt ub i) / update my rows
/ - for (j 2 j lt M j)
- Gnewi,j f(Gi,j, Gi-1,j, Gi1,j,
Gi,j-1, Gi,j1) - G Gnew
29Performance of SOR
- Communication and computation during each
iteration - Each CPU sends/receives 2 messages with M reals
- Each CPU computes N/P M updates
- The algorithm will have good performance if
- Problem size is large N gtgt P
- Message exchanges can be done in parallel
30All-pairs Shorts Paths (ASP)
- Given a graph G with a distance table C
- C i , j length of direct path from node
i to node j - Compute length of shortest path between any two
nodes in G
31Floyd's Sequential Algorithm
- Basic step
- for (k 1 k lt N k)
- for (i 1 i lt N i)
- for (j 1 j lt N j)
- C i , j MIN ( C i, j, . C
i ,k C k, j)
- During iteration k, you can visit only
intermediate nodes in the set 1 .. k - k0 gt initial problem, no intermediate nodes
- kN gt final solution
- During iteration k, you can visit only
intermediate nodes in the set 1 .. k - k0 gt initial problem, no intermediate nodes
- kN gt final solution
32Parallelizing ASP
- Distribute rows of C among the P processors
- During iteration k, each processor executes
- C i,j MIN (Ci ,j, Ci,k Ck,j)
- on its own rows i, so it needs these rows and
row k - Before iteration k, the processor owning row k
sends it to all the others
33k
j
. .
i
.
k
34k
j
. . .
i
. .
k
35j
. . . . . . . .
i
. . . . . . . .
k
36Parallel ASP Algorithm
- int lb, ub / lower/upper bound for this CPU
/ - int rowKN, Clbub, N / pivot row matrix
/ - for (k 1 k lt N k)
- if (k gt lb k lt ub) / do I have it? /
- rowK Ck,
- for (proc 1 proc lt P proc) /
broadcast row / - if (proc ! myprocid) SEND(proc, rowK)
- else
- RECEIVE_FROM_ANY(rowK) / receive row /
- for (i lb i lt ub i) / update my
rows / - for (j 1 j lt N j)
- Ci,j MIN(Ci,j, Ci,k rowKj)
37Performance Analysis ASP
- Per iteration
- 1 CPU sends P -1 messages with N integers
- Each CPU does N/P x N comparisons
- Communication/ computation ratio is small if N
gtgt P
38- ... but, is the Algorithm Correct?
39Parallel ASP Algorithm
- int lb, ub / lower/upper bound for this CPU
/ - int rowKN, Clbub, N / pivot row matrix
/ - for (k 1 k lt N k)
- if (k gt lb k lt ub) / do I have it? /
- rowK Ck,
- for (proc 1 proc lt P proc) /
broadcast row / - if (proc ! myprocid) SEND(proc, rowK)
- else
- RECEIVE_FROM_ANY(rowK) / receive row /
- for (i lb i lt ub i) / update my
rows / - for (j 1 j lt N j)
- Ci,j MIN(Ci,j, Ci,k rowKj)
40Non-FIFO Message Ordering
- Row 2 may be received before row 1
41FIFO Ordering
- Row 5 may be received before row 4
42Correctness
- Problems
- Asynchronous non-FIFO SEND
- Messages from different senders may overtake each
other - Solution is to use a combination of
- Synchronous SEND (less efficient)
- Barrier at the end of outer loop (extra
communication) - Order incoming messages (requires buffering)
- RECEIVE (cpu, msg) (more complicated)
43Introduction to Parallel Programming
- Language notation message passing
- Distributed-memory machine
- (e.g., workstations on a network)
- 5 parallel algorithms of increasing complexity
- Matrix multiplication
- Successive overrelaxation
- All-pairs shortest paths
- Linear equations
- Traveling Salesman problem
44Linear equations
- Linear equations
- a1,1x1 a1,2x2 a1,nxn b1
- ...
- an,1x1 an,2x2 an,nxn bn
- Matrix notation Ax b
- Problem compute x, given A and b
- Linear equations have many important applications
- Practical applications need huge sets of
equations
45Solving a linear equation
- Two phases
- Upper-triangularization -gt U x y
- Back-substitution -gt x
- Most computation time is in upper-triangularizati
on - Upper-triangular matrix
- U i, i 1
- U i, j 0 if i gt j
1 . . . . . . .
0 1 . . . . . .
0 0 1 . . . . .
0 0 0 1 . . . .
0 0 0 0 1 . . .
0 0 0 0 0 1 . .
0 0 0 0 0 0 1 .
0 0 0 0 0 0 0 1
46Sequential Gaussian elimination
- Converts Ax b into Ux y
- Sequential algorithm uses 2/3 N3 operations
- for (k 1 k lt N k)
- for (j k1 j lt N j)
- Ak,j Ak,j / Ak,k
- yk bk / Ak,k
- Ak,k 1
- for (i k1 i lt N i)
- for (j k1 j lt N j)
- Ai,j Ai,j - Ai,k Ak,j
- bi bi - Ai,k yk
- Ai,k 0
1 . . . . . . .
0 . . . . . . .
0 . . . . . . .
0 . . . . . . .
A
y
47Parallelizing Gaussian elimination
- Row-wise partitioning scheme
- Each cpu gets one row (striping )
- Execute one (outer-loop) iteration at a time
- Communication requirement
- During iteration k, cpus Pk1 Pn-1 need part
of row k - This row is stored on CPU Pk
- -gt need partial broadcast (multicast)
48Communication
49Performance problems
- Communication overhead (multicast)
- Load imbalance
- CPUs P0PK are idle during iteration k
- Bad load balance means bad speedups, as
some CPUs have too much work - In general, number of CPUs is less than n
- Choice between block-striped cyclic-striped
distribution - Block-striped distribution has high
load-imbalance - Cyclic-striped distribution has less
load-imbalance
50Block-striped distribution
- CPU 0 gets first N/2 rows
- CPU 1 gets last N/2 rows
- CPU 0 has much less work to do
- CPU 1 becomes the bottleneck
51Cyclic-striped distribution
- CPU 0 gets odd rows
- CPU 1 gets even rows
- CPU 0 and 1 have more or less the same amount of
work
52Traveling Salesman Problem (TSP)
- Find shortest route for salesman among given set
of cities (NP-hard problem) - Each city must be visited once, no return to
initial city
New York
New York
2
2
3
1
Chicago
Saint Louis
4
3
Miami
53Sequential branch-and-bound
- Structure the entire search space as a tree,
sorted using nearest-city first heuristic
54Pruning the search tree
- Keep track of best solution found so far (the
bound) - Cut-off partial routes gt bound
Can be pruned
55Parallelizing TSP
- Distribute the search tree over the CPUs
- Results in reasonably large-grain jobs
CPU 1
CPU 2
CPU 3
56Distribution of the tree
- Static distribution each CPU gets fixed part of
tree - Load imbalance subtrees take different amounts
of time - Impossible to predict load imbalance statically
(as for Gaussian)
3
2
2
3
3
4
4
1
1
3
3
4
4
1
57Dynamic load balancingReplicated Workers Model
- Master process generates large number of jobs
(subtrees) and repeatedly hands them out - Worker processes repeatedly get work and execute
it - Runtime overhead for fetching jobs dynamically
- Efficient for TSP because the jobs are large
workers
Master
58Real search spaces are huge
- NP-complete problem -gt exponential search space
- Master searches MAXHOPS levels, then creates jobs
- Eg for 20 cities MAXHOPS4 -gt 20191817
(gt100,000) jobs, each searching 16 remaining
cities
59Parallel TSP Algorithm (1/3)
- process master (CPU 0)
- generate-jobs() / generate all jobs,
start with empty path / - for (proc1 proc lt P proc) / inform
workers we're done / - RECEIVE(proc, worker-id) / get work
request / - SEND(proc, ) /
return empty path / - generate-jobs (List path)
- if (size(path) MAXHOPS) / if path has
MAXHOPS cities / - RECEIVE-FROM-ANY (worker-id) / wait for
work request / - SEND (worker-id, path) / send
partial route to worker / - else
- for (city 1 city lt NRCITIES city) /
(should be ordered) / - if (city not on path) generate-jobs(pathc
ity) / append city /
60Parallel TSP Algorithm (2/3)
- process worker (CPUs 1..P)
- int Minimum maxint / Length of current best
path (bound) / - List path
- for ()
- SEND (0, myprocid) / send work request
to master / - RECEIVE (0, path) / get next job from
master / - if (path ) exit() / we're done
/ - tsp(path, length(path)) / compute all
subsequent paths /
61Parallel TSP Algorithm (3/3)
- tsp(List path, int length)
- if (NONBLOCKING_RECEIVE_FROM_ANY (m))
- / is there an update message? /
- if (m lt Minimum) Minimum m / update
global minimum / - if (length gt Minimum) return / not a shorter
route / - if (size(path) NRCITIES) / complete route?
/ - Minimum length / update global minimum
/ - for (proc 1 proc lt P proc)
- if (proc ! myprocid) SEND(proc,
length) / broadcast it / - else
- last last(path) / last city on the path
/ - for (city 1 city lt NRCITIES city) /
should be ordered / - if (city not on path) tsp(pathcity,
lengthdistancelast,city)
62Search overhead
CPU 1
CPU 2
CPU 3
63Search overhead
- Path ltn m s gt is started (in parallel) before
the outcome (6) of ltn c s mgt is known, so
it cannot be pruned - The parallel algorithm therefore does more work
than the sequential algorithm - This is called search overhead
- It can occur in algorithms that do speculative
work, like parallel search algorithms - Can also have negative search overhead, resulting
in superlinear speedups!
64Performance of TSP
- Communication overhead (small)
- Distribution of jobs updating the global bound
- Small number of messages
- Load imbalances
- Small does automatic (dynamic) load balancing
- Search overhead
- Main performance problem
65Discussion
- Several kinds of performance overhead
- Communication overhead
- communication/computation ratio must be low
- Load imbalance
- all processors must do same amount of work
- Search overhead
- avoid useless (speculative) computations
- Making algorithms correct is nontrivial
- Message ordering
66Designing Parallel Algorithms
- Source Designing and building parallel programs
(Ian Foster, 1995) - (available on-line at http//www.mcs.anl.gov/dbpp)
- Partitioning
- Communication
- Agglomeration
- Mapping
67Figure 2.1 from Foster's book
68Partitioning
- Domain decomposition
- Partition the data
- Partition computations on data
- owner-computes rule
- Functional decomposition
- Divide computations into subtasks
- E.g. search algorithms
69Communication
- Analyze data-dependencies between partitions
- Use communication to transfer data
- Many forms of communication, e.g.
- Local communication with neighbors (SOR)
- Global communication with all processors (ASP)
- Synchronous (blocking) communication
- Asynchronous (non blocking) communication
70Agglomeration
- Reduce communication overhead by
- increasing granularity
- improving locality
71Mapping
- On which processor to execute each subtask?
- Put concurrent tasks on different CPUs
- Put frequently communicating tasks on same CPU?
- Avoid load imbalances
72Summary
- Hardware and software models
- Example applications
- Matrix multiplication - Trivial parallelism
(independent tasks) - Successive over relaxation - Neighbor
communication - All-pairs shortest paths - Broadcast
communication - Linear equations - Load balancing problem
- Traveling Salesman problem - Search overhead
- Designing parallel algorithms