Introduction to Parallel Programming

About This Presentation

Title:

Introduction to Parallel Programming

Description:

Title: The IC Wall Collaboration between Computer science + Physics Last modified by: bal Document presentation format: Custom Other titles: Times New Roman Arial ... – PowerPoint PPT presentation

Number of Views:339

Avg rating:3.0/5.0

Slides: 73

Provided by: csVuNlba

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Parallel Programming

1
Introduction to Parallel Programming

Language notation message passing
Distributed-memory machine
(e.g., workstations on a network)
5 parallel algorithms of increasing complexity
Matrix multiplication
Successive overrelaxation
All-pairs shortest paths
Linear equations
Traveling Salesman problem

2
Message Passing

SEND (destination, message)
blocking wait until message has arrived (like a
fax)
non blocking continue immediately (like a
mailbox)
RECEIVE (source, message)
RECEIVE-FROM-ANY (message)
blocking wait until message is available
non blocking test if message is available

3
Syntax

Use pseudo-code with C-like syntax
Use indentation instead of .. to indicate
block structure
Arrays can have user-defined index ranges
Default start at 1
int A10100 runs from 10 to 100
int AN runs from 1 to N
Use array slices (sub-arrays)
Ai..j elements A i to A j
Ai, elements Ai, 1 to Ai, N i.e.
row i of matrix A
A, k elements A1, k to AN, k i.e.
column k of A

4
Parallel Matrix Multiplication

Given two N x N matrices A and B
Compute C A x B
Cij Ai1B1j Ai2B2j .. AiNBNj

A
B
C
5
Sequential Matrix Multiplication

for (i 1 i lt N i)
for (j 1 j lt N j)
C i,j 0
for (k 1 k lt N k)
Ci,j Ai,k Bk,j
The order of the operations is over specified
Everything can be computed in parallel

6
Parallel Algorithm 1

Each processor computes 1 element of C
Requires N2 processors
Each processor needs 1 row of A and 1 column of B

7
Structure

Master distributes the work and receives the
results
Slaves get work and execute it
Slaves are numbered consecutively from 1 to P
How to start up master/slave processes depends on
Operating System (not discussed here)

Master distributes work and receives results
Slaves (1 .. P) get work and execute it
How to start up master/slave processes depends on
Operating System

8
Parallel Algorithm 1
Master (processor 0) int proc 1 for (i
1 i lt N i) for (j 1 j lt N
j) SEND(proc, Ai,, B,j, i, j)
proc for (x 1 x lt NN x) RECEIVE_FRO
M_ANY(result, i, j) Ci,j result
Slaves (processors 1 .. P) int AixN, BxjN,
Cij RECEIVE(0, Aix, Bxj, i, j) Cij
0 for (k 1 k lt N k) Cij Aixk
Bxjk SEND(0, Cij , i, j)
9
Efficiency (complexity analysis)

Each processor needs O(N) communication to do
O(N) computations
Communication 2N1 integers O(N)
Computation per processor N multiplications/addit
ions O(N)
Exact communication/computation costs depend on
network and CPU
Still this algorithm is inefficient for any
existing machine
Need to improve communication/computation ratio

10
Parallel Algorithm 2

Each processor computes 1 row (N elements) of C
Requires N processors
Need entire B matrix and 1 row of A as input

11
Structure
Master
A1,
B,
AN,
C1,
CN,
B,
.
Slave
Slave
1
N
12
Parallel Algorithm 2

Master (processor 0)
for (i 1 i lt N i)
SEND (i, Ai,, B,, i)
for (x 1 x lt N x)
RECEIVE_FROM_ANY (result, i)
Ci, result

Slaves int AixN, BN,N, CN RECEIVE(0,
Aix, B, i) for (j 1 j lt N j) Cj
0 for (k 1 k lt N k) Cj Aixk
Bj,k SEND(0, C , i)
13
Problem need larger granularity

Each processor now needs O(N2) communication and
O(N2) computation -gt Still inefficient
Assumption N gtgt P (i.e. we solve a large
problem)
Assign many rows to each processor

14
Parallel Algorithm 3

Each processor computes N/P rows of C
Need entire B matrix and N/P rows of A as input
Each processor now needs O(N2) communication and
O(N3 / P) computation

15
Parallel Algorithm 3 (master)

Master (processor 0)
int result N, N / P
int inc N / P / number of rows per cpu /
int lb 1 / lb lower bound /
for (i 1 i lt P i)
SEND (i, Alb .. lbinc-1, , B,, lb,
lbinc-1)
lb inc
for (x 1 x lt P x)
RECEIVE_FROM_ANY (result, lb)
for (i 1 i lt N / P i)
Clbi-1, resulti,

16
Parallel Algorithm 3 (slave)
Slaves int AN / P, N, BN,N, CN / P,
N RECEIVE(0, A, B, lb, ub) for (i lb
i lt ub i) for (j 1 j lt N
j) Ci,j 0 for (k 1 k lt N
k) Ci,j Ai,k Bk,j SEND(0,
C, , lb)
17
Comparison
Algorithm Parallelism (jobs) Communication per job Computation per job Ratio comp/comm
1 N2 N N 1 N O(1)
2 N N N2 N N2 O(1)
3 P N2/P N2 N2/P N3/P O(N/P)

If N gtgt P, algorithm 3 will have low
communication overhead
Its grain size is high

18
Example speedup graph
19
Discussion

Matrix multiplication is trivial to parallelize
Getting good performance is a problem
Need right grain size
Need large input problem

20
Successive Over relaxation (SOR)

Iterative method for solving Laplace equations
Repeatedly updates elements of a grid

21
Successive Over relaxation (SOR)

float G1N, 1M, Gnew1N, 1M
for (step 0 step lt NSTEPS step)
for (i 2 i lt N i) / update grid /
for (j 2 j lt M j)
Gnewi,j f(Gi,j, Gi-1,j,
Gi1,j,Gi,j-1, Gi,j1)
G Gnew

22
SOR example
23
SOR example
24
Parallelizing SOR

Domain decomposition on the grid
Each processor owns N/P rows
Need communication between neighbors to exchange
elements at processor boundaries

25
SOR example partitioning
26
SOR example partitioning
27
Communication scheme

Each CPU communicates with left right
neighbor(if existing)

28
Parallel SOR

float Glb-1ub1, 1M, Gnewlb-1ub1, 1M
for (step 0 step lt NSTEPS step)
SEND(cpuid-1, Glb) / send 1st row left
/
SEND(cpuid1, Gub) / send last row
right /
RECEIVE(cpuid-1, Glb-1) / receive from
left /
RECEIVE(cpuid1, Gub1) / receive from
right /
for (i lb i lt ub i) / update my rows
/
for (j 2 j lt M j)
Gnewi,j f(Gi,j, Gi-1,j, Gi1,j,
Gi,j-1, Gi,j1)
G Gnew

29
Performance of SOR

Communication and computation during each
iteration
Each CPU sends/receives 2 messages with M reals
Each CPU computes N/P M updates
The algorithm will have good performance if
Problem size is large N gtgt P
Message exchanges can be done in parallel

30
All-pairs Shorts Paths (ASP)

Given a graph G with a distance table C
C i , j length of direct path from node
i to node j
Compute length of shortest path between any two
nodes in G

31
Floyd's Sequential Algorithm

Basic step
for (k 1 k lt N k)
for (i 1 i lt N i)
for (j 1 j lt N j)
C i , j MIN ( C i, j, . C
i ,k C k, j)

During iteration k, you can visit only
intermediate nodes in the set 1 .. k
k0 gt initial problem, no intermediate nodes
kN gt final solution

During iteration k, you can visit only
intermediate nodes in the set 1 .. k
k0 gt initial problem, no intermediate nodes
kN gt final solution

32
Parallelizing ASP

Distribute rows of C among the P processors
During iteration k, each processor executes
C i,j MIN (Ci ,j, Ci,k Ck,j)
on its own rows i, so it needs these rows and
row k
Before iteration k, the processor owning row k
sends it to all the others

33
k
j
. .
i
.
k
34
k
j
. . .
i
. .
k
35
j
. . . . . . . .
i
. . . . . . . .
k
36
Parallel ASP Algorithm

int lb, ub / lower/upper bound for this CPU
/
int rowKN, Clbub, N / pivot row matrix
/
for (k 1 k lt N k)
if (k gt lb k lt ub) / do I have it? /
rowK Ck,
for (proc 1 proc lt P proc) /
broadcast row /
if (proc ! myprocid) SEND(proc, rowK)
else
RECEIVE_FROM_ANY(rowK) / receive row /
for (i lb i lt ub i) / update my
rows /
for (j 1 j lt N j)
Ci,j MIN(Ci,j, Ci,k rowKj)

37
Performance Analysis ASP

Per iteration
1 CPU sends P -1 messages with N integers
Each CPU does N/P x N comparisons
Communication/ computation ratio is small if N
gtgt P

... but, is the Algorithm Correct?

39
Parallel ASP Algorithm

int lb, ub / lower/upper bound for this CPU
/
int rowKN, Clbub, N / pivot row matrix
/
for (k 1 k lt N k)
if (k gt lb k lt ub) / do I have it? /
rowK Ck,
for (proc 1 proc lt P proc) /
broadcast row /
if (proc ! myprocid) SEND(proc, rowK)
else
RECEIVE_FROM_ANY(rowK) / receive row /
for (i lb i lt ub i) / update my
rows /
for (j 1 j lt N j)
Ci,j MIN(Ci,j, Ci,k rowKj)

40
Non-FIFO Message Ordering

Row 2 may be received before row 1

41
FIFO Ordering

Row 5 may be received before row 4

42
Correctness

Problems
Asynchronous non-FIFO SEND
Messages from different senders may overtake each
other
Solution is to use a combination of
Synchronous SEND (less efficient)
Barrier at the end of outer loop (extra
communication)
Order incoming messages (requires buffering)
RECEIVE (cpu, msg) (more complicated)

43
Introduction to Parallel Programming

Language notation message passing
Distributed-memory machine
(e.g., workstations on a network)
5 parallel algorithms of increasing complexity
Matrix multiplication
Successive overrelaxation
All-pairs shortest paths
Linear equations
Traveling Salesman problem

44
Linear equations

Linear equations
a1,1x1 a1,2x2 a1,nxn b1
...
an,1x1 an,2x2 an,nxn bn
Matrix notation Ax b
Problem compute x, given A and b
Linear equations have many important applications
Practical applications need huge sets of
equations

45
Solving a linear equation

Two phases
Upper-triangularization -gt U x y
Back-substitution -gt x
Most computation time is in upper-triangularizati
on
Upper-triangular matrix
U i, i 1
U i, j 0 if i gt j

1 . . . . . . .
0 1 . . . . . .
0 0 1 . . . . .
0 0 0 1 . . . .
0 0 0 0 1 . . .
0 0 0 0 0 1 . .
0 0 0 0 0 0 1 .
0 0 0 0 0 0 0 1
46
Sequential Gaussian elimination

Converts Ax b into Ux y
Sequential algorithm uses 2/3 N3 operations

for (k 1 k lt N k)
for (j k1 j lt N j)
Ak,j Ak,j / Ak,k
yk bk / Ak,k
Ak,k 1
for (i k1 i lt N i)
for (j k1 j lt N j)
Ai,j Ai,j - Ai,k Ak,j
bi bi - Ai,k yk
Ai,k 0

1 . . . . . . .
0 . . . . . . .
0 . . . . . . .
0 . . . . . . .
A
y
47
Parallelizing Gaussian elimination

Row-wise partitioning scheme
Each cpu gets one row (striping )
Execute one (outer-loop) iteration at a time
Communication requirement
During iteration k, cpus Pk1 Pn-1 need part
of row k
This row is stored on CPU Pk
-gt need partial broadcast (multicast)

48
Communication
49
Performance problems

Communication overhead (multicast)
Load imbalance
CPUs P0PK are idle during iteration k
Bad load balance means bad speedups, as
some CPUs have too much work
In general, number of CPUs is less than n
Choice between block-striped cyclic-striped
distribution
Block-striped distribution has high
load-imbalance
Cyclic-striped distribution has less
load-imbalance

50
Block-striped distribution

CPU 0 gets first N/2 rows
CPU 1 gets last N/2 rows
CPU 0 has much less work to do
CPU 1 becomes the bottleneck

51
Cyclic-striped distribution

CPU 0 gets odd rows
CPU 1 gets even rows
CPU 0 and 1 have more or less the same amount of
work

52
Traveling Salesman Problem (TSP)

Find shortest route for salesman among given set
of cities (NP-hard problem)
Each city must be visited once, no return to
initial city

New York
New York
2
2
3
1
Chicago
Saint Louis
4
3
Miami
53
Sequential branch-and-bound

Structure the entire search space as a tree,
sorted using nearest-city first heuristic

54
Pruning the search tree

Keep track of best solution found so far (the
bound)
Cut-off partial routes gt bound

Can be pruned
55
Parallelizing TSP

Distribute the search tree over the CPUs
Results in reasonably large-grain jobs

CPU 1
CPU 2
CPU 3
56
Distribution of the tree

Static distribution each CPU gets fixed part of
tree
Load imbalance subtrees take different amounts
of time
Impossible to predict load imbalance statically
(as for Gaussian)

3
2
2
3
3
4
4
1
1
3
3
4
4
1
57
Dynamic load balancingReplicated Workers Model

Master process generates large number of jobs
(subtrees) and repeatedly hands them out
Worker processes repeatedly get work and execute
it
Runtime overhead for fetching jobs dynamically
Efficient for TSP because the jobs are large

workers
Master
58
Real search spaces are huge

NP-complete problem -gt exponential search space
Master searches MAXHOPS levels, then creates jobs
Eg for 20 cities MAXHOPS4 -gt 20191817
(gt100,000) jobs, each searching 16 remaining
cities

59
Parallel TSP Algorithm (1/3)

process master (CPU 0)
generate-jobs() / generate all jobs,
start with empty path /
for (proc1 proc lt P proc) / inform
workers we're done /
RECEIVE(proc, worker-id) / get work
request /
SEND(proc, ) /
return empty path /
generate-jobs (List path)
if (size(path) MAXHOPS) / if path has
MAXHOPS cities /
RECEIVE-FROM-ANY (worker-id) / wait for
work request /
SEND (worker-id, path) / send
partial route to worker /
else
for (city 1 city lt NRCITIES city) /
(should be ordered) /
if (city not on path) generate-jobs(pathc
ity) / append city /

60
Parallel TSP Algorithm (2/3)

process worker (CPUs 1..P)
int Minimum maxint / Length of current best
path (bound) /
List path
for ()
SEND (0, myprocid) / send work request
to master /
RECEIVE (0, path) / get next job from
master /
if (path ) exit() / we're done
/
tsp(path, length(path)) / compute all
subsequent paths /

61
Parallel TSP Algorithm (3/3)

tsp(List path, int length)
if (NONBLOCKING_RECEIVE_FROM_ANY (m))
/ is there an update message? /
if (m lt Minimum) Minimum m / update
global minimum /
if (length gt Minimum) return / not a shorter
route /
if (size(path) NRCITIES) / complete route?
/
Minimum length / update global minimum
/
for (proc 1 proc lt P proc)
if (proc ! myprocid) SEND(proc,
length) / broadcast it /
else
last last(path) / last city on the path
/
for (city 1 city lt NRCITIES city) /
should be ordered /
if (city not on path) tsp(pathcity,
lengthdistancelast,city)

62
Search overhead
CPU 1
CPU 2
CPU 3
63
Search overhead

Path ltn m s gt is started (in parallel) before
the outcome (6) of ltn c s mgt is known, so
it cannot be pruned
The parallel algorithm therefore does more work
than the sequential algorithm
This is called search overhead
It can occur in algorithms that do speculative
work, like parallel search algorithms
Can also have negative search overhead, resulting
in superlinear speedups!

64
Performance of TSP