Introduction to Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Parallel Programming

Description:

Title: The IC Wall Collaboration between Computer science + Physics Last modified by: bal Document presentation format: Custom Other titles: Times New Roman Arial ... – PowerPoint PPT presentation

Number of Views:338
Avg rating:3.0/5.0
Slides: 73
Provided by: csVuNlba
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Parallel Programming


1
Introduction to Parallel Programming
  • Language notation message passing
  • Distributed-memory machine
  • (e.g., workstations on a network)
  • 5 parallel algorithms of increasing complexity
  • Matrix multiplication
  • Successive overrelaxation
  • All-pairs shortest paths
  • Linear equations
  • Traveling Salesman problem

2
Message Passing
  • SEND (destination, message)
  • blocking wait until message has arrived (like a
    fax)
  • non blocking continue immediately (like a
    mailbox)
  • RECEIVE (source, message)
  • RECEIVE-FROM-ANY (message)
  • blocking wait until message is available
  • non blocking test if message is available

3
Syntax
  • Use pseudo-code with C-like syntax
  • Use indentation instead of .. to indicate
    block structure
  • Arrays can have user-defined index ranges
  • Default start at 1
  • int A10100 runs from 10 to 100
  • int AN runs from 1 to N
  • Use array slices (sub-arrays)
  • Ai..j elements A i to A j
  • Ai, elements Ai, 1 to Ai, N i.e.
    row i of matrix A
  • A, k elements A1, k to AN, k i.e.
    column k of A

4
Parallel Matrix Multiplication
  • Given two N x N matrices A and B
  • Compute C A x B
  • Cij Ai1B1j Ai2B2j .. AiNBNj

A
B
C
5
Sequential Matrix Multiplication
  • for (i 1 i lt N i)
  • for (j 1 j lt N j)
  • C i,j 0
  • for (k 1 k lt N k)
  • Ci,j Ai,k Bk,j
  • The order of the operations is over specified
  • Everything can be computed in parallel

6
Parallel Algorithm 1
  • Each processor computes 1 element of C
  • Requires N2 processors
  • Each processor needs 1 row of A and 1 column of B

7
Structure
  • Master distributes the work and receives the
    results
  • Slaves get work and execute it
  • Slaves are numbered consecutively from 1 to P
  • How to start up master/slave processes depends on
    Operating System (not discussed here)
  • Master distributes work and receives results
  • Slaves (1 .. P) get work and execute it
  • How to start up master/slave processes depends on
    Operating System

8
Parallel Algorithm 1
Master (processor 0) int proc 1 for (i
1 i lt N i) for (j 1 j lt N
j) SEND(proc, Ai,, B,j, i, j)
proc for (x 1 x lt NN x) RECEIVE_FRO
M_ANY(result, i, j) Ci,j result
Slaves (processors 1 .. P) int AixN, BxjN,
Cij RECEIVE(0, Aix, Bxj, i, j) Cij
0 for (k 1 k lt N k) Cij Aixk
Bxjk SEND(0, Cij , i, j)
9
Efficiency (complexity analysis)
  • Each processor needs O(N) communication to do
    O(N) computations
  • Communication 2N1 integers O(N)
  • Computation per processor N multiplications/addit
    ions O(N)
  • Exact communication/computation costs depend on
    network and CPU
  • Still this algorithm is inefficient for any
    existing machine
  • Need to improve communication/computation ratio

10
Parallel Algorithm 2
  • Each processor computes 1 row (N elements) of C
  • Requires N processors
  • Need entire B matrix and 1 row of A as input

11
Structure
Master
A1,
B,
AN,
C1,
CN,
B,
.
Slave
Slave
1
N
12
Parallel Algorithm 2
  • Master (processor 0)
  • for (i 1 i lt N i)
  • SEND (i, Ai,, B,, i)
  • for (x 1 x lt N x)
  • RECEIVE_FROM_ANY (result, i)
  • Ci, result

Slaves int AixN, BN,N, CN RECEIVE(0,
Aix, B, i) for (j 1 j lt N j) Cj
0 for (k 1 k lt N k) Cj Aixk
Bj,k SEND(0, C , i)
13
Problem need larger granularity
  • Each processor now needs O(N2) communication and
    O(N2) computation -gt Still inefficient
  • Assumption N gtgt P (i.e. we solve a large
    problem)
  • Assign many rows to each processor

14
Parallel Algorithm 3
  • Each processor computes N/P rows of C
  • Need entire B matrix and N/P rows of A as input
  • Each processor now needs O(N2) communication and
    O(N3 / P) computation

15
Parallel Algorithm 3 (master)
  • Master (processor 0)
  • int result N, N / P
  • int inc N / P / number of rows per cpu /
  • int lb 1 / lb lower bound /
  • for (i 1 i lt P i)
  • SEND (i, Alb .. lbinc-1, , B,, lb,
    lbinc-1)
  • lb inc
  • for (x 1 x lt P x)
  • RECEIVE_FROM_ANY (result, lb)
  • for (i 1 i lt N / P i)
  • Clbi-1, resulti,

16
Parallel Algorithm 3 (slave)
Slaves int AN / P, N, BN,N, CN / P,
N RECEIVE(0, A, B, lb, ub) for (i lb
i lt ub i) for (j 1 j lt N
j) Ci,j 0 for (k 1 k lt N
k) Ci,j Ai,k Bk,j SEND(0,
C, , lb)
17
Comparison
Algorithm Parallelism (jobs) Communication per job Computation per job Ratio comp/comm
1 N2 N N 1 N O(1)
2 N N N2 N N2 O(1)
3 P N2/P N2 N2/P N3/P O(N/P)
  • If N gtgt P, algorithm 3 will have low
    communication overhead
  • Its grain size is high

18
Example speedup graph
19
Discussion
  • Matrix multiplication is trivial to parallelize
  • Getting good performance is a problem
  • Need right grain size
  • Need large input problem

20
Successive Over relaxation (SOR)
  • Iterative method for solving Laplace equations
  • Repeatedly updates elements of a grid

21
Successive Over relaxation (SOR)
  • float G1N, 1M, Gnew1N, 1M
  • for (step 0 step lt NSTEPS step)
  • for (i 2 i lt N i) / update grid /
  • for (j 2 j lt M j)
  • Gnewi,j f(Gi,j, Gi-1,j,
    Gi1,j,Gi,j-1, Gi,j1)
  • G Gnew

22
SOR example
23
SOR example
24
Parallelizing SOR
  • Domain decomposition on the grid
  • Each processor owns N/P rows
  • Need communication between neighbors to exchange
    elements at processor boundaries

25
SOR example partitioning
26
SOR example partitioning
27
Communication scheme
  • Each CPU communicates with left right
    neighbor(if existing)

28
Parallel SOR
  • float Glb-1ub1, 1M, Gnewlb-1ub1, 1M
  • for (step 0 step lt NSTEPS step)
  • SEND(cpuid-1, Glb) / send 1st row left
    /
  • SEND(cpuid1, Gub) / send last row
    right /
  • RECEIVE(cpuid-1, Glb-1) / receive from
    left /
  • RECEIVE(cpuid1, Gub1) / receive from
    right /
  • for (i lb i lt ub i) / update my rows
    /
  • for (j 2 j lt M j)
  • Gnewi,j f(Gi,j, Gi-1,j, Gi1,j,
    Gi,j-1, Gi,j1)
  • G Gnew

29
Performance of SOR
  • Communication and computation during each
    iteration
  • Each CPU sends/receives 2 messages with M reals
  • Each CPU computes N/P M updates
  • The algorithm will have good performance if
  • Problem size is large N gtgt P
  • Message exchanges can be done in parallel

30
All-pairs Shorts Paths (ASP)
  • Given a graph G with a distance table C
  • C i , j length of direct path from node
    i to node j
  • Compute length of shortest path between any two
    nodes in G

31
Floyd's Sequential Algorithm
  • Basic step
  • for (k 1 k lt N k)
  • for (i 1 i lt N i)
  • for (j 1 j lt N j)
  • C i , j MIN ( C i, j, . C
    i ,k C k, j)
  • During iteration k, you can visit only
    intermediate nodes in the set 1 .. k
  • k0 gt initial problem, no intermediate nodes
  • kN gt final solution
  • During iteration k, you can visit only
    intermediate nodes in the set 1 .. k
  • k0 gt initial problem, no intermediate nodes
  • kN gt final solution

32
Parallelizing ASP
  • Distribute rows of C among the P processors
  • During iteration k, each processor executes
  • C i,j MIN (Ci ,j, Ci,k Ck,j)
  • on its own rows i, so it needs these rows and
    row k
  • Before iteration k, the processor owning row k
    sends it to all the others

33
k
j
. .
i
.
k
34
k
j
. . .
i
. .
k
35
j
. . . . . . . .
i
. . . . . . . .
k
36
Parallel ASP Algorithm
  • int lb, ub / lower/upper bound for this CPU
    /
  • int rowKN, Clbub, N / pivot row matrix
    /
  • for (k 1 k lt N k)
  • if (k gt lb k lt ub) / do I have it? /
  • rowK Ck,
  • for (proc 1 proc lt P proc) /
    broadcast row /
  • if (proc ! myprocid) SEND(proc, rowK)
  • else
  • RECEIVE_FROM_ANY(rowK) / receive row /
  • for (i lb i lt ub i) / update my
    rows /
  • for (j 1 j lt N j)
  • Ci,j MIN(Ci,j, Ci,k rowKj)

37
Performance Analysis ASP
  • Per iteration
  • 1 CPU sends P -1 messages with N integers
  • Each CPU does N/P x N comparisons
  • Communication/ computation ratio is small if N
    gtgt P

38
  • ... but, is the Algorithm Correct?

39
Parallel ASP Algorithm
  • int lb, ub / lower/upper bound for this CPU
    /
  • int rowKN, Clbub, N / pivot row matrix
    /
  • for (k 1 k lt N k)
  • if (k gt lb k lt ub) / do I have it? /
  • rowK Ck,
  • for (proc 1 proc lt P proc) /
    broadcast row /
  • if (proc ! myprocid) SEND(proc, rowK)
  • else
  • RECEIVE_FROM_ANY(rowK) / receive row /
  • for (i lb i lt ub i) / update my
    rows /
  • for (j 1 j lt N j)
  • Ci,j MIN(Ci,j, Ci,k rowKj)

40
Non-FIFO Message Ordering
  • Row 2 may be received before row 1

41
FIFO Ordering
  • Row 5 may be received before row 4

42
Correctness
  • Problems
  • Asynchronous non-FIFO SEND
  • Messages from different senders may overtake each
    other
  • Solution is to use a combination of
  • Synchronous SEND (less efficient)
  • Barrier at the end of outer loop (extra
    communication)
  • Order incoming messages (requires buffering)
  • RECEIVE (cpu, msg) (more complicated)

43
Introduction to Parallel Programming
  • Language notation message passing
  • Distributed-memory machine
  • (e.g., workstations on a network)
  • 5 parallel algorithms of increasing complexity
  • Matrix multiplication
  • Successive overrelaxation
  • All-pairs shortest paths
  • Linear equations
  • Traveling Salesman problem

44
Linear equations
  • Linear equations
  • a1,1x1 a1,2x2 a1,nxn b1
  • ...
  • an,1x1 an,2x2 an,nxn bn
  • Matrix notation Ax b
  • Problem compute x, given A and b
  • Linear equations have many important applications
  • Practical applications need huge sets of
    equations

45
Solving a linear equation
  • Two phases
  • Upper-triangularization -gt U x y
  • Back-substitution -gt x
  • Most computation time is in upper-triangularizati
    on
  • Upper-triangular matrix
  • U i, i 1
  • U i, j 0 if i gt j

1 . . . . . . .
0 1 . . . . . .
0 0 1 . . . . .
0 0 0 1 . . . .
0 0 0 0 1 . . .
0 0 0 0 0 1 . .
0 0 0 0 0 0 1 .
0 0 0 0 0 0 0 1
46
Sequential Gaussian elimination
  • Converts Ax b into Ux y
  • Sequential algorithm uses 2/3 N3 operations
  • for (k 1 k lt N k)
  • for (j k1 j lt N j)
  • Ak,j Ak,j / Ak,k
  • yk bk / Ak,k
  • Ak,k 1
  • for (i k1 i lt N i)
  • for (j k1 j lt N j)
  • Ai,j Ai,j - Ai,k Ak,j
  • bi bi - Ai,k yk
  • Ai,k 0

1 . . . . . . .
0 . . . . . . .
0 . . . . . . .
0 . . . . . . .
A
y
47
Parallelizing Gaussian elimination
  • Row-wise partitioning scheme
  • Each cpu gets one row (striping )
  • Execute one (outer-loop) iteration at a time
  • Communication requirement
  • During iteration k, cpus Pk1 Pn-1 need part
    of row k
  • This row is stored on CPU Pk
  • -gt need partial broadcast (multicast)

48
Communication
49
Performance problems
  • Communication overhead (multicast)
  • Load imbalance
  • CPUs P0PK are idle during iteration k
  • Bad load balance means bad speedups, as
    some CPUs have too much work
  • In general, number of CPUs is less than n
  • Choice between block-striped cyclic-striped
    distribution
  • Block-striped distribution has high
    load-imbalance
  • Cyclic-striped distribution has less
    load-imbalance

50
Block-striped distribution
  • CPU 0 gets first N/2 rows
  • CPU 1 gets last N/2 rows
  • CPU 0 has much less work to do
  • CPU 1 becomes the bottleneck

51
Cyclic-striped distribution
  • CPU 0 gets odd rows
  • CPU 1 gets even rows
  • CPU 0 and 1 have more or less the same amount of
    work

52
Traveling Salesman Problem (TSP)
  • Find shortest route for salesman among given set
    of cities (NP-hard problem)
  • Each city must be visited once, no return to
    initial city

New York
New York
2
2
3
1
Chicago
Saint Louis
4
3
Miami
53
Sequential branch-and-bound
  • Structure the entire search space as a tree,
    sorted using nearest-city first heuristic

54
Pruning the search tree
  • Keep track of best solution found so far (the
    bound)
  • Cut-off partial routes gt bound

Can be pruned
55
Parallelizing TSP
  • Distribute the search tree over the CPUs
  • Results in reasonably large-grain jobs

CPU 1
CPU 2
CPU 3
56
Distribution of the tree
  • Static distribution each CPU gets fixed part of
    tree
  • Load imbalance subtrees take different amounts
    of time
  • Impossible to predict load imbalance statically
    (as for Gaussian)

3
2
2
3
3
4
4
1
1
3
3
4
4
1
57
Dynamic load balancingReplicated Workers Model
  • Master process generates large number of jobs
    (subtrees) and repeatedly hands them out
  • Worker processes repeatedly get work and execute
    it
  • Runtime overhead for fetching jobs dynamically
  • Efficient for TSP because the jobs are large

workers
Master
58
Real search spaces are huge
  • NP-complete problem -gt exponential search space
  • Master searches MAXHOPS levels, then creates jobs
  • Eg for 20 cities MAXHOPS4 -gt 20191817
    (gt100,000) jobs, each searching 16 remaining
    cities






59
Parallel TSP Algorithm (1/3)
  • process master (CPU 0)
  • generate-jobs() / generate all jobs,
    start with empty path /
  • for (proc1 proc lt P proc) / inform
    workers we're done /
  • RECEIVE(proc, worker-id) / get work
    request /
  • SEND(proc, ) /
    return empty path /
  • generate-jobs (List path)
  • if (size(path) MAXHOPS) / if path has
    MAXHOPS cities /
  • RECEIVE-FROM-ANY (worker-id) / wait for
    work request /
  • SEND (worker-id, path) / send
    partial route to worker /
  • else
  • for (city 1 city lt NRCITIES city) /
    (should be ordered) /
  • if (city not on path) generate-jobs(pathc
    ity) / append city /

60
Parallel TSP Algorithm (2/3)
  • process worker (CPUs 1..P)
  • int Minimum maxint / Length of current best
    path (bound) /
  • List path
  • for ()
  • SEND (0, myprocid) / send work request
    to master /
  • RECEIVE (0, path) / get next job from
    master /
  • if (path ) exit() / we're done
    /
  • tsp(path, length(path)) / compute all
    subsequent paths /

61
Parallel TSP Algorithm (3/3)
  • tsp(List path, int length)
  • if (NONBLOCKING_RECEIVE_FROM_ANY (m))
  • / is there an update message? /
  • if (m lt Minimum) Minimum m / update
    global minimum /
  • if (length gt Minimum) return / not a shorter
    route /
  • if (size(path) NRCITIES) / complete route?
    /
  • Minimum length / update global minimum
    /
  • for (proc 1 proc lt P proc)
  • if (proc ! myprocid) SEND(proc,
    length) / broadcast it /
  • else
  • last last(path) / last city on the path
    /
  • for (city 1 city lt NRCITIES city) /
    should be ordered /
  • if (city not on path) tsp(pathcity,
    lengthdistancelast,city)

62
Search overhead
CPU 1
CPU 2
CPU 3
63
Search overhead
  • Path ltn m s gt is started (in parallel) before
    the outcome (6) of ltn c s mgt is known, so
    it cannot be pruned
  • The parallel algorithm therefore does more work
    than the sequential algorithm
  • This is called search overhead
  • It can occur in algorithms that do speculative
    work, like parallel search algorithms
  • Can also have negative search overhead, resulting
    in superlinear speedups!

64
Performance of TSP
  • Communication overhead (small)
  • Distribution of jobs updating the global bound
  • Small number of messages
  • Load imbalances
  • Small does automatic (dynamic) load balancing
  • Search overhead
  • Main performance problem

65
Discussion
  • Several kinds of performance overhead
  • Communication overhead
  • communication/computation ratio must be low
  • Load imbalance
  • all processors must do same amount of work
  • Search overhead
  • avoid useless (speculative) computations
  • Making algorithms correct is nontrivial
  • Message ordering

66
Designing Parallel Algorithms
  • Source Designing and building parallel programs
    (Ian Foster, 1995)
  • (available on-line at http//www.mcs.anl.gov/dbpp)
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping

67
Figure 2.1 from Foster's book
68
Partitioning
  • Domain decomposition
  • Partition the data
  • Partition computations on data
  • owner-computes rule
  • Functional decomposition
  • Divide computations into subtasks
  • E.g. search algorithms

69
Communication
  • Analyze data-dependencies between partitions
  • Use communication to transfer data
  • Many forms of communication, e.g.
  • Local communication with neighbors (SOR)
  • Global communication with all processors (ASP)
  • Synchronous (blocking) communication
  • Asynchronous (non blocking) communication

70
Agglomeration
  • Reduce communication overhead by
  • increasing granularity
  • improving locality

71
Mapping
  • On which processor to execute each subtask?
  • Put concurrent tasks on different CPUs
  • Put frequently communicating tasks on same CPU?
  • Avoid load imbalances

72
Summary
  • Hardware and software models
  • Example applications
  • Matrix multiplication - Trivial parallelism
    (independent tasks)
  • Successive over relaxation - Neighbor
    communication
  • All-pairs shortest paths - Broadcast
    communication
  • Linear equations - Load balancing problem
  • Traveling Salesman problem - Search overhead
  • Designing parallel algorithms
Write a Comment
User Comments (0)
About PowerShow.com