ECE 1747H: Parallel Programming - PowerPoint PPT Presentation

About This Presentation

Title:

ECE 1747H: Parallel Programming

Description:

grid[i][j] = temp[i][j]; Parallel SOR ... To reduce amount of computation, account for interaction only with nearby molecules. ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 58

Provided by: CITI

Learn more at: https://www.eecg.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: ECE 1747H: Parallel Programming

1
ECE 1747H Parallel Programming

Lecture 2-3 More on parallelism and dependences
-- synchronization

2
Synchronization

All programming models give the user the ability
to control the ordering of events on different
processors.
This facility is called synchronization.

3
Example 1

f() a 1 b 2 c 3
g() d 4 e 5 f 6
main() f() g()
No dependences between f and g.
Thus, f and g can be run in parallel.

4
Example 2

f() a 1 b 2 c 3
g() a 4 b 5 c 6
main() f() g()
Dependences between f and g.
Thus, f and g cannot be run in parallel.

5
Example 2 (continued)

f() a 1 b 2 c 3
g() a 4 b 5 c 6
main() f() g()
Dependences are between assignments to a,
assignments to b, assignments to c.
No other dependences.
Therefore, we only need to enforce these
dependences.

6
Synchronization Facility

Suppose we had a set of primitives, signal(x) and
wait(x).
wait(x) blocks unless a signal(x) has occurred.
signal(x) does not block, but causes a wait(x) to
unblock, or causes a future wait(x) not to block.

7
Example 2 (continued)

f() a 1 b 2 c 3
g() a 4 b 5 c 6
main() f() g()
f() a 1 signal(e_a) b 2 signal(e_b) c
3 signal(e_c)
g() wait(e_a) a 4 wait(e_b) b 5
wait(e_c) c 6
main() f() g()

8
Example 2 (continued)

a 1 wait(e_a)
signal(e_a)
b 2 a 4
signal(e_b) wait(e_b)
c 3 b 5
signal(e_c) wait(e_c)
c 6
Execution is (mostly) parallel and correct.
Dependences are covered by synchronization.

9
About synchronization

Synchronization is necessary to make some
programs execute correctly in parallel.
However, synchronization is expensive.
Therefore, needs to be reduced, or sometimes need
to give up on parallelism.

10
Example 3

f() a1 b2 c3
g() d4 e5 a6
main() f() g()
f() a1 signal(e_a) b2 c3
g() d4 e5 wait(e_a) a6
main() f() g()

11
Example 4

for( i1 ilt100 i )
ai
ai-1
Loop-carried dependence, not parallelizable

12
Example 4 (continued)

for( i... ilt... i )
ai
signal(e_ai)
wait(e_ai-1)
ai-1

13
Example 4 (continued)

Note that here it matters which iterations are
assigned to which processor.
It does not matter for correctness, but it
matters for performance.
Cyclic assignment is probably best.

14
Example 5

for( i0 ilt100 i ) ai f(i)
x g(a)
for( i0 ilt100 i ) bi x h( ai )
First loop can be run in parallel.
Middle statement is sequential.
Second loop can be run in parallel.

15
Example 5 (contimued)

We will need to make parallel execution stop
after first loop and resume at the beginning of
the second loop.
Two (standard) ways of doing that
fork() - join()
barrier synchronization

16
Fork-Join Synchronization

fork() causes a number of processes to be created
and to be run in parallel.
join() causes all these processes to wait until
all of them have executed a join().

17
Example 5 (continued)

fork()
for( i... ilt... i ) ai f(i)
join()
x g(a)
fork()
for( i... ilt... i ) bi x h( ai )
join()

18
Example 6

sum 0.0
for( i0 ilt100 i ) sum ai
Iterations have dependence on sum.
Cannot be parallelized, but ...

19
Example 6 (continued)

for( k0 klt... k ) sumk 0.0
fork()
for( j jlt j ) sumk aj
join()
sum 0.0
for( k0 klt... k ) sum sumk

20
Reduction

This pattern is very common.
Many parallel programming systems have explicit
support for it, called reduction.
sum reduce( , a, 0, 100 )

21
Final word on synchronization

Many different synchronization constructs exist
in different programming models.
Dependences have to be covered by appropriate
synchronization.
Synchronization is often expensive.

22
ECE 1747H Parallel Programming

Lecture 2-3 Data Parallelism

23
Previously

Ordering of statements.
Dependences.
Parallelism.
Synchronization.

24
Goal of next few lectures

Standard patterns of parallel programs.
Examples of each.
Later, code examples in various programming
models.

25
Flavors of Parallelism

Data parallelism all processors do the same
thing on different data.
Regular
Irregular
Task parallelism processors do different tasks.
Task queue
Pipelines

26
Data Parallelism

Essential idea each processor works on a
different part of the data (usually in one or
more arrays).
Regular or irregular data parallelism using
linear or non-linear indexing.
Examples MM (regular), SOR (regular), MD
(irregular).

27
Matrix Multiplication

Multiplication of two n by n matrices A and B
into a third n by n matrix C

28
Matrix Multiply

for( i0 iltn i )
for( j0 jltn j )
cij 0.0
for( i0 iltn i )
for( j0 jltn j )
for( k0 kltn k )
cij aikbkj

29
Parallel Matrix Multiply

No loop-carried dependences in i- or j-loop.
Loop-carried dependence on k-loop.
All i- and j-iterations can be run in parallel.

30
Parallel Matrix Multiply (contd.)

If we have P processors, we can give n/P rows or
columns to each processor.
Or, we can divide the matrix in P squares, and
give each processor one square.

31
Data Distribution Examples

BLOCK DISTRIBUTION

32
Data Distribution Examples

BLOCK DISTRIBUTION BY ROW

33
Data Distribution Examples

BLOCK DISTRIBUTION BY COLUMN

34
Data Distribution Examples

CYCLIC DISTRIBUTION BY COLUMN

35
Data Distribution Examples

BLOCK CYCLIC

36
Data Distribution Examples

COMBINATIONS

37
SOR

SOR implements a mathematical model for many
natural phenomena, e.g., heat dissipation in a
metal sheet.
Model is a partial differential equation.
Focus is on algorithm, not on derivation.

38
Problem Statement
y
F 1
2
F(x,y) 0
F 0
F 0
F 0
x
39
Discretization

Represent F in continuous rectangle by a
2-dimensional discrete grid (array).
The boundary conditions on the rectangle are the
boundary values of the array
The internal values are found by the relaxation
algorithm.

40
Discretized Problem Statement
j
i
41
Relaxation Algorithm

For some number of iterations
for each internal grid point
compute average of its four neighbors
Termination condition
values at grid points change very little
(we will ignore this part in our example)

42
Discretized Problem Statement

for some number of timesteps/iterations
for (i1 iltn i )
for( j1, jltn, j )
tempij 0.25
( gridi-1j gridi1j
gridij-1 gridij1 )
for( i1 iltn i )
for( j1 jltn j )
gridij tempij

43
Parallel SOR

No dependences between iterations of first (i,j)
loop nest.
No dependences between iterations of second (i,j)
loop nest.
Anti-dependence between first and second loop
nest in the same timestep.
True dependence between second loop nest and
first loop nest of next timestep.

44
Parallel SOR (continued)

First (i,j) loop nest can be parallelized.
Second (i,j) loop nest can be parallelized.
We must make processors wait at the end of each
(i,j) loop nest.
Natural synchronization fork-join.

45
Parallel SOR (continued)

If we have P processors, we can give n/P rows or
columns to each processor.
Or, we can divide the array in P squares, and
give each processor a square to compute.

46
Molecular Dynamics (MD)

Simulation of a set of bodies under the influence
of physical laws.
Atoms, molecules, celestial bodies, ...
Have same basic structure.

47
Molecular Dynamics (Skeleton)

for some number of timesteps
for all molecules i
for all other molecules j
forcei f( loci, locj )
for all molecules i
loci g( loci, forcei )

48
Molecular Dynamics (continued)

To reduce amount of computation, account for
interaction only with nearby molecules.

49
Molecular Dynamics (continued)

for some number of timesteps
for all molecules i
for all nearby molecules j
forcei f( loci, locj )
for all molecules i
loci g( loci, forcei )

50
Molecular Dynamics (continued)

for each molecule i
number of nearby molecules counti
array of indices of nearby molecules indexj
( 0 lt j lt counti)

51
Molecular Dynamics (continued)

for some number of timesteps
for( i0 iltnum_mol i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
for( i0 iltnum_mol i )
loci g( loci, forcei )

52
Molecular Dynamics (continued)

No loop-carried dependence in first i-loop.
Loop-carried dependence (reduction) in j-loop.
No loop-carried dependence in second i-loop.
True dependence between first and second i-loop.

53
Molecular Dynamics (continued)

First i-loop can be parallelized.
Second i-loop can be parallelized.
Must make processors wait between loops.
Natural synchronization fork-join.

54
Molecular Dynamics (continued)

for some number of timesteps
for( i0 iltnum_mol i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
for( i0 iltnum_mol i )
loci g( loci, forcei )

55
Irregular vs. regular data parallel

In SOR, all arrays are accessed through linear
expressions of the loop indices, known at compile
time regular.
In MD, some arrays are accessed through
non-linear expressions of the loop indices, some
known only at runtime irregular.

56
Irregular vs. regular data parallel