Title: ECE 1747H: Parallel Programming
1ECE 1747H Parallel Programming
- Lecture 2-3 More on parallelism and dependences
-- synchronization
2Synchronization
- All programming models give the user the ability
to control the ordering of events on different
processors. - This facility is called synchronization.
3Example 1
- f() a 1 b 2 c 3
- g() d 4 e 5 f 6
- main() f() g()
- No dependences between f and g.
- Thus, f and g can be run in parallel.
4Example 2
- f() a 1 b 2 c 3
- g() a 4 b 5 c 6
- main() f() g()
- Dependences between f and g.
- Thus, f and g cannot be run in parallel.
5Example 2 (continued)
- f() a 1 b 2 c 3
- g() a 4 b 5 c 6
- main() f() g()
- Dependences are between assignments to a,
assignments to b, assignments to c. - No other dependences.
- Therefore, we only need to enforce these
dependences.
6Synchronization Facility
- Suppose we had a set of primitives, signal(x) and
wait(x). - wait(x) blocks unless a signal(x) has occurred.
- signal(x) does not block, but causes a wait(x) to
unblock, or causes a future wait(x) not to block.
7Example 2 (continued)
- f() a 1 b 2 c 3
- g() a 4 b 5 c 6
- main() f() g()
- f() a 1 signal(e_a) b 2 signal(e_b) c
3 signal(e_c) - g() wait(e_a) a 4 wait(e_b) b 5
wait(e_c) c 6 - main() f() g()
8Example 2 (continued)
- a 1 wait(e_a)
- signal(e_a)
- b 2 a 4
- signal(e_b) wait(e_b)
- c 3 b 5
- signal(e_c) wait(e_c)
- c 6
- Execution is (mostly) parallel and correct.
- Dependences are covered by synchronization.
9About synchronization
- Synchronization is necessary to make some
programs execute correctly in parallel. - However, synchronization is expensive.
- Therefore, needs to be reduced, or sometimes need
to give up on parallelism.
10Example 3
- f() a1 b2 c3
- g() d4 e5 a6
- main() f() g()
- f() a1 signal(e_a) b2 c3
- g() d4 e5 wait(e_a) a6
- main() f() g()
11Example 4
- for( i1 ilt100 i )
- ai
-
- ai-1
-
- Loop-carried dependence, not parallelizable
12Example 4 (continued)
- for( i... ilt... i )
- ai
- signal(e_ai)
-
- wait(e_ai-1)
- ai-1
-
13Example 4 (continued)
- Note that here it matters which iterations are
assigned to which processor. - It does not matter for correctness, but it
matters for performance. - Cyclic assignment is probably best.
14Example 5
- for( i0 ilt100 i ) ai f(i)
- x g(a)
- for( i0 ilt100 i ) bi x h( ai )
- First loop can be run in parallel.
- Middle statement is sequential.
- Second loop can be run in parallel.
15Example 5 (contimued)
- We will need to make parallel execution stop
after first loop and resume at the beginning of
the second loop. - Two (standard) ways of doing that
- fork() - join()
- barrier synchronization
16Fork-Join Synchronization
- fork() causes a number of processes to be created
and to be run in parallel. - join() causes all these processes to wait until
all of them have executed a join().
17Example 5 (continued)
- fork()
- for( i... ilt... i ) ai f(i)
- join()
- x g(a)
- fork()
- for( i... ilt... i ) bi x h( ai )
- join()
18Example 6
- sum 0.0
- for( i0 ilt100 i ) sum ai
- Iterations have dependence on sum.
- Cannot be parallelized, but ...
19Example 6 (continued)
- for( k0 klt... k ) sumk 0.0
- fork()
- for( j jlt j ) sumk aj
- join()
- sum 0.0
- for( k0 klt... k ) sum sumk
20Reduction
- This pattern is very common.
- Many parallel programming systems have explicit
support for it, called reduction. - sum reduce( , a, 0, 100 )
21Final word on synchronization
- Many different synchronization constructs exist
in different programming models. - Dependences have to be covered by appropriate
synchronization. - Synchronization is often expensive.
22ECE 1747H Parallel Programming
- Lecture 2-3 Data Parallelism
23Previously
- Ordering of statements.
- Dependences.
- Parallelism.
- Synchronization.
24Goal of next few lectures
- Standard patterns of parallel programs.
- Examples of each.
- Later, code examples in various programming
models.
25Flavors of Parallelism
- Data parallelism all processors do the same
thing on different data. - Regular
- Irregular
- Task parallelism processors do different tasks.
- Task queue
- Pipelines
26Data Parallelism
- Essential idea each processor works on a
different part of the data (usually in one or
more arrays). - Regular or irregular data parallelism using
linear or non-linear indexing. - Examples MM (regular), SOR (regular), MD
(irregular).
27Matrix Multiplication
- Multiplication of two n by n matrices A and B
into a third n by n matrix C
28Matrix Multiply
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( i0 iltn i )
- for( j0 jltn j )
- for( k0 kltn k )
- cij aikbkj
29Parallel Matrix Multiply
- No loop-carried dependences in i- or j-loop.
- Loop-carried dependence on k-loop.
- All i- and j-iterations can be run in parallel.
30Parallel Matrix Multiply (contd.)
- If we have P processors, we can give n/P rows or
columns to each processor. - Or, we can divide the matrix in P squares, and
give each processor one square.
31Data Distribution Examples
32Data Distribution Examples
- BLOCK DISTRIBUTION BY ROW
33Data Distribution Examples
- BLOCK DISTRIBUTION BY COLUMN
34Data Distribution Examples
- CYCLIC DISTRIBUTION BY COLUMN
35Data Distribution Examples
36Data Distribution Examples
37SOR
- SOR implements a mathematical model for many
natural phenomena, e.g., heat dissipation in a
metal sheet. - Model is a partial differential equation.
- Focus is on algorithm, not on derivation.
38Problem Statement
y
F 1
2
F(x,y) 0
F 0
F 0
F 0
x
39Discretization
- Represent F in continuous rectangle by a
2-dimensional discrete grid (array). - The boundary conditions on the rectangle are the
boundary values of the array - The internal values are found by the relaxation
algorithm.
40Discretized Problem Statement
j
i
41Relaxation Algorithm
- For some number of iterations
- for each internal grid point
- compute average of its four neighbors
- Termination condition
- values at grid points change very little
- (we will ignore this part in our example)
42Discretized Problem Statement
- for some number of timesteps/iterations
- for (i1 iltn i )
- for( j1, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- for( i1 iltn i )
- for( j1 jltn j )
- gridij tempij
43Parallel SOR
- No dependences between iterations of first (i,j)
loop nest. - No dependences between iterations of second (i,j)
loop nest. - Anti-dependence between first and second loop
nest in the same timestep. - True dependence between second loop nest and
first loop nest of next timestep.
44Parallel SOR (continued)
- First (i,j) loop nest can be parallelized.
- Second (i,j) loop nest can be parallelized.
- We must make processors wait at the end of each
(i,j) loop nest. - Natural synchronization fork-join.
45Parallel SOR (continued)
- If we have P processors, we can give n/P rows or
columns to each processor. - Or, we can divide the array in P squares, and
give each processor a square to compute.
46Molecular Dynamics (MD)
- Simulation of a set of bodies under the influence
of physical laws. - Atoms, molecules, celestial bodies, ...
- Have same basic structure.
47Molecular Dynamics (Skeleton)
- for some number of timesteps
- for all molecules i
- for all other molecules j
- forcei f( loci, locj )
- for all molecules i
- loci g( loci, forcei )
48Molecular Dynamics (continued)
- To reduce amount of computation, account for
interaction only with nearby molecules.
49Molecular Dynamics (continued)
- for some number of timesteps
- for all molecules i
- for all nearby molecules j
- forcei f( loci, locj )
- for all molecules i
- loci g( loci, forcei )
50Molecular Dynamics (continued)
- for each molecule i
- number of nearby molecules counti
- array of indices of nearby molecules indexj
- ( 0 lt j lt counti)
51Molecular Dynamics (continued)
- for some number of timesteps
- for( i0 iltnum_mol i )
- for( j0 jltcounti j )
- forcei f(loci,locindexj)
- for( i0 iltnum_mol i )
- loci g( loci, forcei )
52Molecular Dynamics (continued)
- No loop-carried dependence in first i-loop.
- Loop-carried dependence (reduction) in j-loop.
- No loop-carried dependence in second i-loop.
- True dependence between first and second i-loop.
53Molecular Dynamics (continued)
- First i-loop can be parallelized.
- Second i-loop can be parallelized.
- Must make processors wait between loops.
- Natural synchronization fork-join.
54Molecular Dynamics (continued)
- for some number of timesteps
- for( i0 iltnum_mol i )
- for( j0 jltcounti j )
- forcei f(loci,locindexj)
- for( i0 iltnum_mol i )
- loci g( loci, forcei )
55Irregular vs. regular data parallel
- In SOR, all arrays are accessed through linear
expressions of the loop indices, known at compile
time regular. - In MD, some arrays are accessed through
non-linear expressions of the loop indices, some
known only at runtime irregular.
56Irregular vs. regular data parallel
- No real differences in terms of parallelization
(based on dependences). - Will lead to fundamental differences in
expressions of parallelism - irregular difficult for parallelism based on data
distribution - not difficult for parallelism based on iteration
distribution.
57Molecular Dynamics (continued)
- Parallelization of first loop
- has a load balancing issue
- some molecules have few/many neighbors
- more sophisticated loop partitioning necessary