ECE 1747H: Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

ECE 1747H: Parallel Programming

Description:

grid[i][j] = temp[i][j]; Parallel SOR ... To reduce amount of computation, account for interaction only with nearby molecules. ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 58
Provided by: CITI
Category:

less

Transcript and Presenter's Notes

Title: ECE 1747H: Parallel Programming


1
ECE 1747H Parallel Programming
  • Lecture 2-3 More on parallelism and dependences
    -- synchronization

2
Synchronization
  • All programming models give the user the ability
    to control the ordering of events on different
    processors.
  • This facility is called synchronization.

3
Example 1
  • f() a 1 b 2 c 3
  • g() d 4 e 5 f 6
  • main() f() g()
  • No dependences between f and g.
  • Thus, f and g can be run in parallel.

4
Example 2
  • f() a 1 b 2 c 3
  • g() a 4 b 5 c 6
  • main() f() g()
  • Dependences between f and g.
  • Thus, f and g cannot be run in parallel.

5
Example 2 (continued)
  • f() a 1 b 2 c 3
  • g() a 4 b 5 c 6
  • main() f() g()
  • Dependences are between assignments to a,
    assignments to b, assignments to c.
  • No other dependences.
  • Therefore, we only need to enforce these
    dependences.

6
Synchronization Facility
  • Suppose we had a set of primitives, signal(x) and
    wait(x).
  • wait(x) blocks unless a signal(x) has occurred.
  • signal(x) does not block, but causes a wait(x) to
    unblock, or causes a future wait(x) not to block.

7
Example 2 (continued)
  • f() a 1 b 2 c 3
  • g() a 4 b 5 c 6
  • main() f() g()
  • f() a 1 signal(e_a) b 2 signal(e_b) c
    3 signal(e_c)
  • g() wait(e_a) a 4 wait(e_b) b 5
    wait(e_c) c 6
  • main() f() g()

8
Example 2 (continued)
  • a 1 wait(e_a)
  • signal(e_a)
  • b 2 a 4
  • signal(e_b) wait(e_b)
  • c 3 b 5
  • signal(e_c) wait(e_c)
  • c 6
  • Execution is (mostly) parallel and correct.
  • Dependences are covered by synchronization.

9
About synchronization
  • Synchronization is necessary to make some
    programs execute correctly in parallel.
  • However, synchronization is expensive.
  • Therefore, needs to be reduced, or sometimes need
    to give up on parallelism.

10
Example 3
  • f() a1 b2 c3
  • g() d4 e5 a6
  • main() f() g()
  • f() a1 signal(e_a) b2 c3
  • g() d4 e5 wait(e_a) a6
  • main() f() g()

11
Example 4
  • for( i1 ilt100 i )
  • ai
  • ai-1
  • Loop-carried dependence, not parallelizable

12
Example 4 (continued)
  • for( i... ilt... i )
  • ai
  • signal(e_ai)
  • wait(e_ai-1)
  • ai-1

13
Example 4 (continued)
  • Note that here it matters which iterations are
    assigned to which processor.
  • It does not matter for correctness, but it
    matters for performance.
  • Cyclic assignment is probably best.

14
Example 5
  • for( i0 ilt100 i ) ai f(i)
  • x g(a)
  • for( i0 ilt100 i ) bi x h( ai )
  • First loop can be run in parallel.
  • Middle statement is sequential.
  • Second loop can be run in parallel.

15
Example 5 (contimued)
  • We will need to make parallel execution stop
    after first loop and resume at the beginning of
    the second loop.
  • Two (standard) ways of doing that
  • fork() - join()
  • barrier synchronization

16
Fork-Join Synchronization
  • fork() causes a number of processes to be created
    and to be run in parallel.
  • join() causes all these processes to wait until
    all of them have executed a join().

17
Example 5 (continued)
  • fork()
  • for( i... ilt... i ) ai f(i)
  • join()
  • x g(a)
  • fork()
  • for( i... ilt... i ) bi x h( ai )
  • join()

18
Example 6
  • sum 0.0
  • for( i0 ilt100 i ) sum ai
  • Iterations have dependence on sum.
  • Cannot be parallelized, but ...

19
Example 6 (continued)
  • for( k0 klt... k ) sumk 0.0
  • fork()
  • for( j jlt j ) sumk aj
  • join()
  • sum 0.0
  • for( k0 klt... k ) sum sumk

20
Reduction
  • This pattern is very common.
  • Many parallel programming systems have explicit
    support for it, called reduction.
  • sum reduce( , a, 0, 100 )

21
Final word on synchronization
  • Many different synchronization constructs exist
    in different programming models.
  • Dependences have to be covered by appropriate
    synchronization.
  • Synchronization is often expensive.

22
ECE 1747H Parallel Programming
  • Lecture 2-3 Data Parallelism

23
Previously
  • Ordering of statements.
  • Dependences.
  • Parallelism.
  • Synchronization.

24
Goal of next few lectures
  • Standard patterns of parallel programs.
  • Examples of each.
  • Later, code examples in various programming
    models.

25
Flavors of Parallelism
  • Data parallelism all processors do the same
    thing on different data.
  • Regular
  • Irregular
  • Task parallelism processors do different tasks.
  • Task queue
  • Pipelines

26
Data Parallelism
  • Essential idea each processor works on a
    different part of the data (usually in one or
    more arrays).
  • Regular or irregular data parallelism using
    linear or non-linear indexing.
  • Examples MM (regular), SOR (regular), MD
    (irregular).

27
Matrix Multiplication
  • Multiplication of two n by n matrices A and B
    into a third n by n matrix C

28
Matrix Multiply
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( i0 iltn i )
  • for( j0 jltn j )
  • for( k0 kltn k )
  • cij aikbkj

29
Parallel Matrix Multiply
  • No loop-carried dependences in i- or j-loop.
  • Loop-carried dependence on k-loop.
  • All i- and j-iterations can be run in parallel.

30
Parallel Matrix Multiply (contd.)
  • If we have P processors, we can give n/P rows or
    columns to each processor.
  • Or, we can divide the matrix in P squares, and
    give each processor one square.

31
Data Distribution Examples
  • BLOCK DISTRIBUTION

32
Data Distribution Examples
  • BLOCK DISTRIBUTION BY ROW

33
Data Distribution Examples
  • BLOCK DISTRIBUTION BY COLUMN

34
Data Distribution Examples
  • CYCLIC DISTRIBUTION BY COLUMN

35
Data Distribution Examples
  • BLOCK CYCLIC

36
Data Distribution Examples
  • COMBINATIONS

37
SOR
  • SOR implements a mathematical model for many
    natural phenomena, e.g., heat dissipation in a
    metal sheet.
  • Model is a partial differential equation.
  • Focus is on algorithm, not on derivation.

38
Problem Statement
y
F 1
2
F(x,y) 0
F 0
F 0
F 0
x
39
Discretization
  • Represent F in continuous rectangle by a
    2-dimensional discrete grid (array).
  • The boundary conditions on the rectangle are the
    boundary values of the array
  • The internal values are found by the relaxation
    algorithm.

40
Discretized Problem Statement
j
i
41
Relaxation Algorithm
  • For some number of iterations
  • for each internal grid point
  • compute average of its four neighbors
  • Termination condition
  • values at grid points change very little
  • (we will ignore this part in our example)

42
Discretized Problem Statement
  • for some number of timesteps/iterations
  • for (i1 iltn i )
  • for( j1, jltn, j )
  • tempij 0.25
  • ( gridi-1j gridi1j
  • gridij-1 gridij1 )
  • for( i1 iltn i )
  • for( j1 jltn j )
  • gridij tempij

43
Parallel SOR
  • No dependences between iterations of first (i,j)
    loop nest.
  • No dependences between iterations of second (i,j)
    loop nest.
  • Anti-dependence between first and second loop
    nest in the same timestep.
  • True dependence between second loop nest and
    first loop nest of next timestep.

44
Parallel SOR (continued)
  • First (i,j) loop nest can be parallelized.
  • Second (i,j) loop nest can be parallelized.
  • We must make processors wait at the end of each
    (i,j) loop nest.
  • Natural synchronization fork-join.

45
Parallel SOR (continued)
  • If we have P processors, we can give n/P rows or
    columns to each processor.
  • Or, we can divide the array in P squares, and
    give each processor a square to compute.

46
Molecular Dynamics (MD)
  • Simulation of a set of bodies under the influence
    of physical laws.
  • Atoms, molecules, celestial bodies, ...
  • Have same basic structure.

47
Molecular Dynamics (Skeleton)
  • for some number of timesteps
  • for all molecules i
  • for all other molecules j
  • forcei f( loci, locj )
  • for all molecules i
  • loci g( loci, forcei )

48
Molecular Dynamics (continued)
  • To reduce amount of computation, account for
    interaction only with nearby molecules.

49
Molecular Dynamics (continued)
  • for some number of timesteps
  • for all molecules i
  • for all nearby molecules j
  • forcei f( loci, locj )
  • for all molecules i
  • loci g( loci, forcei )

50
Molecular Dynamics (continued)
  • for each molecule i
  • number of nearby molecules counti
  • array of indices of nearby molecules indexj
  • ( 0 lt j lt counti)

51
Molecular Dynamics (continued)
  • for some number of timesteps
  • for( i0 iltnum_mol i )
  • for( j0 jltcounti j )
  • forcei f(loci,locindexj)
  • for( i0 iltnum_mol i )
  • loci g( loci, forcei )

52
Molecular Dynamics (continued)
  • No loop-carried dependence in first i-loop.
  • Loop-carried dependence (reduction) in j-loop.
  • No loop-carried dependence in second i-loop.
  • True dependence between first and second i-loop.

53
Molecular Dynamics (continued)
  • First i-loop can be parallelized.
  • Second i-loop can be parallelized.
  • Must make processors wait between loops.
  • Natural synchronization fork-join.

54
Molecular Dynamics (continued)
  • for some number of timesteps
  • for( i0 iltnum_mol i )
  • for( j0 jltcounti j )
  • forcei f(loci,locindexj)
  • for( i0 iltnum_mol i )
  • loci g( loci, forcei )

55
Irregular vs. regular data parallel
  • In SOR, all arrays are accessed through linear
    expressions of the loop indices, known at compile
    time regular.
  • In MD, some arrays are accessed through
    non-linear expressions of the loop indices, some
    known only at runtime irregular.

56
Irregular vs. regular data parallel
  • No real differences in terms of parallelization
    (based on dependences).
  • Will lead to fundamental differences in
    expressions of parallelism
  • irregular difficult for parallelism based on data
    distribution
  • not difficult for parallelism based on iteration
    distribution.

57
Molecular Dynamics (continued)
  • Parallelization of first loop
  • has a load balancing issue
  • some molecules have few/many neighbors
  • more sophisticated loop partitioning necessary
Write a Comment
User Comments (0)
About PowerShow.com