Title: Parallel Programming Concepts
1Parallel Programming Concepts
Performance measures and related issues
Parallelisation approaches Code
organization Sources of parallelism
2Performance Measures and Related Issues
Speedup Amdahls law Load Balancing Granularity
3Superlinear Speedup
- Should not be possible
- you can simulate the fast parallel algorithm on
a single processor to beat the best sequential
algorithm - Yet sometimes happens is practice
- more memory available in a parallel computer
- different search order in search problems
Sequential search 12 moves
Parallel search 2 moves
4Additional terms
- Efficiency
- Speedup/p
- Cost
- time x p
- Scalability
- how efficiently can the hardware/algorithm use
additional processors - Gustaffson law (observation)
- situation is not as tragic as Amdahls law
suggests - the serial fraction usually stays (nearly)
constant as the problem size increases - consequence nearly linear speedup us possible
if the problem size increases with the number of
processors - Isoefficiency function see Kumar al. book
5Exercises Speedup, Efficiency, Cost
Example 1 Compute the sums of columns in an
upper triangular matrix. Program1 for processor
i, n processors sumi 0 for(j0 jlti
j) sumi Aj,i Program2 for processor i,
p processors for(kip klt(i1)p-1 k)
sumk 0 for(j0 jltk j)
sumi Aj,k Program3 for
processor i, p processors for(ki kltn k
p) sumk 0 for(j0 jltk
j) sumi Aj,k
6Exercises Speedup, Efficiency, Cost
Can we do any better? int AddColumn(int col)
int sum 0 for(i0 iltcol i)
sum Acol, i return sum Program4 for
processor i, p processors col -i while
(colltn) col2i sumcol
AddColumn(col) col2p-2i
sumcol AddColumn(col)
7Exercises Speedup, Efficiency, Cost
Example 2 Compute the sum of n numbers Program
for processor i, n processors tmpSum Ai
for(j2 jltn j2) if (i j 0)
receive(ij/2, hisSum)
tmpSum hisSum else
send(i-j/2, tmpSum) break
8Sources of inefficiency
P0 P1
computation P2
idle P3 P4 P5 P6 P7 P8
execution time
9Sources of inefficiency II
P0 P1
computation P2
idle P3
communication P4 P5 P6 P7 P8
execution time
10Sources of inefficiency
P0 P1
computation P2
idle P3
communication P4
additional or P5
repeated com- P6
putation P7 P8
execution time
11Load Balancing
Efficiency adversely affected by uneven workload
P0 P1
computation P2
idle (wasted) P3 P4
12Load Balancing (cont.)
- Load balancing shifting work from heavily loaded
processors to lightly loaded ones. - P0
- P1
computation - P2
idle (wasted) - P3
moved - P4
- Static load balancing
- before execution
- Dynamic load balancing
- during execution
13Granularity
- The size of the computation segments between
communication. - fine grained
coarse
grained - ILP loop
parallelism task
parallelism - The most efficient granularity is dependent on
the algorithm and the hardware environment in
which it runs - In most cases overhead associated with
communications and synchronization is high
relative to execution speed so it is advantageous
to have coarse granularity.
14Fine Grain Parallelism
- All tasks execute a small number of instructions
between communication cycles - Low computation to communication ratio
- Facilitates load balancing
- Implies high communication overhead and less
opportunity for performance enhancement - If granularity is too fine it is possible that
the overhead required for communications and
synchronization between tasks takes longer than
the computation
15Coarse Grain Parallelism
- Typified by long computations consisting of
large numbers of instructions between
communication synchronization points - High computation to communication ratio
- Lower communication overhead, more opportunity
for performance increase - Harder to load balance efficiently
P0 P1
computation P2
commmunication P3 P4
16Granularity vs. Coupling
- Granularity
- fine grained
coarse
grained - tightly
loosely - SMP ccNUMA NUMA
MPP ethernet cluster - Coupling
- the looser the coupling, the coarser granularity
must be for the communication not to overwhelm
computation
17Parallel Programming Concepts
Performance measures and related
issues Parallelisation approaches Code
organization Sources of parallelism
18Parallelisation Approaches
- Parallelizing compiler
- advantage use your current code
- disadvantage very limited abilities
- Parallel domain-specific libraries
- e.g. linear algebra, numerical libraries,
quantum chemistry - usually good choice, use when possible
- Communication libraries
- message passing libraries MPI, PVM
- shared memory libraries declare and access
shared memory variables (on MPP machines done by
emulation) - advantage use standard compiler
- disadvantage low level programming (parallel
assembler)
19Parallelisation Approaches (cont.)
- New parallel languages
- use a language with built-in explicit control
for parallelism - no language is the best in every domain
- needs new compiler
- fights against inertia
- Parallel features in existing languages
- adding parallel features to an existing language
- I.e. for expressing loop parallelism (pardo) and
data placement - example High Performance Fortran
- Additional possibilities in shared-memory systems
- use threads
- preprocessor compiler directives (OpenMP)
20Parallelisation Approaches Our Focus
- Communication libraries MPI, PVM
- industry standard, available for every platform
- very general, low level approach
- perfectly match for clusters
- most likely to be useful for you
- Shared memory programming
- also very important
- likely to be useful in next iterations of PCs
21Parallel Programming Concepts
Performance measures and related
issues Parallelisation approaches Code
organization Sources of parallelism
22Code Organization - SPMD
- Single Program Multiple Data
- well suited for SIMD computers
- popular choice even for MIMD, as it keeps
everything in one place - typical in MPI programs
- static process creation
- may waste memory
- Example
- - heap-like computation SPMD way
main() if (id 0)
rootNode() else if (id lt p/2) innerNode()
else leafNode()
23Code Organization - MPMD
- Multiple Programs Multiple Data
- allows dynamic process creation
- typically master-slave approach
- more memory-efficient
- typical in PVM
master.c main() for(i1 iltp i)
sidi spawn(slave(i))
slave.c main(int id) // slave code
here
24Parallel Programming Concepts
Performance measures and related
issues Parallelisation approaches Code
organization Sources of parallelism
25Sources of Parallelism
Data Parallelism Task Parallelism Pipelining
26Data Parallelism
- divide data up amongst processors.
- process different data segments in parallel
- communicate boundary information, if necessary
- includes loop parallelism
- well suited for SIMD machines
- communication is often implicit (HPF)
27Task Parallelism
- decompose algorithm into different sections
- assign sections to different processors
- often uses fork()/join()/spawn()
- usually does not yield itself to high level of
parallelism
28Pipelining
- a sequence of tasks whose execution can overlap
- sequential processor must execute them
sequentially, without overlap - parallel computer can overlap the tasks,
increasing throughput (but not decreasing latency)
29New Concepts and Terms - Summary
- speedup, efficiency, cost, scalability
- Amdahls law, Gustaffsons law
- Load Balancing static, dynamic
- Granularity fine, coarse
- Tightly, loosely coupled system
- SPMD, MPMD
- Data Parallelism, Task Parallelism, Pipelining