Title: Parallel%20distributed%20computing%20techniques
1Parallel distributed computing techniques
Sinh viên Lê Tr?ng Tín Mai Van Ninh Phùng
Quang Chánh Nguy?n Ð?c C?nh Ð?ng Trung Tín
2Contents
3Contents
4Motivation of Parallel Computing Techniques
- Demand for Computational Speed
- Continual demand for greater computational speed
from a computer system than is currently possible - Areas requiring great computational speed include
numerical modeling and simulation of scienti?c
and engineering problems. - Computations must be completed within a
reasonable time period.
5Contents
6Message-Passing Computing
- Basics of Message-Passing Programming using
user-level message passing libraries - Two primary mechanisms needed
- A method of creating separate processes for
execution on different computers - A method of sending and receiving messages
7Message-Passing Computing
Source file
Basic MPI way
Compile to suit processor
Source file
Source file
executables
Processor n-1
Processor 0
8Message-Passing Computing
Processor 1
. spawn() . . . . .
PVM way
Start execution of process 2
Processor 2
. . . . . . .
time
9Message-Passing Computing
Method of sending and receiving messages?
10Contents
11Pipelined Computation
- Problem divided into a series of tasks
that have to be completed one after the other
(the basis of sequential programming). - Each task executed by a separate process or
processor.
12Pipelined Computation
- Where pipelining can be used to good effect
- 1-If more than one instance of the
complete problem is to be executed - 2-If a series of data items must be
processed, each requiring multiple operations - 3-If information to start the next process
can be passed forward before the process has
completed all its internal operations
13Pipelined Computation
- Execution time m p - 1 cycles for a p-stage
pipeline and m instances
14Pipelined Computation
15Pipelined Computation
16Pipelined Computation
17Pipelined Computations
18Pipelined Computation
19Contents
20Ideal Parallel Computation
- A computation that can obviously be devided into
a number of completely independent parts - Each of which can be executed by a separate
processor - Each process can do its tasks without any
interaction with other process
21Ideal Parallel Computation
- Practical embarrassingly parallel computation
with static process creation and master slave
approach
22Ideal Parallel Computation
- Practical embarrassingly parallel computation
with dynamic process creation and master slave
approach
23Embarrassingly parallel examples
- Geometrical Transformations of Images
- Mandelbrot set
- Monte Carlo Method
-
24Geometrical Transformations of Images
- Performing on the coordinates of each pixel to
move the position of the pixel without affecting
its value - The transformation on each pixel is totally
independent from other pixels - Some geometrical operations
- Shifting
- Scaling
- Rotation
- Clipping
-
25Geometrical Transformations of Images
- Partitioning into regions for individual
processes - Square region for each process Row region
for each process -
80
640
640
80
480
480
10
26Mandelbrot Set
- Set of points in a complex plane that are
quasi-stable when computed by iterating the
function -
- where is the (k 1)th iteration of the complex
number z a bi and c is a complex number
giving position of point in the complex plane.
The initial value for z is zero. - Iterations continued until magnitude of z is
greater than 2 or number of iterations reaches
arbitrary limit. Magnitude of z is the length of
the vector given by -
-
27Mandelbrot Set
28Mandelbrot Set
29Mandelbrot Set
- c.real real_min x (real_max -
real_min)/disp_width - c.imag imag_min y (imag_max -
imag_min)/disp_height -
- Static Task Assignment
- Simply divide the region into fixed number of
parts, each computed by a separate processor - Not very successful because different regions
require different numbers of iterations and time -
- Dynamic Task Assignment
- Have processor request regions after computing
previouos regions -
30Mandelbrot Set
- Dynamic Task Assignment
- Have processor request regions after computing
previouos regions
31Monte Carlo Method
- Another embarrassingly parallel computation
- Monte Carlo methods use of random selections
- Example To calculate ?
- Circle formed within a square, with unit radius
so that square has side 2x2. Ratio of the area of
the circle to the square given by
32Monte Carlo Method
- One quadrant of the construction can be described
by integral -
- Random pairs of numbers, (xr,yr) generated, each
between 0 and 1. Counted as in circle if that
is,
33Monte Carlo Method
- Alternative method to compute integral
- Use random values of x to compute f(x) and sum
values of f(x) - where xr are randomly generated values of x
between x1 and x2 - Monte Carlo method very useful if the function
cannot be integrated numerically (maybe having a
large number of variables)
34Monte Carlo Method
- Example computing the integral
- Sequential code
-
- Routine randv(x1, x2) returns a pseudorandom
number between x1 and x2 -
35Monte Carlo Method
- Parallel Monte Carlo integration
-
Master
Partial sum
Request
Slaves
Random number
Random-number process
36Contents
37Partitioning and Divide-and-Conquer
Strategies
38Partitioning
- Partitioning simply divides the problem into
parts. - It is the basic of all parallel programming.
- Partitioning can be applied to the program data
(data partitioning or domain decomposition) and
the functions of a program (functional
decomposition). - It is much less mommon to find concurrent
functions in a problem, but data partitioning is
a main strategy for parallel programming.
39Partitioning (cont)
A sequence of numbers, x0 ,, xn-1 , are to be
added
n number of items p number of processors
Partitioning a sequence of numbers into parts and
adding them
40Divide and Conquer
- Characterized by dividing problem into
subproblems of same form as larger problem.
Further divisions into still smaller
sub-problems, usually done by recursion. - Recursive divide and conquer amenable to
parallelization because separate processes can be
used for divided parts. Also usually data is
naturally localized.
41Divide and Conquer (cont)
- A sequential recursive definition for adding
- a list of numbers is
- int add(int s) // add list of numbers, s
-
- if(number(s) lt 2) return (n1 n2)
- else
- Divide (s, s1, s2) // divide s into two part,
s1, s2 - part_sum1 add(s1)// recursive calls to add
sub lists - part_sum2 add(s2)
- return (part_sum1 part_sum2)
-
42Divide and Conquer (cont)
Initial problem
Divide problem
Final task
Tree construction
42
www.cse.hcmut.edu.vn
43Divide and Conquer (cont)
Original list
Initial problem
P0
P0
P4
Divide problem
P2
P0
P6
P4
P7
P6
P5
P4
P3
P2
P1
P0
Final task
x0
xn-1
44Partitioning/Divide and Conquer Examples
- Many possibilities.
- Operations on sequences of number such as simply
adding them together - Several sorting algorithms can often be
partitioned or constructed in a recursive fashion - Numerical integration
- N-body problem
45Bucket sort
- One bucket assigned to hold numbers that fall
within each region. - Numbers in each bucket sorted using a sequential
sorting algorithm. - Sequental sorting time complexity O(nlog(n/m).
- Works well if the original numbers uniformly
distributed across a known interval, say 0 to a -
1.
n number of items m number of buckets
46Parallel version of bucket sort
- Simple approach
- Assign one processor for each bucket.
47Further Parallelization
- Partition sequence into m regions, one region for
each processor. - Each processor maintains p small buckets and
separates the numbers in its region into its own
small buckets. - Small buckets then emptied into p ?nal buckets
for sorting, whichrequires each processor to send
one small bucket to each of the other processors
(bucket i to processor i).
48Another parallel version of bucket sort
- Introduces new message-passing operation -
all-to-all broadcast.
49all-to-all broadcast routine
- Sends data from each process to every other
process
50all-to-all broadcast routine (cont)
- all-to-all routine actually transfers rows of
an array to columns - Tranposes a matrix.
51Contents
52Synchronous Computations
- Synchronous
- Barrier
- Barrier Implementation
- Centralized Counter implementation
- Tree Barrier Implementation
- Butterfly Barrier
- Synchronized Computations
- Fully synchronous
- Data Parallel Computations
- Synchronous Iteration(Synchronous Parallelism)
- Locally synchronous
- Heat Distribution Problem
- Sequential Code
- Parallel Code
53Barrier
- A basic mechanism for synchronizing processes -
inserted at the point in each process where it
must wait. - All processes can continue from this point when
all the processes have reached it - Processes reaching barrier at different times
54Barrier Image
55Barrier Implementation
- Centralized Counter implementation ( linear
barrier) - Tree Barrier Implementation.
- Butterfly Barrier
- Local Synchronization
- Deadlock
56Centralized Counter implementation
- Have two phase
- Arrival phase (trapping)
- Departure phase(release)
- A process enters arrival phase and does not leave
this phase until all processes have arrived in
this phase - Then processes move to departure phase and are
released
57- Example code
- Master
- for (i 0 i lt n i)/count slaves as they
reach barrier/ - recv(Pany)
- for (i 0 i lt n i)/ release slaves /
- send(Pi)
- Slave processes
- send(Pmaster)
- recv(Pmaster)
58Tree Barrier Implementation
- Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6,
P7 - First stage
- P1 sends message to P0 (when P1 reaches its
barrier) - P3 sends message to P2 (when P3 reaches its
barrier) - P5 sends message to P4 (when P5 reaches its
barrier) - P7 sends message to P6 (when P7 reaches its
barrier) - Second stage
- P2 sends message to P0 (P2 P3 reached their
barrier) - P6 sends message to P4 (P6 P7 reached their
barrier) - Second stage
- P4 sends message to P0 (P4, P5, P6, P7
reached barrier) - P0 terminates arrival phase( when P0 reaches
barrier received message from P4)
59Tree Barrier Implementation
- Release with a reverse tree construction.
Tree barrier
60Butterfly Barrier
- This would be used if data were exchanged between
the processes
61Local Synchronization
- Suppose a process Pi needs to be synchronized
and to exchange data with process Pi-1 and
process Pi1 - Not a perfect three-process barrier because
process Pi-1 will only synchronize with Pi and
continue as soon as Pi allows. Similarly,process
Pi1 only synchronizes with Pi.
62Synchronized Computations
- Fully synchronous
- In fully synchronous, all processes involved in
the computation must be synchronized. - Data Parallel Computations
- Synchronous Iteration(Synchronous Parallelism)
- Locally synchronous
- In locally synchronous, processes only need to
synchronize with a set of logically nearby
processes, not all processes involved in the
computation - Heat Distribution Problem
- Sequential Code
- Parallel Code
63Data Parallel Computations
- Same operation performed on different data
elements simultaneously (SIMD) - Data parallel programming is very convenient for
two reasons - The first is its ease of programming (essentially
only one program) - The second is that it can scale easily to larger
problems sizes
64Synchronous Iteration
- Each iteration composed of several processes that
start together at beginning of iteration. Next
iteration cannot begin until all processes have
finished previous iteration Using forall - for (j 0 j lt n j) /for each synch.
iteration / - forall (i 0 i lt N i) /N procs each
using/ - body(i) / specific value of i /
-
-
65Synchronous Iteration
- Solving a General System of Linear Equations by
Iteration - Suppose the equations are of a general form with
n equations and n unknowns where the unknowns are
x0, x1, x2, xn-1 (0 lt i lt n). - an-1,0x0 an-1,1x1 an-1,2x2
an-1,n-1xn-1 bn-1 - .
- .
- .
- .
- a2,0x0 a2,1x1 a2,2x2 a2,n-1xn-1 b2
- a1,0x0 a1,1x1 a1,2x2 a1,n-1xn-1 b1
- a0,0x0 a0,1x1 a0,2x2 a0,n-1xn-1 b0
- where the unknowns are x0, x1, x2, xn-1 (0lt i
lt n).
66Synchronous Iteration
- By rearranging the ith equation
- ai,0x0 ai,1x1 ai,2x2 ai,n-1xn-1 bi
- to
- xi (1/ai,i)bi-(ai,0x0ai,1x1ai,2x2ai,i-1xi-1
ai ,i1xi1ai,n-1xn-1) - Or
67Heat Distribution Problem
- An area has known temperatures along each of its
edges. Find thetemperature distribution within.
Divide area into fine mesh of points, hi,j.
Temperature at an inside point taken to be
average of temperatures of four neighboring - points..
- Temperature of each point by iterating the
equation -
- (0 lt i lt n, 0 lt j lt n)
68Heat Distribution Problem
69Sequential Code
- Using a fixed number of iterations
- for (iteration 0 iteration lt limit
iteration) - for (i 1 i lt n i)
- for (j 1 j lt n j)
- gij 0.25(hi-1jhi1jhij-1
hij1) - for (i 1 i lt n i)/ update points /
- for (j 1 j lt n j)
- hij gij
70Parallel Code
- With fixed number of iterations, Pi,j (except for
the boundary points) - for (iteration 0 iteration lt limit
iteration) - g 0.25 (w x y z)
- send(g, Pi-1,j) / non-blocking sends /
- send(g, Pi1,j)
- send(g, Pi,j-1)
- send(g, Pi,j1)
- recv(w, Pi-1,j) / synchronous receives /
- recv(x, Pi1,j)
- recv(y, Pi,j-1)
- recv(z, Pi,j1)
-
Local Barrier
71Contents
72Load Balancing Termination Detection
73Load Balancing Termination Detection
74Load Balancing
75Load Balancing Termination Detection
76Static Load Balancing
- Round robin algorithm passes out tasks in
sequential order of processes coming back to the
first when all processes have been given a task - Randomized algorithms selects processes at
random to - take tasks
- Recursive bisection recursively divides the
problem into - subproblems of equal computational effort
while minimizing message passing - Simulated annealing an optimization technique
- Genetic algorithm another optimization
technique, described
77Static Load Balancing
- Several fundamental flaws with static load
balancing even if a mathematical solution exists - Very difficult to estimate accurately the
execution times of various parts of a program
without actually executing the parts. - Communication delays that vary under different
circumstances - Some problems have an indeterminate number of
steps to reach their solution.
78Dynamic Load Balancing
79Centralized dynamic load balancing
- Tasks handed out from a centralized location.
Master-slave structure - Master process(or) holds the collection of tasks
to be performed. - Tasks are sent to the slave processes. When a
slave process completes one task, it requests
another task from the master process. - (Terms used work pool, replicated worker,
processor farm.)
80Centralized dynamic load balancing
81Termination
- Computation terminates when
- The task queue is empty and
- Every process has made a request for
another task without any new tasks being
generated - Not sufficient to terminate when task queue empty
if one or more processes are still running if a
running process may provide new tasks for task
queue.
82Decentralized dynamic load balancing
83Fully Distributed Work Pool
- Processes to execute tasks from each other
- Task could be transferred by
- - Receiver-initiated
- - Sender-initiated
84Process Selection
- Algorithms for selecting a process
- Round robin algorithm process Pi requests tasks
from process Px,where x is given by a counter
that is incremented after each request, using
modulo n arithmetic (n processes), excluding x
i. - Random polling algorithm process Pi requests
tasks from process Px, where x is a number that
is selected randomly between 0 and n- 1
(excluding i).
85Distributed Termination Detection Algorithms
- Termination Conditions
- Application-specific local termination conditions
exist throughout the collection of processes, at
time t. - There are no messages in transit between
processes at time t. - Second condition necessary because a message
in transit might restart a terminated process.
More difficult to recognize. The time that it
takes for messages to travel between processes
will not be known in advance.
86Using Acknowledgment Messages
- Each process in one of two states
- Inactive - without any task to perform
- Active
- Process that sent task to make it enter the
active state becomes its parent.
87Using Acknowledgment Messages
- When process receives a task, it immediately
sends an acknowledgment message, except if the
process it receives the taskfrom is its parent
process. Only sends an acknowledgment message to
its parent when it is ready to become inactive,
i.e. when - Its local termination condition exists (all tasks
are completed, and It has transmitted all its
acknowledgments for tasks it has received, and It
has received all its acknowledgments for tasks it
has sent out. - A process must become inactive before its parent
process. When first process becomes idle, the
computation can terminate
88Load balancing/termination detection Example
EX Finding the shortest distance between two
points on a graph.
89References Parallel Programming Techniques and
Applications Using Networked Workstations and
Parallel Computers, Barry Wilkinson and MiChael
Allen, Second Edition, Prentice Hall, 2005.
90QA
91Thank You !