Title: Embarrassingly Parallel or pleasantly parallel
1Embarrassingly Parallel (or pleasantly parallel)
- Domain divisible into a large number of
independent parts. - Minimal or no communication
- Each processor performs the same calculation
independently - Nearly embarrassingly parallel
- Small Computation/Communication ratio
- Communication limited to the distribution and
gathering of data - Computation is time consuming and hides the
communication
2Embarrassingly Parallel Examples
P0
P1
P2
Embarrassingly Parallel Application
Send Data
P0
P1
P2
P3
Receive Data
Nearly Embarrassingly Parallel Application
3Low Level Image Processing
- Storage
- A two dimensional array of pixels.
- One bit, one byte, or three bytes may represent
pixels - Operations may only involve local data
- Image Applications
- Shifting (newXxdelta newYydelta)
- Scaling (newX xscale newY yscale)
- Rotate(newXx cosFy sinF newY-xsinFysinF)
- ClipnewX x if minxltxlt maxx 0 otherwisenewY
y if minyltyltmaxy 0 otherwise - Other Applications
- Smoothing, Edge Detection, Pattern Matching
4Process Partitioning
1024
128
768
P21
128
Partitioning might assign groups of rows or
columns to processors
5Image Shifting Application(See code on Page 84)
- Master
- Send starting row number to slaves
- Initialize a new array to hold shifted image
- FOR each message received
- Update new bit map coordinates
- Slave
- Receive starting row
- Compute translated coordinates and transmit them
back to the master - Questions
- Where is the initial image?
- What happens if a remote processor fails?
- How does the master decide how much to assign to
each processor? - Is the load balanced (all processors working
equally)? - Is the initial transmission of the row numbers
needed?
6Analysis
Program on Page 84
- Computation
- Host 3 rows cols, Slave 2 rows cols /
(P-1) - Communication (tcomm tstartup mtdata)
- Host (tstartup tdata) (P-1) rows columns
(tstartup 4 tdata) - Slaves (tstartup tdata) rows
columns/(P-1)(tstartup 4 tdata) - Total
- Ts 4 rows cols
- Tp 3 rowscols (tstartup tdata) (P-1)
rowscols (tstartup4 tdata) 3
rowscols 2(P-1) 5rowscols
8rowscols2(P-1) - S(p) lt ½
- Computation ratio tcomp/tcomm
(3rows/cols)/(5row
scols2(p-1)) 3/5 - Questions
- Can the transmission of the rows be done in
parallel? - How is it possible to reduce the communication
cost? - Is this an Amdahl or a Gustafson application?
7Mandelbrot Set
- The Mandelbrot Set is a set of complex plane
points that are iterated - using a prescribed function over a bounded area
- The iteration stops when the function value
reaches a limit - The iteration stops when the iteration count
reaches a limit - Each point gets a color according to the final
iteration count
- Complex numbers
- abi where i (-1)1/2
- Complex plane
- horizontal axis real values
- Vertical axis imaginary values.
8Pseudo code
- FOR each point c cxicy in a bounded area
- SET z zreal izimaginary 0 i0
- SET iterations 0
- DO
- SET z f(z, c)
- SET value (zreal2 zimaginary2)1/2
- iterations
- WHILE valueltlimit and iterationsltmax
- point cx and cy scaled to the display
- picturepoint coloriterations
- Notes
- Set each points color based on its final
iteration count - Some points converge quickly others slowly, and
others not at all - The non converging points (exceeding the maximum
iterations) are said to lie in the Mandelbrot Set
(black on the previous slide) - A common Mandelbrot function is z z2 c
9Scaling and Zooming
- Display range of points
- From cmin xmin iymin to cmax xmax iymax
- Display range of pixels
- From the pixel at (0,0) to the pixel at (width,
height) - Pseudo code
- For pixelx 0 to width
- For pixely 0 to height
- cx xmin pixelx (xmax xmin)/width
- cy ymin pixely (xmax xmin)/height
- color mandelbrot(cx, cy)
- picturepixelxpixely color
10Parallel Implementation
Static and Dynamic load balancing approaches
shown in chapter 3
- Load-balancing
- Algorithms used to avoid processors from becoming
idle - Note Does NOT mean that every processor has the
same work load - Static Approach
- The load is partitioned once at the start of the
run - Mandelbrot assign each processor a group of rows
- Deficiencies of book approach
- Separate messages per coordinate
- No accounting for processes that fail
- Dynamic Approach
- The load is partitioned during the run
- Mandelbrot Slaves ask for work when they
complete a section - Improvements from book approach
- Ask for work before completion (double buffering)
- Question How does the program terminate?
11Analysis of Static Approach
- Assumptions (Different from the text)
- Slaves send a row at a time
- Assume display time is equal to computation time
- tstartup and tdata 1
- Master
- Computation heightwidth
- Communication height(tstartup widthtdata)
heightwidth - Slaves
- Computation avgIterations height/(P-1)
width - Communication height/(P-1)(tstartupwidthtdata)
heightwidth/P-1 - Speed-up
- S(p) 2 height width avgIterations
- / (avgIterationsheightwidth/(P-1)hei
ghtwidth/(P-1)) P-1 - Computation/communication ratio
- 2 height width avgIterations / (height
(tstartup widthtdata)) avgIterations
12Monte Carlo Methods
Section 3.2.3 of Text
- Pseudo-code (Throw darts to converge at a
solution) - Compute definite integralWhile more iterations
needed pick a point Evaluate a function
Add to the answerCompute average - Calculation of PIWhile more iterations needed
Randomly pick a point If point is in circle
withinCompute PI 4 within / iterations - Parallel Implementation
- Need a parallel pseudo random generator (See
notes below) - Minimal communication requirements
- Note We can also use the upper right quadrant
1/N ?1N f(pick.x) (xmax xmin)
13Computation of PI
?(1-x2)1/2dx p/4 0ltxlt1
?(1-x2)1/2dx p -1ltxlt1
Within if (point.x2 point.y2) 1
Total points/Points within Total Area/Area in
shape
- Questions
- How to handle the boundary condition?
- What is the best accuracy that we can achieve?
14Parallel Random Number Generator
- Numbers of a pseudo random sequence are
- Uniformly, large period, repeatable,
statistically independent - Each processor must generate a unique sequence
- Accuracy depends upon random sequence precision
- Sequential linear generator (a and m are prime
c0) - Xi1 (axi c) mod m (ex a16807, m231 1,
c0) - Many other generators are possible
- Parallel linear generator with unique sequences
- Xik (Axi C) mod m
- AaP, Cc (aP aP-1 a1 a0)
x1
x2
xP-1
xP-1
xP
xP1
x2P-2
x2P-1
Parallel Random Sequence