Title: Partitioning and divide and conquer strategies
1Partitioning and divide and conquer strategies
2Partitioning
- Partitioning simply divides the data into parts
- Data partitioning or domain decomposition
- Functional decomposition
- Divide and Conquer
- Characterized by dividing the problem into
sub-problems of the same form as the larger
problem. Further divisions into still smaller
sub-problems usually done by recursion. - Recursive divide and conquer is amenable to
parallelization because separate processes can be
used for the divided parts. - Also, usually data is naturally localized.
3Partitioning/Divide conquer examples
- Many possibilities
- Operations on sequences of numbers such as simply
adding them together - Sorting algorithms can often be partitioned or
constructed in a recursive fashion - Numerical integration (quadrature)
- N-body problem
4Summing numbers
Have a set of n numbers x1, x2, , xn we wish
to collapse. Have p processors, so divide set
into parts with n/p items.
p - 1 additions in recombination phase
(n/p) additions per processor
5Speedup analysis
Number of steps tserial ? n - 1 tparallel ?
n/p p - 2 Speed-up S tserial/tparallel
n - 1
n/p p - 2
No speedup for n?p Parallel performance worse for
nltp For large n, S?p as expected.
(Analysis ignores start-up and communication
times, see text p110)
6Divide conquer implementation
Termination when 2 numbers left
Sequential recursive add (pseudocode)
int add(int s) if (number(s) lt 2) return n1
n2 else divide(s, s1, s2) part_sum1
add(s1) part_sum2 add(s2) return
(part_sum1 part_sum2)
Add calls itself recursively
7Tree diagram
add()
add()
add()
add()
In a parallel implementation, we want to traverse
several parts of the tree simultaneously.
8Obvious (naïve) method
Assign one processor to each node. Requires 2m-1
processors Very inefficient, as lots of
processors would be idle.
Re-use processor at different levels of the tree
9Original list
List division
Partial summation
P7
P0
P2
P3
P6
P1
P4
P5
P0
P2
P4
P6
P0
P4
P0
Final sum
10Speedup in divide conquer
There are log2 p levels in the tree Number of
computational steps tparallel n/p log2
p Speedup
n - 1
S
n/p log2 p
11M-ary divide and conquer
Divide and conquer can also be applied where a
task is divided into more than two parts at each
stage
Int add(int s) if (number(s) lt M)
return(n1n2?nM) else divide(s, s1, s2, ,
sM) part_sum1 add(s1) part_sum2
add(s2) part_sumM add(sM) return
(part_sum1part_sum2partsumM)
12Sorting numbers
Given a sequence of unsorted numbers, one can use
a number of algorithms of order n log n to sort
them (eg quicksort). How can we parallelize this?
13Sorting using bucket sort
Unsorted numbers are uniformly distributed over
some region Divide range into equal sized
regions Assign one bucket for each region Sort
each bucket using some standard algorithm (eg
quicksort) Concatenate the buckets.
(see text p117 (p119 old edition) for diagrams
and more explanations)
14Parallelizing bucket sort
Obvious method one processor per bucket!
15Further parallelization
Partition sequence into m regions, one region for
each processor. Each processor maintains p
small buckets and separates the numbers in its
region into its own small buckets. Small buckets
are than emptied into p final buckets for
sorting. This requires each processor to send one
small bucket to each of the other processors
(bucket i to processor i)
16Recall all-to-all broadcast
n/p numbers
Unsorted numbers
p processors
p small buckets
Large buckets
Sort contents of buckets
Merge lists
Sorted numbers
17Steps on one processor
Take its subset of numbers Assign p small
buckets to cover the range of numbers. Note
small bucket i will cover the same range as the
large bucket on processor i. Throw them into the
small buckets depending on the value of each
number. Empty each small bucket into the
appropriate large bucket, ie send small bucket i
to processor i. Sort its own large bucket.
18Analysis
Sequential bucket sort (with p buckets) t ? n
p (n/p) log(n/p) n n log(n/p) Parallel
bucket sort (p small buckets plus one big bucket
per processor) ti ? (n/p) (n/p) log(n/p)
19Numerical integration - quadrature
- Just straightforward data partitioning
- Several ways of doing the quadrature
- Using rectangles
- Trapezoidal method
- Adaptive quadrature
See text
20N-body simulations
Finding positions and movements of bodies in
space subject to the gravitational forces from
other bodies, using Newtonian laws of physics.
21Gravitational N-body equations
Gravitational force between two bodies of masses
m1 and m2 is F Gm1m2 / r2 Where G is the
gravitational constant and r is the distance
between the bodies. Subject to forces, body
accelerates according to Newtons second law F
ma
22Details
Let the time interval be ?t. For a body of mass
m, the force is f m(vt1 - vt) / ?t New
velocity is vt1 vt F ?t / m Where vt1
is the velocity at time t1 and vt is velocity at
time t. Over time interval ?t, position changes
by xt1 - xt v?t Where xt is position at
time t. Once bodies move to new positions,
forces change. Computation has to be repeated.
23Sequential code
Overall gravitational N-body computation can be
described by
For (t 0 t lt tmax t) for (i 0 i lt N
i) F force_func(i) vinew vi F
dt / m xinew xi vi dt For
(i 0 i lt nmax i) xi xinew vi
vinew
24Parallel code
The sequential algorithm is an O(N2) algorithm
(for one iteration) as each of the N bodies is
influenced by each of the other N-1 bodies. Not
feasible to use this direct algorithm for the
more interesting N-body problems where N is very
large.
25Time complexity can be reduced by using the fact
that a cluster of distant bodies can be
approximated as a single distant body of the
total mass of the cluster sited at the centre of
the mass of the cluster.
Centre of mass
r
Distant cluster of bodies
26Barnes-Hut Algorithm
- Star with the whole space in which one cube
contains the bodies (or particles) - First, this cube is divided into eight subcubes
- If a subcube contains no particles, it is deleted
from further consideration - If a subcube contains one body, this subcube is
retained - If a subcube contains more than one body, it is
recursively divided until every subcube contains
one body.
27Creates an octtree - a tree with up to eight
edges from each node. The leaves represent cells
containing one body. After the tree has been
constructed, the total mass and the centre of
mass of the subcube is stored at each node.
28Force on each body is obtained by traversing the
tree, starting at the root, and stopping when the
clustering approximation can be used, ie when r
? d / ? Where ? is a constant typically 1.0 or
less. Constructing the tree requires a time
complexity of O(n log n), and so does the
computing of the forces. So the overall time
complexity of the method is O(n log n).
29Recursive division of two-dimensional space