Title: Parallel Quicksort
1Parallel Quicksort
- Parallel Quicksort for shared memory computer
- every processor gets n/p elements
- repeat
- choose pivot and broadcast it
- each processor i splits its sequence into Li
(ltpivot) and Ri (gt pivot) - collect all Lis and Ris into global L and R
- split the processors into left and right in the
ratio L/R - the left processors recursively sort L, the
right processors R - until a single processor is left for the whole
(reduced) range - sort your range sequentially
2Sample Sort
- Goal get rid of the uniformly distributed data
assumption in Bucket sort - Idea sample the input data and compute a good
set of bucket intervals - High-level description
- sort local data and choose p-1 evenly spaced
pivots - send these pivots to the master process
- master collects all p(p-1) pivots, sorts them
and chooses p-1 evenly spaced (among the p(p-1)
pivots) master pivots - master broadcasts the master pivots
- the master pivots define the buckets used by
each process - continue as in the bucket sort
- distribute local data into the buckets (e.g by
inserting the master pivots using binary search
into the sorted local sequence) - send buckets to appropriate processes
- sort the received data
3Sorting Summary
Odd-even transposition sort O(n) time with n
processors, for linear array, simple,
inefficient, the best we can hope for in linear
array if only communication with direct
neighbours is possible Shear sort O(?n (log n
1)) time with n processors, for 2D mesh, simple,
there is an O(?n) version, which is optimal in 2D
mesh if only communication with direct neighbours
is possible Bitonic sort O(log2 n) time with n
processors (in hypercube), simple, still not cost
optimal, hypercube and 2D mesh implementations Qui
ck Sort O(log2 n) time with n processors (with
good pivots), shared memory and message-passing
implementations Sample Sort for ngtp2, alleviates
partially the problem of uneven data
distribution There exists O(log n) sorting
algorithm with n processors, but it is very
complicated and the constant factors are too high
to be practical.
4Matrix Algorithms
- Matrix-Vector Multiplication
- with 1D partitioning
- with 2D partitioning
- Matrix-Matrix Multiplication
- simple parallel algorithm
- Cannons algorithm
- the DNS algorithm
- Solving Systems of Linear Equations
- simple Gaussian elimination
- with pipelining
- with partial pivoting
- back substitution
5Matrix-Vector Multiplication
Sequential code
Vector MAT_VECT(A,x,y) for(i0
iltn i) yi 0
for(j0 jltn j)
yiyiAi,jxj
return y
6Matrix-Vector Multiplication II
- Analysis
- all-to-all broadcast tslog p tw(n/p)(p-1) ?
tslog p twn - local computation n x n/p n2/p
- Total Tp ? n2/p tslog p twn
7Matrix-Vector Mult., 2D Partitioning
8Matrix-Vector Mult., 2D Partitioning
- Analysis
- distributing vector x to the diagonal ts
tw(n/?p) - broadcasting component of vector x in the
columns (ts tw(n/?p))log(?p) - local computation n2/p
- reduction in rows (ts tw(n/?p))log(?p)
- Total
- Tp ? n2/p tslog p tw(n/?p)log p
- 1D vs 2D
- difference only in the last term
- 1D twn vs 2D tw(n/?p)log p
- conclusion 2D partitioning can use up to n2
processors and has smaller bandwidth
requirements, i.e. it can scale better
9Matrix Algorithms
- Matrix-Vector Multiplication
- with 1D partitioning
- with 2D partitioning
- Matrix-Matrix Multiplication
- simple parallel algorithm
- Cannons algorithm
- the DNS algorithm
- Solving Systems of Linear Equations
- simple Gaussian elimination
- with pipelining
- with partial pivoting
- back substitution
Matrix algorithms based on Kumars book.
10Matrix-Matrix Multiplication
- Sequential code
-
- Sequential complexity O(n3)
- slightly better algorithms possible (e.g.
Strassens algorithm), but the constant factors
render them non-practical for most realistic
values of n
Vector MAT_MULT(A,B,C) for(i0
iltn i) for(j0 jltn j
Ci,j 0
for(k0 kltk j)
Ci,jiAi,kX Bk,j
11Simple Matrix-Matrix Multiplication
12Simple Matrix-Matrix Multiplication II
- Complexity
- initial row-broadcast of blocks of A and
column-broadcast of blocks of B 2(tslog(?p)
tw(n2/p)(?p-1)) - local computation ?p X (n/?p)3 n3/p
- Total Tp ? n3/p tslog p 2tw(n2/?p)
- Memory usage
- a block of rows of A and a block of columns of B
- 2(n X n/?p) per processor, 2n2?p in total
- wasteful with respect to the sequential
algorithm 2n2
13Cannons Matrix-Matrix Multiplication
- Main idea
- pipeline the simple algorithm by shifting the
blocks after each block multiplication in a smart
way so that each process needs to store only one
block of A and one block of B. - rotate row/column i of A/B i steps to the left/up
Initial alignment of A
Initial alignment of B
A and B after initial alignment
A0,2 B2,2
A0,0
A0,1
A0,2
A0,3
B0,0
B0,1
B0,2
B0,3
A0,0 B0,0
A0,1 B1,1
A0,3 B3,3
A1,3 B3,2
A1,0
A1,1
A1,2
A1,3
B1,0
B1,1
B1,2
B1,3
A1,1 B1,0
A1,2 B2,1
A1,0 B0,3
A2,0 B0,2
A2,0
A2,1
A2,2
A2,3
B2,0
B2,1
B2,2
B2,3
A2,2 B2,0
A2,3 B3,1
A2,1 B1,3
A3,1 B1,2
A3,0
A3,1
A3,2
A3,3
B3,0
B3,1
B3,2
B3,3
A3,3 B3,0
A3,0 B0,1
A3,2 B2,3
14Cannons Matrix-Matrix Multiplication II
After first shift
After second shift
After third shift
A0,1 B1,0
A0,2 B2,1
A0,3 B3,2
A0,0 B0,3
A0,2 B,2,0
A0,3 B3,1
A0,0 B0,2
A0,1 B1,3
A0,3 B,3,0
A0,0 B0,1
A0,1 B1,2
A0,2 B2,3
A1,2 B2,0
A1,3 B3,1
A1,0 B0,2
A1,1 B1,3
A1,3 B3,0
A1,0 B0,1
A1,1 B1,2
A1,2 B2,3
A1,0 B0,0
A1,1 B1,1
A1,2 B2,2
A1,3 B3,3
A2,3 B3,0
A2,0 B0,1
A2,1 B1,2
A2,2 B2,3
A2,0 B0,0
A2,1 B1,1
A2,2 B2,2
A2,3 B3,3
A2,1 B1,0
A2,2 B2,1
A2,3 B3,2
A2,0 B0,3
A3,0 B0,0
A3,1 B3,1
A3,2 B2,2
A3,3 B3,3
A3,1 B1,0
A3,2 B2,1
A3,3 B3,2
A3,0 B0,3
A3,2 B2,0
A3,3 B3,1
A3,0 B0,2
A3,1 B1,3
- Analysis
- initial shift 2(ts
tw(n2/p)) - communication in one step 2(ts tw(n2/p))
- local computation in one step (n/?p)3
- number of steps ?p
- Total Tp n3/p 2ts?p 2tw(n2/?p)
15The DNS Algorithm (still M3)
- Cannons algorithm with n2 processors has time
O(n). - Can we have faster Matrix-Matrix Multiplication
algorithm? - Yes, but we need more processors.
- Algorithm DNS
- take n3 processors Pi,j,k for i,j,k 0n-1
- Ai,j and Bi,j initially stored at processors
Pi,j,0 - A and B are broadcasted so that processor
Pi,j,k gets Ai,k and Bk,j - processor Pi,j,k computes Ai,k x Bk,j
- processors Pi,j, perform Reduce() of the
results - processor Pi,j,0 stores the result Ci,j
- The only nontrivial part How to efficiently
broadcast A and B so that everybody gets what it
needs?
16The DNS Algorithm II
Initial distribution of A and B
k3
A,B
k2
k1
k
j
k0
i
17The DNS Algorithm III
After broadcasting Ai,j along j axis
0,3
1,3
2,3
3,3
A
0,2
1,2
2,2
3,2
k
0,1
1,1
2,1
3,1
j
0,0
1,0
2,0
3,0
i
18The DNS Algorithm IV
- Analysis
- moving Ai,j(Bi,j) from Pi,j,0 to Pi,j,j
(Pi,j,i) ts tw - broadcasting within a row/column (ts tw)log n
- local computation 1
- reduction (ts tw)log n
- Total Tp 1ts(2log n 1) tw(2log n 1)
-
- Note Not cost-optimal, as processor-time product
is O(n3 log n). Can be adapted for pq3 lt n3
processors. If pltn3/log n then the DNS algorithm
is cost-optimal.
19Matrix Algorithms
- Matrix-Vector Multiplication
- with 1D partitioning
- with 2D partitioning
- Matrix-Matrix Multiplication
- simple parallel algorithm
- Cannons algorithm
- the DNS algorithm
- Solving Systems of Linear Equations
- simple Gaussian elimination
- with pipelining
- with partial pivoting
- back substitution
20Solving a set of linear equations
We want to solve this set of equations x1
x2 - 2x3 - 3x4 0 3x1 4x2 - 4x3 - x4
0 3x1 x2 - 2x3 - x4 0 4x1 - x2 2x3
5x4 3 Can be represented by the following
matrix
21Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(0)
22Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(0)
EliminateRow(0,1) row1 row1 - 3x row0
23Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(0)
EliminateRow(0,1) row1 row1 - 3x row0
EliminateRow(0,2) row2 row2 - 3x row0
24Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(0)
EliminateRow(0,1) row1 row1 - 3x row0
EliminateRow(0,2) row2 row2 - 3x row0
EliminateRow(0,2) row3 row3 - 4x row0
25Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(1)
26Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(1)
EliminateRow(1,2) row2 row2 2x row1
27Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(1)
EliminateRow(1,2) row2 row2 2x row1
EliminateRow(1,3) row3 row3 5x row1
28Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(2)
29Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(2)
EliminateRow(2,3) row3 row3 - 20x row2
30Gaussian Elimination - Sequential
System of equations is first reduced to upper
triangular form, the solution is then computed
using back substitution. We focus on the first
step
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
DivideRow(3)
31Gaussian Elimination
- Analysis of sequential algorithm
- dividing rows n(n-1) /2 n2 /2
- elimination 2n3/3
- total cost O(n3)
- Quite a lot, lets parallelize it!
- Hmm, where to start?
SequentialGE( ) for(k0 kltn
k) DivideRow(k)
for(ik1 iltn i)
EliminateRow(i,k)
32Gaussian Elimination
- Idea
- lets assign each row to one processor
- we can then do all the elimination steps in
parallel, as they would be done by different
processors - that is good, as the elimination is the costly
step
33Simple Parallel Gaussian Elimination
p0
DivideRow(0)
P1
p3
p4
34Simple Parallel Gaussian Elimination
p0
P1
DivideRow(1)
p3
p4
35Simple Parallel Gaussian Elimination
36Simple Parallel Gaussian Elimination II
- Analysis (simple parallel)
- dividing rows O(n2)
- broadcast O(n(ts twn)log n) O(tsn log(n)
twn2 log(n)) - elimination O(n2)
- Total time Tp O(tsn log(n) twn2 log(n))
- Cost (with n processors) O(n3 log n)
-
37Pipelined Parallel Gaussian Elimination (1D)
- Problem
- simple parallel Gaussian Elimination is not
cost-optimal because of the cost of broadcasting - Solution
- pipeline the broadcasting of rows
- when a process gets data, it forwards them down
the pipeline and only then proceeds to process
the data
38Pipelined Parallel Gaussian Elimination (1D)
- Analysis
- dividing rows O(n2)
- forwarding (n-1)(ts twn) tsn twn2
- start-up (n-1)(ts twn)
- elimination O(n2)
- Total time O(n2 tsn twn2 )
- Total cost O(n3)
- Less then n processes
- block vs striped partitioning
392D partitioning
- Why?
- make use of more then n processors
- Issues
- pipelining, block vs striped partitioning
- Analysis (n2 processors)
- each computation step O(1)
- each communication step ts tw
- of communication/computation steps per
process O(n) - Total time O(n tsn twn )
- Total cost O(n3)
- Code for processor Ai,j
- for iltj
- for ij
- for igtj
40Partial Pivoting
- Problem
- if Ak,k is 0 or close to 0, numerical errors
are greatly magnified - Solution Partial Pivoting
- examine row k and choose the pivot column j such
that Ak,j is biggest in the row k - swap columns j and k (either explicitly or
implicitly) - Issues
- other rows need to learn the pivot column
- if row-wise 1D partitioning is used, the pivot
can be identified locally and passed along with
the data - column-wise 1D partitioning requires
communication to identify the pivot (and swapping
columns can unbalance the load) - 2D partitioning cannot be well pipelined with
partial pivoting identifying pivot for iteration
k1 cannot commence immediately after iteration k
41Back Substitution
Sequential solution
SeqBackSubstitution(U,x,y) for(kn-1
kgt0 k--) xk yk
for(ik-1 igt 0 i--) yi -
xkUi,k
y
U
x0 x1 x2 x3
42Back Substitution
Sequential solution
SeqBackSubstitution(U,i,k) for(kn-1
kgt0 k--) xk yk
for(ik-1 igt 0 i--) yi -
xkUi,k
y
x3 -1 y2 0-(-1)3 3 y1 0-(-1)8
8 y0 0-(-1)(-3) -3
U
x0 x1 x2 x3
43Back Substitution
Sequential solution
SeqBackSubstitution(U,i,k) for(kn-1
kgt0 k--) xk yk
for(ik-1 igt 0 i--) yi -
xkUi,k
y
x2 3 y1 8-32 2 y0 -3-3(-2) 3
U
x0 x1 x2 x3
44Back Substitution
Sequential solution
SeqBackSubstitution(U,i,k) for(kn-1
kgt0 k--) xk yk
for(ik-1 igt 0 i--) yi -
xkUi,k
y
x1 2 y0 3-21 1
U
x0 x1 x2 x3
45Back Substitution
Sequential solution
SeqBackSubstitution(U,i,k) for(kn-1
kgt0 k--) xk yk
for(ik-1 igt 0 i--) yi -
xkUi,k
y
x0 1
U
Sequential cost O(n2)
x0 x1 x2 x3
46Back Substitution
How to parallelize it? Simple 1D solution - use
row partitioning
1DParBackSubs(U,i,k) for(kn-1 kgt0 k--)
xk yk Broadcast(xk)
for(ik-1 igt 0 i--) in parallel
yi - xkUi,k
Algorithm for processor i
for(kn-1 kgti k--) RecvBroadcast(xk)
yi - xkUi,k xi yi
Broadcast(xi)
47Back substitution
- Analysis
- local computation O(1) times n
- broadcast O(log n) times n
- overall time O(n log n)
- overall parallel cost O(n2 log n)
- not cost optimal
- but the overall cost of solving equations is
determined by gaussian elimination. - back substitution can be made optimal using
pipelining. - Homework
- write pipelined version of back substitution
- how to do 2D version of back substitution?
48Matrix Algorithms Summary
- Matrix-Vector Multiplication
- with 1D/2D partitioning
- Matrix-Matrix Multiplication
- simple parallel algorithm
- Cannons algorithm
- the DNS algorithm
- Solving Systems of Linear Equations
- simple Gaussian elimination
- with pipelining/partial pivoting
- back substitution