Title: Parallel Sorting
1Parallel Sorting
- James Demmel
- www.cs.berkeley.edu/demmel/cs267_Spr10
2Some Sorting algorithms
- Choice of algorithm depends on
- Is the data all in memory, or on disk/tape?
- Do we compare operands, or just use their values?
- Sequential or parallel? Shared or distributed
memory? SIMD or not? - We will consider all data in memory and parallel
- Bitonic Sort
- Naturally parallel, suitable for SIMD
architectures - Sample Sort
- Good generalization of Quick Sort to parallel
case - Radix Sort
- Data measured on CM-5
- Written in Split-C (precursor of UPC)
- www.cs.berkeley.edu/culler/papers/sort.ps
3LogP model to predict, understand performance
P ( processors )
M
P
M
P
M
P
o (overhead)
o
g (gap)
L (latency)
Limited Volume
Interconnection Network
(
L/ g
to or from a proc)
- Latency in sending a (small) message between
modules - overhead felt by the processor on sending or
receiving msg - gap between successive sends or receives (1/BW)
- Processors
a
b
4Bottom Line on CM-5 using Split-C (Preview)
Algorithm, Procs
- Good fit between predicted (using LogP model) and
measured (10) - No single algorithm always best
- scaling by processor, input size, sensitivity
- All are global / local hybrids
- the local part is hard to implement and model
5Bitonic Sort (1/2)
- A bitonic sequence is one that is
- Monotonically increasing and then monotonically
decreasing - Or can be circularly shifted to satisfy 1
- A half-cleaner takes a bitonic sequence and
produces - First half is smaller than smallest element in
2nd - Both halves are bitonic
- Apply recursively to each half to complete
sorting - Where do we get a bitonic sequence to start with?
6Bitonic Sort (2/2)
- A bitonic sequence is one that is
- Monotonically increasing and then monotonically
decreasing - Or can be circularly shifted to satisfy this
- Any sequence of length 2 is bitonic
- So we can sort it using bitonic sorter on last
slide - Sort all length-2 sequences in alternating
increasing/decreasing order - Two such length-2 sequences form a length-4
bitonic sequence - Recursively
- Sort length-2k sequences in alternating
increasing/decreasing order - Two such length-2k sequences form a length-2k1
bitonic sequence (that can be sorted)
7Bitonic Sort for n16 all dependencies shown
Similar pattern as FFT similar optimizations
possible
Block Layout
8Bitonic Sort time per key
Predicted using LogP Model
Measured
Procs
80
80
70
70
512
60
60
256
50
50
128
40
40
us/key
us/key
30
30
64
20
20
32
10
10
0
0
16384
32768
65536
16384
32768
65536
131072
262144
524288
131072
262144
524288
1048576
1048576
N/P
N/P
9Sample Sort
1. compute P-1 values of keys that would
split the input into roughly equal pieces.
take S64 samples per processor sort PS keys
(on processor 0) let keys S, 2S, . . .
(P-1)S be Splitters broadcast Splitters 2.
Distribute keys based on Splitters
Splitter(i-1) lt keys ? Splitter(i) all sent to
proc i 3. Local sort of keys on each proc 4.
possibly reshift, so each proc has N/p keys
If samples represent total population, then
Splitters should divide population into P roughly
equal pieces
10Sample Sort Times
Predicted
Measured
Processors
30
30
25
25
512
20
20
256
128
15
15
us/key
us/key
64
10
10
32
5
5
0
0
16384
32768
65536
16384
32768
65536
131072
262144
524288
131072
262144
524288
1048576
1048576
N/P
N/P
11Sample Sort Timing Breakdown
Predicted and Measured (-m) times
12Sequential Radix Sort Counting Sort
- Idea build a histogram of the keys and compute
position in answer array for each element - A 3, 5, 4, 1, 3, 4, 1, 4
- Make temp array B, and write values into position
- B 1, 1, 3, 3, 4, 4, 4, 5
- Cost O(keys size of histogram)
- What if histogram too large (eg all 32-bit ints?
All words?)
13Radix Sort Separate Key Into Parts
- Divide keys into parts, e.g., by digits (radix)
- Using counting sort on these each radix
- Start with least-significant
sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
sort on 3rd character
sort on 2nd character
sort on 1st character
14Histo-radix sort
- Per pass
- compute local histogram
- r-bit keys, 2r bins
- 2. compute position of 1st
- member of each bucket in
- global array
- 2r scans with end-around
- 3. distribute all the keys
- Only r 4, 8,11,16 make sense
- for sorting 32 bit numbers
P
nN/P
2
2r
3
15Histo-Radix Sort (again)
Local Data
Local Histograms
P
Each Pass form local histograms form global
histogram globally distribute data
16Radix Sort Times
procs
512
256
128
64
32
17Radix Sort Timing Breakdown
Predicted and Measured (-m) times
18Local Sort Performance on CM-5
Entropy -Si pi log2 pi , pi Probability of
key i Ranges from 0 to log2 (different keys)
(11 bit radix sort of 32 bit numbers)
19Radix Sort Timing dependence on Key distribution
Entropy -Si pi log2 pi , pi Probability of
key i Ranges from 0 to log2 (different keys)
Slowdown due to contention in redistribution
20Bottom Line on CM-5 using Split-C
Algorithm, Procs
- Good fit between predicted (using LogP model) and
measured (10) - No single algorithm always best
- scaling by processor, input size, sensitivity
- All are global / local hybrids
- the local part is hard to implement and model
21Sorting Conclusions
- Distributed memory model leads to hybrid global /
local algorithm - Use best local algorithm combined with global
part - LogP model is good enough to model global part
- bandwidth (g) or overhead (o) matter most
- including end-point contention
- latency (L) only matters when bandwidth doesnt
- Modeling local computational performance is
harder - dominated by effects of storage hierarchy (eg
TLBs), - depends on entropy
- See http//www.cs.berkeley.edu/culler/papers/sort
.ps - See http//now.cs.berkeley.edu/Papers2/Postscript/
spdt98.ps - disk-to-disk parallel sorting
22Extra slides
23Radix Stream Broadcast Problem
- Processor 0 does only sends
- Others receive then send
- Receives prioritized over sends
- ? Processor 0 needs to be delayed
n
(P-1) ( 2o L (n-1) g ) ?
Need to slow first processor to pipeline well
24Whats the right communication mechanism?
- Permutation via writes
- consistency model?
- false sharing?
- Reads?
- Bulk Transfers?
- what do you need to change in the algorithm?
- Network scheduling?
25Comparison
- Good fit between predicted and measured (10)
- Different sorts for different sorts
- scaling by processor, input size, sensitivity
- All are global / local hybrids
- the local part is hard to implement and model
26Outline
- Some Performance Laws
- Performance analysis
- Performance modeling
- Parallel Sorting combining models with
measurments - Reading
- Chapter 3 of Fosters Designing and Building
Parallel Programs online text - http//www-unix.mcs.anl.gov/dbpp/text/node26.html
- Abbreviated as DBPP in this lecture
- David Baileys Twelve Ways to Fool the Masses
27Measuring Performance
- Performance criterion may vary with domain
- There may be limits on acceptable running time
- E.g., a climate model must run 1000x faster than
real time. - Any performance improvement may be acceptable
- E.g., faster on 4 cores than on 1.
- Throughout may be more critical than latency
- E.g., number of images processed per minute
(throughput) vs. total delay for one image
(latency) in a pipelined system. - Execution time per unit cost
- E.g., GFlop/sec, GFlop/s/ or GFlop/s/Watt
- Parallel scalability (speedup or parallel
efficiency) - Percent relative to best possible (some kind of
peak)
28Amdahls Law (review)
- Suppose only part of an application seems
parallel - Amdahls law
- let s be the fraction of work done sequentially,
so (1-s) is
fraction parallelizable - P number of processors
Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s
- Even if the parallel part speeds up perfectly
performance is limited by the sequential
part
29Amdahls Law (for 1024 processors)
Does this mean parallel computing is a hopeless
enterprise?
Source Gustafson, Montry, Benner
30Scaled Speedup
See Gustafson, Montry, Benner, Development of
Parallel Methods for a 1024 Processor Hypercube,
SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.
31Scaled Speedup (background)
32Littles Law
- Latency vs. Bandwidth
- Latency is physics (wire length)
- e.g., the network latency on the Earth Simulation
is only about 2x times the speed of light across
the machine room - Bandwidth is cost
- add more cables to increase bandwidth
(over-simplification) - Principle (Little's Law) the relationship of a
production system in steady state is - Inventory Throughput Flow Time
- For parallel computing, Littles Law is about the
required concurrency to be limited by bandwidth
rather than latency - Required concurrency Bandwidth Latency
-
bandwidth-delay product - For parallel computing, this means
- Concurrency bandwidth x latency
33Littles Law
- Example 1 a single processor
- If the latency is to memory is 50ns, and the
bandwidth is 5 GB/s (.2ns / Bytes 12.8 ns /
64-byte cache line) - The system must support 50/12.8 4 outstanding
cache line misses to keep things balanced (run at
bandwidth speed) - An application must be able to prefetch 4 cache
line misses in parallel (without dependencies
between them) - Example 2 1000 processor system
- 1 GHz clock, 100 ns memory latency, 100 words of
memory in data paths between CPU and memory. - Main memory bandwidth is
- 1000 x 100 words x 109/s 1014
words/sec. - To achieve full performance, an application
needs - 10-7 x 1014 107 way concurrency
- (some of that may be hidden in the
instruction stream)
34In Performance Analysis Use more Data
- Whenever possible, use a large set of data rather
than one or a few isolated points. A single point
has little information. - E.g., from DBPP
- Serial algorithm scales as N N2
- Speedup of 10.8 on 12 processors with problem
size N100 - Case 1 T N N2/P
- Case 2 T (N N2)/P 100
- Case 2 T (N N2)/P 0.6P2
- All have speedup 10.8 on 12 procs
- Performance graphs (n 100, 1000) show
differences in scaling
35Example Immersed Boundary Simulation
- Using Seaborg (Power3) at NERSC and DataStar
(Power4) at SDSC - How useful is this data? What are ways to make
is more useful/interesting?
Joint work with Ed Givelberg, Armando Solar-Lezama
36Performance Analysis
37Building a Performance Model
- Based on measurements/scaling of components
- FFT is time is
- 5nlogn flops flops/sec (measured for FFT)
- Other costs are linear in either material or
fluid points - Measure constants
- flops/point (independent machine or problem
size) - Flops/sec (measured per machine, per phase)
- Time is a b points
- Communication done similarly
- Find formula for message size as function of
problem size - Check the formula using tracing of some kind
- Use a/b model to predict running time a b
size
38A Performance Model
- 5123 in lt 1 second per timestep not possible
- Primarily limited by bisection bandwidth
39Model Success and Failure
40OSKI SPMV What We Expect
- Assume
- Cost(SpMV) time to read matrix
- 1 double-word 2 integers
- r, c in 1, 2, 4, 8
- CSR 1 int / non-zero
- BCSR(r x c) 1 int / (rc non-zeros)
- As rc increases, speedup should
- Increase smoothly
- Approach 1.5 (eliminate all index overhead)
41What We Get (The Need for Search)
42Using Multiple Models
43Multiple Models
44Multiple Models
45Extended ExampleUsing Performance Modeling
(LogP)To Explain DataApplication to Sorting
46Deriving the LogP Model
- Processing
- powerful microprocessor, large DRAM, cache gt
P - Communication
- significant latency (100's 1000s of
cycles) gt L - limited bandwidth (1 5 of memory bw) gt g
- significant overhead (10's 100's of
cycles) gt o - - on both ends
- no consensus on topology
- gt should not exploit structure
- limited network capacity
- no consensus on programming model
- gt should not enforce one
47LogP
P ( processors )
M
P
M
P
M
P
o (overhead)
o
g (gap)
L (latency)
Limited Volume
Interconnection Network
(
L/ g
to or from a proc)
- Latency in sending a (small) message between
modules - overhead felt by the processor on sending or
receiving msg - gap between successive sends or receives (1/BW)
- Processors
a
b
48Using the LogP Model
o
L
o
o
o
L
g
time
- Send n messages from proc to proc in time
- 2o L g (n-1)
- each processor does on cycles of overhead
- has (g-o)(n-1) L available compute cycles
- Send n total messages from one to many
- in same time
- Send n messages from many to one
- in same time
- all but L/g processors block
- so fewer available cycles, unless scheduled
carefully
P
P
49Use of the LogP Model (cont)
- Two processors sending n words to each other
(i.e., exchange) - 2o L max(g,2o) (n-1) max(g,2o) L
- P processors each sending n words to all
processors (n/P each) in a static, balanced
pattern without conflicts , e.g., transpose, fft,
cyclic-to-block, block-to-cyclic -
- exercise whats wrong with the formula above?
-
o
L
o
o
o
L
g
o
Assumes optimal pattern of send/receive, so could
underestimate time
50LogP "philosophy"
- Think about
- mapping of N words onto P processors
- computation within a processor, its cost, and
balance - communication between processors, its cost,
and balance - given a charaterization of processor and network
performance - Do not think about what happens within the
network
This should be good enough!
51Typical Sort
- Exploits the n N/P grouping
- Significant local computation
- Very general global communication /
transformation - Computation of the transformation
52Costs Split-C (UPC predecessor) Operations
- Read, Write x G, G x 2 (L 2o)
- Store G x L 2o
- Get x G o
- .... 2L 2o
- sync() o
- with interval g
- Bulk store (n words with w words/message)
- 2o (n-1)g
L - Exchange 2o 2L (ì n/w ù - 1 - L/g) max(g,2o)
- One to many
- Many to one
53LogP model
- CM5
- L 6 µs
- o 2.2 µs
- g 4 µs
- P varies from 32 to 1024
- NOW
- L 8.9
- o 3.8
- g 12.8
- P varies up to 100
- What is the processor performance?
- Application-specific
- 10s of Mflops for these machines
54LogP Parameters Today
55Local Computation Parameters - Empirical
Parameter Operation µs per key Sort Swap Simulate
cycle butterfly per key 0.025 lg
N Bitonic mergesort Sort bitonic
sequence 1.0 scatter Move key for
Cyclic-to-block 0.46 gather Move key for
Block-to-cyclic 0.52 if nlt64k or Plt64 Bitonic
Column 1.1 otherwise local sort Local radix
sort (11 bit) 4.5 if n lt 64K 9.0 -
(281000/n) merge Merge sorted lists 1.5 Column cop
y Shift Key 0.5 zero Clear histogram
bin 0.2 Radix hist produce histogram 1.2 add produ
ce scan value 1.0 bsum adjust scan of
bins 2.5 address determine desitination 4.7 compar
e compare key to splitter 0.9 Sample localsort8 lo
cal radix sort of samples 5.0
56Odd-Even Merge - classic parallel sort
N values to be sorted
Treat as two lists of M N/2
Sort each separately
A0 A1 A2 A3 AM-1
B0 B1 B2 B3 BM-1
Redistribute into even and odd sublists
A0 A2 AM-2
A1 A3 AM-1
B0 B2 BM-2
B1 B3 BM-1
Merge into two sorted lists
E0 E1 E2 E3 EM-1
O0 O1 O2 O3 OM-1
Pairwise swaps of Ei and Oi will put it in order
57Wheres the Parallelism?
1xN
2xN/2
4xN/4
E0 E1 E2 E3 EM-1
O0 O1 O2 O3 OM-1
1xN
58Mapping to a Butterfly (or Hypercube)
two sorted sublists
Reverse Order of one list via cross edges
A0
A1
A2
A3
B2
B3
B1
B0
Pairwise swaps on way back
2
3
4
8
7
6
5
1
2
3
4
7
6
5
1
8
2
4
6
1
3
5
7
8
1
2
3
4
5
6
7
8
59Bitonic Sort with N/P per node
A bitonic sequence decreases and then increases
(or vice versa) Bitonic sequences can be merged
like monotonic sequences
- all_bitonic(int APROCSn)
- sort(tolocal(AME0),n,0)
- for (d 1 d lt logProcs d)
- for (i d-1 i gt 0 i--)
- swap(A,T,n,pair(i))
- merge(A,T,n,mode(d,i))
-
- sort(tolocal(AME0),n,mask(i))
sort
swap
60Bitonic Breakdown
P 512, random
61Bitonic Effect of Key Distributions
P 64, N/P 1 M
62Sequential Radix Sort Counting Sort
- Idea build a histogram of the keys and compute
position in answer array for each element - A 3, 5, 4, 1, 3, 4, 1, 4
- Make temp array B, and write values into position
4
1
3
4
4
1
3
4
5
63Counting Sort Pseudo Code
- Counting Sort
- static void countingSort(int A)
-
- int N A.length
- int L min(A), U max(A)
- int count new intU-L2
-
- for (int i 0 i lt N i 1)
- countAi - L 1 1
- for (int j 1 j lt count.length j)
- countj countj-1
-
A 3, 5, 4, 1, 3, 4, 1, 4
N 8
L1, U5
count 0,0,0,0,0,0
count 0,2,0,2,3,1
count 0,2,2,4,7,8
64Distribution Sort Continued
- static void countingSort (int A)
-
- int B new int N
- for (int i 0 i lt N i 1)
- BcountAi-L Ai
- countAi-L 1
-
- // copy back into A
- for (int i 0 i lt N i 1)
- Ai Bi
-
A 3, 5, 4, 1, 3, 4, 1, 4
count 0,2,2,4,7,8
B 0, 0, 0, 0, 0, 0, 0, 0
65Analysis of Counting Sort
- What is the complexity of each step for an
n-element array? - Find min and max Q(n)
- Fill in count histogram Q(n)
- Compute sums of count Q(max-min)
- Fill in B (run over count) Q(max-min)
- Copy B to A Q(n)
- So this is a Q(n m) algorithm,
- where mmax-min
- Great if the range of keys isnt too large
- If m lt n, then Q(n) overall