Parallel Sorting

About This Presentation

Title:

Parallel Sorting

Description:

0.00 0.00 0.00 27.87 27.93 28.27 28.02 27.16 all forward fft. 0.00 0.00 0.00 4.44 3.00 2.00 3.00 4 ... Extra s Radix: Stream Broadcast Problem What s ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 63

Provided by: DavidC363

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Sorting

1
Parallel Sorting

James Demmel
www.cs.berkeley.edu/demmel/cs267_Spr10

2
Some Sorting algorithms

Choice of algorithm depends on
Is the data all in memory, or on disk/tape?
Do we compare operands, or just use their values?
Sequential or parallel? Shared or distributed
memory? SIMD or not?
We will consider all data in memory and parallel
Bitonic Sort
Naturally parallel, suitable for SIMD
architectures
Sample Sort
Good generalization of Quick Sort to parallel
case
Radix Sort
Data measured on CM-5
Written in Split-C (precursor of UPC)
www.cs.berkeley.edu/culler/papers/sort.ps

3
LogP model to predict, understand performance
P ( processors )
M
P
M
P
M
P

o (overhead)
o
g (gap)
L (latency)
Limited Volume
Interconnection Network
(
L/ g
to or from a proc)

Latency in sending a (small) message between
modules
overhead felt by the processor on sending or
receiving msg
gap between successive sends or receives (1/BW)
Processors

a
b
4
Bottom Line on CM-5 using Split-C (Preview)
Algorithm, Procs

Good fit between predicted (using LogP model) and
measured (10)
No single algorithm always best
scaling by processor, input size, sensitivity
All are global / local hybrids
the local part is hard to implement and model

5
Bitonic Sort (1/2)

A bitonic sequence is one that is
Monotonically increasing and then monotonically
decreasing
Or can be circularly shifted to satisfy 1
A half-cleaner takes a bitonic sequence and
produces
First half is smaller than smallest element in
2nd
Both halves are bitonic
Apply recursively to each half to complete
sorting
Where do we get a bitonic sequence to start with?

6
Bitonic Sort (2/2)

A bitonic sequence is one that is
Monotonically increasing and then monotonically
decreasing
Or can be circularly shifted to satisfy this
Any sequence of length 2 is bitonic
So we can sort it using bitonic sorter on last
slide
Sort all length-2 sequences in alternating
increasing/decreasing order
Two such length-2 sequences form a length-4
bitonic sequence
Recursively
Sort length-2k sequences in alternating
increasing/decreasing order
Two such length-2k sequences form a length-2k1
bitonic sequence (that can be sorted)

7
Bitonic Sort for n16 all dependencies shown
Similar pattern as FFT similar optimizations
possible
Block Layout
8
Bitonic Sort time per key
Predicted using LogP Model
Measured
Procs
80
80
70
70
512
60
60
256
50
50
128
40
40
us/key
us/key
30
30
64
20
20
32
10
10
0
0
16384
32768
65536
16384
32768
65536
131072
262144
524288
131072
262144
524288
1048576
1048576
N/P
N/P
9
Sample Sort
1. compute P-1 values of keys that would
split the input into roughly equal pieces.
take S64 samples per processor sort PS keys
(on processor 0) let keys S, 2S, . . .
(P-1)S be Splitters broadcast Splitters 2.
Distribute keys based on Splitters
Splitter(i-1) lt keys ? Splitter(i) all sent to
proc i 3. Local sort of keys on each proc 4.
possibly reshift, so each proc has N/p keys
If samples represent total population, then
Splitters should divide population into P roughly
equal pieces
10
Sample Sort Times
Predicted
Measured
Processors
30
30
25
25
512
20
20
256
128
15
15
us/key
us/key
64
10
10
32
5
5
0
0
16384
32768
65536
16384
32768
65536
131072
262144
524288
131072
262144
524288
1048576
1048576
N/P
N/P
11
Sample Sort Timing Breakdown
Predicted and Measured (-m) times
12
Sequential Radix Sort Counting Sort

Idea build a histogram of the keys and compute
position in answer array for each element
A 3, 5, 4, 1, 3, 4, 1, 4
Make temp array B, and write values into position
B 1, 1, 3, 3, 4, 4, 4, 5
Cost O(keys size of histogram)
What if histogram too large (eg all 32-bit ints?
All words?)

13
Radix Sort Separate Key Into Parts

Divide keys into parts, e.g., by digits (radix)
Using counting sort on these each radix
Start with least-significant

sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
sort on 3rd character
sort on 2nd character
sort on 1st character

Cost O(keys characters)

14
Histo-radix sort

Per pass
compute local histogram
r-bit keys, 2r bins
2. compute position of 1st
member of each bucket in
global array
2r scans with end-around
3. distribute all the keys
Only r 4, 8,11,16 make sense
for sorting 32 bit numbers

P
nN/P
2
2r
3
15
Histo-Radix Sort (again)
Local Data
Local Histograms
P
Each Pass form local histograms form global
histogram globally distribute data
16
Radix Sort Times
procs
512
256
128
64
32
17
Radix Sort Timing Breakdown
Predicted and Measured (-m) times
18
Local Sort Performance on CM-5
Entropy -Si pi log2 pi , pi Probability of
key i Ranges from 0 to log2 (different keys)
(11 bit radix sort of 32 bit numbers)
19
Radix Sort Timing dependence on Key distribution
Entropy -Si pi log2 pi , pi Probability of
key i Ranges from 0 to log2 (different keys)
Slowdown due to contention in redistribution
20
Bottom Line on CM-5 using Split-C
Algorithm, Procs

Good fit between predicted (using LogP model) and
measured (10)
No single algorithm always best
scaling by processor, input size, sensitivity
All are global / local hybrids
the local part is hard to implement and model

21
Sorting Conclusions

Distributed memory model leads to hybrid global /
local algorithm
Use best local algorithm combined with global
part
LogP model is good enough to model global part
bandwidth (g) or overhead (o) matter most
including end-point contention
latency (L) only matters when bandwidth doesnt
Modeling local computational performance is
harder
dominated by effects of storage hierarchy (eg
TLBs),
depends on entropy
See http//www.cs.berkeley.edu/culler/papers/sort
.ps
See http//now.cs.berkeley.edu/Papers2/Postscript/
spdt98.ps
disk-to-disk parallel sorting

22
Extra slides
23
Radix Stream Broadcast Problem

Processor 0 does only sends
Others receive then send
Receives prioritized over sends
? Processor 0 needs to be delayed

n
(P-1) ( 2o L (n-1) g ) ?
Need to slow first processor to pipeline well
24
Whats the right communication mechanism?

Permutation via writes
consistency model?
false sharing?
Reads?
Bulk Transfers?
what do you need to change in the algorithm?
Network scheduling?

25
Comparison

Good fit between predicted and measured (10)
Different sorts for different sorts
scaling by processor, input size, sensitivity
All are global / local hybrids
the local part is hard to implement and model

26
Outline

Some Performance Laws
Performance analysis
Performance modeling
Parallel Sorting combining models with
measurments
Reading
Chapter 3 of Fosters Designing and Building
Parallel Programs online text
http//www-unix.mcs.anl.gov/dbpp/text/node26.html
Abbreviated as DBPP in this lecture
David Baileys Twelve Ways to Fool the Masses

27
Measuring Performance

Performance criterion may vary with domain
There may be limits on acceptable running time
E.g., a climate model must run 1000x faster than
real time.
Any performance improvement may be acceptable
E.g., faster on 4 cores than on 1.
Throughout may be more critical than latency
E.g., number of images processed per minute
(throughput) vs. total delay for one image
(latency) in a pipelined system.
Execution time per unit cost
E.g., GFlop/sec, GFlop/s/ or GFlop/s/Watt
Parallel scalability (speedup or parallel
efficiency)
Percent relative to best possible (some kind of
peak)

28
Amdahls Law (review)

Suppose only part of an application seems
parallel
Amdahls law
let s be the fraction of work done sequentially,
so (1-s) is
fraction parallelizable
P number of processors

Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s

Even if the parallel part speeds up perfectly
performance is limited by the sequential
part

29
Amdahls Law (for 1024 processors)
Does this mean parallel computing is a hopeless
enterprise?
Source Gustafson, Montry, Benner
30
Scaled Speedup
See Gustafson, Montry, Benner, Development of
Parallel Methods for a 1024 Processor Hypercube,
SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.
31
Scaled Speedup (background)
32
Littles Law

Latency vs. Bandwidth
Latency is physics (wire length)
e.g., the network latency on the Earth Simulation
is only about 2x times the speed of light across
the machine room
Bandwidth is cost
add more cables to increase bandwidth
(over-simplification)
Principle (Little's Law) the relationship of a
production system in steady state is
Inventory Throughput Flow Time
For parallel computing, Littles Law is about the
required concurrency to be limited by bandwidth
rather than latency
Required concurrency Bandwidth Latency
bandwidth-delay product
For parallel computing, this means
Concurrency bandwidth x latency

33
Littles Law

Example 1 a single processor
If the latency is to memory is 50ns, and the
bandwidth is 5 GB/s (.2ns / Bytes 12.8 ns /
64-byte cache line)
The system must support 50/12.8 4 outstanding
cache line misses to keep things balanced (run at
bandwidth speed)
An application must be able to prefetch 4 cache
line misses in parallel (without dependencies
between them)
Example 2 1000 processor system
1 GHz clock, 100 ns memory latency, 100 words of
memory in data paths between CPU and memory.
Main memory bandwidth is
1000 x 100 words x 109/s 1014
words/sec.
To achieve full performance, an application
needs
10-7 x 1014 107 way concurrency
(some of that may be hidden in the
instruction stream)

34
In Performance Analysis Use more Data

Whenever possible, use a large set of data rather
than one or a few isolated points. A single point
has little information.
E.g., from DBPP
Serial algorithm scales as N N2
Speedup of 10.8 on 12 processors with problem
size N100
Case 1 T N N2/P
Case 2 T (N N2)/P 100
Case 2 T (N N2)/P 0.6P2
All have speedup 10.8 on 12 procs
Performance graphs (n 100, 1000) show
differences in scaling

35
Example Immersed Boundary Simulation

Using Seaborg (Power3) at NERSC and DataStar
(Power4) at SDSC
How useful is this data? What are ways to make
is more useful/interesting?

Joint work with Ed Givelberg, Armando Solar-Lezama
36
Performance Analysis
37
Building a Performance Model

Based on measurements/scaling of components
FFT is time is
5nlogn flops flops/sec (measured for FFT)
Other costs are linear in either material or
fluid points
Measure constants
flops/point (independent machine or problem
size)
Flops/sec (measured per machine, per phase)
Time is a b points
Communication done similarly
Find formula for message size as function of
problem size
Check the formula using tracing of some kind
Use a/b model to predict running time a b
size

38
A Performance Model

5123 in lt 1 second per timestep not possible
Primarily limited by bisection bandwidth

39
Model Success and Failure
40
OSKI SPMV What We Expect

Assume
Cost(SpMV) time to read matrix
1 double-word 2 integers
r, c in 1, 2, 4, 8
CSR 1 int / non-zero
BCSR(r x c) 1 int / (rc non-zeros)
As rc increases, speedup should
Increase smoothly
Approach 1.5 (eliminate all index overhead)

41
What We Get (The Need for Search)
42
Using Multiple Models
43
Multiple Models
44
Multiple Models
45
Extended ExampleUsing Performance Modeling
(LogP)To Explain DataApplication to Sorting
46
Deriving the LogP Model

Processing
powerful microprocessor, large DRAM, cache gt
P
Communication
significant latency (100's 1000s of
cycles) gt L
limited bandwidth (1 5 of memory bw) gt g
significant overhead (10's 100's of
cycles) gt o
- on both ends
no consensus on topology
gt should not exploit structure
limited network capacity
no consensus on programming model
gt should not enforce one

47
LogP
P ( processors )
M
P
M
P
M
P

o (overhead)
o
g (gap)
L (latency)
Limited Volume
Interconnection Network
(
L/ g
to or from a proc)

Latency in sending a (small) message between
modules
overhead felt by the processor on sending or
receiving msg
gap between successive sends or receives (1/BW)
Processors

a
b
48
Using the LogP Model
o
L
o
o
o
L
g
time

Send n messages from proc to proc in time
2o L g (n-1)
each processor does on cycles of overhead
has (g-o)(n-1) L available compute cycles
Send n total messages from one to many
in same time
Send n messages from many to one
in same time
all but L/g processors block
so fewer available cycles, unless scheduled
carefully

P
P
49
Use of the LogP Model (cont)

Two processors sending n words to each other
(i.e., exchange)
2o L max(g,2o) (n-1) max(g,2o) L
P processors each sending n words to all
processors (n/P each) in a static, balanced
pattern without conflicts , e.g., transpose, fft,
cyclic-to-block, block-to-cyclic
exercise whats wrong with the formula above?

o
L
o
o
o
L
g
o
Assumes optimal pattern of send/receive, so could
underestimate time
50
LogP "philosophy"

Think about
mapping of N words onto P processors
computation within a processor, its cost, and
balance
communication between processors, its cost,
and balance
given a charaterization of processor and network
performance
Do not think about what happens within the
network

This should be good enough!
51
Typical Sort

Exploits the n N/P grouping
Significant local computation
Very general global communication /
transformation
Computation of the transformation

52
Costs Split-C (UPC predecessor) Operations

Read, Write x G, G x 2 (L 2o)
Store G x L 2o
Get x G o
.... 2L 2o
sync() o
with interval g
Bulk store (n words with w words/message)
2o (n-1)g
L
Exchange 2o 2L (ì n/w ù - 1 - L/g) max(g,2o)
One to many
Many to one

53
LogP model

CM5
L 6 µs
o 2.2 µs
g 4 µs
P varies from 32 to 1024
NOW
L 8.9
o 3.8
g 12.8
P varies up to 100
What is the processor performance?
Application-specific
10s of Mflops for these machines

54
LogP Parameters Today
55
Local Computation Parameters - Empirical
Parameter Operation µs per key Sort Swap Simulate
cycle butterfly per key 0.025 lg
N Bitonic mergesort Sort bitonic
sequence 1.0 scatter Move key for
Cyclic-to-block 0.46 gather Move key for
Block-to-cyclic 0.52 if nlt64k or Plt64 Bitonic
Column 1.1 otherwise local sort Local radix
sort (11 bit) 4.5 if n lt 64K 9.0 -
(281000/n) merge Merge sorted lists 1.5 Column cop
y Shift Key 0.5 zero Clear histogram
bin 0.2 Radix hist produce histogram 1.2 add produ
ce scan value 1.0 bsum adjust scan of
bins 2.5 address determine desitination 4.7 compar
e compare key to splitter 0.9 Sample localsort8 lo
cal radix sort of samples 5.0
56
Odd-Even Merge - classic parallel sort
N values to be sorted
Treat as two lists of M N/2
Sort each separately
A0 A1 A2 A3 AM-1
B0 B1 B2 B3 BM-1
Redistribute into even and odd sublists
A0 A2 AM-2
A1 A3 AM-1
B0 B2 BM-2
B1 B3 BM-1
Merge into two sorted lists
E0 E1 E2 E3 EM-1
O0 O1 O2 O3 OM-1
Pairwise swaps of Ei and Oi will put it in order
57
Wheres the Parallelism?
1xN
2xN/2
4xN/4
E0 E1 E2 E3 EM-1
O0 O1 O2 O3 OM-1
1xN
58
Mapping to a Butterfly (or Hypercube)
two sorted sublists
Reverse Order of one list via cross edges
A0
A1
A2
A3
B2
B3
B1
B0
Pairwise swaps on way back
2
3
4
8
7
6
5
1
2
3
4
7
6
5
1
8
2
4
6
1
3
5
7
8
1
2
3
4
5
6
7
8
59
Bitonic Sort with N/P per node
A bitonic sequence decreases and then increases
(or vice versa) Bitonic sequences can be merged
like monotonic sequences

all_bitonic(int APROCSn)
sort(tolocal(AME0),n,0)
for (d 1 d lt logProcs d)
for (i d-1 i gt 0 i--)
swap(A,T,n,pair(i))
merge(A,T,n,mode(d,i))
sort(tolocal(AME0),n,mask(i))

sort
swap
60
Bitonic Breakdown
P 512, random
61
Bitonic Effect of Key Distributions
P 64, N/P 1 M
62
Sequential Radix Sort Counting Sort

Idea build a histogram of the keys and compute
position in answer array for each element
A 3, 5, 4, 1, 3, 4, 1, 4
Make temp array B, and write values into position

4
1
3
4
4
1
3
4
5
63
Counting Sort Pseudo Code