Title: Distributed WHT Algorithms
1Distributed WHT Algorithms
Kang Chen Jeremy Johnson Computer Science Drexel
University
Franz Franchetti Electrical and Computer
Engineering Carnegie Mellon University
http//www.spiral.net
2Sponsors
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3Objective
- Generate high-performance implementations of
linear computations (signal transforms) from
mathematical descriptions - Explore alternative implementations and optimize
using formula generation, manipulation and search - Prototype implementation using WHT
- Prototype transform (WHT)
- Build on existing sequential package
- SMP implementation using OpenMP
- Distributed memory implementation using MPI
- Sequential package presented at ICASSP00 01
and OpenMP extension presented at IPDPS02 - Incorporate into SPIRALAutomatic performance
tuning for DSP transforms
4Outline
- Introduction
- Bit permutations
- Distributed WHT algorithms
- Theoretical results
- Experimental results
5Outline
- Introduction
- Bit permutations
- Distributed WHT algorithms
- Theoretical results
- Experimental results
6Walsh-Hadamard Transform
7SPIRAL WHT Package
- All WHT algorithms have the same arithmetic cost
O(N log N) but different data access patterns
and varying amounts of recursion and iteration - Small transforms (size 21 to 28) are implemented
with straight-line code to reduce overhead - The WHT package allows exploration of the O(7n)
different algorithms and implementations using a
simple grammar - Optimization/adaptation to architectures is
performed by searching for the fastest algorithm - Dynamic Programming (DP)
- Evolutionary Algorithm (STEER)
Johnson and Püschel ICASSP 2000
8Performance of WHT Algorithms (II)
- Automatically generate random algorithms
forWHT216 using SPIRAL - Only difference order of arithmetic instructions
9Architecture Dependency
- The best WHT algorithms also depend on
architecture - Memory hierarchy
- Cache structure
- Cache miss penalty
-
10Outline
- Introduction
- Bit permutations
- Distributed WHT algorithms
- Theoretical results
- Experimental results
11Bit Permutations
- Definition
- ? A permutation of 0,1,,n-1 (bn-1 b1 b0 )
Binary representation of 0 ? i lt 2n. - P? Permutation of 0,1,,2n-1 defined by
- (bn-1 b1 b0) ? (b?(n-1) b?(1) b ?(0))
- Distributed interpretation
- P 2p processors
- Block cyclic data distribution.
- Leading p bits are the pid
- Trailing (n-p) bits are the local offset.
- pid offset
pid offset - (bn-1 bn-p bn-p-1 b1 b0) ? (b?(n-1)
b?(n-p) b ?(n-p-1) b?(1) b ?(0))
12Stride Permutation
Write at stride 4 (8/2)
- 000 000
- 001 100
- 010 001
- 011 101
- 100 010
- 101 110
- 110 011
- 111 111
(b2b1b0)?? (b0b2b1)
13Distributed Stride Permutation
- 0000 0000
-
- 0001 1000
- 0010 0001
- 0011 1001
- 0100 0010
- 0101 1010
- 0110 0011
- 0111 1011
- 1000 0100
-
- 1001 1100
- 1010 0101
-
- 1011 1101
- 1100 0110
-
- 1101 1110
- 1110 0111
- 1111 1111
14Communication Pattern
Each PE sends 1/2 data to 2 different PEs
Looks nicely regular
15Communication Pattern
but is highly irregular
16Communication Pattern
Each PE sends 1/4 data to 4 different PEs
and gets worse for larger parameters of L.
17Multi-Swap Permutation
Writes at stride 4 Pairwise exchange of data
- 000 000
- 001 100
- 010 010
- 011 110
- 100 001
- 101 101
- 110 011
- 111 111
(b2b1b0)?? (b0b1b2)
18Communication Pattern
Each PE exchanges 1/2 data with another PE (4
size 2 All-to-All)
19Communication Pattern
1
0
X(026)
X(127)
2
7
3
6
4
5
20Communication Pattern
Each PE sends 1/4 data to 4 different PEs (2 size
4 All-to-All)
21Communication Scheduling
- Order two Latin square
- Used to schedule All-to-All permutation
- Uses Point-to-Point communication
- Simple recursive construction
22Outline
- Introduction
- Bit permutations
- Distributed WHT algorithms
- Theoretical results
- Experimental results
23Parallel WHT Package
- WHT partition tree is parallelized at the root
node - SMP implementation obtained using OpenMP
- Distributed memory implementation using MPI
- Dynamic programming decides when to use
parallelism - DP decides the best parallel root node
- DP builds the partition with best sequential
subtrees
Sequential WHT package Johnson and Püschel
ICASSP 2000, ICASSP 2001 Dynamic Data Layout N.
Park and V. K. Prasanna ICASSP 2001 OpenMP SMP
version K. Chen and J. Johnson IPDPS 2002
24Distributed Memory WHT Algorithms
- Distributed split, d_split, as root node
- Data equally distributed among threads
- Distributed stride permutation to exchange data
- Different sequences of permutations are possible
- Parallel form WHT transform on local data
Sequential algorithm
25Outline
- Introduction
- Bit permutations
- Distributed WHT algorithms
- Theoretical results
- Experimental results
26Theoretical Results
- Problem statement
- Find sequence of permutations that minimize
communication and congestion - Pease dataflow
- Total bandwidth N log(N)(1-1/P)
- Conjectured Optimal
- Total bandwidth N/2 log(P) N(1-1/P)
- Optimal uses independent pairwise exchanges
(except last permutation)
27Pease Dataflow
28Theoretically Optimal Dataflow
29Outline
- Introduction
- Bit permutations
- Distributed WHT algorithms
- Theoretical results
- Experimental results
30Experimental Results
- Platform
- 32 Pentium III processors, 450 MHz
- 512 MB 8ns PCI-100 memory and
- 2 SMC 100 mbps fast Ethernet cards
- Distributed WHT package implemented using MPI
- Experiments
- All-to-All
- Distributed stride vs. multi-swap permutations
- Distributed WHT
31All-to-All
Three different implementations of All-to-All
permutation
Point-to-point fastest
32Stride vs. Multi-Swap
33Distributed WHT230
vs.
34Summary
- Self-adapting WHT package
- Optimize distributed WHT over different
communication patterns and combinations of
sequential code - Use of point-to-point primitives for all-to-all
Ongoing work
- Lower bounds
- Use high-speed interconnect
- Generalize to other transforms
- Incorporate into SPIRAL
http//www.spiral.net