Distributed WHT Algorithms - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Distributed WHT Algorithms

Description:

Computer Science. Drexel University. http://www.spiral.net. Franz Franchetti ... Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 35

Provided by: jose269

Learn more at: https://www.cs.drexel.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed WHT Algorithms

1
Distributed WHT Algorithms
Kang Chen Jeremy Johnson Computer Science Drexel
University
Franz Franchetti Electrical and Computer
Engineering Carnegie Mellon University
http//www.spiral.net
2
Sponsors
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3
Objective

Generate high-performance implementations of
linear computations (signal transforms) from
mathematical descriptions
Explore alternative implementations and optimize
using formula generation, manipulation and search
Prototype implementation using WHT
Prototype transform (WHT)
Build on existing sequential package
SMP implementation using OpenMP
Distributed memory implementation using MPI
Sequential package presented at ICASSP00 01
and OpenMP extension presented at IPDPS02
Incorporate into SPIRALAutomatic performance
tuning for DSP transforms

4
Outline

Introduction
Bit permutations
Distributed WHT algorithms
Theoretical results
Experimental results

5
Outline

Introduction
Bit permutations
Distributed WHT algorithms
Theoretical results
Experimental results

6
Walsh-Hadamard Transform
7
SPIRAL WHT Package

All WHT algorithms have the same arithmetic cost
O(N log N) but different data access patterns
and varying amounts of recursion and iteration
Small transforms (size 21 to 28) are implemented
with straight-line code to reduce overhead
The WHT package allows exploration of the O(7n)
different algorithms and implementations using a
simple grammar
Optimization/adaptation to architectures is
performed by searching for the fastest algorithm
Dynamic Programming (DP)
Evolutionary Algorithm (STEER)

Johnson and Püschel ICASSP 2000
8
Performance of WHT Algorithms (II)

Automatically generate random algorithms
forWHT216 using SPIRAL
Only difference order of arithmetic instructions

9
Architecture Dependency

The best WHT algorithms also depend on
architecture
Memory hierarchy
Cache structure
Cache miss penalty

10
Outline

Introduction
Bit permutations
Distributed WHT algorithms
Theoretical results
Experimental results

11
Bit Permutations

Definition
? A permutation of 0,1,,n-1 (bn-1 b1 b0 )
Binary representation of 0 ? i lt 2n.
P? Permutation of 0,1,,2n-1 defined by
(bn-1 b1 b0) ? (b?(n-1) b?(1) b ?(0))
Distributed interpretation
P 2p processors
Block cyclic data distribution.
Leading p bits are the pid
Trailing (n-p) bits are the local offset.
pid offset
pid offset
(bn-1 bn-p bn-p-1 b1 b0) ? (b?(n-1)
b?(n-p) b ?(n-p-1) b?(1) b ?(0))

12
Stride Permutation
Write at stride 4 (8/2)

000 000
001 100
010 001
011 101
100 010
101 110
110 011
111 111

(b2b1b0)?? (b0b2b1)
13
Distributed Stride Permutation

0000 0000
0001 1000
0010 0001
0011 1001
0100 0010
0101 1010
0110 0011
0111 1011

1000 0100
1001 1100
1010 0101
1011 1101
1100 0110
1101 1110
1110 0111
1111 1111

14
Communication Pattern
Each PE sends 1/2 data to 2 different PEs
Looks nicely regular
15
Communication Pattern
but is highly irregular
16
Communication Pattern
Each PE sends 1/4 data to 4 different PEs
and gets worse for larger parameters of L.
17
Multi-Swap Permutation
Writes at stride 4 Pairwise exchange of data

000 000
001 100
010 010
011 110
100 001
101 101
110 011
111 111

(b2b1b0)?? (b0b1b2)
18
Communication Pattern
Each PE exchanges 1/2 data with another PE (4
size 2 All-to-All)
19
Communication Pattern
1
0
X(026)
X(127)
2
7
3
6
4
5
20
Communication Pattern
Each PE sends 1/4 data to 4 different PEs (2 size
4 All-to-All)
21
Communication Scheduling

Order two Latin square
Used to schedule All-to-All permutation
Uses Point-to-Point communication
Simple recursive construction

22
Outline

Introduction
Bit permutations
Distributed WHT algorithms
Theoretical results
Experimental results

23
Parallel WHT Package

WHT partition tree is parallelized at the root
node
SMP implementation obtained using OpenMP
Distributed memory implementation using MPI
Dynamic programming decides when to use
parallelism
DP decides the best parallel root node
DP builds the partition with best sequential
subtrees

Sequential WHT package Johnson and Püschel
ICASSP 2000, ICASSP 2001 Dynamic Data Layout N.
Park and V. K. Prasanna ICASSP 2001 OpenMP SMP
version K. Chen and J. Johnson IPDPS 2002
24
Distributed Memory WHT Algorithms

Distributed split, d_split, as root node
Data equally distributed among threads
Distributed stride permutation to exchange data
Different sequences of permutations are possible
Parallel form WHT transform on local data

Sequential algorithm
25
Outline

Introduction
Bit permutations
Distributed WHT algorithms
Theoretical results
Experimental results

26
Theoretical Results

Problem statement
Find sequence of permutations that minimize
communication and congestion
Pease dataflow
Total bandwidth N log(N)(1-1/P)
Conjectured Optimal
Total bandwidth N/2 log(P) N(1-1/P)
Optimal uses independent pairwise exchanges
(except last permutation)

27
Pease Dataflow
28
Theoretically Optimal Dataflow
29
Outline

Introduction
Bit permutations
Distributed WHT algorithms
Theoretical results
Experimental results

30
Experimental Results

Platform
32 Pentium III processors, 450 MHz
512 MB 8ns PCI-100 memory and
2 SMC 100 mbps fast Ethernet cards
Distributed WHT package implemented using MPI
Experiments
All-to-All
Distributed stride vs. multi-swap permutations
Distributed WHT

31
All-to-All
Three different implementations of All-to-All
permutation
Point-to-point fastest
32
Stride vs. Multi-Swap
33
Distributed WHT230
vs.
34
Summary

Self-adapting WHT package
Optimize distributed WHT over different
communication patterns and combinations of
sequential code
Use of point-to-point primitives for all-to-all

Ongoing work