Distributed WHT Algorithms - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Distributed WHT Algorithms

Description:

Computer Science. Drexel University. http://www.spiral.net. Franz Franchetti ... Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 35
Provided by: jose269
Category:

less

Transcript and Presenter's Notes

Title: Distributed WHT Algorithms


1
Distributed WHT Algorithms
Kang Chen Jeremy Johnson Computer Science Drexel
University
Franz Franchetti Electrical and Computer
Engineering Carnegie Mellon University
http//www.spiral.net
2
Sponsors
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3
Objective
  • Generate high-performance implementations of
    linear computations (signal transforms) from
    mathematical descriptions
  • Explore alternative implementations and optimize
    using formula generation, manipulation and search
  • Prototype implementation using WHT
  • Prototype transform (WHT)
  • Build on existing sequential package
  • SMP implementation using OpenMP
  • Distributed memory implementation using MPI
  • Sequential package presented at ICASSP00 01
    and OpenMP extension presented at IPDPS02
  • Incorporate into SPIRALAutomatic performance
    tuning for DSP transforms

4
Outline
  • Introduction
  • Bit permutations
  • Distributed WHT algorithms
  • Theoretical results
  • Experimental results

5
Outline
  • Introduction
  • Bit permutations
  • Distributed WHT algorithms
  • Theoretical results
  • Experimental results

6
Walsh-Hadamard Transform
7
SPIRAL WHT Package
  • All WHT algorithms have the same arithmetic cost
    O(N log N) but different data access patterns
    and varying amounts of recursion and iteration
  • Small transforms (size 21 to 28) are implemented
    with straight-line code to reduce overhead
  • The WHT package allows exploration of the O(7n)
    different algorithms and implementations using a
    simple grammar
  • Optimization/adaptation to architectures is
    performed by searching for the fastest algorithm
  • Dynamic Programming (DP)
  • Evolutionary Algorithm (STEER)

Johnson and Püschel ICASSP 2000
8
Performance of WHT Algorithms (II)
  • Automatically generate random algorithms
    forWHT216 using SPIRAL
  • Only difference order of arithmetic instructions

9
Architecture Dependency
  • The best WHT algorithms also depend on
    architecture
  • Memory hierarchy
  • Cache structure
  • Cache miss penalty

10
Outline
  • Introduction
  • Bit permutations
  • Distributed WHT algorithms
  • Theoretical results
  • Experimental results

11
Bit Permutations
  • Definition
  • ? A permutation of 0,1,,n-1 (bn-1 b1 b0 )
    Binary representation of 0 ? i lt 2n.
  • P? Permutation of 0,1,,2n-1 defined by
  • (bn-1 b1 b0) ? (b?(n-1) b?(1) b ?(0))
  • Distributed interpretation
  • P 2p processors
  • Block cyclic data distribution.
  • Leading p bits are the pid
  • Trailing (n-p) bits are the local offset.
  • pid offset
    pid offset
  • (bn-1 bn-p bn-p-1 b1 b0) ? (b?(n-1)
    b?(n-p) b ?(n-p-1) b?(1) b ?(0))

12
Stride Permutation
Write at stride 4 (8/2)
  • 000 000
  • 001 100
  • 010 001
  • 011 101
  • 100 010
  • 101 110
  • 110 011
  • 111 111

(b2b1b0)?? (b0b2b1)
13
Distributed Stride Permutation
  • 0000 0000
  • 0001 1000
  • 0010 0001
  • 0011 1001
  • 0100 0010
  • 0101 1010
  • 0110 0011
  • 0111 1011
  • 1000 0100
  • 1001 1100
  • 1010 0101
  • 1011 1101
  • 1100 0110
  • 1101 1110
  • 1110 0111
  • 1111 1111

14
Communication Pattern
Each PE sends 1/2 data to 2 different PEs
Looks nicely regular
15
Communication Pattern
but is highly irregular
16
Communication Pattern
Each PE sends 1/4 data to 4 different PEs
and gets worse for larger parameters of L.
17
Multi-Swap Permutation
Writes at stride 4 Pairwise exchange of data
  • 000 000
  • 001 100
  • 010 010
  • 011 110
  • 100 001
  • 101 101
  • 110 011
  • 111 111

(b2b1b0)?? (b0b1b2)
18
Communication Pattern
Each PE exchanges 1/2 data with another PE (4
size 2 All-to-All)
19
Communication Pattern
1
0
X(026)
X(127)
2
7
3
6
4
5
20
Communication Pattern
Each PE sends 1/4 data to 4 different PEs (2 size
4 All-to-All)
21
Communication Scheduling
  • Order two Latin square
  • Used to schedule All-to-All permutation
  • Uses Point-to-Point communication
  • Simple recursive construction

22
Outline
  • Introduction
  • Bit permutations
  • Distributed WHT algorithms
  • Theoretical results
  • Experimental results

23
Parallel WHT Package
  • WHT partition tree is parallelized at the root
    node
  • SMP implementation obtained using OpenMP
  • Distributed memory implementation using MPI
  • Dynamic programming decides when to use
    parallelism
  • DP decides the best parallel root node
  • DP builds the partition with best sequential
    subtrees

Sequential WHT package Johnson and Püschel
ICASSP 2000, ICASSP 2001 Dynamic Data Layout N.
Park and V. K. Prasanna ICASSP 2001 OpenMP SMP
version K. Chen and J. Johnson IPDPS 2002
24
Distributed Memory WHT Algorithms
  • Distributed split, d_split, as root node
  • Data equally distributed among threads
  • Distributed stride permutation to exchange data
  • Different sequences of permutations are possible
  • Parallel form WHT transform on local data

Sequential algorithm
25
Outline
  • Introduction
  • Bit permutations
  • Distributed WHT algorithms
  • Theoretical results
  • Experimental results

26
Theoretical Results
  • Problem statement
  • Find sequence of permutations that minimize
    communication and congestion
  • Pease dataflow
  • Total bandwidth N log(N)(1-1/P)
  • Conjectured Optimal
  • Total bandwidth N/2 log(P) N(1-1/P)
  • Optimal uses independent pairwise exchanges
    (except last permutation)

27
Pease Dataflow
28
Theoretically Optimal Dataflow
29
Outline
  • Introduction
  • Bit permutations
  • Distributed WHT algorithms
  • Theoretical results
  • Experimental results

30
Experimental Results
  • Platform
  • 32 Pentium III processors, 450 MHz
  • 512 MB 8ns PCI-100 memory and
  • 2 SMC 100 mbps fast Ethernet cards
  • Distributed WHT package implemented using MPI
  • Experiments
  • All-to-All
  • Distributed stride vs. multi-swap permutations
  • Distributed WHT

31
All-to-All
Three different implementations of All-to-All
permutation
Point-to-point fastest
32
Stride vs. Multi-Swap
33
Distributed WHT230
vs.
34
Summary
  • Self-adapting WHT package
  • Optimize distributed WHT over different
    communication patterns and combinations of
    sequential code
  • Use of point-to-point primitives for all-to-all

Ongoing work
  • Lower bounds
  • Use high-speed interconnect
  • Generalize to other transforms
  • Incorporate into SPIRAL

http//www.spiral.net
Write a Comment
User Comments (0)
About PowerShow.com