FFTs, Portability, - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

FFTs, Portability,

Description:

Allows easy experimentation with different optimizations & algorithms. ... Many Resulting 'Algorithms' INDIRECT TRANSPOSE gives in-place DFTs, ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 42

Provided by: jdjoann

Category:

more less

Transcript and Presenter's Notes

Title: FFTs, Portability,

1
FFTs, Portability, Performance

Steven G. Johnson, MIT
Matteo Frigo, ITA Software (formerly MIT LCS)

2
Fast Fourier Transforms (FFTs)
Gauss (1805), Cooley Tukey (1965), others
Fast, O(n log n) algorithm(s) to compute the
Discrete Fourier Transform (DFT) of n points
Widely used in many fields, including to
solve partial differential equations e.g.
electromagnetic eigenmodes of photonic
crystals
I couldnt find any FFT code that was
simultaneously fast (near-optimal),
flexible, and portable
3
Whats the fastest algorithm for _____?(computer
science math time math )
Find best asymptotic complexity naïve DFT to
FFT O(n2) to O(n log n)
1
Find best exact operation count? flops 5 n log
n Cooley-Tukey, 1965, radix 2 to 4 n
log n Yavne, 1968, split radix to
2
4
Whats the fastest algorithm for _____?(computer
science math time math )
Find best asymptotic complexity naïve DFT to
FFT O(n2) to O(n log n)
1
Find variant/implementation that runs
fastest hardware-dependent unstable answer!
2
5
Whats the simplest/smallestset of algorithmic
stepswhose compositions alwaysspan the
fastest algorithm?
A question with a more stable answer?
6
FFTW
the Fastest Fourier Tranform in the West
C library for real complex FFTs (arbitrary
size/dimensionality)
( parallel versions for threads MPI)
Computational kernels (80 of code)
automatically generated
Self-optimizes for your hardware (picks best
composition of steps) portability performance
7
FFTW performancepower-of-two sizes, double
precision
833 MHz Alpha EV6
2 GHz PowerPC G5
500 MHz Ultrasparc IIe
2 GHz AMD Opteron
8
FFTW performancenon-power-of-two sizes, double
precision
unusual non-power-of-two sizes receive as much
optimization as powers of two
833 MHz Alpha EV6
2 GHz AMD Opteron
because we let the code do the optimizing
9
FFTW performancedouble precision, 2.8GHz Pentium
IV 2-way SIMD (SSE2)
powers of two
exploiting CPU-specific SIMD instructions (rewriti
ng the code) is easy
non-powers-of-two
because we let the code write itself
10
Why is FFTW fast?three unusual features
FFTW implements many FFT algorithms A planner
picks the best composition by measuring the speed
of different combinations.
The resulting plan is executed with explicit
recursion enhances locality
The base cases of the recursion are
codelets highly-optimized dense
code automatically generated by a special-purpose
compiler
11
FFTW is easy to use
complex xn plan p p plan_dft_1d(n, x,
x, FORWARD, MEASURE) ... execute(p) / repeat
as needed / ... destroy_plan(p)
12
Why is FFTW fast?three unusual features
FFTW implements many FFT algorithms A planner
picks the best composition by measuring the speed
of different combinations.
3
The resulting plan is executed with explicit
recursion enhances locality
1
The base cases of the recursion are
codelets highly-optimized dense
code automatically generated by a special-purpose
compiler
2
13
Cooley-Tukey FFTs
1965 or Gauss, 1802
n pq
1d DFT of size n
2d DFT of size p x q
first DFT columns, size q (non-contiguous)
finally, DFT columns, size p (non-contiguous)
14
Cooley-Tukey FFTs
1965 or Gauss, 1802
n pq
1d DFT of size n
2d DFT of size p x q ( phase rotation
by twiddle factors)
Recursive DFTs of sizes p and q
O(n2)
O(n log n)
(divide-and-conquer algorithm)
15
Cooley-Tukey FFTs
1965 or Gauss, 1802
twiddles
size-p DFTs
size-q DFTs
16
Cooley-Tukey is Naturally Recursive
Size 8 DFT
p 2 (radix 2)
Size 4 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
17
Traditional cache solution Blocking
Size 8 DFT
p 2 (radix 2)
Size 4 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
breadth-first, but with blocks of size cache
requires program specialized for cache size
18
Recursive Divide Conquer is Good
Singleton, 1967
(depth-first traversal)
Size 8 DFT
p 2 (radix 2)
Size 4 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
19
Cache Obliviousness
A cache-oblivious algorithm does not know the
cache size it can be optimal for any machine
for all levels of cache simultaneously
Exist for many other algorithms, too Frigo et
al. 1999
all via the recursive divide conquer approach
20
Why is FFTW fast?three unusual features
FFTW implements many FFT algorithms A planner
picks the best composition by measuring the speed
of different combinations.
3
The resulting plan is executed with explicit
recursion enhances locality
1
The base cases of the recursion are
codelets highly-optimized dense
code automatically generated by a special-purpose
compiler
2
21
The Codelet Generator
a domain-specific FFT compiler
Generates fast hard-coded C for FFTs of
arbitrary size
Necessary to give the planner a large space of
codelets to experiment with (any factorization).
Exploits modern CPU deep pipelines large
register sets.
Allows easy experimentation with different
optimizations algorithms. CPU-specific hacks
(SIMD) feasible
22
The Codelet Generator
written in Objective Caml Leroy, 1998, an ML
dialect
Abstract FFT algorithm
n
Cooley-Tukey npq, Prime-Factor gcd(p,q)
1, Rader n prime,
Symbolic graph (dag)
Simplifications
powerful enough to e.g. derive real-input
FFT from complex FFT algorithm and even find new
algorithms
Cache-oblivious scheduling (cache registers)
Optimized C code (or other language)
23
The Generator Finds Good/New FFTs
24
Symbolic Algorithms are EasyCooley-Tukey in OCaml
25
Simple Simplifications
Well-known optimizations
Algebraic simplification, e.g. a 0 a
Constant folding
Common-subexpression elimination
26
Symbolic Pattern Matching in OCaml
The following actual code fragment is solely
responsible for simplifying multiplications
stimesM function (Uminus a, b) -gt
stimesM (a, b) gtgt suminusM (a, Uminus b)
-gt stimesM (a, b) gtgt suminusM (Num a, Num
b) -gt snumM (Number.mul a b) (Num a, Times
(Num b, c)) -gt snumM (Number.mul a b) gtgt
fun x -gt stimesM (x, c) (Num a, b) when
Number.is_zero a -gt snumM Number.zero (Num
a, b) when Number.is_one a -gt makeNode b
(Num a, b) when Number.is_mone a -gt suminusM b
(a, b) when is_known_constant b not
(is_known_constant a) -gt stimesM (b, a)
(a, b) -gt makeNode (Times (a, b))
(Common-subexpression elimination is implicit
via memoization and monadic programming style.)
27
Simple Simplifications
Well-known optimizations
Algebraic simplification, e.g. a 0 a
Constant folding
Common-subexpression elimination
28
A Quiz Is One Faster?
Both compute the same thing, and have the same
number of arithmetic operations
a 0.5 b c 0.5 d e 1.0 a f 1.0 -
c
a 0.5 b c -0.5 d e 1.0 a f 1.0
c
1015 speedup
29
Non-obvious transformations require
experimentation
30
Quiz 2 Which is Faster?
accessing strided array inside codelet (amid
dense numeric code)
arraystride i
arraystridesi
using precomputed stride array
stridesi stride i
namely, Intel Pentia integer multiplication
conflicts with floating-point
up to 20 speedup
31
Machine-specific hacksare feasibleif you just
generate special code
stride precomputation SIMD instructions (SSE,
Altivec, 3dNow!) fused multiply-add instructions
32
Why is FFTW fast?three unusual features
FFTW implements many FFT algorithms A planner
picks the best composition by measuring the speed
of different combinations.
3
The resulting plan is executed with explicit
recursion enhances locality
1
The base cases of the recursion are
codelets highly-optimized dense
code automatically generated by a special-purpose
compiler
2
33
What does the planner compose?
(new in FFTW 3, spring 2003)
The Cooley-Tukey algorithm presents many
choices which factorization? what order?
memory reshuffling?
Find simple steps that combine without
restriction to form many different algorithms.
34
Some Composable Steps (out of 16)
SOLVE Directly solve a small DFT by a codelet
CT-FACTORr Radix-r Cooley-Tukey step r
(loop) sub-problems of size n/r
VECLOOP Perform one vector loop (can choose
any loop, i.e. loop reordering)
INDIRECT DFT copy in-place
DFT (separates copy/reordering from DFT)
TRANPOSE solve in-place m ? n transpose
35
Dynamic Programmingthe assumption of optimal
substructure
Try all applicable steps
CT-FACTOR2 2 DFT(8) CT-FACTOR4 4 DFT(4)
DFT(16) fastest of
CT-FACTOR2 2 DFT(4) CT-FACTOR4 4
DFT(2) SOLVE1,8
DFT(8) fastest of
If exactly the same problem appears twice, assume
that we can re-use the plan.
36
Many Resulting Algorithms
INDIRECT TRANSPOSE gives in-place DFTs,
bit-reversal product of transpositions no
separate bit-reversal pass Johnson
(unrelated) Burrus (1984)
VECLOOP can push topmost loop to leaves
vector FFT algorithm Swarztrauber (1987)
CT-FACTOR then VECLOOP(s) gives breadth-first
FFT, erases iterative/recursive distinction
37
Actual Plan for size 219524288(2GHz Pentium IV,
double precision, out-of-place)
CT-FACTOR4 (buffered variant)
CT-FACTOR32 (buffered variant)
VECLOOP(b) x32
CT-FACTOR64
INDIRECT VECLOOP(b) ( ) huge
improvements for large 1d sizes
INDIRECT
VECLOOP(b) x64
VECLOOP(a) x4
COPY64
VECLOOP(a) x4
SOLVE64, 64
38
Planner Unpredictability
double-precision, power-of-two sizes, 2GHz
PowerPC G5
FFTW 3
Classic strategy minimize ops fails badly
another test
Use plan from another machine? e.g.
Pentium-IV? lose 2040
heuristic pick plan with fewest adds
multiplies loads/stores
39
Weve Come a Long Way
1965 Cooley Tukey, IBM 7094, 36-bit single
precision size 2048 DFT in 1.2 seconds
2003 FFTW3SIMD, 2GHz Pentium-IV 64-bit double
precision size 2048 DFT in 50 microseconds
(24,000x speedup)
( 30 improvement per year)
40
Weve Come a Long Way?
41
FFTW Homework Problems?
Try an FFTPACK-style back-and-forth solver
Implement Vector-Radix for multi-dimensional n
Pruned FFTs VECLOOP that skips zeros
Better heuristic planner some sort of
optimization of per-solver costs?
Implement convolution solvers

Write a Comment

User Comments (0)