Title: Fast Fourier Transforms on GPUs
1Fast Fourier Transforms on GPUs
- Cory Quammen
- April 11, 2007
2Overview
- Applications exploiting DFT
- Discrete Fourier transform
- FFT algorithm basics
- Papers
- Performance comparison
3Applications
- Signal processing
- Waveform analysis
- Band-pass filtering
- Signal detection
- Convolution
- Signal filtering
- Polynomial multiplication
- Big integer multiplication
- Interpolation
- Padding with zeros in frequency domain implies
increased sampling rate in spatial (time) domain
4Applications
- Solving PDEs
- Finding nth derivative in spatial (time) domain
equivalent to multiplying by in in frequency
domain - Image compression
- Analyze parts of an image, discarding
high-frequency components - Medical image reconstruction
- MRI and ultrasound
- Volume rendering
- Fourier slice projection theorem
5Discrete Fourier Transform
- Transforms discrete, evenly-spaced samples from
spatial (or time) domain to frequency domain - Expresses input signal as a superposition of
sinusoids - Maps complex vectors to complex vectors
- All frequencies in original data are fully
represented in transformed data
6Discrete Fourier Transform
- Linear
- Invertible - original data can be fully recovered
- Important - discrete Fourier transform (DFT) is a
mathematical concept, not an algorithm
7DFT definition
Difference is a change of sign and scale factor
8Multi-dimensional DFT
- Nested summations over each dimension
- Computationally, simplifies to row-column
algorithm - Apply set of 1D DFTs in one dimension, then
subsequent sets of 1D DFTs in remaining dimensions
9Interpretation
- Fourier transform often characterized by two
things - Amplitude
- Phase
10F(impulse)
11F(cos wave)
12F(white noise)
13F(Gaussian)
14Naïve DFT computation
- Implement directly
- Complexity O(N2)
- Much too slow!
15Cooley-Tukey algorithm
- Developed by Gauss in 1805, but forgotten and
developed independently by Cooley and Tukey, and
others before them - Divide-and-conquer algorithm reduces complexity
to O(N log N)
16Faster DFT computation
- Apply divide-and-conquer
- Split input into 2 subarrays
- Even-index subarray
- Odd-index subarray
- Take DFT of subarrays
- Combine results
- Yields complexity O(N log N)
17Danielson-Lanczos Lemma
Separate summation into even-odd parts
Notice
Moreland2003
18Therefore
Recall
Moreland2003
19Recursive implementation
- Recursive formulation of FFT algorithm. Very
inefficient because - of data reordering and replication.
- function Fx myfft(fx)
- N length(fx)
- if (N 1)
- Fx fx
- return
- end
-
- Coefficient
- Wn exp(-i2pi/N)
- even myfft(fx(12N-1))
- odd myfft(fx(22N))
-
- exponents (0(N/2-1))'
- firstHalf even (Wn.exponents.odd)
- secondHalf even - (Wn.exponents.odd)
- Fx firstHalf secondHalf
20Recursive implementation caveats
- Requires stack space
- No stack on GPU
- Array reordering at every step
- Too much memory traffic
21Iterative implementation
- Reorder input
- Details to follow
- Start from smallest partition size first
- Fourier transform of length-one vector is
identity - Start with partition size p 2
- Combine results
- Double partition size p
- Repeat
22Input reordering
Reverse the bits!
23Iterative computation
Unweighted
Weighted
24Spitzer 2003
- Implemented 1D FFT
- Bit reversal is a lookup operation
- Indices and weights for each pass are precomputed
and stored in scramble texture - Pad real inputs to complex by setting imaginary
component to 0
Spitzer2003
25Implementation and results
- OpenGL and Cg
- 1D textures
- Single 2048 FFT
Spitzer2003
26Performance limitations
- Vertex processors not used (5 GFLOPS of unused
processing power) - Many OpenGL calls for ping-ponging
- Each pass must finish before next one starts
- 1D textures - not optimized in hardware
- Additional memory bandwidth for index and weight
lookups - Vector ALUs not exploited
Spitzer2003
27Performance improvements
- Packing complex numbers into 1 or more textures
- RG - first number, BA - second number
- Two textures RGBA1 - real, RGBA2 - imaginary
- Batching
- Fill 2D texture with 1D inputs
- Keep graphics pipeline filled
- Cuts ratio of driver calls to computation
substantially
Spitzer2003
28Mitchell et al. 2003
- Discusses FFT in context of image processing and
DirectX 9 - Similar arrangement and algorithm to
Spitzer2003 - Handles 2D FFTs
Mitchell2003
29FFT over columns
// Horizontal scramble first SetSurfaceAsTexture(
surfaceA) // Input image SetRenderTarget(
surfaceB) LoadShader( HorizontalScramble)
SetTexture( ButterflyTexturelog2(width))
DrawQuad() // Horizontal butterflies
LoadShader( HorizontalButterfly) SetTexture(
ButterflyTexturelog2(width)) for ( i 0 i lt
log2( width) i) SwapSurfacesAandB()
SetShaderConstant( pass, i/log2(width))
DrawQuad()
Mitchell2003
30FFT over rows
// Vertical scramble SwapSurfacesAandB()
LoadShader( VerticalScramble) SetTexture(
ButterflyTexturelog2(height)) DrawQuad()
// Vertical butterflies LoadShader(
VerticalButterfly) SetTexture(
ButterflyTexturelog2(height)) for ( i 0 i
lt log2( height) i) SwapSurfacesAandB()
SetShaderConstant( pass, i/log2(height))
DrawQuad()
Mitchell2003
31Moreland and Angel 2003
- Goal was to enable a real-time filtering stage
during image generation - Developed an indexing scheme to eliminate input
reordering - Optimized for real input signals
Moreland2003
32Optimizations for real input
- Most input images are real
- Symmetry in Fourier transform of real signals
- No need to add 0-valued imaginary components
- Uses half the memory
Moreland2003
33Optimizations for real input
- Can pack two reals into a single complex number
and apply standard complex-gtcomplex FFT - Treat one half of 1D input as real, other half as
imaginary - Must untangle results afterwards
Moreland2003
34Conceptual organization
1D Organization
2D Organization
Real Values
Imaginary Values
- No data reordering
- 2D packing
- Columns 0 and M/2 contain reals
- Other columns can be paired up as real and
imaginary components
Real Values
Imaginary Values
Moreland2003
35Program flow
Moreland2003
36Implementation
- OpenGL and Cg
- NVIDIA GeForce FX 5800 Ultra
- Ping-ponging
Moreland2003
37Results
- Arithmetic-bound
- 2.5 GFLOPS
- 3.4 GB/sec memory bandwidth
- 1024x1024 convolution
- FFTW on 1.7GHz Xeon took 0.625 seconds
- Their implementation took 2.7 seconds (with data
transfer)
Moreland2003
38Sumanaweera and Liu 2005
- Can operate on two complex inputs simultaneously
- Targets 2D images
- Columns first, then rows
- Auto-tunes to balance load among
- Vertex processors
- Rasterizer
- Fragment processors
Sumanaweera2005
39Compute buffers
- Uses one pbuffer with several draw buffers
- Use separate buffers for real and imaginary
complex components - NVIDIA Quadro NV4x family has stereo capability
- Supports up to 8 buffers
- GL_FRONT_LEFT, GL_BACK_LEFT, GL_FRONT_RIGHT,
GL_BACK_RIGHT - GL_AUX0, GL_AUX1, GL_AUX2, GL_AUX3
- Allows parallel processing of two 2D images
- 2 complex images ? 2 components ? (src dst
buffers)
Sumanaweera2005
40Incorporating vertex processors and rasterizer
- In previous implementations, vertex processor is
basically idle - Transforms only 4 vertices for each pass
- Indices and weights stored in textures
- Enhancement
- Fragments can be grouped together into contiguous
chunks within which indices and angular argument
of the weight vary linearly
Sumanaweera2005
41Linear interpolation
- Note the linear variation in indices used to form
F0, F1, F2, F3 - Can exploit interpolation hardware to eliminate
texture accesses for index lookups - Also works for angular arguments of weights
Sumanaweera2005
42Caveats
- In early stages, must draw many small rectangles
- Vertex processors loaded more heavily than
fragment processors - Worse performance than single-quad approach with
index and weight lookups - sincos evaluation in hardware to obtain weights
- Probably enough arithmetic to hide memory latency
Sumanaweera2005
43Load balancing
Quads at step
1
2
log N
- Switch between approaches at some stage decided
at runtime - Auto-tuning code
Index lookups
Interpolated indices
Sumanaweera2005
44Performance
Sumanaweera2005
45Govindaraju et al. 2006
- Emphasis is on cache-efficiency
- Uses Stockham FFT formulation
- Like Moreland2003, does not perform initial
reordering step - Simpler indexing functions
- Supports 1D FFTs
- Data stored in 2D arrays
- Better exploitation of GPU cache designed for 2D
access patterns - Makes larger problem sizes possible
- Divides work up into tiles to improve cache reuse
Govindaraju2006
46Performance
- Problem size - 4 million complex numbers
- On NVIDIA 7800GTX, 64?64 tile size performs best
- 6.1 GFLOPS on NVIDIA 7900 GTX
- 32 GB/s memory bandwidth
- 4-5X faster than two 3.6GHz Xeon processors
w/Hyperthreading and two dual-core Opteron 280s
running threaded FFTW - Claims to be 3X faster than libgpufft from
Stanford
Govindaraju2006
47Other implementations
- CUFFT
- CUDAs black-box FFT implementation
- Mark Harris reports 50 GFLOPS on 8800 GTX for
more than 4096 elements in 1D - GPU FFT
- Brook implementation
- T. Jansen, B von Rymon-Lipinski, N. Hanssen, E.
Keeve, Fourier volume rendering on the GPU using
a split-stream FFT, Proc. Vision, Modeling, and
Visualization, pp. 395-403, 2004. - 9X faster on ATI Radeon 9800 than 2.6GHz Pentium
4 for 10242 real image
48Other implementations
- T. Schiwietz, T. Chang, P. Speier, and R.
Westermann, MR image reconstruction using the
GPU, Proc. SPIE 6142, pp. 1279-1290, 2006. - GPU-FFT on ATI X1600 XT vs. 3.0GHz Pentium 4
- 1.7X faster than FFTW for 10242 complex image
- 1.9X faster for 5122 complex image
- O. Fialka and M. Cadik, FFT and convolution
performance in image filtering on GPU, Proc. of
Information Visualization, 2006. - NVIDIA GeForce 6600GT vs 2.6GHz Intel Celeron D
- Convolution 3.4X faster for 2562 complex image
49GPU FFT shootout
- Tested on NVIDIA GeForce 8800 GTX
- All tests include bus traffic to GPU
- No readback
- Complex inputs
- GFLOPS derived from estimated FFTW operation
count - 5 N log2(N) / time
- N is product of dimension sizes
50GPU FFT shootout
? Dimension not supported
51References
- Spitzer, Implementing a GPU-Efficient FFT,
SIGGRAPH GPGPU Course, 2003. - K. Moreland and E. Angel, The FFT on a GPU,
Graphics Hardware, 2003. - J. Mitchell, M. Ansari, E. Hart, Advanced Image
Processing with DirectX 9 Pixel Shaders, in
ShaderX2 - Shader Programming Tips and Tricks,
2003. - T. Sumanaweera and D. Liu, Medical Image
Reconstruction with the FFT, GPU Gems 2, pp.
765-784, 2005. - N. Govindaraju, S. Larsen, J. Gray, and D.
Manocha, A memory model for scientific
algorithms on graphics processors, Proc.
Supercomputing, p. 6, 2006.