Fast Fourier Transforms on GPUs - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Fast Fourier Transforms on GPUs

Description:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL. Fast ... Interpolation ... Can exploit interpolation hardware to eliminate texture accesses for index lookups ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 52

Provided by: cory154

Category:

more less

Transcript and Presenter's Notes

Title: Fast Fourier Transforms on GPUs

1
Fast Fourier Transforms on GPUs

Cory Quammen
April 11, 2007

2
Overview

Applications exploiting DFT
Discrete Fourier transform
FFT algorithm basics
Papers
Performance comparison

3
Applications

Signal processing
Waveform analysis
Band-pass filtering
Signal detection
Convolution
Signal filtering
Polynomial multiplication
Big integer multiplication
Interpolation
Padding with zeros in frequency domain implies
increased sampling rate in spatial (time) domain

4
Applications

Solving PDEs
Finding nth derivative in spatial (time) domain
equivalent to multiplying by in in frequency
domain
Image compression
Analyze parts of an image, discarding
high-frequency components
Medical image reconstruction
MRI and ultrasound
Volume rendering
Fourier slice projection theorem

5
Discrete Fourier Transform

Transforms discrete, evenly-spaced samples from
spatial (or time) domain to frequency domain
Expresses input signal as a superposition of
sinusoids
Maps complex vectors to complex vectors
All frequencies in original data are fully
represented in transformed data

6
Discrete Fourier Transform

Linear
Invertible - original data can be fully recovered
Important - discrete Fourier transform (DFT) is a
mathematical concept, not an algorithm

7
DFT definition

Forward transform

Inverse transform

Difference is a change of sign and scale factor
8
Multi-dimensional DFT

Nested summations over each dimension
Computationally, simplifies to row-column
algorithm
Apply set of 1D DFTs in one dimension, then
subsequent sets of 1D DFTs in remaining dimensions

9
Interpretation

Fourier transform often characterized by two
things
Amplitude
Phase

10
F(impulse)
11
F(cos wave)
12
F(white noise)
13
F(Gaussian)
14
Naïve DFT computation

Implement directly
Complexity O(N2)
Much too slow!

15
Cooley-Tukey algorithm

Developed by Gauss in 1805, but forgotten and
developed independently by Cooley and Tukey, and
others before them
Divide-and-conquer algorithm reduces complexity
to O(N log N)

16
Faster DFT computation

Apply divide-and-conquer
Split input into 2 subarrays
Even-index subarray
Odd-index subarray
Take DFT of subarrays
Combine results
Yields complexity O(N log N)

17
Danielson-Lanczos Lemma
Separate summation into even-odd parts
Notice
Moreland2003
18
Therefore
Recall
Moreland2003
19
Recursive implementation

Recursive formulation of FFT algorithm. Very
inefficient because
of data reordering and replication.
function Fx myfft(fx)
N length(fx)
if (N 1)
Fx fx
return
end
Coefficient
Wn exp(-i2pi/N)
even myfft(fx(12N-1))
odd myfft(fx(22N))
exponents (0(N/2-1))'
firstHalf even (Wn.exponents.odd)
secondHalf even - (Wn.exponents.odd)
Fx firstHalf secondHalf

20
Recursive implementation caveats

Requires stack space
No stack on GPU
Array reordering at every step
Too much memory traffic

21
Iterative implementation

Reorder input
Details to follow
Start from smallest partition size first
Fourier transform of length-one vector is
identity
Start with partition size p 2
Combine results
Double partition size p
Repeat

22
Input reordering
Reverse the bits!
23
Iterative computation
Unweighted
Weighted
24
Spitzer 2003

Implemented 1D FFT
Bit reversal is a lookup operation
Indices and weights for each pass are precomputed
and stored in scramble texture
Pad real inputs to complex by setting imaginary
component to 0

Spitzer2003
25
Implementation and results

OpenGL and Cg
1D textures
Single 2048 FFT

Spitzer2003
26
Performance limitations

Vertex processors not used (5 GFLOPS of unused
processing power)
Many OpenGL calls for ping-ponging
Each pass must finish before next one starts
1D textures - not optimized in hardware
Additional memory bandwidth for index and weight
lookups
Vector ALUs not exploited

Spitzer2003
27
Performance improvements

Packing complex numbers into 1 or more textures
RG - first number, BA - second number
Two textures RGBA1 - real, RGBA2 - imaginary
Batching
Fill 2D texture with 1D inputs
Keep graphics pipeline filled
Cuts ratio of driver calls to computation
substantially

Spitzer2003
28
Mitchell et al. 2003

Discusses FFT in context of image processing and
DirectX 9
Similar arrangement and algorithm to
Spitzer2003
Handles 2D FFTs

Mitchell2003
29
FFT over columns
// Horizontal scramble first SetSurfaceAsTexture(
surfaceA) // Input image SetRenderTarget(
surfaceB) LoadShader( HorizontalScramble)
SetTexture( ButterflyTexturelog2(width))
DrawQuad() // Horizontal butterflies
LoadShader( HorizontalButterfly) SetTexture(
ButterflyTexturelog2(width)) for ( i 0 i lt
log2( width) i) SwapSurfacesAandB()
SetShaderConstant( pass, i/log2(width))
DrawQuad()
Mitchell2003
30
FFT over rows
// Vertical scramble SwapSurfacesAandB()
LoadShader( VerticalScramble) SetTexture(
ButterflyTexturelog2(height)) DrawQuad()
// Vertical butterflies LoadShader(
VerticalButterfly) SetTexture(
ButterflyTexturelog2(height)) for ( i 0 i
lt log2( height) i) SwapSurfacesAandB()
SetShaderConstant( pass, i/log2(height))
DrawQuad()
Mitchell2003
31
Moreland and Angel 2003

Goal was to enable a real-time filtering stage
during image generation
Developed an indexing scheme to eliminate input
reordering
Optimized for real input signals

Moreland2003
32
Optimizations for real input

Most input images are real
Symmetry in Fourier transform of real signals
No need to add 0-valued imaginary components
Uses half the memory

Moreland2003
33
Optimizations for real input

Can pack two reals into a single complex number
and apply standard complex-gtcomplex FFT
Treat one half of 1D input as real, other half as
imaginary
Must untangle results afterwards

Moreland2003
34
Conceptual organization
1D Organization
2D Organization
Real Values
Imaginary Values

No data reordering
2D packing
Columns 0 and M/2 contain reals
Other columns can be paired up as real and
imaginary components

Real Values
Imaginary Values
Moreland2003
35
Program flow
Moreland2003
36
Implementation

OpenGL and Cg
NVIDIA GeForce FX 5800 Ultra
Ping-ponging

Moreland2003
37
Results

Arithmetic-bound
2.5 GFLOPS
3.4 GB/sec memory bandwidth
1024x1024 convolution
FFTW on 1.7GHz Xeon took 0.625 seconds
Their implementation took 2.7 seconds (with data
transfer)

Moreland2003
38
Sumanaweera and Liu 2005

Can operate on two complex inputs simultaneously
Targets 2D images
Columns first, then rows
Auto-tunes to balance load among
Vertex processors
Rasterizer
Fragment processors

Sumanaweera2005
39
Compute buffers

Uses one pbuffer with several draw buffers
Use separate buffers for real and imaginary
complex components
NVIDIA Quadro NV4x family has stereo capability
Supports up to 8 buffers
GL_FRONT_LEFT, GL_BACK_LEFT, GL_FRONT_RIGHT,
GL_BACK_RIGHT
GL_AUX0, GL_AUX1, GL_AUX2, GL_AUX3
Allows parallel processing of two 2D images
2 complex images ? 2 components ? (src dst
buffers)

Sumanaweera2005
40
Incorporating vertex processors and rasterizer

In previous implementations, vertex processor is
basically idle
Transforms only 4 vertices for each pass
Indices and weights stored in textures
Enhancement
Fragments can be grouped together into contiguous
chunks within which indices and angular argument
of the weight vary linearly

Sumanaweera2005
41
Linear interpolation

Note the linear variation in indices used to form
F0, F1, F2, F3
Can exploit interpolation hardware to eliminate
texture accesses for index lookups
Also works for angular arguments of weights

Sumanaweera2005
42
Caveats

In early stages, must draw many small rectangles
Vertex processors loaded more heavily than
fragment processors
Worse performance than single-quad approach with
index and weight lookups
sincos evaluation in hardware to obtain weights
Probably enough arithmetic to hide memory latency

Sumanaweera2005
43
Load balancing
Quads at step
1
2

log N

Switch between approaches at some stage decided
at runtime
Auto-tuning code

Index lookups
Interpolated indices
Sumanaweera2005
44
Performance
Sumanaweera2005
45
Govindaraju et al. 2006

Emphasis is on cache-efficiency
Uses Stockham FFT formulation
Like Moreland2003, does not perform initial
reordering step
Simpler indexing functions
Supports 1D FFTs
Data stored in 2D arrays
Better exploitation of GPU cache designed for 2D
access patterns
Makes larger problem sizes possible
Divides work up into tiles to improve cache reuse

Govindaraju2006
46
Performance

Problem size - 4 million complex numbers
On NVIDIA 7800GTX, 64?64 tile size performs best
6.1 GFLOPS on NVIDIA 7900 GTX
32 GB/s memory bandwidth
4-5X faster than two 3.6GHz Xeon processors
w/Hyperthreading and two dual-core Opteron 280s
running threaded FFTW
Claims to be 3X faster than libgpufft from
Stanford

Govindaraju2006
47
Other implementations

CUFFT
CUDAs black-box FFT implementation
Mark Harris reports 50 GFLOPS on 8800 GTX for
more than 4096 elements in 1D
GPU FFT
Brook implementation
T. Jansen, B von Rymon-Lipinski, N. Hanssen, E.
Keeve, Fourier volume rendering on the GPU using
a split-stream FFT, Proc. Vision, Modeling, and
Visualization, pp. 395-403, 2004.
9X faster on ATI Radeon 9800 than 2.6GHz Pentium
4 for 10242 real image

48
Other implementations

T. Schiwietz, T. Chang, P. Speier, and R.
Westermann, MR image reconstruction using the
GPU, Proc. SPIE 6142, pp. 1279-1290, 2006.
GPU-FFT on ATI X1600 XT vs. 3.0GHz Pentium 4
1.7X faster than FFTW for 10242 complex image
1.9X faster for 5122 complex image
O. Fialka and M. Cadik, FFT and convolution
performance in image filtering on GPU, Proc. of
Information Visualization, 2006.
NVIDIA GeForce 6600GT vs 2.6GHz Intel Celeron D
Convolution 3.4X faster for 2562 complex image

49
GPU FFT shootout

Tested on NVIDIA GeForce 8800 GTX
All tests include bus traffic to GPU
No readback
Complex inputs
GFLOPS derived from estimated FFTW operation
count
5 N log2(N) / time
N is product of dimension sizes

50
GPU FFT shootout
? Dimension not supported
51
References

Spitzer, Implementing a GPU-Efficient FFT,
SIGGRAPH GPGPU Course, 2003.
K. Moreland and E. Angel, The FFT on a GPU,
Graphics Hardware, 2003.
J. Mitchell, M. Ansari, E. Hart, Advanced Image
Processing with DirectX 9 Pixel Shaders, in
ShaderX2 - Shader Programming Tips and Tricks,
2003.
T. Sumanaweera and D. Liu, Medical Image
Reconstruction with the FFT, GPU Gems 2, pp.
765-784, 2005.
N. Govindaraju, S. Larsen, J. Gray, and D.
Manocha, A memory model for scientific
algorithms on graphics processors, Proc.
Supercomputing, p. 6, 2006.