Title: FFT%20in%20Hardware%20and%20Software
1FFT in Hardware and Software
2Background
- Core Algorithm
- Original Algorithm, the DFT, O(n2) complexity
- New Algorithm, the FFT (Fast Fourier Transform),
O(nlog2(n)) depending on implementation.
3DFT Computation
- A summation over the whole input array for every
single element in the output array. - A VERY computationally inefficient algorithm to
implement.
4FFT Computation
- A much more computationally efficient algorithm
- Works using the divide and conquer principle.
- First developed by Cooley and Tukey in 1965!
5DFT vs. FFT (Number of Operations)
Problem Size (N) Standard DFT (smaller is better) FFT (smaller is better) of DFT (smaller is better)
2 4 1 25
4 16 4 25
8 64 12 19
16 256 32 13
32 1024 80 8
64 4096 192 5
128 16384 448 3
256 65536 1024 2
512 262144 2304 1
1024 1048576 5120 lt1
6DFT vs. FFT
7FFT Butterfly Operations
- Butterfly arrangement of computations
- Repeated on successive pairs of input data
- Then half as many times on alternating pairs
- Then half again as many times on every fourth
element
8The Butterfly
- Simple operations repeated many times
98-point FFT DemonstrationThe Entire Calculation
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
108-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
118-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
128-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
138-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
148-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
158-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
168-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
178-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
188-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
198-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
208-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
218-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
22Why Hardware?
- Even more speed for FFT
- Extremely parallelizable
- A whole layer can be done in two FPGA clock
cycles - 1 multiply cycle
- 1 add cycle
- (Assuming sufficient multipliers)
23Hardware Problems
- Complexity
- Input speed
- Output speed
- If the FPGA takes 24.4ns but takes 20?s to
transfer the input data, what gain is there? - i.e. 24.4ns 20?s 20?s 40?s!
24Mitigation of Hardware Problems
- Use a faster bus
- AMD Opterons Hypertransport
- 20.8 GB/s (166.4 Gb/s) per Link (V. 3)
- Modules that fit into an AMD 64-bit Opteron
Socket - http//www.drccomputer.com/pages/modules.html -
xilinx based module - http//www.xtremedatainc.com/xd1000_brief.html -
altera based module
25Mitigation of Hardware Problems
- Put the FPGA on the die with the DSP
- Need silicon vendor support
- FPGA can access memory on a very wide bus (i.e.
128 bits per cycle) - Implement the entire project in FPGA
- Time consuming to program
- Possibly insufficient room on the FPGA
268-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
278-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
288-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
298-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
30Why Not Software?
- Each butterfly must be done sequentially
- Only slight parallelism enabled by a DSP like the
TigerSHARC - Each Butterfly can be done in 2 cycles (after
optimization).
31Results of Testing
- Linear Profiling of FFT Algorithm in C
Stage Cycle count Cycle count Cycle count Time Time Time
Stage 8-point 32-point 256-point 8-point 32-point 256-point
Initialization 21 25 25 35.07ns 41.75ns 41.75ns
Computation 6922 1135 174222 1.895 ?s 11.559 ?s 290.950 ?s
Butterfly 91 91 91 151.97ns 151.97ns 151.97ns
32Results of Testing
- Profiling of VHDL on FPGA
- Butterfly takes 24.377ns to execute
- 62 is computational, 38 is routing on FPGA
33Product Offerings
- Most DSP Vendors
- Many FPGA Vendors (IP Intellectual Property)
- Microcontroller Vendors (i.e. Blackfin)
- FFTW The Fastest Fourier Transform in the West
- AMD Math Core Library
- Intel Library
- Highly Optimized for the expected hardware
34Published Results
- The Radix 4 version delivers a 1 K points complex
processing time of 25 microseconds at 200-MHz
system speeds and uses only about 10 percent of
the resources in a mid-range Stratix device. The
Radix 2 is half the size of the Radix 4 and
offers a 1 K points complex processing time of 50
microseconds at 200-MHz system speeds. Additional
versions of the new cores are under development.
6
FFT IP Core Published Results 7 FFT IP Core Published Results 7 FFT IP Core Published Results 7 FFT IP Core Published Results 7
FFT/IFFT length Texas Instruments C6713 Single 4DSP FFT core (Smaller is Better) Quad 4DSP FFT core (Smaller is Better)
256 12.3µs 3.68µs 920ns
512 27.3µs 6.24µs 1.56µs
1024 60.2µs 11.4µs 2.85µs
35References
- 1 Signals Systems and Transforms
- 2 James W. Cooley and John W. Tukey, "An
algorithm for the machine calculation of complex
Fourier series," Math. Comput. 19, 297301
(1965). - 3 http//www.drccomputer.com/pages/modules.html
- xilinx based module - 4 http//www.xtremedatainc.com/xd1000_brief.html
- altera based module - 5 http//www.amd.com/us-en/Processors/DevelopWit
hAMD/0,,30_2252_2353,00.html - 6 http//www.us.design-reuse.com/news/news5650.h
tml - 7 http//www.4dsp.com/fft.htm