FFT%20in%20Hardware%20and%20Software - PowerPoint PPT Presentation

About This Presentation
Title:

FFT%20in%20Hardware%20and%20Software

Description:

FFT in Hardware and Software Background Core Algorithm Original Algorithm, the DFT, O(n2) complexity New Algorithm, the FFT (Fast Fourier Transform), O(nlog2(n ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 36
Provided by: Electr53
Category:

less

Transcript and Presenter's Notes

Title: FFT%20in%20Hardware%20and%20Software


1
FFT in Hardware and Software
2
Background
  • Core Algorithm
  • Original Algorithm, the DFT, O(n2) complexity
  • New Algorithm, the FFT (Fast Fourier Transform),
    O(nlog2(n)) depending on implementation.

3
DFT Computation
  • A summation over the whole input array for every
    single element in the output array.
  • A VERY computationally inefficient algorithm to
    implement.

4
FFT Computation
  • A much more computationally efficient algorithm
  • Works using the divide and conquer principle.
  • First developed by Cooley and Tukey in 1965!

5
DFT vs. FFT (Number of Operations)
Problem Size (N) Standard DFT (smaller is better) FFT (smaller is better) of DFT (smaller is better)
2 4 1 25
4 16 4 25
8 64 12 19
16 256 32 13
32 1024 80 8
64 4096 192 5
128 16384 448 3
256 65536 1024 2
512 262144 2304 1
1024 1048576 5120 lt1
6
DFT vs. FFT
7
FFT Butterfly Operations
  • Butterfly arrangement of computations
  • Repeated on successive pairs of input data
  • Then half as many times on alternating pairs
  • Then half again as many times on every fourth
    element

8
The Butterfly
  • Simple operations repeated many times

9
8-point FFT DemonstrationThe Entire Calculation
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
10
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
11
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
12
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
13
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
14
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
15
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
16
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
17
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
18
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
19
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
20
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
21
8-point FFT Demonstration
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
22
Why Hardware?
  • Even more speed for FFT
  • Extremely parallelizable
  • A whole layer can be done in two FPGA clock
    cycles
  • 1 multiply cycle
  • 1 add cycle
  • (Assuming sufficient multipliers)

23
Hardware Problems
  • Complexity
  • Input speed
  • Output speed
  • If the FPGA takes 24.4ns but takes 20?s to
    transfer the input data, what gain is there?
  • i.e. 24.4ns 20?s 20?s 40?s!

24
Mitigation of Hardware Problems
  • Use a faster bus
  • AMD Opterons Hypertransport
  • 20.8 GB/s (166.4 Gb/s) per Link (V. 3)
  • Modules that fit into an AMD 64-bit Opteron
    Socket
  • http//www.drccomputer.com/pages/modules.html -
    xilinx based module
  • http//www.xtremedatainc.com/xd1000_brief.html -
    altera based module

25
Mitigation of Hardware Problems
  • Put the FPGA on the die with the DSP
  • Need silicon vendor support
  • FPGA can access memory on a very wide bus (i.e.
    128 bits per cycle)
  • Implement the entire project in FPGA
  • Time consuming to program
  • Possibly insufficient room on the FPGA

26
8-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
27
8-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
28
8-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
29
8-point FFT DemonstrationIn Hardware
Input Array
Output
x0
x4
x2
x6
x1
x5
x3
x7
X0
X1
X2
X3
X4
X5
X6
X7
Multiplication by W factor
Addition
30
Why Not Software?
  • Each butterfly must be done sequentially
  • Only slight parallelism enabled by a DSP like the
    TigerSHARC
  • Each Butterfly can be done in 2 cycles (after
    optimization).

31
Results of Testing
  • Linear Profiling of FFT Algorithm in C

Stage Cycle count Cycle count Cycle count Time Time Time
Stage 8-point 32-point 256-point 8-point 32-point 256-point
Initialization 21 25 25 35.07ns 41.75ns 41.75ns
Computation 6922 1135 174222 1.895 ?s 11.559 ?s 290.950 ?s
Butterfly 91 91 91 151.97ns 151.97ns 151.97ns
32
Results of Testing
  • Profiling of VHDL on FPGA
  • Butterfly takes 24.377ns to execute
  • 62 is computational, 38 is routing on FPGA

33
Product Offerings
  • Most DSP Vendors
  • Many FPGA Vendors (IP Intellectual Property)
  • Microcontroller Vendors (i.e. Blackfin)
  • FFTW The Fastest Fourier Transform in the West
  • AMD Math Core Library
  • Intel Library
  • Highly Optimized for the expected hardware

34
Published Results
  • The Radix 4 version delivers a 1 K points complex
    processing time of 25 microseconds at 200-MHz
    system speeds and uses only about 10 percent of
    the resources in a mid-range Stratix device. The
    Radix 2 is half the size of the Radix 4 and
    offers a 1 K points complex processing time of 50
    microseconds at 200-MHz system speeds. Additional
    versions of the new cores are under development.
    6

FFT IP Core Published Results 7 FFT IP Core Published Results 7 FFT IP Core Published Results 7 FFT IP Core Published Results 7
FFT/IFFT length Texas Instruments C6713 Single 4DSP FFT core (Smaller is Better) Quad 4DSP FFT core (Smaller is Better)
256 12.3µs 3.68µs 920ns
512 27.3µs 6.24µs 1.56µs
1024 60.2µs 11.4µs 2.85µs
35
References
  • 1 Signals Systems and Transforms
  • 2 James W. Cooley and John W. Tukey, "An
    algorithm for the machine calculation of complex
    Fourier series," Math. Comput. 19, 297301
    (1965).
  • 3 http//www.drccomputer.com/pages/modules.html
    - xilinx based module
  • 4 http//www.xtremedatainc.com/xd1000_brief.html
    - altera based module
  • 5 http//www.amd.com/us-en/Processors/DevelopWit
    hAMD/0,,30_2252_2353,00.html
  • 6 http//www.us.design-reuse.com/news/news5650.h
    tml
  • 7 http//www.4dsp.com/fft.htm
Write a Comment
User Comments (0)
About PowerShow.com