Title: ELEC692 VLSI Signal Processing Architecture Lecture 8
1ELEC692 VLSI Signal Processing ArchitectureLectur
e 8
- Architecture for Fourier Transform
2Usage of FFT
- Frequency transformation
- Applications
- OFDM wireless systems
- Speech/Multimedia data processing
- Satellite wireless transmission
- DTV, DAB broadcasting using OFDM
- Real-time requirement needs special hardware to
do this - E.g. COFDM for DTV
- Signal bandwidth 7.5MHz
- Useful symbol duration 1ms
- Number of parallel subcarrier 7.51000/1 7500
- Need 8K complex point FFT
- Compute 8K complex FFT in 1ms, i.e. 8M complex
FFT in a second - Not efficient and practical to implement in
software, need special HW for FFT - In fact there are quite some off-the-selves FFT
processors available in the market, but it is
better to integrate the hardware within your chip
3DFT review
- The N-point discrete Fourier transform X(k) of an
N-point sequence x(n) (and the inverse DF) is
given by
4DFT
5Direct Implementation of DFT
- Product of a matrix (W) and a vector (x)
61D array for DFT for N8
7Complex multiplications
8Fast DFT
- Fast DFT (Discrete Fourier Transform) algorithm
- Cooley-Tukey decomposition (1965)
- Radix-2 Decimation-in-time (DIT) or
Decimation-in-Frequency (DIF) - Divide the problem size into two interleaved
halves with each recursive stage - Radix-2 decomposition first computes the
even-indexed numbers x0,x2,,xn-2 and then the
odd-indexed number x1,x3,,xn-1, and then
combines these two results. - The sequence can be decomposed recursively to
reduce the overall runtime to O(nlogn)
9Radix-2 DIF DFT
Since WNN/2 corresponds to a rotation of 180o,
the factor of the second sum can be even further
reduced. WE have
The division of k into even and odd values leads
to the following
10Radix-2 decomposition of 8-point FFT
x(0)
y(0)
W0
x(1)
y(4)
-1
W0
x(2)
y(2)
-1
W2
W0
x(3)
y(6)
-1
-1
W0
x(4)
y(14)
-1
W1
W0
x(5)
y (5)
-1
-1
W2
W0
x(6)
y(3)
-1
-1
W3
W2
W0
x(7)
y(7)
-1
-1
-1
11Implementation of Radix 2 FFT
- Two extreme methods
- Reuse single Butterfly
- Slower
- Smaller area
- More complicated control
- Fully multi-stage straight implementation
- Faster
- Larger area
- More regular control
- Trade-off between the two ends based on
- Speed, area, power
12Comparison of calculation
DFT FFT
MUL ADD MUL ADD
(N-2)2 (N)2 N/2log2N-(N-1) Nlog2N
hardware
13Data transport
- One problem for FFT is its less regular data
transport.
If the butterfly PEs are configured such that PEs
with lower exponents of W come first in each
stage, a configuration results with identical
communication networks between stages, (perfect
shuffle)
14Conventional single butterfly FFT implementation
Strong speed limitation Large intermediate
results storage area need (N complex words) If
the memory is not partitioned, the number of R/W
accesses to perform the FFT creates a
bottleneck An N-point FFT requires N/r logrN
radix-r butterfly computations and 2N logrN R/W
RAM access
15Single-stage (1-D) implementation- horizontal
projection
- Horizontal projection- provide PE for a single
stage - Use only N/2 PE, i.e. one stage only
- Reduce throughput by a factor of log2N comparing
with a 2-D array. - Need to take care about the complex communication
structure
PEs do not have fixed coefficients, they need to
change after each cycle and the global
communication network is disadvantageous
16Single-stage (1-D) implementation implementation-
horizontal projection
- Pipelining with PEs does not allow a direct
increase in through put for this architecture
since the results of the current processing are
required for the next processing step. - However sequential data blocks of length N can be
processed independently of one another, so
several data blocks can be processed by
interleaving - Need increase in of register
17Single-stage (1-D) implementation -horizontal
projection
- If N is large, we cannot implement all N of PE.
- Project N/2 butterfly PEs to MPEs where M is
also a power of 2 and M lt N/2 - Special registers for input data, intermediate
results and result data are required. - Register cyclically read and write a particular
sequence of 2M complex data
18Single-stage (1-D) implementation Vertical
projection
- Vertical projection Have 1PE for each stage
(total logN PE) - Need circuitry between PEs to prepare the correct
data input - From stage to stage, the length of the sequence
onto which the FFT is applied is halved. - Given the previous stage led to a DFT of length
2n, in accordance with perfect shuffle, the
sequence of length 2n must be halved and the 1st
and (n1)th values must be fed to the following
PE. Then the 2nd and (n2)th values are fed to
it. - Hence the sequence must be delayed by n clock
cycles in accordance with the position of the
midpoint
19Data formatting/sorting for Vertical projection
- The block un-1,,u0 must be delayed by n clock
cycles. - When un is available, the values from the stream
u must be fed to the new lower stream v. The
values of u are input in parallel into the next
butterfly stages for n clock cycles. - SO the values of v are fed in parallel to the
next butterfly PE for n clock cycles and
vn-1,,v0 are delayed by 2n cycles and v2n-1,,vn
delayed by n cycles.
20Data formatting/sorting for Vertical projection
- Special circuit is necessary for the data input
of the 1st stage. - Incoming data stream of N data is divided into 2
parts of N/2 data. The clock rate is hence
halved.We need a demultiplexer followed by a FIFO
register
21Overall architecture of Linear FFT array based ob
butterfly PEs and delay commutators
Consists of N PEs and delay commutators are
located between the PEs. Due to the continued
halving, control signals are extracted using
frequency dividers
22Higher radix FFT
We have
Thus
23Radix-4 DIF algorithm
- Butterfly of Radix-4 Algorithm
24Radix-4 Signal flow graph
25Higher radix FFT
26Some pipeline FFT Processor Architecture
- Assume input sequence to be in normal order and
output is allowed to be in digit-reversed
(radix-2 or radix-4) order. - Assume DIF type of decomposition
- Here we assume additive butterfly has been
separated from multiplier to show the hardware
requirement distinctively
27Radix-2 Multi-path Delay Commutator (R2MDC)
N16
Input sequence has been broken into 2 parallel
data stream flowing forward, with correct
distance between data elements entering the
butterfly scheduled by proper delays
of multipliers log2N 2 of butterfly
log2N of registers (3/2)N-2
28Radix-2 Single-path Delay Feedback (R2SDF)
N16
Storing the butterfly output in feedback shift
registers. A single data streams goes through the
multiplier at every stage.
of multiplers log2N 2 of butterfly
log2N of registers N-1
29Radix-4 Single-path Delay Feedback (R4SDF)
N256
Use radix-4 and CORDIC iterations. Utilization of
multipliers increased to 75 due to storage of 3
out of radix-4 butterfly outputs. Utilization of
the radix-4 butterfly (which is more complicated
than radix-2 butterfly, containing at least 8
complex adders) is dropped to 25. of
multiplers log4N 1 of butterfly log4N of
registers N-1
30Radix-4 Multi-path Delay Commutator (R4MDC)
N256
Utilization Rate Butterflies 25, multiplier
250 of multiplers 3log4N of butterfly
log4N of registers (5/2)N-4
31Some observation
- Delay-feedbacks are more efficient than
corresponding delay commutator in terms of memory
utilization since the stored butterfly output can
be directly used by the multipliers - Radix-4 algorithm based single-path architectures
have higher multiplier utilization, but radix-2
algorithm have simpler butterflies which are
better utilized.
32Comparison
Radix / Speed Low ? ------------------------------
----- ?High
Control Theme Simple ? ---------------------------
-------- ?Complex
Processing Ability / Unit Low ?
----------------------------------- ?High
Combine the advantages ? Further decompose high
radix PE
33Radix-22 DIF FFT
- Optimal hardware
- Same number of non-trivial multiplications at the
same positions in the SFG as of radix-4
algorithms - The same butterfly structure as that of radix-2
algorithms. - Radix-22 DIF FFT (S. He, M. Torkelson, A New
Approach to Pipeline FFT Processor, in
Proceedings of IPPS, 1996, pp. 766-780.
34Radix-22 DIF FFT
Apply a 3-dimensional linear index map
The Common factor algorithm has the form of
Summation Over n1
35Radix-22 DIF FFT
- Proceed the second step of decomposition to the
remaining DFT coefficients, including the
twiddle factor to exploit the exceptional
values in multiplication before the next
butterfly is constructed.
After substituting and simplification, we have
BF I
BF I
BF II
36Butterfly with decomposed twiddle factors
Full multipliers are required to compute the
product of the decomposed twiddle factor. The
order of the twiddle factors is different from
that of radix-4 algorithm.
37Complete Radix-22 DIF FFT
- Apply the CFA recursively to the remaining DFTs
of length N/4.
38(No Transcript)
39Radix-22 Single-path Delay Feedback (R22SDF)
2 types of butterflies 1 identical to R2SDf, the
other contains also the logic to implement the
trivial twiddle factor multiplication
- A log2N bit binary counter servers two purposes
- Synchronization controller
- Address generation counter for twiddle factor
reading in each stages
40Radix-22 Single-path Delay Feedback (R22SDF)
- Structure for BF2I and BF2II
BF2II
BF2I
Operation scheduling
1st N/2 cycle, 2-to-1 mux in BF2I switch to 0
and the butterfly is idle. Input data is directed
to the shift registers until they are
filled. Next N/2 cycles, the mux turn to 1, the
butterfly computes a 2-point DFT with incoming
data and the data stored in the shift registers