Processor Architecture Needed to handle FFT algoarithm - PowerPoint PPT Presentation

About This Presentation
Title:

Processor Architecture Needed to handle FFT algoarithm

Description:

FFT algoarithm M. Smith ... Canada FFT algorithms There are more FFT algorithms performed per day than any other algorithm in the world Therefore, ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 37
Provided by: Micha788
Category:

less

Transcript and Presenter's Notes

Title: Processor Architecture Needed to handle FFT algoarithm


1
Processor Architecture Needed to handleFFT
algoarithm
  • M. Smith

2
FFT algorithms
  • There are more FFT algorithms performed per day
    than any other algorithm in the world
  • Therefore, of course, custom parts of the
    processors to handle this situations

3
1 MRI session
http//www.core.org.cn/NR/rdonlyres/Nuclear-Engine
ering/22-56JFall-2005/EA28B7B3-39E5-4999-B858-5E3B
248E5408/0/chp_mri.jpg
4
Magnetic resonance (MR) and DFT discrete Fourier
transform
  • Place a body in a steady magnetic field 1.5
    Telsa, (3 or 7)
  • All the protons spin (precess) around magnetic
    field at a frequency of around10 MHz
  • If sent in an RF (90 degree) pulse at 10 MHz
    will be absorbed by system.
  • Some energy then emitted by system as sinusoid at
    10 MHZ
  • Do DFT single pulse whose height proportional
    to number of protons in body (amount of hydrogen
    in body amount of water in body)
  • Used to non-destructively measuring water content
    in wheat

5
Magnetic resonance imaging (MRI) and DFT
discrete Fourier transform
  • Place four bodies in a steady magnetic field
    1.5 Telsa, (3 or 7)
  • Apply 90 pulse
  • Apply DFT on response
  • One signal at 10 MHz
  • Place four bodies in a steady magnetic field of
    1.5 Telsa
  • Apply 90 pulse
  • Apply a field gradient in X direction
  • Apply DFT on response
  • Four signals 10 G.X1, 10 G.X2, 10 G.X3 , 10
    G.X4 where Xi is the x position of object

6
Magnetic resonance imaging (MRI) and DFT
discrete Fourier transform
  • Place four bodies in a steady magnetic field of
    1.5 Telsa
  • Apply 90 pulse
  • Apply a field gradient Gx in X direction
  • Apply DFT on response
  • Four signals 10 Gx.X1, 10 Gx.X2, 10 Gx.X3 ,
    10 Gx.X4 where Xi is the x position of object
  • Apply 90 pulse
  • Now add a field gradient in both X and Y
    directions
  • Apply DFT on response
  • Four signals 10 Gx.X1 Gy.Y1, 10 Gx.X2
    Gy.Y2, 10 Gx.X3 Gy.Y3, 10 Gx.X4 Gy.Y3
    where Xi is the x position of object and Yi is
    the i position of object

7
1 MRI session
http//www.core.org.cn/NR/rdonlyres/Nuclear-Engine
ering/22-56JFall-2005/EA28B7B3-39E5-4999-B858-5E3B
248E5408/0/chp_mri.jpg
  • Occurs for about 20 minutes
  • Echo planar imaging (EPI) Generates 19 2 D slices
    of the brain in about 60 seconds
  • Each image is 256 pixels by 256 pixels x 19
  • Each image requires 256 256 19 DFTs / minute
  • DFT is only ONE part of the algorithm
  • My research is using MRI for stroke diagnosis

http//www.magnet.fsu.edu/education/tutorials/magn
etacademy/mri/images/mri-scanner.jpg
8
Tackled already this term
  • Three types of DSP algorithms
  • Long loops, multiplication and addition
    intensive, regular (simple) memory accesses
    e.g. 300 taps in FIR algorithms
  • Short loops involving multiplications and
    additions e.g. 3 stages in IIR algorithms

9
Comparing IIR and FIR filters
Infinite Impulse Responsefilters few
operations to produce output frominput for each
IIR stage 3 7 stages
Finite Impulse Responsefilters many operations
to produce output frominput. Long FIFO buffer
whichmay require as many operations As FIR
calculation itself. Easy to optimize
10
Discrete Fourier Transform
  • FIR and IIR algorithms directly manipulate the
    data in the time domain.
  • FIR -- Process M data points using N point FIR
    filter involves M (N-1) additions M
    N multiplications M N 2 M memory
    accesses Algorithm takes a time of Order (M
    N)
  • Very slow if manipulating large amount of data

11
Frequency domain analysis
  • Apply discrete Fourier transform (implemented via
    FFT)
  • Transform to frequency domain takes time Order (M
    log M)
  • Perform FIR in frequency domain takes time Order
    (M)
  • Transform back to time-domain takes time Order (M
    log M)
  • FFT (Order (M log M) is orders of magnitude
    faster that FIR (Order (M log M)

12
(No Transcript)
13
(No Transcript)
14
4 point DFT to show concepts
15
Simplify using special complex exponential
properties
16
Running FFT on data stored in array
17
8 point FFT with log 8 ( 3) stages
  • 3 stages with N / 2 butterflies / stageOrder
    (N log N) in time

18
Architectural characteristics needed to handle
FFT efficiently
19
Add / subtract in one instruction
  • The following instruction is illegal as a single
    instruction
  • F4 F2 F3, F5 F6 F7 ILLEGAL
  • Needs bits to describe 6 registers (6 4 bits)
  • FFT Butterfly add is special instruction
  • F4 F11 F12, F5 F11 F12
  • Uses only 4 registers, 2 in, 2 out (4 2 bits)
  • 2 bits how come?
  • F4 F12 F11, F5 F12 F11 ILLEGAL
  • Fx F11-8 F15-12, Fy F11-8 - F15-12

20
Memory accesses
  • Stage 1
  • Fetch X data at location k and k N /2
  • Store X data at location k and k N /2
  • Stage 2
  • Fetch X data at location k and k N /4
  • Store X data at location k and k N /4
  • Stage 3 -- Final stage
  • Fetch X data at location k and k N /8
  • Store X data at bit-reversed location k and k N
    /4

21
First issue how do you store complex numbers?
  • One option
  • Use 16-bit values
  • Store real part in top 16-bits
  • Store imaginary part in bottom 16 bits
  • Access data on J-bus
  • Access complex sinusoids on J-bus
  • Access both components (R and I) in one cycle
  • TigerSHARC has the ability to do 16-bit complex
    additions and multiplications as specific
    instructions INTEGER only (NOT SHARC)
  • Can use both X and Y compute blocks

22
Integer operations a pain tend to overflow --
TigerSHARC syntax
  • Option 2 floating point
  • Store Real component in location X and imaginary
    component in location Y
  • Use R10 QJ4 4
  • Store first imaginary number in X0 and Y0
  • Store second imaginary number in X1 and Y1
  • FR3 R1 R0 performs complex floating point
    addition in single cycle
  • LJ5 R3 stores complex answer back

23
Integer operations a pain tend to overflow --
TigerSHARC syntax
  • Option 3 floating point
  • Access Real component along J- bus from data1
    and Imaginary component along K-bus from data
    2
  • Use XR30 QJ4 4 YR30 QK4 4
  • Store first imaginary number in X0 and Y0
  • Store second, third and fourth imaginary number
    in XR1, YR1 XR2, YR2 XR3, YR3
  • Which option is best? Depends? How handle bring
    in complex sinusoids

24
Bit reverse addressing
25
Bit reverse addressing Check manual for
accurate details before MII
  • Only possible with I0, I1, I2? and I3? registers
    (also I8, I9, I10?, I11?)
  • You must start the array on a N aligned boundary
    otherwise it does not work
  • I0 address pointer
  • B0 base register point to start of array
  • L0 length of array register
  • M0 special circular buffer modify register ????
  • F4 BR I0 1 // Correct SHARC syntax???
  • Bit-reverse addressing only works on POST-MODIFY
    (permits next address to be calculated in
    parallel)

26
Issues handling FFT Butterfly
27
(No Transcript)
28
(No Transcript)
29
Only possible on TigerSHARC
30
Wrong again
  • This is using the Radix 2 form of the algorithm
    breaks down into 2-pt DFT
  • There is also a Radix 4 form of the algorithm
    which is faster again

31
(No Transcript)
32
(No Transcript)
33
TIGERSHARC
34
DSP Co-processor on SHARC
35
(No Transcript)
36
Will discuss FFT accelerator later
  • Was not available on previous SHARCS
  • Way to go with future processors
  • Cheap processors co-processors
  • Cheap microcontroller FPGA component
Write a Comment
User Comments (0)
About PowerShow.com