Title: Distributed Arithmetic
1Distributed Arithmetic
- Dr Sumam David S.
- Dept. of EC, NITK Surathkal
- Courtesy for slides Xilinx Professors Workshop
Resources
2Objective
- Distributed arithmetic
- What ?
- Where ?
- How ?
3What is DA?
- Multiplication using LUT
- Used to implement multipliers in LUT rich FPGAs
4Twos Complement Multiplication
5SDA 1-Tap FIR Filter
N BITS WIDE SAMPLE DATA
Partial Product ROM
A0
/-
X0
Parallel to serial converter
Scaling Accumulator
6Distributed Arithmeticfor a 2-Tap Filter
- Partial products of equal weight are added
together before being summed to next higher
partial product weight - Create look-up table of summed partial products
-23 22 21 20
-23 22 21 20
C0 1 0 0 1 (-7)
C1 0 1 1 0 ( 6)
X0 0 1 1 1 ( 7)
X
X1 0 1 0 1 ( 5)
X
( 1 0 0 1 ( 1 0 0
1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1
1 1 1
0 1 1 0) 0 0 0
0 ) 0 1 1 0 ) 0 0 0 0
) 0 0 0 1 1 1 1 0
1 1 1 1 1 0 0
1 1 1 1 1 0 0 0 0 1
1 1 0 1 1 0 1
(-1) (-14) (-4) (0) (-19)
(-49)
( 30)
(Serial-Data / Tap-Parallel Multiply)
Sign Extension
7SDA 2-Tap FIR Filter
N BITS WIDE SAMPLE DATA
Partial Product ROM
A0
X0
/-
A1
X1
Scaling Accumulator
8SDA 4-Tap FIR Filter
N BITS WIDE SAMPLE DATA
Partial Product ROM
A0
0000...0
X0
C0
1
A1
0000...0
X1
C1
1
A2
0000...0
X2
C2
1
A3
0000...0
X3
C3
9SDA 8-Tap FIR Filter
N BITS WIDE SAMPLE DATA
A0
Partial Product ROM
X0
A1
X1
A2
Pre-Adder
X2
A3
X3
/-
A0
X4
Partial Product ROM
Scaling Accumulator
A1
X5
A2
X6
4 -input LUT contains all possible sums of the
partial products
A3
X7
10Xilinx DA FIR Performance
6000
Dual MAC
DA FIR B8
5000
DA FIR B12
4000
DA FIR B16
3000
Performance (MMACs/s)
Serial FPGA FIR
2000
1000
0
0
50
100
150
200
250
Filter Length (Taps)
Filter Length (Taps)
fclk 200 MHz for both processor and FPGA B
data sample precision for FPGA
11Trade Clock Cycles for Logic Area
Trade Clock Cycles for Logic Area
Multi bits per clock cycle
20Ms/s
160Ms/s
b7
b7
b7
Serial-DA
Parallel-DA
b4
b3
b0
Hardware Over-sampling 4
b0
Hardware Over-sampling 8
Hardware Over-sampling 2
b0
b0
b7
b3
Hardware Over-sampling 1
b4
b0
The sample is serialized and processed 1 bit per
clock cycle. 8 clock cycles are thus required to
process the whole sample
The sample is serialized and processed 2 bits per
clock cycle. 4 clock cycles are thus required to
process the whole sample
The sample is processed in parallel 8 bits per
clock cycle
The sample is serialized and processed 4 bits per
clock cycle
b0
12Conclusion
- Efficiency of computation
- Slow as its bit serial
- Memory requirements
13References
- The role of Distributed Arithmetic in FPGA based
signal processing, www.xilinx.com