Title: Optimizing high speed arithmetic circuits using threeterm extraction
1Optimizing high speed arithmetic circuits using
three-term extraction
- Anup Hosangadi
- Ryan Kastner Farzan
Fallah -
- ECE Department
Fujitsu Laboratories - University of California, Santa Barbara
of America
2Outline
- Carry Save Arithmetic
- Related Work
- Problem formulation
- Algebraic methods
- Delay aware optimization
- Experimental results
3Carry Save Arithmetic
- Multi-Operand addition
- F A B C D E F
- Carry propagation major bottleneck
- Fast adders Carry Lookahead Adder (CLA), Carry
Select Adders, not fast enough - Solution Eliminate Carry propagation to the
final step - Generate Sums and Carries separately
- Treat them as separate numbers
- Keep adding till only two numbers remain
- Add the numbers using fast adder (CLA)
4Carry Save Arithmetic
Delay 3 log2(M 3)
S
S
C
C
Tree height log1.5(N/2)
3 height of CSA tree M bitwidth of operands
C
S
S
C
CLA
F
5Carry Save arithmetic
Using Ripple carry adders (RCAs)
(M 1)
(M 2)
Delay (M5) 4
(M 3)
(M 4)
Delay thru CSA network 3 log1.5(M 3)
(M 5)
6Related Work
- Kim et. al Arithmetic optimization using Carry
Save Adders, DAC98
D
E
7Related Work
- Kim. et. al Optimal allocation of CSAs,
ICCAD99 - Delay aware CSA allocation
- Kim et. al High performance, low power
synthesis, DAC2000 - SynopsysTM Behavioral optimization for arithmetic
(BOA) - A.Verma and P.Ienne Improved use of the carry
save representation for the synthesis of complex
arithmetic circuits, ICCAD2004
8Problem formulation
- No methodology for detecting redundancy in CSA
computations - Can reduce the number of CSAs
- Can reduce the number of wires
- Common subexpression elimination
- Standard compiler technique
- Applied to 2-term arithmetic operations
- Polynomial expressions (ICCAD04, VLSI05)
- Constant multiplications (ASAP04, ASPDAC05)
- CSA expressions (Common 3-term subexpressions)
9Problem formulation
Y1 X1 X1ltlt2 X2 X2ltlt1 X2ltlt2 Y2 X1ltlt2
X2ltlt2 X2ltlt3
D1 X1 X2 X2ltlt1 Y1 (D1S D1C) X1ltlt2
X2ltlt2 Y2 (D1S D1C)
10Algebraic methods
- Polynomial transformation
- Xltlti XLi
- Detects shifted common subexpressions and also
extends to multiple variables
C X ?(XLi)
(14)10 X (1110)2 X
Xltlt3 Xltlt2 Xltlt1 XL3
XL2 XL1
(100-10)CSD X XL4 XL1
11Algebraic methods
- 3-term divisors All potential common
subexpressions - Divisor generation
- One for every combination of 3 terms
- eg. F1 X1 X1L2 X2 X2L X2L2
- d1 X1L2 X2L X2L2
- MinL L
- Divisor D1 d1/L X1L X2 X2L
- of divisors
- Theorem
- There exists a 3-term common subexpression iff
there exists a non-overlapping intersection among
the set of 3-term divisors
12Algebraic methods
- Greedy Iterative algorithm
- Extracts the best 3-term divisor
- Rewrites the expressions containing it
- Terminates when there are no more common
subexpressions
F1 a b c d e F2 a b c d f
F1 D1S D1C d e F2 D1S D1C d f
F1 D2S D2C e F2 D2S D2C f
gtgt D1 a b c
gtgt D2 D1S D1C e
13Algebraic methods
Optimize (Pi) Pi Set of
expressions in polynomial form D Set
of divisors f // Step 1. Creating divisors
and their frequency statistics for each
expression Pi in Pi Dnew
Divisors(Pi) Update frequency
statistics of divisors in D D
D Dnew //Step 2. Iterative
selection and elimination of best divisor
while (1) Find d divisor in
D with most number of
non-overlapping intersections if
(d NULL) break Rewrite affected
expressions in Pi using d Remove
divisors in D that have become invalid
Update frequency statistics of affected
divisors Dnew Set of new divisors
from new terms added by
division D D Dnew
14Algebraic methods
- Algorithm complexity
- M expressions, each with N terms
- Divisor generation M O(MN3)
- Iterative algorithm, worst case
- N terms reduced to 2 terms (N -2) steps
- M expressions O(MN) steps
15Delay aware optimization
- Sharing subexpressions can increase the total
delay - Traditional high level synthesis approach Reduce
delay by Tree Height Reduction (THR) - Our solution Control delay during optimization
itself - Optimal delay CSA allocation (T.Kim, J.Um,
Timing driven synthesis, ASPDAC2000) - Use this to get minimum possible delay
F1 a(2) b(0) c(0) d(0) e(0) F2 a(2)
b(0) c(0) d(0) f(0)
16Delay aware optimization
- Optimal allocation Delay ignorant
extraction
Delay(F1) Delay(F2) 3 D(Add)
17Delay aware extraction
- Control delay during optimization
- Evaluate each candidate divisor for delay
- Only consider those divisors that do not increase
the delay
F1 a(2) b(0) c(0) d(0) e(0) F2 a(2)
b(0) c(0) d(0) f(0)
Delay 5 D(Add)
F1 D1S(3) D1C(3) d(0) e(0) F2 D1S(3)
D1C(3) d(0) f(0)
Delay 5 D(Add)
gtgt D1(3) a(2) b(0) c(0)
18Delay aware extraction
- Control delay during optimization
- Evaluate each candidate divisor for delay
- Only consider those divisors that do not increase
the delay
F1 a(2) b(0) c(0) d(0) e(0) F2 a(2)
b(0) c(0) d(0) f(0)
Delay 3 D(Add)
F1 D2S(1) D2C(1) e(0) a(2) F2 D2S(1)
D2C(1) f(0) a(2)
Delay 3 D(Add)
gtgt D2(1) b(0) c(0) d(0)
19Experimental results
Average 38.4 reduction
20Experimental results
- Synthesis for Standard Cell Designs
- SynopsysTM Design compiler
- 0.25 micron library
- Synthesized for minimum delay
Avg 32.7 Area reduction Avg 3.7 increase in
delay
21Experimental results
- FPGA synthesis
- Virtex II FPGAs
- Synthesized designs and performed place route
Avg 14.1 reduction in Slices and Avg 12.9
reduction in LUTs Avg 5.7 increase in the delay
22Experimental results
- Evaluate Delay aware extraction algorithm
- Consider different arrival times of the signals
- Assume delay dominated by gate delay (FA delay)
- Only consider best case delay
Best delay with 15.5 increase in CSAs
23Conclusions
- First methodology for common subexpression
elimination for Carry Save Arithmetic - Significant area/power reduction
- Delay aware optimization algorithm also developed
- Can be combined with CSA tree extraction methods
for actual application improvement
24Thank you!!