Title: Lecture 7 Behavioral Synthesis for LowPower
1Lecture 7Behavioral Synthesis forLow-Power
- Behavioral Level Transforms
- Potential for large power reduction
- Summary
- Michael L. Bushnell
- CAIP Center and WINLAB
- ECE Dept., Rutgers U., Piscataway, NJ
2Motivation
- Conventional automated layout synthesis method
- Describe design at RTL or higher level
- Generate technology-independent realization
- Map logic-level circuit to technology library
- Optimization goal shifting from low-area to
low-power and higher performance - Need accurate signal probability/activity
estimates - Consider low-power needs at all design levels
3Behavioral-Level Transformations
4Algorithm-Level Power Reductions vs. Other Levels
5Differential Coefficients for Finite Impulse
Response (FIR) Filters
- Discrete-time Linear Time-Invariant FIR system
- Ci are the filter coefficients
- N is taps or filter length
- Differential Coefficients Method (DCM) reduces
computations to save power - Uses differences between coefficients rather than
direct-form computation - Uses various orders of differences
- Requires more storage devices and storage
accesses
6First-Order Differences
- 3 Consecutive outputs Y
- Rewrite product terms
- Except for C0, can express each coefficient as
the sum of preceding coefficient and difference
between it and the preceding coefficient
7First-Order Differences (contd)
- Store product terms and reuse them for next
output time period - Need only 1 extra addition per product term and 1
storage element - Store C0 and ds
- Trade off long multiplier for a short one and
storage overheads -
- d1k-1/k is first-order difference between Ck and
Ck-1
8Orders of Differences
9Second-Order Differences
-
- Coefficient
expressions -
- Needs just 2 extra storage variables and 2 extra
additions per product to compute FIR output with
2nd-order differences compared with direct form
computation
10Generalized mth-Order and Negative Differences
- mth-order differences require storage of m
intermediate results for each product term, of
size N, so need mN storage variables and m
additions per product term compared with direct
form - Differences can be positive or negative
- Possible to get absolute value of partial product
with negative differences
11Sorted Recursive Differences (SRD)
- DCM only applicable to systems where envelope
generated by coefficient sequences (or
differences) is a smoothly-varying continuous
function - Mainly for low-pass FIR filters
- Recursively sort coefficients and use various
orders of differences to reduce computation - Use transposed direct form of FIR output
computation - No restriction on applicable coefficient sequence
- Word length reduction not the same for each
coefficient
12Transposed Direct-Form (TDF) Computation
- Compute all N product terms for particular data
before computing terms for next sequential data - Same throughput as for direct-form computation
- Signal-flow graph
13Signal Flow Graph for TDM Realization
14Maximum Savings in Adds Using SRD
15Frequency Response of SRD Low-Pass Filter and
Hamming Window
16Savings in Adds for Low-Pass Filter
- Black N 201, Grey N 101, White N 51
17Savings in Shifts Using SRD
18Least-Squares Coefficient Optimization for Filters
- Find coefficient closest to desired coefficient,
but with fewest of 1 bits -- 1s called a - Goal is to reduce the of additions
- Use sign-magnitude coefficient representation
- k maximum code class allowed per coefficient
- Use branch-and-bound method to solve integer
programming problem of selecting coefficient
approximations - Shown to reduce addition computations in
low-power filters by more than 40
19Activity-Driven Architectural Transformations
- Basic idea Power consumption in digital filters
depends on order of addition operators - Restructure addition tree to move adders with
higher-coefficient multiplications towards the
output - Higher-activity circuitry is moved closest to
root of addition tree - Definition of average signal activity over N
consecutive time frames
20Data Flow Graph of IIR Filter
21Perfectly Balanced Addition Tree
22Filter Implementations
- Can be bit-serial or word-parallel arithmetic
- W bits fed in parallel to adders and multipliers
- At time t 1, z of W bits change from time t
values - Activity b (t) z / W
- b (t) is a random variable stochastic process
strict sense stationary - Average power dissipation proportional to
- In bit-serial implementations, intra-word bit
differences, and not inter-word bit differences,
cause node activity
23Architectural Transforms on DSP Filters
- I inputs to word-parallel computation tree
- I 2 l 1, l levels in tree
- y S aj bj
- Obtain minimum average value of qi over all
balanced adder nodes when - a1 a2 aI or a1 a2 aI
- Minimum average value in a linear array of adders
when - a1 a2 aI
j I j 1
24Linear Array of Adders
- Assume mutually independent
inputs, but method works even
when signals correlate due to
reconvergent fanouts
25Power Optimization Algorithm
- Simulate circuit at functional level
- Using random, mutually-independent input values
- Note signal activities at all adder inputs
- Restructure adder trees using above 2 hypotheses
- Move additions with high activity closer to root
of computation tree - Recompute average activities
- Iterate until no additional power is saved
- Method shown to save up to 23 of power
26Architecture-Driven Voltage Scaling
- Scale down VDD to save power, but increases
circuit delay - Reduce delay by scaling down device sizes (less
C) - But interconnect C becomes dominant, not device C
- Need architectural transformation to introduce
more parallelism to compensate for increased
delay - Introduce parallel or pipelined architecture
-
27Example Original Data Path Operator
28Redesigned Parallel Implementation gt 2 X Area
Increase
29Redesigned Pipelined Implementation
30Operation Reduction Methods
- Reduce operators in data flow graph
- Computes X3 AX2 BX C
- Reduces C, but may slow down critical paths
- Reduction maintaining throughput
31Example
- Reduction with less throughput
32Operation Substitution Methods
- Multiplication uses more energy than addition
- Replace multipliers with adders in high-level
synthesis
33Method and Results
- Transformations
- Common sub-expression utilization
- Apply distributive law
- Replace multiplication with repeated shifting and
adding - On an 11-tap FIR filter
- Saved 62 of dissipated power
34Precomputation-Based Optimization
- Basic idea Precompute (with low-overhead
hardware) circuit output logic values 1 cycle
before they are needed - Use precomputed information in next clock cycle
to disable unneeded hardware, reduces switching
activity - Must be careful Precomputation hardware can add
to area and lengthen clock period
35Precomputation Architecture
36Explanation
-
- When both functions are 0, indicates that no
prediction of output value is possible - When prediction happens (we know definitely that
output is 1 or 0), we turn off R2 - Reduces activity in combinational logic block
- R1 still is active, so Comb still computes
correct output - More effective if P (f1 f2) is large
- For comparator, saves 50 of the power
37Specific Comparator Example
38Can Precompute Outputs Needed 2 or More Clocks
Later
- Can reduce switching activity by 12.5
39Example Adder-Comparator
40Precomputation with Shannons Expansion Theorem
41Summary
- Behavioral or architectural level synthesis
- Resynthesize state variable equations to save
power - Scale down supply voltage, and introduce
parallelism and pipelining to make up for
slow-down of hardware
42Future Research Directions
- Formal methods for behavioral or data flow level
to explore power reduction design space - Behavioral level power estimation algorithms
needed - Synthesis scheduling and data path allocation
algorithms should incorporate power tradeoffs