Title: Optimized Custom Function Evaluation for Embedded Processors
1Optimized Custom Function Evaluation for
Embedded Processors
Dong-U Lee Image Communications Laboratory EE
Dept., UCLA CENS Seminar 07-28-2006
2Motivation
- Embedded applications can impose severe
constraints on platform cost, power, memory
resources and architectural flexibility - Typical platforms 8/16/32-bit fixed-point
processors - Function evaluation (e.g. log(x), sin(x)) common
in image/audio processing, sensor networking,
etc. - These evaluations are often the most
mathematically complex aspects of an application - Efficient function evaluation is critical in
overall system speed and memory utilization
3Traditional Methods of Function Evaluation
- Direct table lookup
- Conceptually simple, easy to implement
- High memory cost
- Impractical for high precisions
- Standard math libraries (e.g. math.h in C) via
FPU (Floating Point Unit) emulation - Easy to use
- Very slow (example 5600 cycles on ATmega128 for
logarithm function) - Utilizes surplus precision (double precision
float) that is inherent in standard libraries
4Related Work on Function Evaluation
- Function evaluation has been widely studied for
processors with dedicated FPUs - Little work has been on fixed-point processors,
despite the importance of these processors in the
overall embedded and sensor markets - Cheung et al. 32-bit IBM Embedded PowerPC 405
- Lack of error analysis ? precision not guaranteed
- Texas Instruments 16-bit TMS320C24x DSP
- Lack of error analysis
- Iordache and Tang (Intel) 32-bit Intel XScale
- Optimized for single and double precision, but
many applications do not require this level of
precision - Assembly language routines targeting only XScale
not general across other processors
5Main Contributions
- Precision guaranteed to 1 unit in the last place
(ulp) - e.g. ulp of a result with 8 fractional bits is
2-8 - Different approximation methods for trading off
precision / latency / code size - Minimization of operator sizes
- Automated C code generation
6Resource-Aware Function Evaluation
- Approximate functions via polynomials
- Minimal resources for given target precision
- Exploit native processor word-length via
fixed-point arithmetic - To minimize latency, determine minimal number of
bits to each signal in the data path - Use multi-word arithmetic emulate operations
larger than natural processor word-length in
multiple processor cycles
7Multi-Word Arithmetic
- Use multiple words to execute operations larger
than the natural word-length of the processor - 2n-bit by 2n-bit multiplication on an n-bit
processor - 4 multiplications
- 6 additions
- Desirable to minimize the number of words
involved for each operation
8Function Evaluation
- Typically in three steps
- (1) reduce the input interval a,b to a smaller
interval a,b - (2) function approximation on the range-reduced
interval - (3) expand the result back to the original range
- Evaluation of log(x)
where Mx is the mantissa over the 1,2) and Ex is
the exponent of x
9Polynomial Approximations
- Single polynomial
- Approximate the whole interval with a single
polynomial - Increase the polynomial degree until the error
requirement is met - Splines (piecewise polynomials)
- Partition the interval into multiple segments
use different polynomials for each segment - Given a polynomial degree, increase the number of
segments until the error requirement is met
10Computation Flow
- Input interval is split into 2Bx0 equally sized
segments - Leading Bx0 bits serve as coefficient table index
- Coefficients computed in minimax sense via Remez
- Determine minimal bit-widths ? minimize execution
time - x1 used for polynomial arithmetic normalized over
0,1)
11Approximation Methods
- Degree-3 Taylor, Chebyshev, and minimax
approximations to log(x) - We choose minimax approximations due to their
superior maximum error behaviour must be
computed iteratively via the Remez algorithm
12Design Flow Overview
- Fully automated within MATLAB
- Approximation methods
- Single Polynomial
- Degree-d splines
- Range analysis
- Analytical method based on computing the roots of
the derivate of a signal - Precision analysis
- Simulated annealing on analytical error
expressions
13Error Sources in Digital Function Evaluation
- Three main error sources
- Inherent error E? due to approximating functions
- Quantization error EQ due to finite precision
effects - Final output rounding step, which can cause a
maximum of 0.5 ulp - To achieve 1 ulp accuracy at the output, E?EQ
0.5 ulp - Large E?
- Polynomial degree reduction (single polynomial)
or required number of segments reduction
(splines) - However, leads to small EQ, leading to large
bit-widths - Good balance allocate a maximum of 0.3 ulp for
E? and the rest for EQ
14Range Analysis
- Inspect dynamic range of each signal and compute
required number of integer bits - Twos complement assumed, for a signal x
- For a range xxmin,xmax, IBx is given by
15Range Determination
- Examine local minima, local maxima, and minimum
and maximum input values at each signal - Works for designs with differentiable signals,
which is the case for polynomials
range y2, y5
16Range Analysis Example
- Degree-3 polynomial approximation to log(x)
- Able to compute exact ranges
- IB can be negative as shown for C3, C0, and D4
leading zeros in the fractional part
17Precision Analysis
- Determine minimal FBs of all signals while
meeting error constraint at output - Quantization methods
- Truncation 2-FB (1 ulp) maximum error
- Round-to-nearest 2-FB-1 (0.5 ulp) maximum error
- To achieve 1 ulp accuracy at output,
round-to-nearest must be performed at output - Free to choose either method for internal
signals Although round-to-nearest offers
smaller errors, it requires an extra adder, hence
truncation is chosen
18Error Models of Arithmetic Operators
- Let be the quantized version and be the
error due to quantizing a signal - Addition/subtraction
- Multiplication
19Precision Analysis for Polynomials
- Degree-3 polynomial example, assuming that
coefficients are rounded to the nearest
Inherent approximation error
20Uniform Fractional Bit-Width
- Obtain 8 fractional bits with 1 ulp (2-8)
accuracy - Suboptimal but simple solution is the uniform
fractional bit-width (UFB)
21Non-Uniform Fractional Bit-Width
- Let the fractional bit-widths to be different
- Use adaptive simulated annealing (ASA), which
allows for faster convergence times than
traditional simulated annealing - Constraint function error inequalities
- Cost function latency of multiword arithmetic
- Bit-widths must be a multiple of the natural
processor word-length n - On an 8-bit processor, if signal IBx 1, then
FBx 7, 15, 23,
22Bit-Widths for Degree-3 Example
Shifts for binary point alignment
Total Bit-Width
Integer Bit-Width
Fractional Bit-Width
23Fixed-Point to Integer Mapping
Multiplication
Addition (binary point of operands must be
aligned)
- Fixed-point libraries for the C language do not
provide support for negative integer bit-widths
? emulate fixed-point via integers with shifts
24Multiplications in C language
- In standard C, a 16-bit by 16-bit multiplication
returns the least significant 16 bits of the full
32-bit result ? undesirable since access to the
full 32 bits is required - Solution 1 pad the two operands with 16 leading
zeros and perform 32-bit by 32-bit multiplication - Solution 2 use special C syntax to extract full
32-bit result from 16-bit by 16-bit
multiplicationMore efficient than solution 1
and works on both Atmel and TI compilers
25Automatically Generated C Code for Degree-3
Polynomial
- Casting for controlling multi-word arithmetic
(inttypes.h) - Shifts after each operation for quantization
- r is a constant used for final rounding
26Automatically Generated C Code for Degree-3
Splines
- Accurate to 15 fractional bits (2-15)
- 4 segments used
- 2 leading bits of x of the table index
- Over 90 are exactly rounded less than ½ ulp
error
27Experimental Validation
- Two commonly-used embedded processors
- Atmel ATmega 128 8-bit MCU
- Single ALU a hardware multiplier
- Instructions execute in one cycle except
multiplier (2 cycles) - 4 KB RAM
- Atmel AVR Studio 4.12 for cycle-accurate
simulation - TI TMS320C64x 16-bit fixed-point DSP
- VLIW architecture with six ALUs two hardware
multipliers - ALU multiple additions/subtractions/shifts per
cycle - Multiplier 2x 16b-by-16b / 4x 8b-by-8b per cycle
- 32 KB L1 2048 KB L2 cache
- TI Code Composer Studio 3.1 for cycle-accurate
simulation - Same C code used for both platforms
28Table Size Variation
- Single polynomial approximation shows little area
variation - Rapid growth with low degree splines due to
increasing number of segments
29NFB vs UFB Comparisons
- Non-uniform fractional bit widths (NFB) allow
reduced latency and code size relative to uniform
fractional bit-width (UFB)
30Latency Variations
31Code Size Variations
32Code Size Data and Instructions
Upper part Data Lower part Instructions
33Comparisons Against Floating-Point
- Significant savings in both latency and code size
34Pareto-Optimal Points
- Latency / code size space
35Application to 3D Object Location
- Dot products commonly occur in 3D coordinate
computation - Example z x0 ? y0 x1 ? y1 x2 ? y2
- 8-bit ATmega128 MCU
Sufficient for 3D location
Assuming active power of 20 mW
Degree-1 splines Degree-2 splines
36Application to Gamma Correction
- Evaluation of f(x) x0.8 on ATmega128 MCU
No visible difference
Degree-1 splines
37Summary
- Automated methodology for accurate and efficient
function evaluation on fixed-point processors - Approximation using polynomials and splines in
fixed point-arithmetic via integer operations
with shifts - Multi-word arithmetic with overflow protection
and precision accurate to 1 ulp - Analytical fixed-point signal bit-width
determination - Experimental results on ATmega128 MCU and
TMS320C64x DSP - Allows improved latency and/or code size gives
substantially more flexibility in system design