Optimized Custom Function Evaluation for Embedded Processors - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Optimized Custom Function Evaluation for Embedded Processors

Description:

Exploit native processor word-length via fixed-point arithmetic ... Bit-widths must be a multiple of the natural processor word-length n ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 38
Provided by: dlee2
Category:

less

Transcript and Presenter's Notes

Title: Optimized Custom Function Evaluation for Embedded Processors


1
Optimized Custom Function Evaluation for
Embedded Processors
Dong-U Lee Image Communications Laboratory EE
Dept., UCLA CENS Seminar 07-28-2006
2
Motivation
  • Embedded applications can impose severe
    constraints on platform cost, power, memory
    resources and architectural flexibility
  • Typical platforms 8/16/32-bit fixed-point
    processors
  • Function evaluation (e.g. log(x), sin(x)) common
    in image/audio processing, sensor networking,
    etc.
  • These evaluations are often the most
    mathematically complex aspects of an application
  • Efficient function evaluation is critical in
    overall system speed and memory utilization

3
Traditional Methods of Function Evaluation
  • Direct table lookup
  • Conceptually simple, easy to implement
  • High memory cost
  • Impractical for high precisions
  • Standard math libraries (e.g. math.h in C) via
    FPU (Floating Point Unit) emulation
  • Easy to use
  • Very slow (example 5600 cycles on ATmega128 for
    logarithm function)
  • Utilizes surplus precision (double precision
    float) that is inherent in standard libraries

4
Related Work on Function Evaluation
  • Function evaluation has been widely studied for
    processors with dedicated FPUs
  • Little work has been on fixed-point processors,
    despite the importance of these processors in the
    overall embedded and sensor markets
  • Cheung et al. 32-bit IBM Embedded PowerPC 405
  • Lack of error analysis ? precision not guaranteed
  • Texas Instruments 16-bit TMS320C24x DSP
  • Lack of error analysis
  • Iordache and Tang (Intel) 32-bit Intel XScale
  • Optimized for single and double precision, but
    many applications do not require this level of
    precision
  • Assembly language routines targeting only XScale
    not general across other processors

5
Main Contributions
  • Precision guaranteed to 1 unit in the last place
    (ulp)
  • e.g. ulp of a result with 8 fractional bits is
    2-8
  • Different approximation methods for trading off
    precision / latency / code size
  • Minimization of operator sizes
  • Automated C code generation

6
Resource-Aware Function Evaluation
  • Approximate functions via polynomials
  • Minimal resources for given target precision
  • Exploit native processor word-length via
    fixed-point arithmetic
  • To minimize latency, determine minimal number of
    bits to each signal in the data path
  • Use multi-word arithmetic emulate operations
    larger than natural processor word-length in
    multiple processor cycles

7
Multi-Word Arithmetic
  • Use multiple words to execute operations larger
    than the natural word-length of the processor
  • 2n-bit by 2n-bit multiplication on an n-bit
    processor
  • 4 multiplications
  • 6 additions
  • Desirable to minimize the number of words
    involved for each operation

8
Function Evaluation
  • Typically in three steps
  • (1) reduce the input interval a,b to a smaller
    interval a,b
  • (2) function approximation on the range-reduced
    interval
  • (3) expand the result back to the original range
  • Evaluation of log(x)

where Mx is the mantissa over the 1,2) and Ex is
the exponent of x
9
Polynomial Approximations
  • Single polynomial
  • Approximate the whole interval with a single
    polynomial
  • Increase the polynomial degree until the error
    requirement is met
  • Splines (piecewise polynomials)
  • Partition the interval into multiple segments
    use different polynomials for each segment
  • Given a polynomial degree, increase the number of
    segments until the error requirement is met

10
Computation Flow
  • Input interval is split into 2Bx0 equally sized
    segments
  • Leading Bx0 bits serve as coefficient table index
  • Coefficients computed in minimax sense via Remez
  • Determine minimal bit-widths ? minimize execution
    time
  • x1 used for polynomial arithmetic normalized over
    0,1)

11
Approximation Methods
  • Degree-3 Taylor, Chebyshev, and minimax
    approximations to log(x)
  • We choose minimax approximations due to their
    superior maximum error behaviour must be
    computed iteratively via the Remez algorithm

12
Design Flow Overview
  • Fully automated within MATLAB
  • Approximation methods
  • Single Polynomial
  • Degree-d splines
  • Range analysis
  • Analytical method based on computing the roots of
    the derivate of a signal
  • Precision analysis
  • Simulated annealing on analytical error
    expressions

13
Error Sources in Digital Function Evaluation
  • Three main error sources
  • Inherent error E? due to approximating functions
  • Quantization error EQ due to finite precision
    effects
  • Final output rounding step, which can cause a
    maximum of 0.5 ulp
  • To achieve 1 ulp accuracy at the output, E?EQ
    0.5 ulp
  • Large E?
  • Polynomial degree reduction (single polynomial)
    or required number of segments reduction
    (splines)
  • However, leads to small EQ, leading to large
    bit-widths
  • Good balance allocate a maximum of 0.3 ulp for
    E? and the rest for EQ

14
Range Analysis
  • Inspect dynamic range of each signal and compute
    required number of integer bits
  • Twos complement assumed, for a signal x
  • For a range xxmin,xmax, IBx is given by

15
Range Determination
  • Examine local minima, local maxima, and minimum
    and maximum input values at each signal
  • Works for designs with differentiable signals,
    which is the case for polynomials

range y2, y5
16
Range Analysis Example
  • Degree-3 polynomial approximation to log(x)
  • Able to compute exact ranges
  • IB can be negative as shown for C3, C0, and D4
    leading zeros in the fractional part

17
Precision Analysis
  • Determine minimal FBs of all signals while
    meeting error constraint at output
  • Quantization methods
  • Truncation 2-FB (1 ulp) maximum error
  • Round-to-nearest 2-FB-1 (0.5 ulp) maximum error
  • To achieve 1 ulp accuracy at output,
    round-to-nearest must be performed at output
  • Free to choose either method for internal
    signals Although round-to-nearest offers
    smaller errors, it requires an extra adder, hence
    truncation is chosen

18
Error Models of Arithmetic Operators
  • Let be the quantized version and be the
    error due to quantizing a signal
  • Addition/subtraction
  • Multiplication

19
Precision Analysis for Polynomials
  • Degree-3 polynomial example, assuming that
    coefficients are rounded to the nearest

Inherent approximation error
20
Uniform Fractional Bit-Width
  • Obtain 8 fractional bits with 1 ulp (2-8)
    accuracy
  • Suboptimal but simple solution is the uniform
    fractional bit-width (UFB)

21
Non-Uniform Fractional Bit-Width
  • Let the fractional bit-widths to be different
  • Use adaptive simulated annealing (ASA), which
    allows for faster convergence times than
    traditional simulated annealing
  • Constraint function error inequalities
  • Cost function latency of multiword arithmetic
  • Bit-widths must be a multiple of the natural
    processor word-length n
  • On an 8-bit processor, if signal IBx 1, then
    FBx 7, 15, 23,

22
Bit-Widths for Degree-3 Example
Shifts for binary point alignment
Total Bit-Width
Integer Bit-Width
Fractional Bit-Width
23
Fixed-Point to Integer Mapping
Multiplication
Addition (binary point of operands must be
aligned)
  • Fixed-point libraries for the C language do not
    provide support for negative integer bit-widths
    ? emulate fixed-point via integers with shifts

24
Multiplications in C language
  • In standard C, a 16-bit by 16-bit multiplication
    returns the least significant 16 bits of the full
    32-bit result ? undesirable since access to the
    full 32 bits is required
  • Solution 1 pad the two operands with 16 leading
    zeros and perform 32-bit by 32-bit multiplication
  • Solution 2 use special C syntax to extract full
    32-bit result from 16-bit by 16-bit
    multiplicationMore efficient than solution 1
    and works on both Atmel and TI compilers

25
Automatically Generated C Code for Degree-3
Polynomial
  • Casting for controlling multi-word arithmetic
    (inttypes.h)
  • Shifts after each operation for quantization
  • r is a constant used for final rounding

26
Automatically Generated C Code for Degree-3
Splines
  • Accurate to 15 fractional bits (2-15)
  • 4 segments used
  • 2 leading bits of x of the table index
  • Over 90 are exactly rounded less than ½ ulp
    error

27
Experimental Validation
  • Two commonly-used embedded processors
  • Atmel ATmega 128 8-bit MCU
  • Single ALU a hardware multiplier
  • Instructions execute in one cycle except
    multiplier (2 cycles)
  • 4 KB RAM
  • Atmel AVR Studio 4.12 for cycle-accurate
    simulation
  • TI TMS320C64x 16-bit fixed-point DSP
  • VLIW architecture with six ALUs two hardware
    multipliers
  • ALU multiple additions/subtractions/shifts per
    cycle
  • Multiplier 2x 16b-by-16b / 4x 8b-by-8b per cycle
  • 32 KB L1 2048 KB L2 cache
  • TI Code Composer Studio 3.1 for cycle-accurate
    simulation
  • Same C code used for both platforms

28
Table Size Variation
  • Single polynomial approximation shows little area
    variation
  • Rapid growth with low degree splines due to
    increasing number of segments

29
NFB vs UFB Comparisons
  • Non-uniform fractional bit widths (NFB) allow
    reduced latency and code size relative to uniform
    fractional bit-width (UFB)

30
Latency Variations
31
Code Size Variations
32
Code Size Data and Instructions
Upper part Data Lower part Instructions
33
Comparisons Against Floating-Point
  • Significant savings in both latency and code size

34
Pareto-Optimal Points
  • Latency / code size space

35
Application to 3D Object Location
  • Dot products commonly occur in 3D coordinate
    computation
  • Example z x0 ? y0 x1 ? y1 x2 ? y2
  • 8-bit ATmega128 MCU

Sufficient for 3D location
Assuming active power of 20 mW
Degree-1 splines Degree-2 splines
36
Application to Gamma Correction
  • Evaluation of f(x) x0.8 on ATmega128 MCU

No visible difference
Degree-1 splines
37
Summary
  • Automated methodology for accurate and efficient
    function evaluation on fixed-point processors
  • Approximation using polynomials and splines in
    fixed point-arithmetic via integer operations
    with shifts
  • Multi-word arithmetic with overflow protection
    and precision accurate to 1 ulp
  • Analytical fixed-point signal bit-width
    determination
  • Experimental results on ATmega128 MCU and
    TMS320C64x DSP
  • Allows improved latency and/or code size gives
    substantially more flexibility in system design
Write a Comment
User Comments (0)
About PowerShow.com