Optimized Custom Function Evaluation for Embedded Processors

About This Presentation

Title:

Optimized Custom Function Evaluation for Embedded Processors

Description:

Exploit native processor word-length via fixed-point arithmetic ... Bit-widths must be a multiple of the natural processor word-length n ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 38

Provided by: dlee2

Category:

more less

Transcript and Presenter's Notes

Title: Optimized Custom Function Evaluation for Embedded Processors

1
Optimized Custom Function Evaluation for
Embedded Processors
Dong-U Lee Image Communications Laboratory EE
Dept., UCLA CENS Seminar 07-28-2006
2
Motivation

Embedded applications can impose severe
constraints on platform cost, power, memory
resources and architectural flexibility
Typical platforms 8/16/32-bit fixed-point
processors
Function evaluation (e.g. log(x), sin(x)) common
in image/audio processing, sensor networking,
etc.
These evaluations are often the most
mathematically complex aspects of an application
Efficient function evaluation is critical in
overall system speed and memory utilization

3
Traditional Methods of Function Evaluation

Direct table lookup
Conceptually simple, easy to implement
High memory cost
Impractical for high precisions
Standard math libraries (e.g. math.h in C) via
FPU (Floating Point Unit) emulation
Easy to use
Very slow (example 5600 cycles on ATmega128 for
logarithm function)
Utilizes surplus precision (double precision
float) that is inherent in standard libraries

4
Related Work on Function Evaluation

Function evaluation has been widely studied for
processors with dedicated FPUs
Little work has been on fixed-point processors,
despite the importance of these processors in the
overall embedded and sensor markets
Cheung et al. 32-bit IBM Embedded PowerPC 405
Lack of error analysis ? precision not guaranteed
Texas Instruments 16-bit TMS320C24x DSP
Lack of error analysis
Iordache and Tang (Intel) 32-bit Intel XScale
Optimized for single and double precision, but
many applications do not require this level of
precision
Assembly language routines targeting only XScale
not general across other processors

5
Main Contributions

Precision guaranteed to 1 unit in the last place
(ulp)
e.g. ulp of a result with 8 fractional bits is
2-8
Different approximation methods for trading off
precision / latency / code size
Minimization of operator sizes
Automated C code generation

6
Resource-Aware Function Evaluation

Approximate functions via polynomials
Minimal resources for given target precision
Exploit native processor word-length via
fixed-point arithmetic
To minimize latency, determine minimal number of
bits to each signal in the data path
Use multi-word arithmetic emulate operations
larger than natural processor word-length in
multiple processor cycles

7
Multi-Word Arithmetic

Use multiple words to execute operations larger
than the natural word-length of the processor
2n-bit by 2n-bit multiplication on an n-bit
processor
4 multiplications
6 additions
Desirable to minimize the number of words
involved for each operation

8
Function Evaluation

Typically in three steps
(1) reduce the input interval a,b to a smaller
interval a,b
(2) function approximation on the range-reduced
interval
(3) expand the result back to the original range
Evaluation of log(x)

where Mx is the mantissa over the 1,2) and Ex is
the exponent of x
9
Polynomial Approximations

Single polynomial
Approximate the whole interval with a single
polynomial
Increase the polynomial degree until the error
requirement is met
Splines (piecewise polynomials)
Partition the interval into multiple segments
use different polynomials for each segment
Given a polynomial degree, increase the number of
segments until the error requirement is met

10
Computation Flow

Input interval is split into 2Bx0 equally sized
segments
Leading Bx0 bits serve as coefficient table index
Coefficients computed in minimax sense via Remez
Determine minimal bit-widths ? minimize execution
time
x1 used for polynomial arithmetic normalized over
0,1)

11
Approximation Methods

Degree-3 Taylor, Chebyshev, and minimax
approximations to log(x)
We choose minimax approximations due to their
superior maximum error behaviour must be
computed iteratively via the Remez algorithm

12
Design Flow Overview

Fully automated within MATLAB
Approximation methods
Single Polynomial
Degree-d splines
Range analysis
Analytical method based on computing the roots of
the derivate of a signal
Precision analysis
Simulated annealing on analytical error
expressions

13
Error Sources in Digital Function Evaluation

Three main error sources
Inherent error E? due to approximating functions
Quantization error EQ due to finite precision
effects
Final output rounding step, which can cause a
maximum of 0.5 ulp
To achieve 1 ulp accuracy at the output, E?EQ
0.5 ulp
Large E?
Polynomial degree reduction (single polynomial)
or required number of segments reduction
(splines)
However, leads to small EQ, leading to large
bit-widths
Good balance allocate a maximum of 0.3 ulp for
E? and the rest for EQ

14
Range Analysis

Inspect dynamic range of each signal and compute
required number of integer bits
Twos complement assumed, for a signal x
For a range xxmin,xmax, IBx is given by

15
Range Determination

Examine local minima, local maxima, and minimum
and maximum input values at each signal
Works for designs with differentiable signals,
which is the case for polynomials

range y2, y5
16
Range Analysis Example

Degree-3 polynomial approximation to log(x)
Able to compute exact ranges
IB can be negative as shown for C3, C0, and D4
leading zeros in the fractional part

17
Precision Analysis

Determine minimal FBs of all signals while
meeting error constraint at output
Quantization methods
Truncation 2-FB (1 ulp) maximum error
Round-to-nearest 2-FB-1 (0.5 ulp) maximum error
To achieve 1 ulp accuracy at output,
round-to-nearest must be performed at output
Free to choose either method for internal
signals Although round-to-nearest offers
smaller errors, it requires an extra adder, hence
truncation is chosen

18
Error Models of Arithmetic Operators

Let be the quantized version and be the
error due to quantizing a signal
Addition/subtraction
Multiplication

19
Precision Analysis for Polynomials

Degree-3 polynomial example, assuming that
coefficients are rounded to the nearest

Inherent approximation error
20
Uniform Fractional Bit-Width

Obtain 8 fractional bits with 1 ulp (2-8)
accuracy
Suboptimal but simple solution is the uniform
fractional bit-width (UFB)

21
Non-Uniform Fractional Bit-Width

Let the fractional bit-widths to be different
Use adaptive simulated annealing (ASA), which
allows for faster convergence times than
traditional simulated annealing
Constraint function error inequalities
Cost function latency of multiword arithmetic
Bit-widths must be a multiple of the natural
processor word-length n
On an 8-bit processor, if signal IBx 1, then
FBx 7, 15, 23,

22
Bit-Widths for Degree-3 Example
Shifts for binary point alignment
Total Bit-Width
Integer Bit-Width
Fractional Bit-Width
23
Fixed-Point to Integer Mapping
Multiplication
Addition (binary point of operands must be
aligned)

Fixed-point libraries for the C language do not
provide support for negative integer bit-widths
? emulate fixed-point via integers with shifts

24
Multiplications in C language

In standard C, a 16-bit by 16-bit multiplication
returns the least significant 16 bits of the full
32-bit result ? undesirable since access to the
full 32 bits is required
Solution 1 pad the two operands with 16 leading
zeros and perform 32-bit by 32-bit multiplication
Solution 2 use special C syntax to extract full
32-bit result from 16-bit by 16-bit
multiplicationMore efficient than solution 1
and works on both Atmel and TI compilers

25
Automatically Generated C Code for Degree-3
Polynomial

Casting for controlling multi-word arithmetic
(inttypes.h)
Shifts after each operation for quantization
r is a constant used for final rounding

26
Automatically Generated C Code for Degree-3
Splines

Accurate to 15 fractional bits (2-15)
4 segments used
2 leading bits of x of the table index
Over 90 are exactly rounded less than ½ ulp
error

27
Experimental Validation

Two commonly-used embedded processors
Atmel ATmega 128 8-bit MCU
Single ALU a hardware multiplier
Instructions execute in one cycle except
multiplier (2 cycles)
4 KB RAM
Atmel AVR Studio 4.12 for cycle-accurate
simulation
TI TMS320C64x 16-bit fixed-point DSP
VLIW architecture with six ALUs two hardware
multipliers
ALU multiple additions/subtractions/shifts per
cycle
Multiplier 2x 16b-by-16b / 4x 8b-by-8b per cycle
32 KB L1 2048 KB L2 cache
TI Code Composer Studio 3.1 for cycle-accurate
simulation
Same C code used for both platforms

28
Table Size Variation

Single polynomial approximation shows little area
variation
Rapid growth with low degree splines due to
increasing number of segments

29
NFB vs UFB Comparisons

Non-uniform fractional bit widths (NFB) allow
reduced latency and code size relative to uniform
fractional bit-width (UFB)

30
Latency Variations
31
Code Size Variations
32
Code Size Data and Instructions
Upper part Data Lower part Instructions
33
Comparisons Against Floating-Point

Significant savings in both latency and code size

34
Pareto-Optimal Points

Latency / code size space

35
Application to 3D Object Location

Dot products commonly occur in 3D coordinate
computation
Example z x0 ? y0 x1 ? y1 x2 ? y2
8-bit ATmega128 MCU

Sufficient for 3D location
Assuming active power of 20 mW
Degree-1 splines Degree-2 splines
36
Application to Gamma Correction

Evaluation of f(x) x0.8 on ATmega128 MCU

No visible difference
Degree-1 splines
37
Summary

Automated methodology for accurate and efficient
function evaluation on fixed-point processors
Approximation using polynomials and splines in
fixed point-arithmetic via integer operations
with shifts
Multi-word arithmetic with overflow protection
and precision accurate to 1 ulp
Analytical fixed-point signal bit-width
determination
Experimental results on ATmega128 MCU and
TMS320C64x DSP
Allows improved latency and/or code size gives
substantially more flexibility in system design

Write a Comment

User Comments (0)

About PowerShow.com

Optimized Custom Function Evaluation for Embedded Processors - PowerPoint PPT Presentation

Optimized Custom Function Evaluation for Embedded Processors

Exploit native processor word-length via fixed-point arithmetic ... Bit-widths must be a multiple of the natural processor word-length n ... – PowerPoint PPT presentation