Title: Floating Point Computation
1Floating Point Computation
2Contents
- Sources of Computational Error
- Computer Representation of (floating-point)
Numbers - Efficiency Issues
3Sources of Computational Error
- Converting a mathematical problem to numerical
problem, one introduces errors due to limited
computation resources - round off error (limited precision of
representation) - truncation error (limited time for computation)
- Misc.
- Error in original data
- Blunder to make a mistake through stupidity,
ignorance, or carelessness programming/data
input error - Propagated error
4Supplement Error Classification (Hildebrand)
- Gross error caused by human or mechanical
mistakes - Roundoff error the consequence of using a number
specified by n correct digits to approximate a
number which requires more than n digits
(generally infinitely many digits) for its exact
specification.
- Truncation error any error which is neither a
gross error nor a roundoff error. - Frequently, a truncation error corresponds to the
fact that, whereas an exact result would be
afforded (in the limit) by an infinite sequence
of steps, the process is truncated after a
certain finite number of steps.
5Common Measures of Error
- Definitions
- total error round off truncation
- Absolute error numerical exact
- Relative error Abs. error / exact
- If exact is zero, rel. error is not defined
6Ex Round off error
- Representation consists of finite number of
digits - The approximation of real-number on the number
line is discrete!
7Watch out for printf !!
8Ex Numerical Differentiation
- Evaluating first derivative of f(x)
Truncation error
9Numerical Differentiation (cont)
- Select a problem with known answer
- So that we can evaluate the error!
10Numerical Differentiation (cont)
- Error analysis
- h ? (truncation) error ?
- What happened at h 0.00001?!
11Ex Polynomial Deflation
- F(x) is a polynomial with 20 real roots
- Use any method to numerically solve a root, then
deflate the polynomial to 19th degree - Solve another root, and deflate again, and again,
- The accuracy of the roots obtained is getting
worse each time due to error propagation
12Computer Representation of Floating Point Numbers
- Decimal-binary conversion
- Floating point VS. fixed point
- Standard IEEE 754 (1985)
13Decimal-Binary Conversion
2910111012
14Fraction Binary Conversion
?2
a11
?2
a21
a31
a4 a50
150.62510 0.1012
16Floating VS. Fixed Point
- Decimal, 6 digits (positive number)
- fixed point with 5 digits after decimal point
- 0.00001, , 9.99999
- Floating point 2 digits as exponent (10-base) 4
digits for mantissa (accuracy) - 0.001x10-99, , 9.999x1099
- Comparison
- Fixed point fixed accuracy simple math for
computation (used in systems w/o FPU) - Floating point trade accuracy for larger range
of representation
17Floating Point Representation
- Fraction, f
- Usually normalized so that
- Base, b
- 2 for personal computers
- 16 for mainframe
-
- Exponent, e
18IEEE 754-1985
- Purpose make floating system portable
- Defines the number representation, how
calculation performed, exceptions, - Single-precision (32-bit)
- Double-precision (64-bit)
19Number Representation
- S sign of mantissa
- Range (roughly)
- Single 10-38 to 1038
- Double 10-307 to 10307
- Precision (roughly)
- Single 7-8 significant decimal digits
- Double 15 significant decimal digits
20Significant Digits
- In binary sense, 24 bits are significant (with
implicit one next page) - In decimal sense, roughly 7-8 decimal significant
digits
- When you write your program, make sure the
results you printed carry the meaningful
significant digits.
21Implicit One
- Normalized mantissa always ? 1.0
- Only store the fractional part to increase one
extra bit of precision - Ex 3.5
22Exponent Bias
- Ex in single precision, exponent has 8 bits
- 0000 0000 (0) to 1111 1111 (255)
- Add an offset to represent / numbers
- Effective exponent biased exponent bias
- Bias value 32-bit (127) 64-bit (1023)
- Ex 32-bit
- 1000 0000 (128) effective exp.128-1271
23Ex Convert 3.5 to 32-bit FP Number
24Examine Bits of FP Numbers
- Explain how this program works
25The Examiner
- Use the previous program to
- Observe how ME work
- Test subnormal behaviors on your
computer/compiler - Convince yourself why the subtraction of two
nearly equal numbers produce lots of error - NaN Not-a-Number !?
26Design Philosophy of IEEE 754
- sem
- S first whether the number is /- can be tested
easily - E before M simplify sorting
- Represent negative by bias (not 2s complement)
for ease of sorting - biased rep 1, 0, 1 126, 127, 128
- 2s compl. 1, 0, 1 0xFF, 0x00, 0x01
- More complicated math for sorting,
increment/decrement
27Exceptions
- Overflow
- INF when number exceeds the range of
representation - Underflow
- When the number are too close to zero, they are
treated as zeroes - Dwarf
- The smallest representable number in the FP
system - Machine Epsilon (ME)
- A number with computation significance (more
later)
28Extremities
More later
- E (11)
- M (00) infinity
- M not all zeros NaN (Not a Number)
- E (00)
- M (00) clean zero
- M not all zero dirty zero (see next page)
29Not-a-Number
- Numerical exceptions
- Sqrt of a negative number
- Invalid domain of trigonometric functions
-
- Often cause program to stop running
30Extremities (32-bit)
- Max
- Min (w/o stepping into dirty-zero)
(1.1111)?2254-127(10-0.0001) ?2127?2128
(1.0000)?21-1272-126
31Dirty-Zero (a.k.a. denormals)
a.k.a. also known as
- No Implicit One
- IEEE 754 did not specify compatibility for
denormals - If you are not sure how to handle them, stay away
from them. Scale your problem properly - Many problems can be solved by pretending as if
they do not exist
32Dirty-Zero (cont)
2-126
00000000 10000000 00000000 00000000
2-127
00000000 01000000 00000000 00000000
2-128
00000000 00100000 00000000 00000000
00000000 00010000 00000000 00000000
2-129
(Dwarf the smallest representable)
33Drawf (32-bit)
Value 2-149
34Machine Epsilon (ME)
- Definition
- smallest non-zero number that makes a difference
when added to 1.0 on your working platform - This is not the same as the dwarf
35Computing ME (32-bit)
1eps Getting closer to 1.0
ME (00111111 10000000 00000000 00000001) 1.0
2-23 ? 1.12 ? 10-7
36Effect of ME
37Significance of ME
- Never terminate the iteration on that 2 FP
numbers are equal. - Instead, test whether x-y lt ME
38Numerical Scaling
- Number density there are as many IEEE 754
numbers between 1.0, 2.0 as there are in 256,
512 - Revisit
- roundoff error
- ME a measure of real number density near 1.0
- Implication
- Scale your problem so that intermediate results
lie between 1.0 and 2.0 (where numbers are dense
and where roundoff error is smallest)
39Scaling (cont)
- Performing computation on denser portions of real
line minimizes the roundoff error - but dont over do it switch to double precision
will easily increase the precision - The densest part is near subnormal, if density is
defined as numbers per unit length
40How Subtraction is Performed on Your PC
- Steps
- convert to Base 2
- Equalize the exponents by adjusting the mantissa
values truncate the values that do not fit - Subtract mantissa
- normalize
41Subtraction of Nearly Equal Numbers
Significant loss of accuracy (most bits are
unreliable)
42Theorem of Loss Precision
- x, y be normalized floating point machine
numbers, and xgtygt0 - If
- then at most p, at least q significant binary
bits are lost in the subtraction of x-y. - Interpretation
- When two numbers are very close, their
subtraction introduces a lot of numerical error.
43Implications
- You should write these instead
Every FP operation introduces error, but the
subtraction of nearly equal numbers is the worst
and should be avoided whenever possible
44Efficiency Issues
- Horner Scheme
- program examples
45Horner Scheme
- For polynomial evaluation
- Compare efficiency
46Accuracy vs. Efficiency
47Good Coding Practice
48Storing Multidimensional Array in Linear Memory
C and others
Fortran, MATLAB
49On Accessing Arrays
Which one is more efficient?
50Issues of PI
- 3.14 is often not accurate enough
- 4.0atan(1.0) is a good substitute
51Compare
52Exercise
- Explain why
- Explain why converge when implemented numerically
53Exercise
- Why Me( ) does not work as advertised?
- Construct the 64-bit version of everything
- Bit-Examiner
- Dme( )
- 32-bit int and float. Can every int be
represented by float (if converted)?
54Understanding Your Platform
1
2
4
4
8
8
16
4
Memory word 4 bytes on 32-bit machines
55Padding
56Data Alignment (data structure padding)
- Padding is only inserted when a structure member
is followed by a member with a larger alignment
requirement or at the end of the structure. - Alignment requirement
57Ex Padding
sizeof (struct MixedData) 12 bytes
58Data Alignment (cont)
- By changing the ordering of members in a
structure, it is possible to change the amount of
padding required to maintain alignment.
- Direct the compiler to ignore data alignment
(align it on a 1-byte boundary)
Push current alignment to stack
59More on Fixed Point Arithmetic