Title: Floating Point Computation
1Floating Point Computation
2Contents
- Sources of Computational Error
- Computer Representation of (Floating-point)
Numbers - Efficiency Issues
3Sources of Computational Error
- Converting a mathematical problem to numerical
problem, one introduces errors due to limited
computation resources - round off error (limited precision of
representation) - truncation error (limited time for computation)
- Misc.
- Error in original data
- Blunder (programming/data input error)
- Propagated error
4Supplement Error Classification (Hildebrand)
- Gross error caused by human or mechanical
mistakes - Roundoff error the consequence of using a number
specified by n correct digits to approximate a
number which requires more than n digits
(generally infinitely many digits) for its exact
specification.
- Truncation error any error which is neither a
gross error nor a roundoff error. - Frequently, a truncation error corresponds to the
fact that, whereas an exact result would be
afforded (in the limit) by an infinite sequence
of steps, the process is truncated after a
certain finite number of steps.
5Common measures of error
- Definitions
- total error round off truncation
- Absolute error numerical exact
- Relative error Abs. error / exact
- If exact is zero, rel. error is not defined
6Ex Round off error
- Representation consists of finite number of
digits - Implication real-number is discrete (more later)
7Watch out for printf !!
8Ex Numerical Differentiation
- Evaluating first derivative of f(x)
Truncation error
9Numerical Differentiation (cont)
- Select a problem with known answer
- So that we can evaluate the error!
10Numerical Differentiation (cont)
- Error analysis
- h ? (truncation) error ?
- What happened at h 0.00001?!
11Ex Polynomial Deflation
- F(x) is a polynomial with 20 real roots
- Use any method to numerically solve a root, then
deflate the polynomial to 19th degree - Solve another root, and deflate again, and again,
- The accuracy of the roots obtained is getting
worse each time due to error propagation
12Computer Representation of Floating Point Numbers
- Floating point VS. fixed point
- Decimal-binary conversion
- Standard IEEE 754 (1985)
13Floating VS. Fixed Point
- Decimal, 6 digits (positive number)
- fixed point with 5 digits after decimal point
- 0.00001, , 9.99999
- Floating point 2 digits as exponent (10-base) 4
digits for mantissa (accuracy) - 0.001x10-99, , 9.999x1099
- Comparison
- Fixed point fixed accuracy simple math for
computation (sometimes used in graphics programs) - Floating point trade accuracy for larger range
of representation
14Decimal-Binary Conversion
- Ex 134 (base 10)
- Ex 0.125 (base 10)
- Ex 0.1 (base 10)
15Floating Point Representation
- Fraction, f
- Usually normalized so that
- Base, b
- 2 for personal computers
- 16 for mainframe
-
- Exponent, e
16Understanding Your Platform
17Padding
18IEEE 754-1985
- Purpose make floating system portable
- Defines the number representation, how
calculation performed, exceptions, - Single-precision (32-bit)
- Double-precision (64-bit)
19Number Representation
- S sign of mantissa
- Range (roughly)
- Single 10-38 to 1038
- Double 10-307 to 10307
- Precision (roughly)
- Single 7 significant decimal digits
- Double 15 significant decimal digits
Describe how these are obtained
20Implication
- When you write your program, make sure the
results you printed carry the meaningful
significant digits.
21Implicit One
- Normalized mantissa to increase one extra bit of
precision - Ex 3.5
22Exponent Bias
- Ex in single precision, exponent has 8 bits
- 0000 0000 (0) to 1111 1111 (255)
- Add an offset to represent / numbers
- Effective exponent biased exponent bias
- Bias value 32-bit (127) 64-bit (1023)
- Ex 32-bit
- 1000 0000 (128) effective exp.128-1271
23Ex Convert 3.5 to 32-bit FP Number
24Examine Bits of FP Numbers
- Explain how this program works
25The Examiner
- Use the previous program to
- Observe how ME work
- Test subnormal behaviors on your
computer/compiler - Convince yourself why the subtraction of two
nearly equal numbers produce lots of error - NaN Not-a-Number !?
26Design Philosophy of IEEE 754
- sem
- S first whether the number is /- can be tested
easily - E before M simplify sorting
- Represent negative by bias (not 2s complement)
for ease of sorting - biased rep 1, 0, 1 126, 127, 128
- 2s compl. 1, 0, 1 0xFF, 0x00, 0x01
- More complicated math for sorting,
increment/decrement
27Exceptions
- Overflow
- INF when number exceeds the range of
representation - Underflow
- When the number are too close to zero, they are
treated as zeroes - Dwarf
- The smallest representable number in the FP
system - Machine Epsilon (ME)
- A number with computation significance (more
later)
28Extremities
More later
- E (11)
- M (00) infinity
- M not all zeros NaN (Not a Number)
- E (00)
- M (00) clean zero
- M not all zero dirty zero (see next page)
29Not-a-Number
- Numerical exceptions
- Sqrt of a negative number
- Invalid domain of trigonometric functions
-
- Often cause program to stop running
30Extremities (32-bit)
- Max
- Min (w/o stepping into dirty-zero)
(1.1111)?2254-127(10-0.0001) ?2127?2128
(1.0000)?21-1272-126
31Dirty-Zero (a.k.a. denormals)
a.k.a. also known as
- No Implicit One
- IEEE 754 did not specify compatibility for
denormals - If you are not sure how to handle them, stay away
from them. Scale your problem properly - Many problems can be solved by pretending as if
they do not exist
32Dirty-Zero (cont)
2-126
00000000 10000000 00000000 00000000
2-127
00000000 01000000 00000000 00000000
2-128
00000000 00100000 00000000 00000000
00000000 00010000 00000000 00000000
2-129
(Dwarf the smallest representable)
33Drawf (32-bit)
Value 2-149
34Machine Epsilon (ME)
- Definition
- smallest non-zero number that makes a difference
when added to 1.0 on your working platform - This is not the same as the dwarf
35Computing ME (32-bit)
1eps Getting closer to 1.0
ME (00111111 10000000 00000000 00000001) 1.0
2-23 ? 1.12 ? 10-7
36Effect of ME
37Significance of ME
- Never terminate the iteration on that 2 FP
numbers are equal. - Instead, test whether x-y lt ME
38Numerical Scaling
- Number Density there are as many IEEE 754
numbers between 1.0, 2.0 as there are in 256,
512 - Revisit
- roundoff error
- ME a measure of density near the 1.0
- Implication
- Scale your problem so that intermediate results
lie between 1.0 and 2.0 (where numbers are dense
and where roundoff error is smallest)
39Scaling (cont)
- Performing computation on denser portions of real
line minimizes the roundoff error - but dont over do it switch to double precision
will easily increase the precision - The densest part is near subnormal, if density is
defined as numbers per unit length
40How Subtraction is Performed on Your PC
- Steps
- convert to Base 2
- Equalize the exponents by adjusting the mantissa
values truncate the values that do not fit - Subtract mantissa
- normalize
41Subtraction of Nearly Equal Numbers
Significant loss of accuracy (most bits are
unreliable)
42Theorem of Loss Precision
- x, y be normalized floating point machine
numbers, and xgtygt0 - If
- then at most p, at least q significant binary
bits are lost in the subtraction of x-y. - Interpretation
- When two numbers are very close, their
subtraction introduces a lot of numerical error.
43Implications
- You should write these instead
Every FP operation introduces error, but the
subtraction of nearly equal numbers is the worst
and should be avoided whenever possible
44Efficiency Issues
- Horner Scheme
- program examples
45Horner Scheme
- For polynomial evaluation
- Compare efficiency
46Accuracy vs. Efficiency
47Good Coding Practice
48On Arrays
49Issues of PI
- 3.14 is often not accurate enough
- 4.0atan(1.0) is a good substitute
50Compare
51Exercise
- Explain why
- Explain why converge when implemented numerically
52Exercise
- Why Me( ) does not work as advertised?
- Construct the 64-bit version of everything
- Bit-Examiner
- Dme( )
- 32-bit int and float. Can every int be
represented by float (if converted)?