Floating Point Computation - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Floating Point Computation

Description:

Use any method to numerically solve a root, then deflate the polynomial to 19th degree. Solve another root, and deflate again, and again, ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 60

Provided by: jmc9

Category:

more less

Transcript and Presenter's Notes

Title: Floating Point Computation

1
Floating Point Computation

Jyun-Ming Chen

2
Contents

Sources of Computational Error
Computer Representation of (floating-point)
Numbers
Efficiency Issues

3
Sources of Computational Error

Converting a mathematical problem to numerical
problem, one introduces errors due to limited
computation resources
round off error (limited precision of
representation)
truncation error (limited time for computation)

Misc.
Error in original data
Blunder to make a mistake through stupidity,
ignorance, or carelessness programming/data
input error
Propagated error

4
Supplement Error Classification (Hildebrand)

Gross error caused by human or mechanical
mistakes
Roundoff error the consequence of using a number
specified by n correct digits to approximate a
number which requires more than n digits
(generally infinitely many digits) for its exact
specification.

Truncation error any error which is neither a
gross error nor a roundoff error.
Frequently, a truncation error corresponds to the
fact that, whereas an exact result would be
afforded (in the limit) by an infinite sequence
of steps, the process is truncated after a
certain finite number of steps.

5
Common Measures of Error

Definitions
total error round off truncation
Absolute error numerical exact
Relative error Abs. error / exact
If exact is zero, rel. error is not defined

6
Ex Round off error

Representation consists of finite number of
digits
The approximation of real-number on the number
line is discrete!

7
Watch out for printf !!
8
Ex Numerical Differentiation

Evaluating first derivative of f(x)

Truncation error
9
Numerical Differentiation (cont)

Select a problem with known answer
So that we can evaluate the error!

10
Numerical Differentiation (cont)

Error analysis
h ? (truncation) error ?
What happened at h 0.00001?!

11
Ex Polynomial Deflation

F(x) is a polynomial with 20 real roots
Use any method to numerically solve a root, then
deflate the polynomial to 19th degree
Solve another root, and deflate again, and again,
The accuracy of the roots obtained is getting
worse each time due to error propagation

12
Computer Representation of Floating Point Numbers

Decimal-binary conversion
Floating point VS. fixed point
Standard IEEE 754 (1985)

13
Decimal-Binary Conversion

Ex 29(base 10)

2910111012
14
Fraction Binary Conversion

Ex 0.625 (base 10)

?2
a11
?2
a21
a31
a4 a50
15

Computing

How about 0.110?

0.62510 0.1012
16
Floating VS. Fixed Point

Decimal, 6 digits (positive number)
fixed point with 5 digits after decimal point
0.00001, , 9.99999
Floating point 2 digits as exponent (10-base) 4
digits for mantissa (accuracy)
0.001x10-99, , 9.999x1099
Comparison
Fixed point fixed accuracy simple math for
computation (used in systems w/o FPU)
Floating point trade accuracy for larger range
of representation

17
Floating Point Representation

Fraction, f
Usually normalized so that
Base, b
2 for personal computers
16 for mainframe
Exponent, e

18
IEEE 754-1985

Purpose make floating system portable
Defines the number representation, how
calculation performed, exceptions,
Single-precision (32-bit)
Double-precision (64-bit)

19
Number Representation

S sign of mantissa
Range (roughly)
Single 10-38 to 1038
Double 10-307 to 10307
Precision (roughly)
Single 7-8 significant decimal digits
Double 15 significant decimal digits

20
Significant Digits

In binary sense, 24 bits are significant (with
implicit one next page)
In decimal sense, roughly 7-8 decimal significant
digits

When you write your program, make sure the
results you printed carry the meaningful
significant digits.

21
Implicit One

Normalized mantissa always ? 1.0
Only store the fractional part to increase one
extra bit of precision
Ex 3.5

22
Exponent Bias

Ex in single precision, exponent has 8 bits
0000 0000 (0) to 1111 1111 (255)
Add an offset to represent / numbers
Effective exponent biased exponent bias
Bias value 32-bit (127) 64-bit (1023)
Ex 32-bit
1000 0000 (128) effective exp.128-1271

23
Ex Convert 3.5 to 32-bit FP Number
24
Examine Bits of FP Numbers

Explain how this program works

25
The Examiner

Use the previous program to
Observe how ME work
Test subnormal behaviors on your
computer/compiler
Convince yourself why the subtraction of two
nearly equal numbers produce lots of error
NaN Not-a-Number !?

26
Design Philosophy of IEEE 754

sem
S first whether the number is /- can be tested
easily
E before M simplify sorting
Represent negative by bias (not 2s complement)
for ease of sorting
biased rep 1, 0, 1 126, 127, 128
2s compl. 1, 0, 1 0xFF, 0x00, 0x01
More complicated math for sorting,
increment/decrement

27
Exceptions

Overflow
INF when number exceeds the range of
representation
Underflow
When the number are too close to zero, they are
treated as zeroes
Dwarf
The smallest representable number in the FP
system
Machine Epsilon (ME)
A number with computation significance (more
later)

28
Extremities
More later

E (11)
M (00) infinity
M not all zeros NaN (Not a Number)
E (00)
M (00) clean zero
M not all zero dirty zero (see next page)

29
Not-a-Number

Numerical exceptions
Sqrt of a negative number
Invalid domain of trigonometric functions
Often cause program to stop running

30
Extremities (32-bit)

Max
Min (w/o stepping into dirty-zero)

(1.1111)?2254-127(10-0.0001) ?2127?2128
(1.0000)?21-1272-126
31
Dirty-Zero (a.k.a. denormals)
a.k.a. also known as

No Implicit One
IEEE 754 did not specify compatibility for
denormals
If you are not sure how to handle them, stay away
from them. Scale your problem properly
Many problems can be solved by pretending as if
they do not exist

32
Dirty-Zero (cont)
2-126
00000000 10000000 00000000 00000000
2-127
00000000 01000000 00000000 00000000
2-128
00000000 00100000 00000000 00000000
00000000 00010000 00000000 00000000
2-129
(Dwarf the smallest representable)
33
Drawf (32-bit)
Value 2-149
34
Machine Epsilon (ME)

Definition
smallest non-zero number that makes a difference
when added to 1.0 on your working platform
This is not the same as the dwarf

35
Computing ME (32-bit)
1eps Getting closer to 1.0
ME (00111111 10000000 00000000 00000001) 1.0
2-23 ? 1.12 ? 10-7
36
Effect of ME
37
Significance of ME

Never terminate the iteration on that 2 FP
numbers are equal.
Instead, test whether x-y lt ME

38
Numerical Scaling

Number density there are as many IEEE 754
numbers between 1.0, 2.0 as there are in 256,
512
Revisit
roundoff error
ME a measure of real number density near 1.0

Implication
Scale your problem so that intermediate results
lie between 1.0 and 2.0 (where numbers are dense
and where roundoff error is smallest)

39
Scaling (cont)

Performing computation on denser portions of real
line minimizes the roundoff error
but dont over do it switch to double precision
will easily increase the precision
The densest part is near subnormal, if density is
defined as numbers per unit length

40
How Subtraction is Performed on Your PC

Steps
convert to Base 2
Equalize the exponents by adjusting the mantissa
values truncate the values that do not fit
Subtract mantissa
normalize

41
Subtraction of Nearly Equal Numbers

Base 10 1.24446 1.24445

Significant loss of accuracy (most bits are
unreliable)
42
Theorem of Loss Precision

x, y be normalized floating point machine
numbers, and xgtygt0
If
then at most p, at least q significant binary
bits are lost in the subtraction of x-y.
Interpretation
When two numbers are very close, their
subtraction introduces a lot of numerical error.

43
Implications

When you program

You should write these instead

Every FP operation introduces error, but the
subtraction of nearly equal numbers is the worst
and should be avoided whenever possible
44
Efficiency Issues

Horner Scheme
program examples

45
Horner Scheme

For polynomial evaluation
Compare efficiency

46
Accuracy vs. Efficiency
47
Good Coding Practice
48
Storing Multidimensional Array in Linear Memory
C and others
Fortran, MATLAB
49
On Accessing Arrays
Which one is more efficient?
50
Issues of PI

3.14 is often not accurate enough
4.0atan(1.0) is a good substitute

51
Compare
52
Exercise

Explain why
Explain why converge when implemented numerically

53
Exercise

Why Me( ) does not work as advertised?
Construct the 64-bit version of everything
Bit-Examiner
Dme( )
32-bit int and float. Can every int be
represented by float (if converted)?

54
Understanding Your Platform
1
2
4
4
8
8
16
4
Memory word 4 bytes on 32-bit machines
55
Padding
56
Data Alignment (data structure padding)

Padding is only inserted when a structure member
is followed by a member with a larger alignment
requirement or at the end of the structure.
Alignment requirement

57
Ex Padding
sizeof (struct MixedData) 12 bytes
58
Data Alignment (cont)