Floating Point Computation - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Floating Point Computation

Description:

Use any method to numerically solve a root, then deflate the polynomial to 19th degree. Solve another root, and deflate again, and again, ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 60
Provided by: jmc9
Category:

less

Transcript and Presenter's Notes

Title: Floating Point Computation


1
Floating Point Computation
  • Jyun-Ming Chen

2
Contents
  • Sources of Computational Error
  • Computer Representation of (floating-point)
    Numbers
  • Efficiency Issues

3
Sources of Computational Error
  • Converting a mathematical problem to numerical
    problem, one introduces errors due to limited
    computation resources
  • round off error (limited precision of
    representation)
  • truncation error (limited time for computation)
  • Misc.
  • Error in original data
  • Blunder to make a mistake through stupidity,
    ignorance, or carelessness programming/data
    input error
  • Propagated error

4
Supplement Error Classification (Hildebrand)
  • Gross error caused by human or mechanical
    mistakes
  • Roundoff error the consequence of using a number
    specified by n correct digits to approximate a
    number which requires more than n digits
    (generally infinitely many digits) for its exact
    specification.
  • Truncation error any error which is neither a
    gross error nor a roundoff error.
  • Frequently, a truncation error corresponds to the
    fact that, whereas an exact result would be
    afforded (in the limit) by an infinite sequence
    of steps, the process is truncated after a
    certain finite number of steps.

5
Common Measures of Error
  • Definitions
  • total error round off truncation
  • Absolute error numerical exact
  • Relative error Abs. error / exact
  • If exact is zero, rel. error is not defined

6
Ex Round off error
  • Representation consists of finite number of
    digits
  • The approximation of real-number on the number
    line is discrete!

7
Watch out for printf !!
8
Ex Numerical Differentiation
  • Evaluating first derivative of f(x)

Truncation error
9
Numerical Differentiation (cont)
  • Select a problem with known answer
  • So that we can evaluate the error!

10
Numerical Differentiation (cont)
  • Error analysis
  • h ? (truncation) error ?
  • What happened at h 0.00001?!

11
Ex Polynomial Deflation
  • F(x) is a polynomial with 20 real roots
  • Use any method to numerically solve a root, then
    deflate the polynomial to 19th degree
  • Solve another root, and deflate again, and again,
  • The accuracy of the roots obtained is getting
    worse each time due to error propagation

12
Computer Representation of Floating Point Numbers
  • Decimal-binary conversion
  • Floating point VS. fixed point
  • Standard IEEE 754 (1985)

13
Decimal-Binary Conversion
  • Ex 29(base 10)

2910111012
14
Fraction Binary Conversion
  • Ex 0.625 (base 10)

?2
a11
?2
a21
a31
a4 a50
15
  • Computing
  • How about 0.110?

0.62510 0.1012
16
Floating VS. Fixed Point
  • Decimal, 6 digits (positive number)
  • fixed point with 5 digits after decimal point
  • 0.00001, , 9.99999
  • Floating point 2 digits as exponent (10-base) 4
    digits for mantissa (accuracy)
  • 0.001x10-99, , 9.999x1099
  • Comparison
  • Fixed point fixed accuracy simple math for
    computation (used in systems w/o FPU)
  • Floating point trade accuracy for larger range
    of representation

17
Floating Point Representation
  • Fraction, f
  • Usually normalized so that
  • Base, b
  • 2 for personal computers
  • 16 for mainframe
  • Exponent, e

18
IEEE 754-1985
  • Purpose make floating system portable
  • Defines the number representation, how
    calculation performed, exceptions,
  • Single-precision (32-bit)
  • Double-precision (64-bit)

19
Number Representation
  • S sign of mantissa
  • Range (roughly)
  • Single 10-38 to 1038
  • Double 10-307 to 10307
  • Precision (roughly)
  • Single 7-8 significant decimal digits
  • Double 15 significant decimal digits

20
Significant Digits
  • In binary sense, 24 bits are significant (with
    implicit one next page)
  • In decimal sense, roughly 7-8 decimal significant
    digits
  • When you write your program, make sure the
    results you printed carry the meaningful
    significant digits.

21
Implicit One
  • Normalized mantissa always ? 1.0
  • Only store the fractional part to increase one
    extra bit of precision
  • Ex 3.5

22
Exponent Bias
  • Ex in single precision, exponent has 8 bits
  • 0000 0000 (0) to 1111 1111 (255)
  • Add an offset to represent / numbers
  • Effective exponent biased exponent bias
  • Bias value 32-bit (127) 64-bit (1023)
  • Ex 32-bit
  • 1000 0000 (128) effective exp.128-1271

23
Ex Convert 3.5 to 32-bit FP Number
24
Examine Bits of FP Numbers
  • Explain how this program works

25
The Examiner
  • Use the previous program to
  • Observe how ME work
  • Test subnormal behaviors on your
    computer/compiler
  • Convince yourself why the subtraction of two
    nearly equal numbers produce lots of error
  • NaN Not-a-Number !?

26
Design Philosophy of IEEE 754
  • sem
  • S first whether the number is /- can be tested
    easily
  • E before M simplify sorting
  • Represent negative by bias (not 2s complement)
    for ease of sorting
  • biased rep 1, 0, 1 126, 127, 128
  • 2s compl. 1, 0, 1 0xFF, 0x00, 0x01
  • More complicated math for sorting,
    increment/decrement

27
Exceptions
  • Overflow
  • INF when number exceeds the range of
    representation
  • Underflow
  • When the number are too close to zero, they are
    treated as zeroes
  • Dwarf
  • The smallest representable number in the FP
    system
  • Machine Epsilon (ME)
  • A number with computation significance (more
    later)

28
Extremities
More later
  • E (11)
  • M (00) infinity
  • M not all zeros NaN (Not a Number)
  • E (00)
  • M (00) clean zero
  • M not all zero dirty zero (see next page)

29
Not-a-Number
  • Numerical exceptions
  • Sqrt of a negative number
  • Invalid domain of trigonometric functions
  • Often cause program to stop running

30
Extremities (32-bit)
  • Max
  • Min (w/o stepping into dirty-zero)

(1.1111)?2254-127(10-0.0001) ?2127?2128
(1.0000)?21-1272-126
31
Dirty-Zero (a.k.a. denormals)
a.k.a. also known as
  • No Implicit One
  • IEEE 754 did not specify compatibility for
    denormals
  • If you are not sure how to handle them, stay away
    from them. Scale your problem properly
  • Many problems can be solved by pretending as if
    they do not exist

32
Dirty-Zero (cont)
2-126
00000000 10000000 00000000 00000000
2-127
00000000 01000000 00000000 00000000
2-128
00000000 00100000 00000000 00000000
00000000 00010000 00000000 00000000
2-129
(Dwarf the smallest representable)
33
Drawf (32-bit)
Value 2-149
34
Machine Epsilon (ME)
  • Definition
  • smallest non-zero number that makes a difference
    when added to 1.0 on your working platform
  • This is not the same as the dwarf

35
Computing ME (32-bit)
1eps Getting closer to 1.0
ME (00111111 10000000 00000000 00000001) 1.0
2-23 ? 1.12 ? 10-7
36
Effect of ME
37
Significance of ME
  • Never terminate the iteration on that 2 FP
    numbers are equal.
  • Instead, test whether x-y lt ME

38
Numerical Scaling
  • Number density there are as many IEEE 754
    numbers between 1.0, 2.0 as there are in 256,
    512
  • Revisit
  • roundoff error
  • ME a measure of real number density near 1.0
  • Implication
  • Scale your problem so that intermediate results
    lie between 1.0 and 2.0 (where numbers are dense
    and where roundoff error is smallest)

39
Scaling (cont)
  • Performing computation on denser portions of real
    line minimizes the roundoff error
  • but dont over do it switch to double precision
    will easily increase the precision
  • The densest part is near subnormal, if density is
    defined as numbers per unit length

40
How Subtraction is Performed on Your PC
  • Steps
  • convert to Base 2
  • Equalize the exponents by adjusting the mantissa
    values truncate the values that do not fit
  • Subtract mantissa
  • normalize

41
Subtraction of Nearly Equal Numbers
  • Base 10 1.24446 1.24445

Significant loss of accuracy (most bits are
unreliable)
42
Theorem of Loss Precision
  • x, y be normalized floating point machine
    numbers, and xgtygt0
  • If
  • then at most p, at least q significant binary
    bits are lost in the subtraction of x-y.
  • Interpretation
  • When two numbers are very close, their
    subtraction introduces a lot of numerical error.

43
Implications
  • When you program
  • You should write these instead

Every FP operation introduces error, but the
subtraction of nearly equal numbers is the worst
and should be avoided whenever possible
44
Efficiency Issues
  • Horner Scheme
  • program examples

45
Horner Scheme
  • For polynomial evaluation
  • Compare efficiency

46
Accuracy vs. Efficiency
47
Good Coding Practice
48
Storing Multidimensional Array in Linear Memory
C and others
Fortran, MATLAB
49
On Accessing Arrays
Which one is more efficient?
50
Issues of PI
  • 3.14 is often not accurate enough
  • 4.0atan(1.0) is a good substitute

51
Compare
52
Exercise
  • Explain why
  • Explain why converge when implemented numerically

53
Exercise
  • Why Me( ) does not work as advertised?
  • Construct the 64-bit version of everything
  • Bit-Examiner
  • Dme( )
  • 32-bit int and float. Can every int be
    represented by float (if converted)?

54
Understanding Your Platform
1
2
4
4
8
8
16
4
Memory word 4 bytes on 32-bit machines
55
Padding
56
Data Alignment (data structure padding)
  • Padding is only inserted when a structure member
    is followed by a member with a larger alignment
    requirement or at the end of the structure.
  • Alignment requirement

57
Ex Padding
sizeof (struct MixedData) 12 bytes
58
Data Alignment (cont)
  • By changing the ordering of members in a
    structure, it is possible to change the amount of
    padding required to maintain alignment.
  • Direct the compiler to ignore data alignment
    (align it on a 1-byte boundary)

Push current alignment to stack
59
More on Fixed Point Arithmetic
Write a Comment
User Comments (0)
About PowerShow.com