Floating Point Computation - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Floating Point Computation

Description:

Converting a mathematical problem to numerical problem, one ... Horner Scheme. For polynomial evaluation. Compare efficiency. 46. Accuracy vs. Efficiency ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 53
Provided by: jmc9
Category:

less

Transcript and Presenter's Notes

Title: Floating Point Computation


1
Floating Point Computation
  • Jyun-Ming Chen

2
Contents
  • Sources of Computational Error
  • Computer Representation of (Floating-point)
    Numbers
  • Efficiency Issues

3
Sources of Computational Error
  • Converting a mathematical problem to numerical
    problem, one introduces errors due to limited
    computation resources
  • round off error (limited precision of
    representation)
  • truncation error (limited time for computation)
  • Misc.
  • Error in original data
  • Blunder (programming/data input error)
  • Propagated error

4
Supplement Error Classification (Hildebrand)
  • Gross error caused by human or mechanical
    mistakes
  • Roundoff error the consequence of using a number
    specified by n correct digits to approximate a
    number which requires more than n digits
    (generally infinitely many digits) for its exact
    specification.
  • Truncation error any error which is neither a
    gross error nor a roundoff error.
  • Frequently, a truncation error corresponds to the
    fact that, whereas an exact result would be
    afforded (in the limit) by an infinite sequence
    of steps, the process is truncated after a
    certain finite number of steps.

5
Common measures of error
  • Definitions
  • total error round off truncation
  • Absolute error numerical exact
  • Relative error Abs. error / exact
  • If exact is zero, rel. error is not defined

6
Ex Round off error
  • Representation consists of finite number of
    digits
  • Implication real-number is discrete (more later)

7
Watch out for printf !!
8
Ex Numerical Differentiation
  • Evaluating first derivative of f(x)

Truncation error
9
Numerical Differentiation (cont)
  • Select a problem with known answer
  • So that we can evaluate the error!

10
Numerical Differentiation (cont)
  • Error analysis
  • h ? (truncation) error ?
  • What happened at h 0.00001?!

11
Ex Polynomial Deflation
  • F(x) is a polynomial with 20 real roots
  • Use any method to numerically solve a root, then
    deflate the polynomial to 19th degree
  • Solve another root, and deflate again, and again,
  • The accuracy of the roots obtained is getting
    worse each time due to error propagation

12
Computer Representation of Floating Point Numbers
  • Floating point VS. fixed point
  • Decimal-binary conversion
  • Standard IEEE 754 (1985)

13
Floating VS. Fixed Point
  • Decimal, 6 digits (positive number)
  • fixed point with 5 digits after decimal point
  • 0.00001, , 9.99999
  • Floating point 2 digits as exponent (10-base) 4
    digits for mantissa (accuracy)
  • 0.001x10-99, , 9.999x1099
  • Comparison
  • Fixed point fixed accuracy simple math for
    computation (sometimes used in graphics programs)
  • Floating point trade accuracy for larger range
    of representation

14
Decimal-Binary Conversion
  • Ex 134 (base 10)
  • Ex 0.125 (base 10)
  • Ex 0.1 (base 10)

15
Floating Point Representation
  • Fraction, f
  • Usually normalized so that
  • Base, b
  • 2 for personal computers
  • 16 for mainframe
  • Exponent, e

16
Understanding Your Platform
17
Padding
18
IEEE 754-1985
  • Purpose make floating system portable
  • Defines the number representation, how
    calculation performed, exceptions,
  • Single-precision (32-bit)
  • Double-precision (64-bit)

19
Number Representation
  • S sign of mantissa
  • Range (roughly)
  • Single 10-38 to 1038
  • Double 10-307 to 10307
  • Precision (roughly)
  • Single 7 significant decimal digits
  • Double 15 significant decimal digits

Describe how these are obtained
20
Implication
  • When you write your program, make sure the
    results you printed carry the meaningful
    significant digits.

21
Implicit One
  • Normalized mantissa to increase one extra bit of
    precision
  • Ex 3.5

22
Exponent Bias
  • Ex in single precision, exponent has 8 bits
  • 0000 0000 (0) to 1111 1111 (255)
  • Add an offset to represent / numbers
  • Effective exponent biased exponent bias
  • Bias value 32-bit (127) 64-bit (1023)
  • Ex 32-bit
  • 1000 0000 (128) effective exp.128-1271

23
Ex Convert 3.5 to 32-bit FP Number
24
Examine Bits of FP Numbers
  • Explain how this program works

25
The Examiner
  • Use the previous program to
  • Observe how ME work
  • Test subnormal behaviors on your
    computer/compiler
  • Convince yourself why the subtraction of two
    nearly equal numbers produce lots of error
  • NaN Not-a-Number !?

26
Design Philosophy of IEEE 754
  • sem
  • S first whether the number is /- can be tested
    easily
  • E before M simplify sorting
  • Represent negative by bias (not 2s complement)
    for ease of sorting
  • biased rep 1, 0, 1 126, 127, 128
  • 2s compl. 1, 0, 1 0xFF, 0x00, 0x01
  • More complicated math for sorting,
    increment/decrement

27
Exceptions
  • Overflow
  • INF when number exceeds the range of
    representation
  • Underflow
  • When the number are too close to zero, they are
    treated as zeroes
  • Dwarf
  • The smallest representable number in the FP
    system
  • Machine Epsilon (ME)
  • A number with computation significance (more
    later)

28
Extremities
More later
  • E (11)
  • M (00) infinity
  • M not all zeros NaN (Not a Number)
  • E (00)
  • M (00) clean zero
  • M not all zero dirty zero (see next page)

29
Not-a-Number
  • Numerical exceptions
  • Sqrt of a negative number
  • Invalid domain of trigonometric functions
  • Often cause program to stop running

30
Extremities (32-bit)
  • Max
  • Min (w/o stepping into dirty-zero)

(1.1111)?2254-127(10-0.0001) ?2127?2128
(1.0000)?21-1272-126
31
Dirty-Zero (a.k.a. denormals)
a.k.a. also known as
  • No Implicit One
  • IEEE 754 did not specify compatibility for
    denormals
  • If you are not sure how to handle them, stay away
    from them. Scale your problem properly
  • Many problems can be solved by pretending as if
    they do not exist

32
Dirty-Zero (cont)
2-126
00000000 10000000 00000000 00000000
2-127
00000000 01000000 00000000 00000000
2-128
00000000 00100000 00000000 00000000
00000000 00010000 00000000 00000000
2-129
(Dwarf the smallest representable)
33
Drawf (32-bit)
Value 2-149
34
Machine Epsilon (ME)
  • Definition
  • smallest non-zero number that makes a difference
    when added to 1.0 on your working platform
  • This is not the same as the dwarf

35
Computing ME (32-bit)
1eps Getting closer to 1.0
ME (00111111 10000000 00000000 00000001) 1.0
2-23 ? 1.12 ? 10-7
36
Effect of ME
37
Significance of ME
  • Never terminate the iteration on that 2 FP
    numbers are equal.
  • Instead, test whether x-y lt ME

38
Numerical Scaling
  • Number Density there are as many IEEE 754
    numbers between 1.0, 2.0 as there are in 256,
    512
  • Revisit
  • roundoff error
  • ME a measure of density near the 1.0
  • Implication
  • Scale your problem so that intermediate results
    lie between 1.0 and 2.0 (where numbers are dense
    and where roundoff error is smallest)

39
Scaling (cont)
  • Performing computation on denser portions of real
    line minimizes the roundoff error
  • but dont over do it switch to double precision
    will easily increase the precision
  • The densest part is near subnormal, if density is
    defined as numbers per unit length

40
How Subtraction is Performed on Your PC
  • Steps
  • convert to Base 2
  • Equalize the exponents by adjusting the mantissa
    values truncate the values that do not fit
  • Subtract mantissa
  • normalize

41
Subtraction of Nearly Equal Numbers
  • Base 10 1.24446 1.24445

Significant loss of accuracy (most bits are
unreliable)
42
Theorem of Loss Precision
  • x, y be normalized floating point machine
    numbers, and xgtygt0
  • If
  • then at most p, at least q significant binary
    bits are lost in the subtraction of x-y.
  • Interpretation
  • When two numbers are very close, their
    subtraction introduces a lot of numerical error.

43
Implications
  • When you program
  • You should write these instead

Every FP operation introduces error, but the
subtraction of nearly equal numbers is the worst
and should be avoided whenever possible
44
Efficiency Issues
  • Horner Scheme
  • program examples

45
Horner Scheme
  • For polynomial evaluation
  • Compare efficiency

46
Accuracy vs. Efficiency
47
Good Coding Practice
48
On Arrays
49
Issues of PI
  • 3.14 is often not accurate enough
  • 4.0atan(1.0) is a good substitute

50
Compare
51
Exercise
  • Explain why
  • Explain why converge when implemented numerically

52
Exercise
  • Why Me( ) does not work as advertised?
  • Construct the 64-bit version of everything
  • Bit-Examiner
  • Dme( )
  • 32-bit int and float. Can every int be
    represented by float (if converted)?
Write a Comment
User Comments (0)
About PowerShow.com